A Simple Algorithm to Discover Patterns in Symbolic Streams
[draft, comments welcome]
by Sergio Navega
snavega@attglobal.net
August 1998
This paper is more an annotation of some ideas that developed during experimentation and should not be considered a finished or error free work.
The idea of discovering regularities in data seems to be the very essence of understanding the world around us. We receive plenty of such regularities through our senses. It is reasonable to think of our brain as a mechanism that, among other tasks, spend some time trying to perceive and categorize these regularities in a meaningful manner.
This paper will present a simple symbolic algorithm which can be used to find such patterns in elementary sequences of characters. My goal is to use this algorithm as a platform to raise important suggestions about the way our brain explores these regularities to construct our knowledge of the world.
The method is not claimed to be unique, nor even to be optimal. In fact, there are several algorithms much more efficient (in terms of memory and processing speed) than the one I will present. What is important in this case is the philosophy of its operation, something that will be clearer with our final comments. I start by proposing the discovery of regularities in a sequence such as this:
21211212122121121212
This (admittedly contrived) sequence will be analyzed looking for similarities and differences, which is obviously related to the perception of regularities. The algorithm starts with the leftmost element. I will not care for a while looking for occurrences of individual elements, so my pointer will go to the next element:
21211212122121121212
^
I will take the element pointed together with its immediate predecessor. This group of elements (21 in this case) will be put in a table. We have now:
21211212122121121212
^
21
Advance the pointer one more and find another group (pointed character plus predecessor, in the order of occurrence, giving 12):
21211212122121121212
^^
I search for this group in my table and as it is new, I will add the group to it. However, instead of advancing the pointer, I form another group with the predecessor of the predecessor. This group (212) is also new, so I put it in the table too:
21211212122121121212
^^^
21
12
212 <--
Advance pointer and find another time group (21). As this group already exists in the table, I merely increment a count number in front of its entry in the table:
21211212122121121212
^
21 (2)
12
212
Again, I will take the predecessor of the predecessor and, following that will add another predecessor, limiting groups to a maximum of 4 characters in this example. By applying this method to the remainder of the pattern, we end up with this table:
21211212122121121212
21 (8)
12 (8)
212 (6)
121 (6)
2121 (4)
11 (2)
211 (2)
1211 (2)
112 (2)
2112 (2)
1121 (2)
1212 (4)
22 (1)
122 (1)
2122 (1)
221 (1)
1221 (1)
2212 (1)
Now comes the first important point. Each group found will be given a "name", that is, it will be transformed into what I will call an "internal concept". Instead of a name, I will use here one letter to simplify the things. Here is the table with the concept names in front of each entry:
21211212122121121212
A 21 (8)
B 12 (8)
C 212 (6)
D 121 (6)
E 2121 (4)
F 11 (2)
G 211 (2)
H 1211 (2)
I 112 (2)
J 2112 (2)
K 1121 (2)
L 1212 (4)
M 22 (1)
N 122 (1)
O 2122 (1)
P 221 (1)
Q 1221 (1)
R 2212 (1)
Now, another important transformation: I will "rebuild" the original pattern using the internal concept name whenever possible. I will do this for each concept discovered. The table, then becomes this (rebuilt patterns are in front of the respective entry in the concept table):
21211212122121121212
A 21 (8)
AA1AA2AA1AA2
B 12 (8) 2B1BBB2B1BBB
C 212 (6) C11C12C11C12
D 121 (6) 2DD2122DD212
E 2121 (4) E1E2E1E2
F 11 (2) 212F21212212F21212
G 211 (2) 21G2121221G21212
H 1211 (2) 2H212122H21212
I 112 (2) 212I1212212I1212
J 2112 (2) 21J121221J1212
K 1121 (2) .... and so on...
L 1212 (4) .................
M 22 (1)
N 122 (1)
O 2122 (1)
P 221 (1)
Q 1221 (1)
R 2212 (1)
One more important transformation: take each one of the new (rebuilt) patterns and apply them recursively to the same process. This will discover other regularities on the patterns discovered. Doing this for the concept A, we find:
AA1AA2AA1AA2
AA
A1
AA1
1A
A1A
AA1A
1AA
A1AA
A2
AA2
1AA2
2A
A2A
AA2A
... and so on ...
Again, I will name the "internal concepts" found, this time with a lower case letter:
AA1AA2AA1AA2
a AA
b A1
c AA1
d 1A
e A1A
f AA1A
g 1AA
h A1AA
i A2
j AA2
k 1AA2
l 2A
m A2A
n AA2A
... and so on ...
And once more, rebuild the original pattern using the discovered concepts:
AA1AA2AA1AA2
a AA a1a2a1a2
b A1 AbAA2AbAA2
c AA1 ... and so on ...
d 1A
e A1A
f AA1A
g 1AA
h A1AA
i A2
j AA2
k 1AA2
l 2A
m A2A
n AA2A
... and so on ...
Let's look closely to concept 'a': a1a2a1a2. If we get this concept again through the same process, we get:
a1a2a1a2
a1
1a
a1a
a2
1a2
a1a2 <--
...
It is obvious now that our method will discover one new "concept" (lets call it 'Y') that will signify the pattern 'a1a2'. This means that:
'a1a2a1a2' is equivalent to YY
which can be further reduced to a single concept Z. The point here is that we have been able to reduce the initial pattern '21211212122121121212' all the way down to a single "compressed concept" Z. It is this reduction that seems interesting and it should be noted that this reduction captures one of the "essences" of the original pattern:
Z
(YY)
((a1a2) (a1a2))
((AA 1 AA 2) (AA 1 AA 2))
((2121) 1 (2121) 2) ((2121) 1 (2121) 2)
It is easy to understand that another of the initial concepts would also give other interesting arrangements, such as this:
(212) (112) (1212) (212) (112) (1212)
There are dozens of "good" arrangements possible (among hundreds of others not interesting) and all of those good arrangements seem to have in common that aspect of reduction, greatest compression, that is typical of redundant signals. Besides discovering paths that lead to maximum compression, this method also produces a lot of internal concepts that carry an interesting property of meaningfulness, respective to the original signal. Keeping track of the number of times one internal concept is used and annotating the situations in which each internal concept appears seems to be relevant to the understanding of what this signal means.
The combinatorial explosion that this method provokes can be alleviated by the progressive dismissal of the concepts that didn't appear to be significant (had few occurrences). Another method to reduce this complexity is the grouping of similar concepts in templates and the formation of rules, a subject that is beyond the scope of this paper.
Conclusions
This simple (although computationally intensive) method seeks for recurrent patterns within an input flux. As I said, it was not meant to be efficient. It serves, however, to demonstrate several important issues:
a) An internal concept is an invented "symbol", meaningful only to the system itself and that codifies the input signals in a different manner, usually with great economization of information. An internal concept may be seen as an attempt of the system in understanding the way in which the input signal repeats itself.
b) The reproduction of the original signal using the internal concepts as a codification method also show patterns that can give rise to other concepts, recursively. If the signal is very redundant, the compression obtained will be maximum, which means there is a lot to learn. If the input signal is not redundant, then the method will dispose most of the internal concepts, in trying to conserve memory.
c) We are looking for the greatest compactation of the input signal. This compression is related to the MDL (Minimum Description Length) method and also shows similarity with Occam's Razor, both important concepts in the discipline of Machine Learning.
d) The method should be applied not only to some sub-patterns as we have done in our example, but to all of them. The process seems to stop when no more compression can be achieved.
e) One point not yet explored in this method is the *learning* of the paths that conducts to good results (the sequence of concept formation and substitution). I suggest that this path is similar to the plastic growth of synaptic connections in our brain, in particular the growth that happens during initial childhood. The bad paths, similarly to the bad synaptic connections, will be atrophied after some time. The good ones, those that receive strong reinforcements, will be used more often and will be prioritarily investigated during parsings of future signals. This seems to ease the process of parsing of complex sequences.
f) The learning of the best paths must start with a purely randomic search (in this case, evaluating all possibilities of combination and recursive substitutions). As the system becomes more knowledgeable about which paths give the best results, the random exploration of some possibilities may be deferred or even dismissed in favor of those that have better perspective of giving good results. This process is one of the basis of the Copycat mechanism (see Hofstadter and Mitchell). This means that the performance of the agent will be better as it learns more, exactly the opposite of conventional computational implementations and similarly to the way our brain appear to work.
g) The creation of concepts in this example is specific to the input signal presented. However, nothing prevents the use of concepts defined in previous runs of the program over other sets of input signals. This will allow the system to use previous knowledge to help solve the current problem, a point that seems vital in intelligent agents. Thus, it is reasonable to suppose that if the agent receives input signals representing information from the *same* world, it will be able to successively "refine" this internal view of the world in terms of those concepts. I am tempted to see this as the process that happens within our brain in relation to the world we perceive. Our vision of the world is in constant refinement, as the result of our use of previous knowledge together with new sensory information.
Future Work
This method is just a toy algorithm whose main purpose is exploring with the fundamental ideas behind learning from raw sensory signals. It is the very beginning of a long road. Among the very next steps we can highlight:
a) Exploration of the role of similarity between concepts. Concept 'A1AA' is similar to 'A2AA'. How can we exploit this? Perhaps by devising one template 'AnAA'. This template process is the first step in the creation of *rules* grounded in sensory explorations. Rules seem the definitive way of economizing on the representation of features.
b) Memorization of the path from the input signal to a successful transformation. This path can be reinforced if it is traversed again in another processing.
c) Processing of one signal using concepts developed during previous runs.
d) Development of a method to group similar concepts, a process known as categorization. This process should be implemented not only with respect to the "shape" of the symbol but also with respect to discovered roles of the symbol (relation with predecessors, successors, single or multiple occurrence in an individual stream, structural similarity, etc.).
e) Exploration of analogies, use of previously learned path shapes to help propose new and interesting ways to solve the current problem.
References
Hofstadter, Douglas (1985) Metamagical Themas: Questing for the Essence of Mind and Pattern. Basic Books.
Hofstadter, Douglas (1995) Fluid Concepts and Creative Analogies. HarperCollins Publishers 1995.
Mitchell, Melanie (1993) Analogy-Making as Perception. Bradford Book, MIT Press.
Mitchell, Tom M., (1997) Machine Learning. McGraw-Hill Comp. Inc.
Churchland, Patricia S., Sejnowski, Terrence J., (1992) The Computational Brain. Bradford Book, MIT Press
Wolff, J. Gerard (1993) Computing, Cognition and Information Compression.
Appendix A - Source code for the main function
Below we present the source code of the main function (AssembleTable) which builds the table of concepts. The function is called by ProcessPattern which should be called from the main program, in charge of handling all necessary I/O.
//
// Simple Pattern Finder
//
//--------- Specific Libs ----------
#include <stdlib.h>
#include <vcl/dstring.h>
#include "cforscn.h"
#include "str.h"
//--------- Configuration -----------//
#define sizetext
10000
// max size of input text
#define sizetable 5000
// size of table
#define maxconceptsize 4
//
maximum size of concepts
#define sizeline 5000
// size line to show
//------------ Globals --------------//
String
tableA[sizetable]; //
first table
String
rebuildA[sizetable]; // rebuilt
elements of first table
long
numconceptsA = 0;
// number discovered concepts
static char
*baseconcept =
"A"; // base for concepts
char
row[sizeline];
// line to show
boolean AssembleTable(String &txt, String tbl[],
String rebld[], long &nconcepts)
//----------------------------------------------------------------------//
// Assembles the Table with all concepts
from input text //
//----------------------------------------------------------------------//
{
long cnt, ii, jj, kk;
String concept;
boolean found;
String rebuild;
size_t posi;
nconcepts = 0;
for (cnt = 1; cnt < txt.Length(); cnt++) // starts 1, case 0 is not
concept
{
for (ii = 1; ii < maxconceptsize; ii++) //
concepts of 1 byte not used
{
concept = "";
// clear previous concept
for (jj = 0; jj <=
ii; jj++)
concept = concept + txt[cnt + jj - ii];
// check to see if concept already exists
found = false;
for (kk = 0; kk <
nconcepts; kk++)
if (concept == tbl[kk])
{
found = true;
break;
}
if (not found)
// add this concept to
table
{
tbl[nconcepts] = concept;
// Rebuild original pattern with learned concept
rebuild = txt; // original pattern
while (rebuild.Find(concept, posi))
{
rebuild.Delete(posi, concept.Length());
rebuild.Insert(posi, (char) ((int)
baseconcept[0] + nconcepts));
}
rebld[nconcepts] = rebuild;
nconcepts++;
if (nconcepts == sizetable)
return false;
}
// stop condition:
we've got to the beginning of the pattern
if ( (cnt - ii) <=
0)
break;
}
}
return true;
}
int ProcessPattern(AnsiString &txt, char bcpt)
//----------------------------------------------------------------------//
// Process text pattern //
//----------------------------------------------------------------------//
{
numconceptsA = 0;
baseconcept[0] = bcpt;
if (AssembleTable(txt.c_str(), tableA, rebuildA, numconceptsA))
return numconceptsA;
else
return -1;
}