An Accelerated Lookup Strategy for Extremely Large Databases


#1

An Accelerated Lookup Strategy for Extremely Large Databases

Summary
The conventional lookup( function in Panorama X has every record in the primary database conduct a sequential search of every record in the target database until it either finds a match or exhausts the possibilities. A sequential search is the least efficient search method devised and, hence, the lookup( function is limited to the comparison of modestly-sized databases. If there are very few successful lookups, the total number of lookup comparisons required approaches the product, m.n, where m is the number of records in the primary database and n is the number in the target database; even if every record has a match the number of comparisons will be of the order of m.n/2. So, if lookup( is used to compare two databases of about 100,000 records each, the process will bog down with serious memory overload long before the search is completed.
The superlookup( function overcomes this limitation to some extent but, for each record where no match exists, it will still search the entire target database and the problem of memory overload still occurs, particularly if there are very few matches between the two databases.
This document describes a strategy whereby databases of virtually unlimited size can be compared with no danger of memory overload and with up to 97% fewer comparisons.

The Strategy
The process is similar to the way you would use the thumb index of a large dictionary. It loads the keyfield and the datafield from the target database into an array and rearranges this array into several hundred sub-arrays, each containing keyfields which begin with a common pair of alpha-numeric characters. In the primary database, all keydata items beginning with a given character set then seek matches only in the sub-array whose elements share those leading characters. Furthermore, large primary database data sets are partitioned into packets of no more than 1,000 records and large target database sub-arrays are partitioned into packets of no more than 10,000 elements. When very large databases are being compared, these techniques save an enormous amount of time and totally avoid any likelihood of memory overload. In theory, there is no limit to the sizes of the databases which can be compared.
If the primary database keydata and the target database combination of keyfield and datafield are compared as text arrays, the arraylookup( function is the most efficient way to go. Unfortunately, that function recognises the * and ? characters as wildcard characters which can lead to erroneous matches. The solution to this is to convert the primary database keydata into binary data, to convert the target database combination of keyfield and datafield into a data dictionary and then to compare them using the getdictionaryvalue( function which outputs a text value to the results field. Despite the overhead of the binary conversions, this technique is actually faster than the conventional text comparison.
Extensive testing has shown that, where the two databases have a high degree of data matching, the actual number of comparisons made is about 7% of the number that would be required in a conventional lookup. If there are very few matches, that figure drops to about 3%. These are remarkable improvements in efficiency.

For the largest test conducted to date, I took the contents of the seven Harry Potter novels and changed every punctuation mark to a carriage return so as to maximise the number of records. I then did the same with the complete works of Mark Twain which I saw as a dataset bound to minimise the number of matches. The Potter database has 160,000 records and the Twain has 355,000 records. There are only 13,000 matching records. A conventional lookup would need to make more than 54 billion comparisons, whereas the accelerated method required less than two billion and took a touch over four hours, with the Activity Monitor showing minimal memory pressure throughout.


#2

FYI, that is going to potentially take up a large amount of memory right there.

On the subject of memory pressure, I made some changes a few weeks ago that should have helped with memory pressure when a procedure does multiple massive operations in a row. I don’t recall hearing anything back from you as to whether you could tell if that made any difference.


#3

Maybe so but it works - that’s the important thing. According to the Activity Monitor, it uses no more than 3.4 GB of RAM.

The last I heard from you on that was that your initial attempt ran into problems. I’ll try a conventional lookup from a 160,000-record database on a 355,000-record database and see how it runs tomorrow - right now, I’m off to bed.


#4

Actually, it doesn’t because I close the target database as soon as the array is created. The array takes about 25MB - not a lot.

On memory pressure, this was your last comment:

But, I had earlier experienced the problem of large successive, discrete computations building up memory pressure unreasonably and, recently, that doesn’t happen. So maybe something’s working.


#5

Yes, my comment was that I had to reverse some of the changes, not all of them.

Good, that is what I hoped for.


#6

A couple of interesting follow-ups to this post.

  1. A conventional lookup from a 5,400-record database on a 5,800-record database with only 105 matches took an average of 108 seconds over several tests. The emulator took an average of 37 seconds. So, although the emulator has lots of overheads designed to cope with very large databases, it’s still faster than the standard lookup of small databases by a factor of three.

  2. I tried an even larger test, a 355,000-record database looking up a modified version of itself with no matches. It reduced the number of comparisons needed from about 126 billion to 3.8 billion and chugged away for close to 16 hours with no memory pressure problems. So it certainly seems that there is no upper limit to the size of the databases that can be compared.