An Accelerated Lookup Strategy for Extremely Large Databases
Summary
The conventional lookup( function in Panorama X has every record in the primary database conduct a sequential search of every record in the target database until it either finds a match or exhausts the possibilities. A sequential search is the least efficient search method devised and, hence, the lookup( function is limited to the comparison of modestly-sized databases. If there are very few successful lookups, the total number of lookup comparisons required approaches the product, m.n, where m is the number of records in the primary database and n is the number in the target database; even if every record has a match the number of comparisons will be of the order of m.n/2. So, if lookup( is used to compare two databases of about 100,000 records each, the process will bog down with serious memory overload long before the search is completed.
The superlookup( function overcomes this limitation to some extent but, for each record where no match exists, it will still search the entire target database and the problem of memory overload still occurs, particularly if there are very few matches between the two databases.
This document describes a strategy whereby databases of virtually unlimited size can be compared with no danger of memory overload and with up to 97% fewer comparisons.
The Strategy
The process is similar to the way you would use the thumb index of a large dictionary. It loads the keyfield and the datafield from the target database into an array and rearranges this array into several hundred sub-arrays, each containing keyfields which begin with a common pair of alpha-numeric characters. In the primary database, all keydata items beginning with a given character set then seek matches only in the sub-array whose elements share those leading characters. Furthermore, large primary database data sets are partitioned into packets of no more than 1,000 records and large target database sub-arrays are partitioned into packets of no more than 10,000 elements. When very large databases are being compared, these techniques save an enormous amount of time and totally avoid any likelihood of memory overload. In theory, there is no limit to the sizes of the databases which can be compared.
If the primary database keydata and the target database combination of keyfield and datafield are compared as text arrays, the arraylookup( function is the most efficient way to go. Unfortunately, that function recognises the * and ? characters as wildcard characters which can lead to erroneous matches. The solution to this is to convert the primary database keydata into binary data, to convert the target database combination of keyfield and datafield into a data dictionary and then to compare them using the getdictionaryvalue( function which outputs a text value to the results field. Despite the overhead of the binary conversions, this technique is actually faster than the conventional text comparison.
Extensive testing has shown that, where the two databases have a high degree of data matching, the actual number of comparisons made is about 7% of the number that would be required in a conventional lookup. If there are very few matches, that figure drops to about 3%. These are remarkable improvements in efficiency.
For the largest test conducted to date, I took the contents of the seven Harry Potter novels and changed every punctuation mark to a carriage return so as to maximise the number of records. I then did the same with the complete works of Mark Twain which I saw as a dataset bound to minimise the number of matches. The Potter database has 160,000 records and the Twain has 355,000 records. There are only 13,000 matching records. A conventional lookup would need to make more than 54 billion comparisons, whereas the accelerated method required less than two billion and took a touch over four hours, with the Activity Monitor showing minimal memory pressure throughout.