Running out of memory with big lookups

michael · August 22, 2017, 3:42am

I’m looking for ways to speed up very large lookups and/or to run extremely large lookups which would not normally be possible.

One of my test pairs comprises a 12,000-record database looking up another of the same size with very few matches. This means that the primary d/b records interrogate almost every target d/b record.

On an i5 iMac with 8Gb of RAM, the straightforward lookup takes just under ten minutes and the Activity Monitor shows that memory pressure is high but not into the red zone. I acknowledge that the Activity Monitor is not a perfect tool but there is no doubt that performance slows down appreciably when AM shows high memory pressure.

In an attempt to reduce the memory pressure, I tried processing the lookup in batches, using a loop to select a subset of primary records at a time. To my surprise, as each group was processed, the time increased, as did the memory pressure. In the extreme case of dividing the d/b into twelve 1,000-record blocks, the lookup took over 16 minutes and the final stages were processed at glacial pace with memory pressure constantly in the red zone.

All of that leads to one simple question: Is there any way to avoid the steady build-up of memory pressure - a memoryflush` statement would be great.

michael · August 23, 2017, 12:42am

I should elaborate on what’s happening here. A large array is created from the contents of the second database. Each element of this array contains a two-dimensional sub-array, each element of which begins with a common pair of characters.

In the primary database, a formulafill statement uses an arraylookup( function in which each keydata item seeks a match only in the sub-array whose elements begin with the same two characters as the keydata item itself. So, a keydata item beginning with “mi” will search only the sub-array whose elements begin with “mi”. The formulafill statement has this form:

execute {field «Results field» formulafill arraylookup([the relevant sub-array],} + chevronquoted([variable containing field name]) + {,Sep2,Sep3,"")}

When the primary database is divided into batches, a loop is iterated for each batch. The loop contains only the formulafill statement. Monitoring the process shows that each loop takes longer than the previous one and more memory is used. In some circumstances, the final one or two loops are never completed because of a lack of memory. MenuMeters shows that the process begins with each of the Mac’s four processors using 20% to 30% of CPU but grinds to a halt with CPU usage at zero to 2% per processor.

Is this normal behaviour?