Extending Wrap speed

JohnB · May 20, 2020, 2:34am

Large database FormulaFill SuperLookup statements can be time consuming. One speedup option PanX offers is the “Wrap” option. I’m trying to leverage it into more circumstances.

Suppose database dbTo (with t records and including fields keyTo and valueTo) is doing a FormulaFill SuperLookup from database dbFrom (with f records including fields keyFrom and valueFrom.) For simplicity assume all records are selected in both databases.

Compare ‘unWrapped’ version:

Window dbTo
Field keyTo
Formulafill superlookup("dbFrom", {keyFrom=««keyTo»»}, {valueFrom},"Wrap",false())

with ‘Wrapped’ version:

Window dbTo
Field keyTo
Formulafill superlookup("dbFrom", {keyFrom=««keyTo»»}, {valueFrom},"Wrap",true())

In unWrapped (leaving out the Wrap option would be the same; also the variations of Lookup would be similar speeds): as dbTo’s record pointer progresses from first to last record, dbFrom’s pointer resets to its first record after reaching either a matching record or its last record. On the average t*f/2 record pointer moves occur. Slow, at least by PanX standards.

In Wrapped, WITH optimized data configuration (no summary records, keyTo and keyFrom sorted in same direction, and a match for every keyTo value existing within keyFrom) dbFrom’s record pointer never returns to its first record. t+f record pointer moves occur. Much faster. BUT without the optimal data configuration dbFrom’s pointer resets to its first record often. Potentially more pointer moves than unWrapped, potentially much slower.

Ungrouping and properly sorting data to allow optimized Wrapped is trivial. Assuring 100% matches is not.

A few unmatched records may be filled via the "Default", true() option but more quickly slows the process. With "Default", false() option unmatches stop with an error, with that record active. Which can be helpful debugging when you’d expected no unmatched records. But if unmatched records are expected, the usual choice would be to employ the unWrapped version and accept the slow speed. However, it should be possible to proactively split the process, using Wrapped for the selection of known to match records and one Fill statement for the reverse known not to match. Each split faster than its unWrapped counterpart. At the cost of writing more complex code and the overhead involved in running it. So it may or not be worthwhile.

This can be done (assuming proper sorting, no summaries and all records selected):

Let matchedKeys=arrayboth(arraybuild(¶, dbFrom, {keyFrom}), arraybuild(¶, dbTo, {keyTo}), ¶)
Window dbTo
Field keyTo
Select arraycontains(matchedKeys, keyTo, ¶)  ;  alas, a bottleneck
Field valueTo
Formulafill superlookup("dbFrom", {keyFrom=««keyTo»»}, {valueFrom},"Wrap",true())
SelectReverse
Fill "unmatched" ; or whatever else you want as default data
SelectAll

By adding temporary placeholder fields, sorting for this then restoring a prior sort order is possible. And with more fields it could be extended to FormulaFill a partial selection of dbTo or from just a partial selection of dbFrom.

The arraybuilds are pretty quick. I believe arrayboth does its work similar to Wrap, sorting each array internally first, then just one element pointer pass for each array, as its output is sorted. It should be quick.

However, if things work as I imagine, Select arraycontains is a bottleneck. In this case matchedKeys and keyTo are sorted the same. But arraycontains doesn’t require sorted operands, can’t presume them, so its element pointer presumably resets to its first element for each new record comparison. It may take as much time as unWrapped alone. Making all this worthless.

If there is a good current PanX bypass for that bottleneck I don’t know it and would like to! If there isn’t, I request a new function, similar to arraycontains, but that doesn’t reset the element pointer for each new record comparison. Like Superlookup's Wrap option, it would require optimized data for best speed. With it, faster superlookups of partially matching records becomes practical. Not as fast as optimized Wrapped with everything matching, but still significantly faster than unWrapped.

Although I don’t know a good current PanX bypass, I do know a bad one. Append extra records to dbFrom, one for each keyTo value (placed in their keyFrom field), with your default value in their valueFrom field and with something in some other field to tag them as temporary records. Sort both databases. Then Wrapped would match all records. Then delete the temporary records. Although that kluge works you have to be desparate to employ it. My experience suggests it’s too easy to corrupt your data.

admin · May 20, 2020, 3:52am

It’s not really practical to make a modified arraycontains function along the lines you are suggesting. However, your post has given me an idea on a possible method that might enable faster lookups when working with formulafill. It’s actually building on an enhancement that was recently added to 10.2. I’ve written this up in the issue tracker so I won’t forget it.

JohnB · May 20, 2020, 10:26pm

I’d hoped it was possible, but then had also thought that if it were you’d probably already have made it. So I’m not surprised. But if my thinking out loud triggers a better idea, great. If looking up via a new kind of dictionary could be faster I’m willing to learn and try it. Until then I’m ok. Although I don’t personally need Server, View Search and some of the other new features listed as coming attractions sound useful. I hope debugging 10.2 goes efficiently.