Data scrubbing video

I frequently get data from external sources that is “dirty”. The data has been manually collected and may contain;
double spaces
leading and trailing spaces
Incorrect spelling
non-printing characters
various keys to indicate a split field.

It would be helpful to create a video or a set of “macros” that can do simple data scrubbing.

Data scrubbing is simple up to a point, beyond which it can get very difficult.

For known, unwanted character strings like double spaces and non-printing characters, it’s just a matter of replacing them with whatever you want. Typical code would be, for a field named Check:

field Check
formulafill replace(Check,"  "," ")

which replaces all occurrences of a double-space with a single space. If you are likely to have three or more contiguous spaces, you might use this:

field Check
formulafill replace(Check,"     "," ")
formulafill replace(Check,"   "," ")
formulafill replace(Check,"  "," ")

which seeks five, three and two spaces.

This statement:

formulafill strip(Check)

will strip all non-printing characters from both ends of a string. If you wanted to retain leading or trailing CRLF characters you would need a more targeted statement.

If you know what the “various keys to indicate a split field” are, you can easily replace them, using code similar to my first example.

If you want to validate every field in a database, check out the formulafillallfields statement.

Incorrect spelling is much more difficult to handle. Unless you are seeking a defined group of possible misspellings, you would need to test every word in the field against a very comprehensive dictionary which is unique to your region. For instance, assuming you are in the US, the word, behaviour would be incorrect for you whereas, for me, it’s correct. Most dictionaries offer both options and, hence, both behavior and behaviour would be acceptable, no matter where you are.

Instead of using a sequence of replace( functions as Michael suggested, you might want to look into the onespace( function, which will replace any number of spaces into a single space.

Another option is the onewhitespace( function, which also handles tabs and line breaks.

There is also a function to strip out nonprintable characters.

Here is a formula that will scrub extra multiple spaces, leading and trailing extra spaces and extra returns as well as replacing line feeds with carriage returns and the windows carriage return & line feed combinations with simple carriage returns. It strips all other non-printable characters as well.

strip(stripprintable(replacemultiple(onespace(theText),cr()+lf()+{; }+lf(),cr()+{;}+lf(),“;”)))

Just a note to point out that this strips carriage returns and replaces them with spaces but does not strip line feeds which could cause confusion.

That’s an excellent point, and that is definitely a bug. It was easier to fix it than to create a bug report, so I went ahead and did that. The fix will be in the next version.

In the mean time, I think your formula might be simpler using the lftocr( function. Gary, please double check me as I am dashing this off. Also, I may be a bit dizzy at this late hour but I think your formula actually converts carriage returns into linefeeds, rather than linefeeds into carriage returns as you stated. Anyway, here’s my attempt.

strip(stripprintable(onespace(lftocr(theText))))

You are correct - of course. I did not notice that the lftocr( function also replaced the carriage return/line feed Windows line breaks with a single carriage return. Maybe a new function here? textscrub(

And another note for Dan:

You might want to adopt different protocols for validating numeric fields, landline and mobile phone fields, zip code fields, and so on. Do you need help with those?

I don’t need help yet. Thank you.

That would be an interesting function. Believe it or not, I use your product as a quick, lost cost app to scrub data.

I just tried to use onespace( and the procedure editor told me it was an “unknown function”. It looked really simple in the help file.

My guess is that you somehow misspelled it. It’s working for me.

If I put a “formulafill” in front of it, it works, but the error message said “Unknown statement: onespace(” which didn’t leave me with much hope right out of the blocks.

The question of statements vs. functions, and where they can be used, has tripped up more than one Panorama user over the years. I’ve taken another crack at explaining this in the Code page of the Panorama X documentation.

The relevant section is Statements vs. Functions, which for convenience I have included below. Please note that everything stated in this section is applicable to both Panorama 6 and Panorama X.


Statements vs. Functions

In some programming languages, all actions are functions that can return a value and be used in formulas. In Panorama, however, statements and functions are distinct. In Panorama, a statement performs an action. A function, on the other hand, returns a value. You can’t use a function where Panorama is expecting a statement, and you can’t use a statement in a formula, where Panorama needs a value.

To illustrate this, consider the message statement.

message "Warning, negative balance"

When used in a program, this statement displays an alert, like this.

But since message is a statement, not a function, it doesn’t return a value and you cannot use it in a formula. So this code will not work:

x = message "Warning, negative balance"

and neither will this:

x = message("Warning, negative balance")

On the other hand, consider the sqr( function, which calculates a square root. Since this is a function and returns a value, it’s perfectly fine to use it in a formula, for example with this assignment.

x = sqr(7)

But the sqr( function cannot be used by itself. It’s not an action, so Panorama doesn’t know what to do with this:

sqr(7)

The basic rule is, if a keyword ends with a (, it’s a function and returns a value, and can be used in formulas. If a keyword doesn’t end with a (, it’s a statement and it performs an action.

Thanks, Jim. One of my Achilles heels, I fear, of long standing.