Data scrubbing video

I frequently get data from external sources that is “dirty”. The data has been manually collected and may contain;
double spaces
leading and trailing spaces
Incorrect spelling
non-printing characters
various keys to indicate a split field.

It would be helpful to create a video or a set of “macros” that can do simple data scrubbing.

Data scrubbing is simple up to a point, beyond which it can get very difficult.

For known, unwanted character strings like double spaces and non-printing characters, it’s just a matter of replacing them with whatever you want. Typical code would be, for a field named Check:

field Check
formulafill replace(Check,"  "," ")

which replaces all occurrences of a double-space with a single space. If you are likely to have three or more contiguous spaces, you might use this:

field Check
formulafill replace(Check,"     "," ")
formulafill replace(Check,"   "," ")
formulafill replace(Check,"  "," ")

which seeks five, three and two spaces.

This statement:

formulafill strip(Check)

will strip all non-printing characters from both ends of a string. If you wanted to retain leading or trailing CRLF characters you would need a more targeted statement.

If you know what the “various keys to indicate a split field” are, you can easily replace them, using code similar to my first example.

If you want to validate every field in a database, check out the formulafillallfields statement.

Incorrect spelling is much more difficult to handle. Unless you are seeking a defined group of possible misspellings, you would need to test every word in the field against a very comprehensive dictionary which is unique to your region. For instance, assuming you are in the US, the word, behaviour would be incorrect for you whereas, for me, it’s correct. Most dictionaries offer both options and, hence, both behavior and behaviour would be acceptable, no matter where you are.

Instead of using a sequence of replace( functions as Michael suggested, you might want to look into the onespace( function, which will replace any number of spaces into a single space.

Another option is the onewhitespace( function, which also handles tabs and line breaks.

There is also a function to strip out nonprintable characters.

Here is a formula that will scrub extra multiple spaces, leading and trailing extra spaces and extra returns as well as replacing line feeds with carriage returns and the windows carriage return & line feed combinations with simple carriage returns. It strips all other non-printable characters as well.

strip(stripprintable(replacemultiple(onespace(theText),cr()+lf()+{; }+lf(),cr()+{;}+lf(),";")))

Just a note to point out that this strips carriage returns and replaces them with spaces but does not strip line feeds which could cause confusion.

That’s an excellent point, and that is definitely a bug. It was easier to fix it than to create a bug report, so I went ahead and did that. The fix will be in the next version.

In the mean time, I think your formula might be simpler using the lftocr( function. Gary, please double check me as I am dashing this off. Also, I may be a bit dizzy at this late hour but I think your formula actually converts carriage returns into linefeeds, rather than linefeeds into carriage returns as you stated. Anyway, here’s my attempt.

strip(stripprintable(onespace(lftocr(theText))))

You are correct - of course. I did not notice that the lftocr( function also replaced the carriage return/line feed Windows line breaks with a single carriage return. Maybe a new function here? textscrub(

And another note for Dan:

You might want to adopt different protocols for validating numeric fields, landline and mobile phone fields, zip code fields, and so on. Do you need help with those?

I don’t need help yet. Thank you.

That would be an interesting function. Believe it or not, I use your product as a quick, lost cost app to scrub data.