Stripprintable( does not strip out non printable ghost characters

George · January 22, 2022, 8:45pm

Perhaps I am misunderstanding but the stripprintable( function does not strip unprintable ghost characters in Pan X.

This link is useful for identifying non-printable Unicode characters.

https://www.soscisurvey.de/tools/view-chars.php

As I transition to Pan X some of my procedures do not seamlessly go forward.

I made up a silly test string:

" Pan X See what’s hidden in your string… or behind Face Corn Pot—
L
l
l
boo Hiss "

In Pan 6 the hidden characters are removed. But not for me in Pan X.

These are three hidden characters not removed. It seems that what I pasted above for the silly string has the hidden characters.

Should the stripprintable strip out these hidden characters?

Thanks

dave · January 22, 2022, 9:30pm

The documentation says that it doesn’t strip out spaces or carriage returns. If it doesn’t strip out carriage returns, I think it makes sense that it not strip out line feeds. I do think it should be stripping out your zero width space, and zero width non breaking space. Those wouldn’t have existed in the first place in Pan 6. I don’t know what third character you are referring to.

George · January 22, 2022, 9:56pm

Thanks Dave.

I tried the same formula in Pan 6. It strips out those characters. I agree about spaces and carriage returns. I take care of the carriage returns with another statement:

«Company Name»=replace(lftocr(«Company Name»),cr(),"")

The third character is just what that link I referenced gives as a result of the string I put in my original posting.

Hopefully I put the image in of the results.

George

admin · January 22, 2022, 10:29pm

The stripprintable( function strips out all “control” characters (characters below 0x20) except for carriage returns. It also strips out all characters that Apple deems “illegal”. I do not know what characters are on Apple’s “illegal” list, and I suppose it could be different in different versions of macOS. As usual Apple’s documentation is pretty sparse.

The illustration below shows that this function DOES strip out line feeds but leaves in carriage returns.

In George’s screen shot, most likely the line feeds were added as part of the copy/paste as the text was transferred to the browser.

Panorama 6 does not use Unicode encoding, it uses and encoding called MacOSRoman. So it’s not at all possible to compare the two.

dave · January 22, 2022, 11:48pm

This is the image that I got.

I got that by opening your post for editing, and copying the text that you pasted, rather than Discourse’s rendering of that in the web page.

Somehow I missed the horizontal tab the first time I did this. That’s a control character, and stripprintable( does strip that. U+200B is the zero width space, and U+FEFF is the zero width no break space. Both of those are “legal” Unicode characters, or they wouldn’t have names.

George · January 23, 2022, 1:28am

I may have not made myself clear. In the following test string (Forget Pan 6) there are 3 unicode Ghost Characters.

U+A0
U+200B
U+FEFF

They are not visible.

Here is my new example:

" <spaces this way—(U+A0) is before the next word that is see"See" what’s hidden in your string… "Tab to the left "or behind Carraiage Return Next-CRLF is next here>>>

after 3 carriage returns. U+200B is between “b” and “h” in the word “behind” above.

U+FEFF is right after “behind”. Three spaces here "

This is the showing of ghost characters from the URL: View non-printable unicode characters
This is the result before AND after stripprintable constains the same 3 Ghost Characters.

Actually the carriage returns are getting stripped also. Just not the Ghost Characters.

Jim Mentioned above that the carriage returns should not be stripped. They are.

My formula is:

«Result»=stripprintable(«Example»)

It is a one liner.

«Result»=stripprintable(«Example»)

Sorry if I was not clear before.

See the results of the link that shows ghost characters.

George

pcnewble · January 23, 2022, 1:52am

They are all valid, useful characters. U+A0 is the non-breakable space (which is clearly visible in your example, the same width as a normal space, U+20), U+200B and U+FEFF are the zero-width breakable and non-breakable spaces which, by definition, you won’t see, but they serve a purpose.

I think Jim’s point was that stripprintable( specifically strips characters U+0 to U+1F (what we used to call control characters) and also what Apple calls ‘values in the category of Non-Characters or that have not yet been defined in version 3.2 of the Unicode standard’. I’m not sure what it means by Non-Characters, but those three are all long-standing and useful Unicode characters which have existed in all versions of Unicode and I wouldn’t expect it to strip.

If you want a function to strip a wider range of characters it wouldn’t be hard to write, although it would be slower in operation.

admin · January 23, 2022, 2:00am

I guess neither have I. I don’t care what some random Dutch web site says, I’m going to rely on Apple for a definition of what a legal character is. I explained what stripprintable( is designed to do, and it does exactly that. The three characters you mention are perfectly valid Unicode characters, they should not be stripped and they are not stripped.

I didn’t just mention it, I included a screen shot that clearly shows that they are not. You can easily duplicate those results yourself. Also, you seem to have missed my point that the text is changing in the copy/paste process.

Except for , there is no such thing as a “Unicode Ghost” character, I think you made up this term (couldn’t find it in google). A few Unicode characters do have zero width, but they are perfectly valid characters. If you want to explicitly remove these characters you would need to use the replace( function to do so, like this:

replace(sometext, chr(0x200B),"", chr(0xFEFF),"")

I didn’t replace the U+A0 character in this formula because that is NOT a zero width character – it looks just like a space but it does not act as a word break. But you could include that if you want.