Importing from webpage - kan't get there from here

designer · November 4, 2021, 6:23pm

I want to import text from a web page into a Panorama database but there are challenges and a mystery. I would like to automate the required actions as much as possible but given the nature of the site, I may have to manually navigate to the page, copy to the clipboard, and maybe more. I’ll have to preprocess the import before it will align with its eventual database home.

The site is: https://www.cryptogram.org/members/digicon/2021/ND2021DigitalCons.txt
But you have to enter a username (ACA - fixed) and password (changes each month) twice to reach that page.

Once there, I can see the text. I’ll paste the HTML behind it (via FireFox browser tools) below.

I have copied the text from that page to the clipboard and tried the following: On the PC, pasting into NotePad or MS Word 2010 and saving as a .txt file, then opening that with Pan6, On the Mac, pasting into MS Word 2011 and BBEdit, saving as text and opening in Pan6 and PanX. In all cases but one, The result is less than desirable for additional editing - additional fields show up, in no specific order, fragmenting the content.

ONE TIME - something I did resulted in a very orderly import. There was only one occurrence of part of the content of field A migrating to field B. All the rest of the data appeared in multiple records in field A in an ordered form that would be easy to parse into the desired structure. That means, for example, the content of puzzle A-1. shows up in three lines (records), A-2 appears in the next record. I could scan those lines and reconstruct their content so all the data that belongs to A-1 is one record. The mystery is, I don’t know what I did for that import to happen. I haven’t been able to reproduce it in Pan6 or PanX. II might have checked a box that asked if I wanted CF/LF - some option like that. Or maybe it was something I did with BBEdit before saving as text.

I can see the HEX code via BBEdit and I could start with the Clipboard and go character by character, building a new structure that way. But that one time - I can’t reproduce - was so close, it would be easy to start from there and concatenate the multiple records into the one record where they belong.

But I’ve never worked with a direct HTML import.
When I viewed the HTML code from FireFox, I didn’t see the body content you see below (which I’ve abbreviated). I was surprised it showed up. But given that, I could navigate to the page with FireFox, have the browser tools display the HTML, copy it, and start by parsing out everything between the body tags.

I can use BBEdit tools but I’m trying to minimize manual effort and let Panorama do most of the work.

Perhaps I just need to apply a REPLACE to the clipboard content and replace the character that is causing the content to jump to the next field with a space.

What is the best way to get started on this?

Thank you.

I removed part of the body content for, you know, brevity

<html><script type="text/javascript" id="__gaOptOutExtension">window["_gaUserPrefs"] = { ioo : function() { return true; } }</script><head><link rel="stylesheet" href="resource://content-accessible/plaintext.css" media=""><style>@media print {#ghostery-tracker-tally {display:none !important}}</style></head><body><pre>ND2021DigitalCons.txt

ND-1 see pdf

ND-2, 3, and 4 see pdf

A-1. Seasonal anagram. K2 [83] WORD WIZARD
"Z *ENJJX *ROJLKMEZK ZFA Z *OZHHX *FNV *XNZJ" VGFANJSQDDX
ZFZPJZEK MG "EZX EZFX Z JNA VJNZMO RZJJX OZHHLFNKK."

A-2. RE: Stricken. K2 [87] H20
EMAV M WSCR DEMOQV. UN KNFC PVDE EN AVVG EZVR NWW EZV
XCMDD. SW KNF XSHV EZVR MO SOQZ, EZVK'YY EMAV EZV IZNYV
KMCU.

X-6. Swedish. K2 Housework. [26 alph. *qwz] (verkligt) [86] OOBOO
SYK SPK ARX ZPW XPC DASSYF BM UFF HUCDIYWF QCP ELEFUS KNC
TUF WNIIUC XGEXMIIEPCQUFUF. YKWUK PH AEE WOC TUF.

C-11. Duodecimal square root. (Two words, B-0) ARIES
MA'SQ'UE gives root IAS; - MK = ASQ; - AKN = STUE; - SMRN = NEO

</pre></body></html>

designer · November 4, 2021, 6:26pm

Hum, it seems pasting in here removed the HTML part of the text copy/paste.

dave · November 4, 2021, 6:32pm

I got it back by formatting it as code.

designer · November 4, 2021, 6:37pm

Thank you Dave, there was a “body” tag at the beginning, but the idea of parsing out everything between the body tags is clear. Given the user/password requirements, I don’t know if Panorama can navigate to the text download page (even if I provide the UserName/Password) and pull out the HTML. If it could, that would be great. Then I’d just have the content between the body tags to deal with. I could find out which character is causing the extra columns to appear and do a REPLACE to squash that. Then go record by record and concatenate those that belong together.

I guess the first question would be, can Panorama, though it’s on web commands or via something embedded - like AppleScript or Python - go to a web page, provide a username/password, and trigger a log in? In this specfic case, the ACA page requires a user log in on the main page, then you navigate to another page via a members menu. The same username/password is required again, then you select a link to the puzzles (text) you want, and they are displayed on the page.

So it’s a general case of pulling data from a webpage - except there are username/password hoops to jump through. I’m not much of an HTML guy so I don’t know what is possible in this area.

JamesCook · November 4, 2021, 8:26pm

The ability to log in via Panorama depends largely on the security the site has in place. Odds are automated logins will be blocked. The cross site forgery preventions I’ve had Panorama build on sites would be enough to block it and that’s only one typical level.

Once in though, Panorama can certainly parse a page for you but your example doesn’t show enough about the page structure. Using tag data(, for instance, you can pick an HTML table to pieces, organizing rows and columns into an array to append to your database.

That’s vague, yes, but so is your description of the scenario at this point. So suffice it to say that Panorama’s ability to import any reliably consistent HTML structure into the appropriate fields and records is very good.

dave · November 4, 2021, 9:15pm

The whole web page appears to be pre-formatted text. Between the <pre> and </pre> tags, it’s simply formatted in paragraphs. These aren’t paragraph tags, they are simply empty lines between the end of one paragraph, and the beginning of the next.

designer · November 4, 2021, 11:32pm

That is correct. The backstory is, in yesteryear, there was just this publication of cipher puzzles to your mailbox every two months. Finally, kicking and screaming, they entered the computer age and created a web page that shows the publication’s puzzle contents as text that you could copy/paste and print out for pencil/paper solving or paste into any solving aid. Some members requested that the page be formatted - like XML or some such - but the old guard is not ready for that step yet. So there is just text - with line feeds - between the HTML body tags.

When I selectall/copy and paste it into something that will create a text file, then import - with the goal of creating a database of the entries (A-1, A-2, etc.) - the line feeds create separate records. If that was all, I could put them back together in appropriate records.

But some of the content creates additional fields. I can investigate what character is doing that - there is nothing visible - and replace it with something like a space while the content is in the Clipboard or maybe do the replace in the program I use if I make the text (.txt) file. The benefit of the text file is I can see what is happening as I go. I suppose I can get things to work with the .txt file and when that is finished, I just pass the clipboard to the parsing routine and change it to work with that - knowing what I have to watch for.

I’m trying to minimize the steps. It looks like that would be manually going to the web page, copying all the text (drag over and copy), then do a replace of what’s causing that field jump. That would get imported (or I continue to work with the clipboard content) and I restructure it so each puzzle entry - which now shows up as multiple lines - is one record. There are additional fields that have to be parsed out, but they all belong to the one record (per puzzle) and I’m left that aspect out for simplicity.

It looks like I need to use some hex display program that shows me the character before the content jumps to another field so I can search and replace.

I was hoping I could just (via dialog) pass Panorama the username/password for the month and turn it loose - resulting in each puzzle (A-1, A-2, …) having its own record and additional fields identifying the issue (month/year) and additional aspects (hints and such - like which language the foreign language ciphers are in) in separate fields in its record.

gary · November 5, 2021, 12:00am

I have a file in the Panorama Database Exchange called Reveal Invisible Text. You can paste your text into this file and it will highlight all the invisible characters contained therein. My quick look at your example shows two line feeds between the paragraphs as well as single line feeds within the paragraphs. I didn’t notice any other invisible characters other than spaces.

admin · November 5, 2021, 1:02am

It looks to me that this page is using standard HTTP authentication. If that’s true, you should be able to access it with the url( function, using the USER option to supply the user name and password. See User Authentication on this help page:

designer · November 5, 2021, 1:44am

Thank you Jim, I’ll look into that url function.
And Thank you Gary for the exchange file showing invisible characters.

Amazing but true, there is only a space (sp) character between one part of the text and a string that is separated out as a second field. embedded in the larger text string is a sequence of double quotes and asterisks. I’ll look into this deeply tomorrow. Something is “encouraging” PanX (and Pan6) to separate that string into a separate field.

I’ll copy just the A-! puzzle and see if, on its own, it breaks the content into additional fields instead of just making a new record on a line feed.

If I recall, Panorama determines how to parse something base on what it sees first - a Tab or a Comma. That might be the fox in the hen house. If there is no Tab, then it makes new records on linefeeds and new fields on commas - something like that.

But in this case there is only a space character - so why jump to another field.

In my earlier post, in the A-1 puzzle, the string VGFANJSQDDX is moved to a second field. Why would that happen?

The string in front of it is:
“Z *ENJJX *ROJLKMEZK ZFA Z *OZHHX *FNV *XNZJ” - the quote marks and asterisks are part of the imported text.

dave · November 5, 2021, 3:30am

My guess would be the quotes are acting as delimiters.

dave · November 5, 2021, 3:42pm

I just did some testing, and quotes do seem to cause that behavior in Panorama 6, but not in Panorama X.

designer · November 5, 2021, 11:55pm

Dave, Yes - it is behaving much better - predictable even - in PanX. All the new field breaks I’m seeing are caused by commas.

Unfortunately, this isn’t a disciplined data entry endeavor on the source text (volunteer operation). I see Tab characters have also occasionally been used in some places. That means I can’t just add a Tab character in front so the commas won’t create new fields - because those random tabs will then cause erroneous fields.

In the olden days, we use to replace commas and/or tabs with some unique character, then import, then, once everything was in its proper place field-wise, we’d replace that unique character with the original comma - that sort of action.

My guess is - and it’s a pretty good guess - that if I take care of the comma and tab issue, then eliminate the single LF’s so a double (LFLF becomes just LF) that my records will be squared away too.

Then I just have to parse out the components of the first field into its appropriate parts. That, as the saying goes, is left as an exercise for the student.

KJM · November 6, 2021, 1:23am

That’s exactly what you can do in PanoramaX with the importtext option “REARRANGE”.

designer · November 6, 2021, 3:35am

With Jim’s mention of the URL function and KJM’s mention of the REARRANGE option, this image forms in my mind of Desi Arnaz saying to Lucille Ball (think “Cuban accent”), “Lucy, you’ve got a lot of reading to do."

Time passes …

The Rearrange option might be handy in some cases, but note how well ordered the imported text is in the example; all records have all fields in the same order. That is not my case. Sometimes, in addition to the puzzle itself, there is a hint, sometimes not. Sometimes that hint comes before the cipher sometimes at the end, sometimes in square brackets, sometimes in parentheses. Sometimes I’m dealing with all numbers, sometimes text, sometimes combinations, depending upon the cipher type. Sometimes some plain text is included, as a hint, where it occurs in the cipher.

But the option “AddFields”, “No” - then accommodating the error when a tab or comma shows up - is worth playing with.

So these are all “cases”. And each cipher type (sometimes not given - part of the challenge), has its variations in what is provided and where.

With observation, I can accommodate most variations and over time, adjust for any format not originally accounted for. It’s fun Sort of like a catcher handling anything the pitcher might throw.

And I haven’t even gotten to the solving part yet (which you will thankfully be spared).