I want to import text from a web page into a Panorama database but there are challenges and a mystery. I would like to automate the required actions as much as possible but given the nature of the site, I may have to manually navigate to the page, copy to the clipboard, and maybe more. I’ll have to preprocess the import before it will align with its eventual database home.
The site is: https://www.cryptogram.org/members/digicon/2021/ND2021DigitalCons.txt
But you have to enter a username (ACA - fixed) and password (changes each month) twice to reach that page.
Once there, I can see the text. I’ll paste the HTML behind it (via FireFox browser tools) below.
I have copied the text from that page to the clipboard and tried the following: On the PC, pasting into NotePad or MS Word 2010 and saving as a .txt file, then opening that with Pan6, On the Mac, pasting into MS Word 2011 and BBEdit, saving as text and opening in Pan6 and PanX. In all cases but one, The result is less than desirable for additional editing - additional fields show up, in no specific order, fragmenting the content.
ONE TIME - something I did resulted in a very orderly import. There was only one occurrence of part of the content of field A migrating to field B. All the rest of the data appeared in multiple records in field A in an ordered form that would be easy to parse into the desired structure. That means, for example, the content of puzzle A-1. shows up in three lines (records), A-2 appears in the next record. I could scan those lines and reconstruct their content so all the data that belongs to A-1 is one record. The mystery is, I don’t know what I did for that import to happen. I haven’t been able to reproduce it in Pan6 or PanX. II might have checked a box that asked if I wanted CF/LF - some option like that. Or maybe it was something I did with BBEdit before saving as text.
I can see the HEX code via BBEdit and I could start with the Clipboard and go character by character, building a new structure that way. But that one time - I can’t reproduce - was so close, it would be easy to start from there and concatenate the multiple records into the one record where they belong.
But I’ve never worked with a direct HTML import.
When I viewed the HTML code from FireFox, I didn’t see the body content you see below (which I’ve abbreviated). I was surprised it showed up. But given that, I could navigate to the page with FireFox, have the browser tools display the HTML, copy it, and start by parsing out everything between the body tags.
I can use BBEdit tools but I’m trying to minimize manual effort and let Panorama do most of the work.
Perhaps I just need to apply a REPLACE to the clipboard content and replace the character that is causing the content to jump to the next field with a space.
What is the best way to get started on this?
I removed part of the body content for, you know, brevity