Inputting a PDF file?

My hobby site, the American Cryptogram Association, provided a mailed publication bimonthly. They also gave members access to a text file (txt) of the bimonthly ciphers. Time passes.

Now they are phasing out the paper publication and strongly encouraging users to subscribe to an emailed .pdf version. And they no longer provide the .txt file. We have to open the .pdf file and copy the text as the ciphers appear in various places of the publication.

I had a start in parsing all the ciphers, by type category, out of the previous text file. I even had plans, partially tested, to go to the txt file page (via the URL command that includes username and password, pull the text info, and file it in a Pan X database.

But now I have to start with a .pdf file.

How would you get .pdf data into PanX? Seaching Help for pdf only gave two print statements. If I can open the file, I can navigate through it, sentence-by-sentence, and parse out the ciphers.

The only otherway I can think of is having PanX execute something like an AppleScript that would launch a pdf reader to open the file and pass the contents to Panorama.

The ACA has just started these policies with the Jan-Feb issue. My guess is, there will be feedback from members who prefer access to the .txt file. But just in case they stick with the pdf, I’m looking for “next steps”.

You could open the pdf file in Preview and Select All and then paste it into TextEdit. This should get all the text from the pdf.

And that can be done with an AppleScript from within PanX?

I would imagine it could be done with AppleScript. I haven’t used AppleScript for some time now so I don’t have the steps necessary to code it all.

okie-doke - it does verify that I can’t go .pdf :backhand_index_pointing_right:t2: PanX without an intermediary.

Been quite awhile since I’ve played with AppleScript but both Preview and TextEdit have decent support for it. Moreover its Script Editor is recordable so you may be able to get a good first approximation of the needed script by recording while trying to do it manually. How text is stored in pdf files can vary a lot so extracting all of it might not be straightforward. But then those into cryptography are not easily balked. Just think of it as another layer to your cyphers. If you can figure how to extract the text manually and get that process into AppleScript then you should be able to port your code to run everything within PanX. PanX’s Openanything or Openwith statements could also let you open your pdf file in Preview should you be more comfortable with more PanX and less AppleScript code. Might not even need TextEdit if you can just select and copy all the text to the clipboard with Preview.

There is a better way. Just drop the file (or select all→copy→paste) into grok or chatgpt or any AI chatbot and ask it to extract data and format it as CSV (or whatever) data. you can then save this to a text file for importing. Some chatbots will let you drag a pdf right into their window.

For me it works perfectly every time.
Your AI might have an API that lets you automate the manual bits like loading and saving out the file, i only had to do this a few times, so i didnt code for ultimate laziness :slight_smile:

Two good options are ‘pdftotext’ (part of the ‘poppler’ package) and ‘ocrmypdf’ which are command-line utilities that can be can be installed using Homebrew or MacPorts, or similar installation software.

‘pdftotext’ requires a PDF with an existing text layer. PDFs are created either with or without text layers, or have a text layer added via an OCR program. If you are using a PDF without a text layer, then ‘ocrmypdf’ will provide the ability to OCR the PDF file and embed an text layer in that file.

Both programs will generate text files or copy the text to the system clipboard, but ‘pdftotext’ does a better job (but not perfect) of maintaining the formatting in the PDF.

Panorama should be able to steer that process:

You would write a custom Panorama X procedure that uses a shell script to do the following (thank you to AI):

  1. Specify the PDF file path: Panorama X can store file paths in database fields.

  2. Execute the pdftotext command: The script would call the pdftotextexecutable, passing the input PDF file path and an output text file path as arguments.

  3. Import the resulting text: The Panorama X procedure can then import the content of the generated text file into a database field for further analysis or storage.

Thank you all for your suggestions. I hope the Admins at ACA get enough feedback from members so that they will restore the .txt file. But if not, you’ve given me an alternative path.