Accessing a PDF's contents

JamesCook · January 25, 2023, 4:11pm

By chance, has anyone done anything in Panorama to access the contents of a PDF file?

I have need to scrape emailed PDF receipts and initial attempts within Panorama aren’t promising.

There are Python scripts to convert PDF to Text that I have yet to try, but figured asking here would be a good first step, if by chance…

KJM · January 25, 2023, 5:21pm

This depends on what kind of PDF it is: a generated PDF (that contains text) or just a scanned image.

Generated PDFs can be handled with an Automator action to extract Text from the PDF.
Scanned image PDFs need an additional OCR step. (Or your scanner application does this already for you.) The recognised text is saved as a text layer in the PDF file. Then it should be easy to extract the text (as described in point 1).
The new LiveText feature of macOS may be helpful to copy text from images. This works not only in Fotos.app, but in Preview.app, too.

JamesCook · January 25, 2023, 6:15pm

Excellent suggestion, thanks. I never think of Automator but in trying it just now it converted the generated PDF to text so fast that I didn’t think it ran.

Aside from Python, this may prove to be a solution. It depends on how I can piece together the rest of this workflow… checking email and downloading the PDF’s extracting the desired info and saving it to a database. Should be simple enough.