Look at Apache PDFBox software. This can be used to pull the text strings from 
a PDF, and that may provide ability to "mine" the PDF depending on whether what 
you are looking for is visible in the text-string content only.

Experience with PDFBox is that it does seem to assemble lines of text 
coherently. In a PDF, the characters of text that appear visually on a line are 
often quite jumbled in the actual PDF data file. PDFBox appears to do the 
necessary reasoning to give a coherent line of text back, for what visually 
appears to a human reader to be a line of text.




________________________________
From: Costello, Roger L. <[email protected]>
Sent: Thursday, August 22, 2019 8:09 AM
To: [email protected] <[email protected]>
Subject: Anyone created a DFDL schema for PDF documents?

Hello DFDL community,

I may soon be involved in a project that needs to parse and mine PDF documents.

Has anyone created a DFDL schema for PDF?

/Roger

Reply via email to