Look at Apache PDFBox software. This can be used to pull the text strings from a PDF, and that may provide ability to "mine" the PDF depending on whether what you are looking for is visible in the text-string content only.
Experience with PDFBox is that it does seem to assemble lines of text coherently. In a PDF, the characters of text that appear visually on a line are often quite jumbled in the actual PDF data file. PDFBox appears to do the necessary reasoning to give a coherent line of text back, for what visually appears to a human reader to be a line of text. ________________________________ From: Costello, Roger L. <[email protected]> Sent: Thursday, August 22, 2019 8:09 AM To: [email protected] <[email protected]> Subject: Anyone created a DFDL schema for PDF documents? Hello DFDL community, I may soon be involved in a project that needs to parse and mine PDF documents. Has anyone created a DFDL schema for PDF? /Roger
