Am 09.06.25 01:45 schrieb(en) Leo L te Braake: > * If somehow I get this text without the graphics in a LO Draw file, > will I be able to make a Writes file out of it? > * Is there a better route between the PDF and a .csv file?
Not sure if I understood your issue completely… If you have a PDF which includes both the scanned bitmap as well as the plain text from OCR, you can use command line tools like “pdftotext” (on Debian in the ”poppler-utils” package) or similar to extract the latter. If the quality of the scan (as you mentioned) is somewhat bad, but you can access a higher quality scan as PDF, have a look at OCRmyPDF (<https://github.com/ocrmypdf/OCRmyPDF>). It runs tesseract as OCR engine on the PDF input, producing a combined bitmap/text PDF output, with the ability to write the OCR output to a different file (have a look at the “--sidecar” option). Once you have the plain-text output, it should be feasible to write a script (python, perl, whatever) to extract the relevant data as CSV. Hth, Albrecht. -- To unsubscribe e-mail to: [email protected] Problems? https://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/ Posting guidelines + more: https://wiki.documentfoundation.org/Netiquette List archive: https://listarchives.libreoffice.org/global/users/ Privacy Policy: https://www.documentfoundation.org/privacy
