Re: [libreoffice-users] Scanned and OCR's PDF to text

Albrecht Dreß Mon, 09 Jun 2025 03:03:05 -0700

Am 09.06.25 01:45 schrieb(en) Leo L te Braake:
>   * If somehow I get this text without the graphics in a LO Draw file,
>     will I be able to make a Writes file out of it?
>   * Is there a better route between the PDF and a .csv file?


Not sure if I understood your issue completely…

If you have a PDF which includes both the scanned bitmap as well as the plain 
text from OCR, you can use command line tools like “pdftotext” (on Debian in 
the ”poppler-utils” package) or similar to extract the latter.

If the quality of the scan (as you mentioned) is somewhat bad, but you can 
access a higher quality scan as PDF, have a look at OCRmyPDF 
(<https://github.com/ocrmypdf/OCRmyPDF>).  It runs tesseract as OCR engine on 
the PDF input, producing a combined bitmap/text PDF output, with the ability to 
write the OCR output to a different file (have a look at the “--sidecar” 
option).

Once you have the plain-text output, it should be feasible to write a script 
(python, perl, whatever) to extract the relevant data as CSV.

Hth,
Albrecht.
-- 
To unsubscribe e-mail to: [email protected]
Problems? https://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: https://wiki.documentfoundation.org/Netiquette
List archive: https://listarchives.libreoffice.org/global/users/
Privacy Policy: https://www.documentfoundation.org/privacy

Re: [libreoffice-users] Scanned and OCR's PDF to text

Reply via email to