[tesseract-ocr] Is there a way to get end-of-page (FF) encoded in PDF?

ArtmanDC Fri, 14 Jan 2022 10:21:44 -0800

In my project I am scanning images on microfilm, then using Tesseract (v. 
5.0.0) to create a PDF including the OCR'ed text layer.


The input images are text (monospaced typewriter), and I combine several 
(2-8 typically) images in a multipage tif.

I use the following command in Windows 10—

tesseract multipage.tif output --psm 1 pdf

This works as expected, producing a multi-page output.pdf. (I added the 
<--psm 4> after I discovered that when several consecutive lines had word 
spaces above each other, the program interpreted this as a gap between 
columns, leading to unwanted results.)

As a check in my workflow, I highlight the image in the PDF (CTRL-A) and 
copy/paste into my editor (notepad++). This pastes the OCR text from all 
pages in the document.

The result is reasonably good except that paragraph and page breaks are not 
indicated. Line breaks are.

If I replace the <pdf> with a <txt> in the command, the resulting text file 
has a blank line between paragraphs <LF LF> (Linux style, even though I'm 
using Windows) and a page break <FF>  at the end of each page.

I would like my PDF text layer to have the more user-friendly display that 
tesseract deploys in a text file. 

Is this possible?  If so, how?

Thanks!


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4842891a-6244-49ff-b841-6ce244310544n%40googlegroups.com.

[tesseract-ocr] Is there a way to get end-of-page (FF) encoded in PDF?

Reply via email to