[tesseract-ocr] Re: How to process PDF files line by line with tesseract

Aaron Stewart Fri, 08 Nov 2019 13:40:31 -0800

If you have any suggestions on how to split input images into individual 
text lines, I would appreciate it.  I am able to use Python and OpenCV, but 
I don't have a lot of experience with either.  I can read publications if 
necessary.


I'm using Tesseract 5.0.0-alpha from UB Mannheim (Windows 10), to process 
pages from a directory.  The line spacing is very narrow.  In my project, 
increasing line spacing improves the recognition accuracy.  

I believe that splitting the input image into separate lines of text would 
improve the results, in my case.  


=== Original ===
FLOYD. THOMAS J.—La.1,°07; (1°07).
ao LOWNDES = (b’64)-~Ala.2,°90:

=== Spaced ===
FLOYD, THOMAS J.—La.1,"07; (1°07).
HENDRICK. LOWNDES  (b’64)-—~Ala.2,°90:
(1°90).

In the original example, the name HENDRICK is missing and the third line is 
also missing.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d8706d07-4a5e-4a62-899e-b79c31d9ceb6%40googlegroups.com.

[tesseract-ocr] Re: How to process PDF files line by line with tesseract

Reply via email to