[tesseract-ocr] Re: How to process PDF files line by line with tesseract

Alex Giokas Fri, 08 Nov 2019 06:34:08 -0800

If your PDF is not a bitmap, then you don't need OCR, simply extract the 
text.
If the PDF is a bitmap, then convert it to an image format, and then OCR it.
You can play with PSM options (instead of 3, try 11 or 12) if your PDF is 
sparse, I get better accuracy that way.
If you really need a line-by-line approach, then you have to use some 
pre-processing algorithm (e.g., use OpenCV to find rows of text, extract 
that as a ROI, and feed that ROI to tesseract one at a time).
This can be easily achieved, but it increases computation time tremendously.


Regards,
Alex

On Friday, 8 November 2019 14:15:43 UTC, jcr wrote:
>
> when processing PDF files to obtain text content (convert to TIF with 
> ImageMagick + run Tesseract 4.1.0 on output), I observe that in many cases, 
> the input is read "vertically", such that words/numbers being close to each 
> other (e.g. same line) in the input are torn apart in the txt output.
>
> Is there any way to prevent this? And are there any recommendations for 
> configuration of DPI etc. when processing PDF to text?
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/2bc3c616-d82f-4056-8f99-0ed4029fb880%40googlegroups.com.

[tesseract-ocr] Re: How to process PDF files line by line with tesseract

Reply via email to