If your PDF is not a bitmap, then you don't need OCR, simply extract the text. If the PDF is a bitmap, then convert it to an image format, and then OCR it. You can play with PSM options (instead of 3, try 11 or 12) if your PDF is sparse, I get better accuracy that way. If you really need a line-by-line approach, then you have to use some pre-processing algorithm (e.g., use OpenCV to find rows of text, extract that as a ROI, and feed that ROI to tesseract one at a time). This can be easily achieved, but it increases computation time tremendously.
Regards, Alex On Friday, 8 November 2019 14:15:43 UTC, jcr wrote: > > when processing PDF files to obtain text content (convert to TIF with > ImageMagick + run Tesseract 4.1.0 on output), I observe that in many cases, > the input is read "vertically", such that words/numbers being close to each > other (e.g. same line) in the input are torn apart in the txt output. > > Is there any way to prevent this? And are there any recommendations for > configuration of DPI etc. when processing PDF to text? > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/2bc3c616-d82f-4056-8f99-0ed4029fb880%40googlegroups.com.

