when processing PDF files to obtain text content (convert to TIF with 
ImageMagick + run Tesseract 4.1.0 on output), I observe that in many cases, 
the input is read "vertically", such that words/numbers being close to each 
other (e.g. same line) in the input are torn apart in the txt output.

Is there any way to prevent this? And are there any recommendations for 
configuration of DPI etc. when processing PDF to text?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/606cf4d5-bea5-46f2-b8ba-8bb61a962be6%40googlegroups.com.

Reply via email to