I also experienced a similar problem with images especially if they used fixed-pitch fonts (older scanned documents often did). Tesseract groups characters vertically assuming rotated text. I used PSM 6 instead of 3 with some improvement, but it did miss significant portions of text in return. I was processing old student records looking for personal information, like SS#, to redact. I ended up running Tesseract multiple times with different PSM modes. Each mode picked up certain parts and missed on others. It was time-consuming. Is there an engine flag (or can one be added) to force "no-rotate policy" on layout analysis? I think it will be of tremendous help. Thanks, Farhad
On Fri, Nov 8, 2019 at 7:33 AM Alex Giokas <[email protected]> wrote: > If your PDF is not a bitmap, then you don't need OCR, simply extract the > text. > If the PDF is a bitmap, then convert it to an image format, and then OCR > it. > You can play with PSM options (instead of 3, try 11 or 12) if your PDF is > sparse, I get better accuracy that way. > If you really need a line-by-line approach, then you have to use some > pre-processing algorithm (e.g., use OpenCV to find rows of text, extract > that as a ROI, and feed that ROI to tesseract one at a time). > This can be easily achieved, but it increases computation time > tremendously. > > Regards, > Alex > > On Friday, 8 November 2019 14:15:43 UTC, jcr wrote: >> >> when processing PDF files to obtain text content (convert to TIF with >> ImageMagick + run Tesseract 4.1.0 on output), I observe that in many cases, >> the input is read "vertically", such that words/numbers being close to each >> other (e.g. same line) in the input are torn apart in the txt output. >> >> Is there any way to prevent this? And are there any recommendations for >> configuration of DPI etc. when processing PDF to text? >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/2bc3c616-d82f-4056-8f99-0ed4029fb880%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/2bc3c616-d82f-4056-8f99-0ed4029fb880%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAAMcHAHMDC2LwFK62BGzCb36xFYALY6ySLSrF7n333SZKKXJ2A%40mail.gmail.com.

