I have scanned some index pages that I would like to ocr for rapid
searching. I am using tesseract from the command line. The problem is that
tesseract ignores the whitespace between columns and merges everything
together, essentially fragmenting the contents. Using some debug output I
see that no "columns" are detected. Probably more important is that three
"blocks" are detected, one around the first and last line, and one
encompassing everything in between. Is there a way to train block
detection, or some parameters that I can tweak to optimize this?
I have attached the image merely as an abstract representation of the text
layout to show the types of columns I am dealing with. Ideally, it would
also be nice to know if tab stops can be trained and used to oneline each
individual topic, which I could do postprocess if I could get tabstops
You received this message because you are subscribed to the Google Groups
To unsubscribe from this group and stop receiving emails from it, send an email
To post to this group, send email to firstname.lastname@example.org.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
For more options, visit https://groups.google.com/d/optout.