On Wednesday, October 12, 2016 at 5:21:17 PM UTC-4, fuzzy7k wrote:
> I have scanned some index pages that I would like to ocr for rapid
> searching. I am using tesseract from the command line. The problem is that
> tesseract ignores the whitespace between columns and merges everything
> together, essentially fragmenting the contents. Using some debug output I
> see that no "columns" are detected. ...
> I have attached the image merely as an abstract representation of the text
> layout to show the types of columns I am dealing with. Ideally, it would
> also be nice to know if tab stops can be trained and used to oneline each
> individual topic, which I could do postprocess if I could get tabstops
Tesseract is probably getting confused by the indents for the entries. It
should be pretty easy to identify the columns using image processing (.e.g.
create a histogram of black pixel counts for each vertical pixel column).
Why not just do the page segmentation yourself and pass the three columns
to Tesseract separately.
You received this message because you are subscribed to the Google Groups
To unsubscribe from this group and stop receiving emails from it, send an email
To post to this group, send email to firstname.lastname@example.org.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
For more options, visit https://groups.google.com/d/optout.