Which page segmentation mode (psm) did you try?
On 12 Oct 2016 11:21 p.m., "fuzzy7k" <kvan...@gmail.com> wrote:
> I have scanned some index pages that I would like to ocr for rapid
> searching. I am using tesseract from the command line. The problem is that
> tesseract ignores the whitespace between columns and merges everything
> together, essentially fragmenting the contents. Using some debug output I
> see that no "columns" are detected. Probably more important is that three
> "blocks" are detected, one around the first and last line, and one
> encompassing everything in between. Is there a way to train block
> detection, or some parameters that I can tweak to optimize this?
> I have attached the image merely as an abstract representation of the text
> layout to show the types of columns I am dealing with. Ideally, it would
> also be nice to know if tab stops can be trained and used to oneline each
> individual topic, which I could do postprocess if I could get tabstops
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to firstname.lastname@example.org.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> For more options, visit https://groups.google.com/d/optout.
You received this message because you are subscribed to the Google Groups
To unsubscribe from this group and stop receiving emails from it, send an email
To post to this group, send email to email@example.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
For more options, visit https://groups.google.com/d/optout.