On Wednesday, October 12, 2016 at 5:21:17 PM UTC-4, fuzzy7k wrote: > > I have scanned some index pages that I would like to ocr for rapid > searching. I am using tesseract from the command line. The problem is that > tesseract ignores the whitespace between columns and merges everything > together, essentially fragmenting the contents. Using some debug output I > see that no "columns" are detected. ... > > I have attached the image merely as an abstract representation of the text > layout to show the types of columns I am dealing with. Ideally, it would > also be nice to know if tab stops can be trained and used to oneline each > individual topic, which I could do postprocess if I could get tabstops > printed. >
Tesseract is probably getting confused by the indents for the entries. It should be pretty easy to identify the columns using image processing (.e.g. create a histogram of black pixel counts for each vertical pixel column). Why not just do the page segmentation yourself and pass the three columns to Tesseract separately. Tom -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f404ef82-ed51-4666-b415-b560b3ae1b51%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.