Well, I have used ocrfeeder to draw up columns individually, but that is a lot of mouse clicking and copy/pasting. I don't care to do that for 40 pages of index material, considering most of the text will probably never even be looked at. That's why I was hoping to find a line of code that I could tweak so that I can just whip up a script to take on the whole batch with the press of a finger. I made a few changes in textord/colfind.cpp, but concluded that I was chasing a rabbit into a hole. I had success with drawing a line freestyle between the columns. I'm currently looking into how to do that with convert.
I like the histogram idea. That sounds like a good feature request. On Saturday, October 15, 2016 at 9:49:20 PM UTC-4, Tom Morris wrote: > > On Wednesday, October 12, 2016 at 5:21:17 PM UTC-4, fuzzy7k wrote: >> >> I have scanned some index pages that I would like to ocr for rapid >> searching. I am using tesseract from the command line. The problem is that >> tesseract ignores the whitespace between columns and merges everything >> together, essentially fragmenting the contents. Using some debug output I >> see that no "columns" are detected. ... >> >> I have attached the image merely as an abstract representation of the >> text layout to show the types of columns I am dealing with. Ideally, it >> would also be nice to know if tab stops can be trained and used to oneline >> each individual topic, which I could do postprocess if I could get tabstops >> printed. >> > > Tesseract is probably getting confused by the indents for the entries. It > should be pretty easy to identify the columns using image processing (.e.g. > create a histogram of black pixel counts for each vertical pixel column). > Why not just do the page segmentation yourself and pass the three columns > to Tesseract separately. > > Tom > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/7605283f-6346-45b5-8a89-ab9163a06708%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

