[tesseract-ocr] Re: Failure to recognize columns

fuzzy7k Sun, 23 Oct 2016 18:35:39 -0700

Well, I have used ocrfeeder to draw up columns individually, but that is a 
lot of mouse clicking and copy/pasting. I don't care to do that for 40 
pages of index material, considering most of the text will probably never 
even be looked  at. That's why I was hoping to find a line of code that I 
could tweak so that I can just whip up a script to take on the whole batch 
with the press of a finger. I made a few changes in textord/colfind.cpp, 
but concluded that I was chasing a rabbit into a hole. I had success with 
drawing a line freestyle between the columns. I'm currently looking into 
how to do that with convert.


I like the histogram idea. That sounds like a good feature request. 

On Saturday, October 15, 2016 at 9:49:20 PM UTC-4, Tom Morris wrote:
>
> On Wednesday, October 12, 2016 at 5:21:17 PM UTC-4, fuzzy7k wrote:
>>
>> I have scanned some index pages that I would like to ocr for rapid 
>> searching. I am using tesseract from the command line. The problem is that 
>> tesseract ignores the whitespace between columns and merges everything 
>> together, essentially fragmenting the contents. Using some debug output I 
>> see that no "columns" are detected. ...
>>
>> I have attached the image merely as an abstract representation of the 
>> text layout to show the types of columns I am dealing with. Ideally, it 
>> would also be nice to know if tab stops can be trained and used to oneline 
>> each individual topic, which I could do postprocess if I could get tabstops 
>> printed.
>>
>
> Tesseract is probably getting confused by the indents for the entries. It 
> should be pretty easy to identify the columns using image processing (.e.g. 
> create a histogram of black pixel counts for each vertical pixel column). 
> Why not just do the page segmentation yourself and pass the three columns 
> to Tesseract separately.
>
> Tom 
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7605283f-6346-45b5-8a89-ab9163a06708%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Failure to recognize columns

Reply via email to