[tesseract-ocr] Re: Failure to recognize columns

2016-10-23 Thread fuzzy7k
It's less than elegant, but works convert -draw "line 800,0 800,1" -draw "line 1500,0 1500,1" index-3.pnm x.pnm On Sunday, October 23, 2016 at 9:35:21 PM UTC-4, fuzzy7k wrote: > > Well, I have used ocrfeeder to draw up columns individually, but that is a > l

[tesseract-ocr] Re: Failure to recognize columns

2016-10-23 Thread fuzzy7k
into how to do that with convert. I like the histogram idea. That sounds like a good feature request. On Saturday, October 15, 2016 at 9:49:20 PM UTC-4, Tom Morris wrote: > > On Wednesday, October 12, 2016 at 5:21:17 PM UTC-4, fuzzy7k wrote: >> >> I have scanned some index pages

Re: [tesseract-ocr] Failure to recognize columns

2016-10-13 Thread fuzzy7k
hat I want unless I can get tesseract to draw "blocks" vertically around the individual columns. On Thursday, October 13, 2016 at 8:30:05 PM UTC-4, fuzzy7k wrote: > > 6 gives the exact same results as 3 (i.e. no column separation). 11 & 12 > are essentially the same in that

Re: [tesseract-ocr] Failure to recognize columns

2016-10-13 Thread fuzzy7k
tps://github.com/tesseract-ocr/tesseract/issues/434 > > On 13 Oct 2016 1:13 p.m., "fuzzy7k" <kva...@gmail.com > > wrote: > >> I tried psm 0-3 >> >> On Thursday, October 13, 2016 at 1:46:45 AM UTC-4, shree wrote: >>> >>> Which page segmen

Re: [tesseract-ocr] Failure to recognize columns

2016-10-13 Thread fuzzy7k
I tried psm 0-3 On Thursday, October 13, 2016 at 1:46:45 AM UTC-4, shree wrote: > > Which page segmentation mode (psm) did you try? > > On 12 Oct 2016 11:21 p.m., "fuzzy7k" <kva...@gmail.com > > wrote: > >> I have scanned some index pages that I would l

[tesseract-ocr] Failure to recognize columns

2016-10-12 Thread fuzzy7k
I have scanned some index pages that I would like to ocr for rapid searching. I am using tesseract from the command line. The problem is that tesseract ignores the whitespace between columns and merges everything together, essentially fragmenting the contents. Using some debug output I see

[tesseract-ocr] Re: text2image creates char boxes for 'fi' and 'fl'. Why?

2016-09-04 Thread fuzzy7k
My earlier successes were definitely font related. Use a blacklist, or whitelist -c tessedit_char_blacklist=fifl https://groups.google.com/d/topic/tesseract-ocr/jO_4ZMMK9xw/discussion On Saturday, September 3, 2016 at 1:45:21 PM UTC-4, fuzzy7k wrote: > > It's a language thing:

[tesseract-ocr] Re: text2image creates char boxes for 'fi' and 'fl'. Why?

2016-09-03 Thread fuzzy7k
It's a language thing: https://en.wikipedia.org/wiki/Typographic_ligature Try specifying a specific language? This parameter seems like a possible association (due to the description containing glyph): segment_penalty_dict_nonword1.25Score multiplier for glyph fragment segmentations

[tesseract-ocr] Re: Unrecognized lines using psm 3

2016-09-02 Thread fuzzy7k
I found the function that puts everything on the table, with regard to the scrollview blob debug window... ccstruct/blobbox.cpp: ScrollView::Color BLOBNBOX::TextlineColor(BlobRegionType region_type, BlobTextFlowType flow_type) { switch (region_type) {

[tesseract-ocr] Re: Unrecognized lines using psm 3

2016-09-02 Thread fuzzy7k
I found the function that puts everything on the table, with regard to the scrollview blob debug window... ccstruct/blobbox.cpp: ScrollView::Color BLOBNBOX::TextlineColor(BlobRegionType region_type, BlobTextFlowType flow_type) { switch (region_type) {

[tesseract-ocr] Unrecognized lines using psm 3

2016-09-02 Thread fuzzy7k
Ever so frequently I will get a page where one line on the whole page is not recognized. I think I've tracked the problem to blob recognition, but don't know where to go from here. The attached images are of an index page and they are obtained using textord_tabfind_show_images. The line that is