[tesseract-ocr] Page segmentation finding wrong columns

Daniel Bonniot de Ruisselet Wed, 27 Aug 2014 13:32:20 -0700

Hi,

First of all, thanks for this very useful piece of software!


Here's an issue I'm seeing on 3.03 and git HEAD. On the attached image, 
page segmentation (-psm 3, also default) seems to find some valid but also 
one invalid column. Going through the output:

10
15
20
25
30
35

EP 2 377 850 A1

This is a good detecting of the narrow column on the left, and of the top 
line.

1-(2-(dimethylamino)-4-(trifluoromethyl)benzyl)-3-(2,3-dihydro-2—oxo-1H-benzo[d]imidazo|—4-yl)ure
1-(4-(trif|uoromethyl)-2—(pyrrolidin-1-y|)benzy|)-3-(2,3-dihydro-2—oxo—1H-benzo[d]imidazol-4-yl)urea
[...]

Also good.

1-( 
-(trif|uoromethyl)-2—(pyrrolidin-1-y|)benzy|)-3-(2,3-dihydro—2—oxobenzo[d]oxazo|—4-y|)urea
1-( 
-(trif|uoromethyl)-2—(piperidin-1-y|)benzyl)-3-(2,3-dihydro-2—oxobenzo[d]oxazol-4-yl)urea
[...]

Here one character (the 4) is missing from each line.

4
4
[...]

The 4s seem to have been detected as a separate column, which is not 
desired. Seems to me a column should not be detected here, both because the 
4s are actually close to other characters (no column separation), and 
because this column largely overlaps with the main (widest) one.

Would someone familiar with the code be able to check why this is 
happening? If pointed in the right direction, I could have a try as well :)

Cheers,

Daniel

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/cb792447-37f0-4704-9781-93658d314ae6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Page segmentation finding wrong columns

Reply via email to