Hi, First of all, thanks for this very useful piece of software!
Here's an issue I'm seeing on 3.03 and git HEAD. On the attached image, page segmentation (-psm 3, also default) seems to find some valid but also one invalid column. Going through the output: 10 15 20 25 30 35 EP 2 377 850 A1 This is a good detecting of the narrow column on the left, and of the top line. 1-(2-(dimethylamino)-4-(trifluoromethyl)benzyl)-3-(2,3-dihydro-2—oxo-1H-benzo[d]imidazo|—4-yl)ure 1-(4-(trif|uoromethyl)-2—(pyrrolidin-1-y|)benzy|)-3-(2,3-dihydro-2—oxo—1H-benzo[d]imidazol-4-yl)urea [...] Also good. 1-( -(trif|uoromethyl)-2—(pyrrolidin-1-y|)benzy|)-3-(2,3-dihydro—2—oxobenzo[d]oxazo|—4-y|)urea 1-( -(trif|uoromethyl)-2—(piperidin-1-y|)benzyl)-3-(2,3-dihydro-2—oxobenzo[d]oxazol-4-yl)urea [...] Here one character (the 4) is missing from each line. 4 4 [...] The 4s seem to have been detected as a separate column, which is not desired. Seems to me a column should not be detected here, both because the 4s are actually close to other characters (no column separation), and because this column largely overlaps with the main (widest) one. Would someone familiar with the code be able to check why this is happening? If pointed in the right direction, I could have a try as well :) Cheers, Daniel -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/cb792447-37f0-4704-9781-93658d314ae6%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.