Thanks, but as I see the problem is active since 2017, and no clear solution is present.
Now I tried to get recognition result via iterator API, and that's really a strange thing. All the characted are listed, and those that are "duplicates" share the same coordinates as the correct ones, but have different confidence values. First idea was to sort them on X coordinate and just get best fit values, BUT the X coordinates returned by TessPageIteratorBoundingBox happen *to be totally invalid*. Seems it's some critical bug is Tesseract !!! Let's take a line of "1234567890". Result returned by iterator is: >> 1 Conf: 98,65 Box: 1805, 771, 1843, 813 >> 2 Conf: 99,00 Box: 1811, 771, 1875, 813 >> 3 Conf: 99,00 Box: 1843, 771, 1927, 813 >> 4 Conf: 99,00 Box: 1890, 771, 1964, 813 >> 5 *<<< DAM, what is here ?! Why letter "5" is reported with X coordinate right after letter "3", while really it goes after letter "5" ?!* Conf: 99,00 Box: 1927, 771, 2001, 813 >> 6 << This one is even more amazing. Letter "6" is said right the place of letter "1", and size is 30+mm !!! Conf: 99,02 Box: 1805, 771, 2195, 813 >> 7 Conf: 98,99 Box: 2005, 771, 2090, 813 >> 8 Conf: 98,96 Box: 2053, 771, 2127, 813 >> 9 Conf: 99,01 Box: 2095, 771, 2158, 813 >> 0 Conf: 98,98 Box: 2126, 771, 2190, 813 четверг, 4 июля 2019 г., 15:09:13 UTC+3 пользователь shree написал: > > This is an open issue - see > https://github.com/tesseract-ocr/tesseract/issues/1060 > and other related issues > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a6b65fd0-38ef-407d-9e67-e0b0d19066a2%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

