Hi Stef, Thanks for the reply (here and on SO). The fix mostly works, but unfortunately I am still seeing that tesseract sometimes ignores the unicharambigs file I set for it.
For example I have the following two images: <https://lh3.googleusercontent.com/-DviQndEfN4U/V2Guw9Vnz_I/AAAAAAAAays/CVNEoOO7BSYSu442aDBDE2YTB6kVvdMVwCLcB/s1600/djh5_trim.png> And : <https://lh3.googleusercontent.com/-LmhBq6IVGE0/V2Gu46UNj1I/AAAAAAAAay0/gnLIr-dUGngoqhbDdNCCPueBsemUu_HIQCLcB/s1600/djh5_trim_larger_border.png> The only difference between the files is the border around them. In my eng.unicharambigs file I have added the following lines: 3 : I I 3 : / / 1 3 : / I 3 : / / 1 3 : I / 3 : / / 1 5 . c o m l 5 . c o m / 1 3 : / l 3 : / / 1 3 : l / 3 : / / 1 When I run tesseract on file without spacing I get the following output: http:II11111111111111111111111111111111111111111 1111111111111111111.com/ When I run tesseract on file with spacing I get the correct output: http://11111111111111111111111111111111111111111 1111111111111111111.com/ Another example of spacing (or something else?) making a difference: Smaller border <https://lh3.googleusercontent.com/-1zpwtv5-dCo/V2Gw2EEtngI/AAAAAAAAazU/q8CAfPO1uwE8jnv7KM61qrGKhY6qiKM0QCLcB/s1600/djh7_small_border.png> Larger border: <https://lh3.googleusercontent.com/-N0rpjxGgZB8/V2Gw52DggDI/AAAAAAAAazc/derCJqYiH30NRrggg32_3igODaoAw3DzwCLcB/s1600/djh7_large_border.png> both these files have spacing around the text with the first image having less spacing. (and the find is a little different between the two images, though very slightly) running Tesseract on first file gives correct result: http://alphaGl.com/primenumbershittingbearl (Except for 6 -> G and last / becoming l) On the second image I get the output http://alpha61.comIprimenumbershittingbearl. It seems as if the unicharambigs file is ignored for the .com/ case. It doesn't do the substitution as specified. Anything you can think of the fix this problem? On Friday, 3 June 2016 18:39:38 UTC+2, Stef wrote: > > Here you are: SO answer. > <http://stackoverflow.com/questions/37533524/tweak-tesseract-for-better-detection-of-urls-in-image/37602220#37602220> > > > Am Freitag, 3. Juni 2016 18:31:47 UTC+2 schrieb John Muccigrosso: >> >> On Thursday, June 2, 2016 at 5:21:51 PM UTC-4, Stef wrote: >>> >>> You can resolve the ambiguity using the unicharambigs file, for details >>> see my SO answer to your SO question. >>> >>> Stef >>> >> >> I'm curious about this as well. Could you post a link to this discussion? >> >> Thanks. >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/35005c56-a045-44c9-8224-3ad623a58f76%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

