Hi Stef, 
Thanks for the reply (here and on SO).

The fix mostly works, but unfortunately I am still seeing that tesseract 
sometimes ignores the unicharambigs file I set for it.

For example I have the following two images:


<https://lh3.googleusercontent.com/-DviQndEfN4U/V2Guw9Vnz_I/AAAAAAAAays/CVNEoOO7BSYSu442aDBDE2YTB6kVvdMVwCLcB/s1600/djh5_trim.png>

And :

<https://lh3.googleusercontent.com/-LmhBq6IVGE0/V2Gu46UNj1I/AAAAAAAAay0/gnLIr-dUGngoqhbDdNCCPueBsemUu_HIQCLcB/s1600/djh5_trim_larger_border.png>



The only difference between the files is the border around them.


In my eng.unicharambigs file I have added the following lines:


3    : I I    3    : / /    1
3    : / I    3    : / /    1
3    : I /    3    : / /    1
5    . c o m l    5    . c o m /    1
3    : / l    3    : / /    1
3    : l /     3    : / /    1


When I run tesseract on file without spacing I get the following output:


http:II11111111111111111111111111111111111111111
1111111111111111111.com/


When I run tesseract on file with spacing I get the correct output:


http://11111111111111111111111111111111111111111
1111111111111111111.com/


Another example of spacing (or something else?) making a difference:


Smaller border


<https://lh3.googleusercontent.com/-1zpwtv5-dCo/V2Gw2EEtngI/AAAAAAAAazU/q8CAfPO1uwE8jnv7KM61qrGKhY6qiKM0QCLcB/s1600/djh7_small_border.png>



Larger border:

<https://lh3.googleusercontent.com/-N0rpjxGgZB8/V2Gw52DggDI/AAAAAAAAazc/derCJqYiH30NRrggg32_3igODaoAw3DzwCLcB/s1600/djh7_large_border.png>





both these files have spacing around the text with the first image having 
less spacing.  (and the find is a little different between the two images, 
though very slightly)


running Tesseract on first file gives correct result: 
http://alphaGl.com/primenumbershittingbearl (Except for 6 -> G and last / 
becoming l)


On the second image I get the output 
http://alpha61.comIprimenumbershittingbearl.  It seems as if the 
unicharambigs file is ignored for the .com/ case.  It doesn't do the 
substitution as specified.


Anything you can think of the fix this problem?












On Friday, 3 June 2016 18:39:38 UTC+2, Stef wrote:
>
> Here you are: SO answer. 
> <http://stackoverflow.com/questions/37533524/tweak-tesseract-for-better-detection-of-urls-in-image/37602220#37602220>
>  
>
> Am Freitag, 3. Juni 2016 18:31:47 UTC+2 schrieb John Muccigrosso:
>>
>> On Thursday, June 2, 2016 at 5:21:51 PM UTC-4, Stef wrote:
>>>
>>> You can resolve the ambiguity using the unicharambigs file, for details 
>>> see my SO answer to your SO question.
>>>
>>> Stef
>>>
>>
>> I'm curious about this as well. Could you post a link to this discussion?
>>
>> Thanks. 
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/35005c56-a045-44c9-8224-3ad623a58f76%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to