Hi Bojidar,
Thanks for your reply.  Yes, you are right, I can just rely on the well 
defined strings of URLs to work around this problem, but having set the 
values for the unicharambigs file I was expecting the output behavior to be 
reliable from tesseract.  I'm disappointed that it is not.  If another user 
has the same problems, but not the same easy to fix output strings she will 
be more stuck than I am.

It still looks like a tesseract bug to me.

Anyway, thanks again for the effort.

Regards



On Thursday, 16 June 2016 12:36:37 UTC+2, Bojidar Stanchev wrote:
>
> well you could just run a simple program on the output on tesseract to 
> find and correct those mistakes
> in your case if you have http:// and you see http:II then it should be a 
> no brainer to just change to http:// it's an easy case because those two 
> dashes are always there
> another thing is that probably after .com there is eather nothing or a 
> slash.. not many cases there eather.
> Give it a quick search - maybe there is already a program that checks 
> urls..
> tl; dr: with urls being standartised it is pretty easy to create a program 
> that detects errors in links and correct them.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/68c30342-bb9a-4b9b-836a-3cd323392af2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to