Re: Training individual characters in an existing language

Attila Sukosd Mon, 22 Apr 2013 07:10:28 -0700

Hi again,

I've looked at the unicharambigs file, but I think the problem is elsewhere.


<https://lh4.googleusercontent.com/-XrDllWLRSN4/UXUnzmx4JNI/AAAAAAAAAGE/5L4CqAnuXbQ/s1600/boundingbox.png>
In the attached image, you can see that the last word is "omkommet", but 
tesseract recognises it as "onkonnet". To me it looks like the bounding 
boxes are incorrect, mostly because the "mm" and "mk" have no character 
spacing in between them.
Is there a way to train this scenario to work better?

Cheers,

Attila




On Monday, April 22, 2013 1:54:11 PM UTC+2, Attila Sukosd wrote:
>
> Wow, thank you for the detailed reply! I will give it a try! :)
>
> Best,
>
> Attila
>
> On Monday, April 22, 2013 11:04:32 AM UTC+2, sdk wrote:
>>
>> Please look at the unicharambigs file for your language. You can add 
>> these substitutions to the same and recombine the traineddata without 
>> needing to do any additional training. 
>>
>> Please see http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3- 
>> section on - The last file (unicharambigs)
>>
>> The final data file that Tesseract uses is called unicharambigs. It 
>>> represents the intrinsic ambiguity between characters or sets of 
>>> characters, and is currently entirely manually generated. To understand the 
>>> file format, look at the following example: 
>>>
>>> v1
>>> 3       I I 0   2       u o     3
>>>
>>> 3       I - I   1       H       2
>>> 2       ' '     1       "       1
>>>
>>>
>>> 2       ಕೊ 6    1       ಕೋ     1
>>> 1       m       2       r n     0
>>> 3       i i i   1       m       0
>>>
>>> The first line is a version identifier. The remaining lines consist of 5 
>>> tab-separated fields. The first field is the number of strings in the 
>>> second field. The 3rd field is the number of strings in the 4th field, and 
>>> the 5th field is a type indicator. The 2nd and 4th fields consist of a 
>>> number of space-separated strings. As with the other files, this is a UTF-8 
>>> format file, and therefore each string is a UTF-8 string. Each of these 
>>> strings must match the first field of some line in the unicharset file, ie 
>>> it must a recognizable unit. 
>>>
>>
>> If that doesn't work, you can try post-processing the OCR output. VietOCR 
>> allows a user defined susbtitution file for the same.
>> See http://vietocr.sourceforge.net/usage.html - section on 
>> post-processing
>>
>> In addition to the built-in text postprocessing algorithm, you can add 
>>> your own custom text replacement scheme via a text file named 
>>> x.DangAmbigs.txt, where x is the ISO639-3 language code. The 
>>> UTF-8-encoded file should contain equal sign-delimited 
>>> oldValue=newValuepairs.  
>>>
>>
>> Shree Devi Kumar
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>  
>>
>> On Mon, Apr 22, 2013 at 2:00 PM, Attila Sukosd <
>> [email protected]> wrote:
>>
>>> Hi all,
>>>
>>> I'm trying to run some OCR on some old-ish danish datasets from 1970+, 
>>> and it seems like some of the characters are consequently recognized wrong:
>>>
>>> å => á
>>> mm => nn
>>> : => e
>>> l => 1
>>>
>>> Is there any way to improve on the recognition of these individual 
>>> characters without having to retrain the complete font?
>>> I've found a lot of documents on how to train a completely new font, but 
>>> not a lot on how to improve on existing ones.
>>>
>>> Best,
>>>
>>> Attila
>>>
>>> -- 
>>> -- 
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To post to this group, send email to [email protected]
>>> To unsubscribe from this group, send email to
>>> [email protected]
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en
>>>  
>>> --- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>  
>>>  
>>>
>>
>>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Training individual characters in an existing language

Reply via email to