Re: Training individual characters in an existing language

Attila Sukosd Mon, 22 Apr 2013 04:57:02 -0700

Wow, thank you for the detailed reply! I will give it a try! :)

Best,


Attila

On Monday, April 22, 2013 11:04:32 AM UTC+2, sdk wrote:
>
> Please look at the unicharambigs file for your language. You can add these 
> substitutions to the same and recombine the traineddata without needing to 
> do any additional training. 
>
> Please see http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3- 
> section on - The last file (unicharambigs)
>
> The final data file that Tesseract uses is called unicharambigs. It 
>> represents the intrinsic ambiguity between characters or sets of 
>> characters, and is currently entirely manually generated. To understand the 
>> file format, look at the following example: 
>>
>> v1
>> 3       I I 0   2       u o     3
>>
>> 3       I - I   1       H       2
>> 2       ' '     1       "       1
>>
>>
>> 2       ಕೊ 6    1       ಕೋ     1
>> 1       m       2       r n     0
>> 3       i i i   1       m       0
>>
>> The first line is a version identifier. The remaining lines consist of 5 
>> tab-separated fields. The first field is the number of strings in the 
>> second field. The 3rd field is the number of strings in the 4th field, and 
>> the 5th field is a type indicator. The 2nd and 4th fields consist of a 
>> number of space-separated strings. As with the other files, this is a UTF-8 
>> format file, and therefore each string is a UTF-8 string. Each of these 
>> strings must match the first field of some line in the unicharset file, ie 
>> it must a recognizable unit. 
>>
>
> If that doesn't work, you can try post-processing the OCR output. VietOCR 
> allows a user defined susbtitution file for the same.
> See http://vietocr.sourceforge.net/usage.html - section on post-processing
>
> In addition to the built-in text postprocessing algorithm, you can add 
>> your own custom text replacement scheme via a text file named 
>> x.DangAmbigs.txt, where x is the ISO639-3 language code. The 
>> UTF-8-encoded file should contain equal sign-delimited 
>> oldValue=newValuepairs.  
>>
>
> Shree Devi Kumar
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>  
>
> On Mon, Apr 22, 2013 at 2:00 PM, Attila Sukosd 
> <[email protected]<javascript:>
> > wrote:
>
>> Hi all,
>>
>> I'm trying to run some OCR on some old-ish danish datasets from 1970+, 
>> and it seems like some of the characters are consequently recognized wrong:
>>
>> å => á
>> mm => nn
>> : => e
>> l => 1
>>
>> Is there any way to improve on the recognition of these individual 
>> characters without having to retrain the complete font?
>> I've found a lot of documents on how to train a completely new font, but 
>> not a lot on how to improve on existing ones.
>>
>> Best,
>>
>> Attila
>>
>> -- 
>> -- 
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To post to this group, send email to [email protected]<javascript:>
>> To unsubscribe from this group, send email to
>> [email protected] <javascript:>
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en
>>  
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>  
>>  
>>
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Training individual characters in an existing language

Reply via email to