Re: Training individual characters in an existing language

Shree Devi Kumar Tue, 23 Apr 2013 03:11:56 -0700

I
 have myself started experimenting with tesseract recently. So, I passed
on the info I had found on the documentation pages.


The experts on the forum may suggest the next steps.


On Mon, Apr 22, 2013 at 5:40 PM, Attila Sukosd
<[email protected]>wrote:

> Hi again,
>
> I've looked at the unicharambigs file, but I think the problem is
> elsewhere.
>
>
> <https://lh4.googleusercontent.com/-XrDllWLRSN4/UXUnzmx4JNI/AAAAAAAAAGE/5L4CqAnuXbQ/s1600/boundingbox.png>
> In the attached image, you can see that the last word is "omkommet", but
> tesseract recognises it as "onkonnet". To me it looks like the bounding
> boxes are incorrect, mostly because the "mm" and "mk" have no character
> spacing in between them.
> Is there a way to train this scenario to work better?
>
> Cheers,
>
> Attila
>
>
>
>
> On Monday, April 22, 2013 1:54:11 PM UTC+2, Attila Sukosd wrote:
>>
>> Wow, thank you for the detailed reply! I will give it a try! :)
>>
>> Best,
>>
>> Attila
>>
>> On Monday, April 22, 2013 11:04:32 AM UTC+2, sdk wrote:
>>>
>>> Please look at the unicharambigs file for your language. You can add
>>> these substitutions to the same and recombine the traineddata without
>>> needing to do any additional training.
>>>
>>> Please see http://code.google.com/p/**tesseract-ocr/wiki/**
>>> TrainingTesseract3<http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3>-
>>>  section on - The last file (unicharambigs)
>>>
>>> The final data file that Tesseract uses is called unicharambigs. It
>>>> represents the intrinsic ambiguity between characters or sets of
>>>> characters, and is currently entirely manually generated. To understand the
>>>> file format, look at the following example:
>>>>
>>>> v1
>>>> 3       I I 0   2       u o     3
>>>>
>>>> 3       I - I   1       H       2
>>>> 2       ' '     1       "       1
>>>>
>>>>
>>>>
>>>> 2       ಕೊ 6    1       ಕೋ     1
>>>> 1       m       2       r n     0
>>>> 3       i i i   1       m       0
>>>>
>>>> The first line is a version identifier. The remaining lines consist of
>>>> 5 tab-separated fields. The first field is the number of strings in the
>>>> second field. The 3rd field is the number of strings in the 4th field, and
>>>> the 5th field is a type indicator. The 2nd and 4th fields consist of a
>>>> number of space-separated strings. As with the other files, this is a UTF-8
>>>> format file, and therefore each string is a UTF-8 string. Each of these
>>>> strings must match the first field of some line in the unicharset file, ie
>>>> it must a recognizable unit.
>>>>
>>>
>>> If that doesn't work, you can try post-processing the OCR output.
>>> VietOCR allows a user defined susbtitution file for the same.
>>> See 
>>> http://vietocr.sourceforge.**net/usage.html<http://vietocr.sourceforge.net/usage.html>-
>>>  section on post-processing
>>>
>>> In addition to the built-in text postprocessing algorithm, you can add
>>>> your own custom text replacement scheme via a text file named
>>>> x.DangAmbigs.txt, where x is the ISO639-3 language code. The
>>>> UTF-8-encoded file should contain equal sign-delimited
>>>> oldValue=newValue pairs.
>>>>
>>>
>>> Shree Devi Kumar
>>> ______________________________**______________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>>
>>> On Mon, Apr 22, 2013 at 2:00 PM, Attila Sukosd <
>>> [email protected]> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm trying to run some OCR on some old-ish danish datasets from 1970+,
>>>> and it seems like some of the characters are consequently recognized wrong:
>>>>
>>>> å => á
>>>> mm => nn
>>>> : => e
>>>> l => 1
>>>>
>>>> Is there any way to improve on the recognition of these individual
>>>> characters without having to retrain the complete font?
>>>> I've found a lot of documents on how to train a completely new font,
>>>> but not a lot on how to improve on existing ones.
>>>>
>>>> Best,
>>>>
>>>> Attila
>>>>
>>>> --
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To post to this group, send email to [email protected]
>>>> To unsubscribe from this group, send email to
>>>> tesseract-oc...@googlegroups.**com
>>>> For more options, visit this group at
>>>> http://groups.google.com/**group/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en>
>>>>
>>>> ---
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to tesseract-oc...@googlegroups.**com.
>>>> For more options, visit 
>>>> https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out>
>>>> .
>>>>
>>>>
>>>>
>>>
>>>  --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Training individual characters in an existing language

Reply via email to