Re: Training individual characters in an existing language

Shree Devi Kumar Mon, 22 Apr 2013 03:31:16 -0700

Please look at the unicharambigs file for your language. You can add these
substitutions to the same and recombine the traineddata without needing to
do any additional training.


Please see http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 -
section on - The last file (unicharambigs)

The final data file that Tesseract uses is called unicharambigs. It
> represents the intrinsic ambiguity between characters or sets of
> characters, and is currently entirely manually generated. To understand the
> file format, look at the following example:
>
> v1
> 3       I I 0   2       u o     3
> 3       I - I   1       H       2
> 2       ' '     1       "       1
>
> 2       ಕೊ 6    1       ಕೋ     1
> 1       m       2       r n     0
> 3       i i i   1       m       0
>
> The first line is a version identifier. The remaining lines consist of 5
> tab-separated fields. The first field is the number of strings in the
> second field. The 3rd field is the number of strings in the 4th field, and
> the 5th field is a type indicator. The 2nd and 4th fields consist of a
> number of space-separated strings. As with the other files, this is a UTF-8
> format file, and therefore each string is a UTF-8 string. Each of these
> strings must match the first field of some line in the unicharset file, ie
> it must a recognizable unit.
>

If that doesn't work, you can try post-processing the OCR output. VietOCR
allows a user defined susbtitution file for the same.
See http://vietocr.sourceforge.net/usage.html - section on post-processing

In addition to the built-in text postprocessing algorithm, you can add your
> own custom text replacement scheme via a text file named x.DangAmbigs.txt,
> where x is the ISO639-3 language code. The UTF-8-encoded file should
> contain equal sign-delimited oldValue=newValue pairs.
>

Shree Devi Kumar
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com


On Mon, Apr 22, 2013 at 2:00 PM, Attila Sukosd
<[email protected]>wrote:

> Hi all,
>
> I'm trying to run some OCR on some old-ish danish datasets from 1970+, and
> it seems like some of the characters are consequently recognized wrong:
>
> å => á
> mm => nn
> : => e
> l => 1
>
> Is there any way to improve on the recognition of these individual
> characters without having to retrain the complete font?
> I've found a lot of documents on how to train a completely new font, but
> not a lot on how to improve on existing ones.
>
> Best,
>
> Attila
>
> --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Training individual characters in an existing language

Reply via email to