Re: Tesseract Training

Dmitry Silaev Sun, 16 Jan 2011 00:48:11 -0800

Dear Sochenda,

I'm not sure what's the ultimate goal of your code assignment but a formal
answer to your question is "Yes". You can assign "k001" or "k002" to a
bounding box in a .box file. Moreover, you can assign any UTF-8 encoded
character sequence. In Tess version 3.0x (current) the only restriction is a
24 byte limit for the entire char sequence length. This also allows you to
use not only an abstract code like "k001" but a meaningful character
sequence from your real language (e.g. a well-known "fi" ligature in some
Latin fonts) which then relieves you from using the pre- and
post-processing.


If you still prefer using abstract codes then pre-/post-processing can be
done without tinkering with Tess's code. Since training as well as
recognition result in generation of output files, you can develop a couple
of file processing command-line utilities which then can be used along with
calls to the Tesseract executable within shell scripts (or .bat files in
Windows).

For further details you definitely should study thoroughly the
"TrainingTesseract3" and "ReadMe" (section "Installation Notes - Tesseract
3.00") documents (
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 and
http://code.google.com/p/tesseract-ocr/wiki/ReadMe). These are not quite
easy searchable documents but they contain all the info you might need.

Warm regards,
Dmitry Silaev




On Sun, Jan 16, 2011 at 10:42 AM, KHEM Sochenda <[email protected]>wrote:

>
> Dear Dmitry,
>
> Thank you very much for a comprehensive explanation.
> Let say, to go straight, does it sound ok by assigning a code like 'k001'
> or 'k002' to the glype obtain from tesseract segmentation?
>
> For post processing, touching the code tesseract, could you please point me
> out which I files I should modify to work on. Advice me if the last version
> of tesseract will do fine.
>
> Thank you very much in advance for your time and response back.
>
> Best Regards,
>
> Sochenda
>
>
> On Sat, Jan 15, 2011 at 3:05 AM, Dmitry Silaev <[email protected]>wrote:
>
>> Chenda,
>>
>> In fact Tesseract doesn't care if you do training for a real language's
>> letter and which language this letter belongs to. Simplistically saying Tess
>> only saves the mapping of feature sets obtained from training to Unicode
>> ids. This implies that during training you can assign virtually any
>> character code to virtually any glyph (to be exact, to a connected component
>> or to a set of connected components).
>>
>> If your language script is comprised by a reasonable number of joint
>> character combinations then while training you can assign every such
>> combination a predefined Unicode id (some restrictions apply). Later, when
>> running recognition, you should do some post-processing to decode your
>> predefined ids into real language's character sequences.
>>
>> For good results all this requires you to develop a training file
>> pre-processor (mapping: language char combinations -> provisional ids) and a
>> recognition result post-processor (mapping: provisional ids -> language char
>> sequences). I'm not sure but this also may require correcting character
>> property bit masks in the unicharset file (I don't know exactly how this
>> information is used by Tess as I don't need it in my project).
>>
>> Warm regards,
>> Dmitry Silaev
>>
>>
>>
>>
>> On Fri, Jan 14, 2011 at 10:25 AM, KHEM Sochenda 
>> <[email protected]>wrote:
>>
>>> Dear Tesseract Team,
>>>
>>> In training new language step, we have to assign a unicode value to each
>>> box.
>>> I would like to know if a shape that is composed of *several unicode
>>> characters?
>>> Is there anyway to assign only an id for each box in tesseract?
>>>
>>> Thank you very much in advance for your response.
>>>
>>> Best Regards,
>>> Chenda *
>>>
>>>    1. **
>>>
>>>  --
>>> You received this message because you are subscribed to the Google Groups
>>> "tesseract-ocr" group.
>>> To post to this group, send email to [email protected].
>>> To unsubscribe from this group, send email to
>>> [email protected]<tesseract-ocr%[email protected]>
>>> .
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to
>> [email protected]<tesseract-ocr%[email protected]>
>> .
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected]<tesseract-ocr%[email protected]>
> .
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Tesseract Training

Reply via email to