Re: Tesseract Training

KHEM Sochenda Sun, 16 Jan 2011 20:01:56 -0800

Dear Dmitry,

Thank you again for a very quick response.


I am going to train tesseract for Khmer Language in which there are many
ligatures are in the same cases as "fi" in some latin fonts.
The attachment show you the example of the one line khmer sentence, please
count the box from left to right. You can see that some glyphs are above to
others. The first glyph is formed of two unicode characters, somehow the
third glyph and the fifth glyph form a Unicode characters. This is the
reason why I wish to give each glype its own ID and then I do a post
processing afterward.

Regarding the two glyphs which are overlapped each other like the case of
7th glyph and the 8th glyph, how tesseract will segment these glyphs?  How
to give the position of the boxes?


Thank you very much in advance for your response.


Best Regards,

Sochenda


On Sun, Jan 16, 2011 at 3:48 PM, Dmitry Silaev <[email protected]>wrote:

> Dear Sochenda,
>
> I'm not sure what's the ultimate goal of your code assignment but a formal
> answer to your question is "Yes". You can assign "k001" or "k002" to a
> bounding box in a .box file. Moreover, you can assign any UTF-8 encoded
> character sequence. In Tess version 3.0x (current) the only restriction is a
> 24 byte limit for the entire char sequence length. This also allows you to
> use not only an abstract code like "k001" but a meaningful character
> sequence from your real language (e.g. a well-known "fi" ligature in some
> Latin fonts) which then relieves you from using the pre- and
> post-processing.
>
> If you still prefer using abstract codes then pre-/post-processing can be
> done without tinkering with Tess's code. Since training as well as
> recognition result in generation of output files, you can develop a couple
> of file processing command-line utilities which then can be used along with
> calls to the Tesseract executable within shell scripts (or .bat files in
> Windows).
>
> For further details you definitely should study thoroughly the
> "TrainingTesseract3" and "ReadMe" (section "Installation Notes - Tesseract
> 3.00") documents (
> http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 and
> http://code.google.com/p/tesseract-ocr/wiki/ReadMe). These are not quite
> easy searchable documents but they contain all the info you might need.
>
> Warm regards,
> Dmitry Silaev
>
>
>
>
>
> On Sun, Jan 16, 2011 at 10:42 AM, KHEM Sochenda <[email protected]>wrote:
>
>>
>> Dear Dmitry,
>>
>> Thank you very much for a comprehensive explanation.
>> Let say, to go straight, does it sound ok by assigning a code like 'k001'
>> or 'k002' to the glype obtain from tesseract segmentation?
>>
>> For post processing, touching the code tesseract, could you please point
>> me out which I files I should modify to work on. Advice me if the last
>> version of tesseract will do fine.
>>
>> Thank you very much in advance for your time and response back.
>>
>> Best Regards,
>>
>> Sochenda
>>
>>
>> On Sat, Jan 15, 2011 at 3:05 AM, Dmitry Silaev <[email protected]>wrote:
>>
>>> Chenda,
>>>
>>> In fact Tesseract doesn't care if you do training for a real language's
>>> letter and which language this letter belongs to. Simplistically saying Tess
>>> only saves the mapping of feature sets obtained from training to Unicode
>>> ids. This implies that during training you can assign virtually any
>>> character code to virtually any glyph (to be exact, to a connected component
>>> or to a set of connected components).
>>>
>>> If your language script is comprised by a reasonable number of joint
>>> character combinations then while training you can assign every such
>>> combination a predefined Unicode id (some restrictions apply). Later, when
>>> running recognition, you should do some post-processing to decode your
>>> predefined ids into real language's character sequences.
>>>
>>> For good results all this requires you to develop a training file
>>> pre-processor (mapping: language char combinations -> provisional ids) and a
>>> recognition result post-processor (mapping: provisional ids -> language char
>>> sequences). I'm not sure but this also may require correcting character
>>> property bit masks in the unicharset file (I don't know exactly how this
>>> information is used by Tess as I don't need it in my project).
>>>
>>> Warm regards,
>>> Dmitry Silaev
>>>
>>>
>>>
>>>
>>> On Fri, Jan 14, 2011 at 10:25 AM, KHEM Sochenda 
>>> <[email protected]>wrote:
>>>
>>>> Dear Tesseract Team,
>>>>
>>>> In training new language step, we have to assign a unicode value to each
>>>> box.
>>>> I would like to know if a shape that is composed of *several unicode
>>>> characters?
>>>> Is there anyway to assign only an id for each box in tesseract?
>>>>
>>>> Thank you very much in advance for your response.
>>>>
>>>> Best Regards,
>>>> Chenda *
>>>>
>>>>    1. **
>>>>
>>>>  --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To post to this group, send email to [email protected].
>>>> To unsubscribe from this group, send email to
>>>> [email protected]<tesseract-ocr%[email protected]>
>>>> .
>>>> For more options, visit this group at
>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google Groups
>>> "tesseract-ocr" group.
>>> To post to this group, send email to [email protected].
>>> To unsubscribe from this group, send email to
>>> [email protected]<tesseract-ocr%[email protected]>
>>> .
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to
>> [email protected]<tesseract-ocr%[email protected]>
>> .
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected]<tesseract-ocr%[email protected]>
> .
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

<<attachment: example of Khmer sentence.TIF>>

Re: Tesseract Training

Reply via email to