Re: Tesseract Training

KHEM Sochenda Sun, 16 Jan 2011 22:25:37 -0800

In the image, I've done manually.

On Mon, Jan 17, 2011 at 12:16 PM, Sriranga(78yrsold) <
[email protected]> wrote:


> Which tool you have used to create boxes. Please also upload box file
> generated by you.
>
>
> On Mon, Jan 17, 2011 at 9:31 AM, KHEM Sochenda <[email protected]>wrote:
>
>> Dear Dmitry,
>>
>> Thank you again for a very quick response.
>>
>> I am going to train tesseract for Khmer Language in which there are many
>> ligatures are in the same cases as "fi" in some latin fonts.
>> The attachment show you the example of the one line khmer sentence, please
>> count the box from left to right. You can see that some glyphs are above to
>> others. The first glyph is formed of two unicode characters, somehow the
>> third glyph and the fifth glyph form a Unicode characters. This is the
>> reason why I wish to give each glype its own ID and then I do a post
>> processing afterward.
>>
>> Regarding the two glyphs which are overlapped each other like the case of
>> 7th glyph and the 8th glyph, how tesseract will segment these glyphs?  How
>> to give the position of the boxes?
>>
>>
>> Thank you very much in advance for your response.
>>
>>
>> Best Regards,
>>
>> Sochenda
>>
>>
>>
>> On Sun, Jan 16, 2011 at 3:48 PM, Dmitry Silaev <[email protected]>wrote:
>>
>>> Dear Sochenda,
>>>
>>> I'm not sure what's the ultimate goal of your code assignment but a
>>> formal answer to your question is "Yes". You can assign "k001" or "k002" to
>>> a bounding box in a .box file. Moreover, you can assign any UTF-8 encoded
>>> character sequence. In Tess version 3.0x (current) the only restriction is a
>>> 24 byte limit for the entire char sequence length. This also allows you to
>>> use not only an abstract code like "k001" but a meaningful character
>>> sequence from your real language (e.g. a well-known "fi" ligature in some
>>> Latin fonts) which then relieves you from using the pre- and
>>> post-processing.
>>>
>>> If you still prefer using abstract codes then pre-/post-processing can be
>>> done without tinkering with Tess's code. Since training as well as
>>> recognition result in generation of output files, you can develop a couple
>>> of file processing command-line utilities which then can be used along with
>>> calls to the Tesseract executable within shell scripts (or .bat files in
>>> Windows).
>>>
>>> For further details you definitely should study thoroughly the
>>> "TrainingTesseract3" and "ReadMe" (section "Installation Notes - Tesseract
>>> 3.00") documents (
>>> http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 and
>>> http://code.google.com/p/tesseract-ocr/wiki/ReadMe). These are not quite
>>> easy searchable documents but they contain all the info you might need.
>>>
>>> Warm regards,
>>> Dmitry Silaev
>>>
>>>
>>>
>>>
>>>
>>> On Sun, Jan 16, 2011 at 10:42 AM, KHEM Sochenda 
>>> <[email protected]>wrote:
>>>
>>>>
>>>> Dear Dmitry,
>>>>
>>>> Thank you very much for a comprehensive explanation.
>>>> Let say, to go straight, does it sound ok by assigning a code like
>>>> 'k001' or 'k002' to the glype obtain from tesseract segmentation?
>>>>
>>>> For post processing, touching the code tesseract, could you please point
>>>> me out which I files I should modify to work on. Advice me if the last
>>>> version of tesseract will do fine.
>>>>
>>>> Thank you very much in advance for your time and response back.
>>>>
>>>> Best Regards,
>>>>
>>>> Sochenda
>>>>
>>>>
>>>> On Sat, Jan 15, 2011 at 3:05 AM, Dmitry Silaev 
>>>> <[email protected]>wrote:
>>>>
>>>>> Chenda,
>>>>>
>>>>> In fact Tesseract doesn't care if you do training for a real language's
>>>>> letter and which language this letter belongs to. Simplistically saying 
>>>>> Tess
>>>>> only saves the mapping of feature sets obtained from training to Unicode
>>>>> ids. This implies that during training you can assign virtually any
>>>>> character code to virtually any glyph (to be exact, to a connected 
>>>>> component
>>>>> or to a set of connected components).
>>>>>
>>>>> If your language script is comprised by a reasonable number of joint
>>>>> character combinations then while training you can assign every such
>>>>> combination a predefined Unicode id (some restrictions apply). Later, when
>>>>> running recognition, you should do some post-processing to decode your
>>>>> predefined ids into real language's character sequences.
>>>>>
>>>>> For good results all this requires you to develop a training file
>>>>> pre-processor (mapping: language char combinations -> provisional ids) 
>>>>> and a
>>>>> recognition result post-processor (mapping: provisional ids -> language 
>>>>> char
>>>>> sequences). I'm not sure but this also may require correcting character
>>>>> property bit masks in the unicharset file (I don't know exactly how this
>>>>> information is used by Tess as I don't need it in my project).
>>>>>
>>>>> Warm regards,
>>>>> Dmitry Silaev
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Jan 14, 2011 at 10:25 AM, KHEM Sochenda <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Dear Tesseract Team,
>>>>>>
>>>>>> In training new language step, we have to assign a unicode value to
>>>>>> each box.
>>>>>> I would like to know if a shape that is composed of *several unicode
>>>>>> characters?
>>>>>> Is there anyway to assign only an id for each box in tesseract?
>>>>>>
>>>>>> Thank you very much in advance for your response.
>>>>>>
>>>>>> Best Regards,
>>>>>> Chenda *
>>>>>>
>>>>>>    1. **
>>>>>>
>>>>>>  --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To post to this group, send email to [email protected].
>>>>>> To unsubscribe from this group, send email to
>>>>>> [email protected]<tesseract-ocr%[email protected]>
>>>>>> .
>>>>>> For more options, visit this group at
>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>
>>>>>
>>>>>  --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To post to this group, send email to [email protected].
>>>>> To unsubscribe from this group, send email to
>>>>> [email protected]<tesseract-ocr%[email protected]>
>>>>> .
>>>>> For more options, visit this group at
>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>
>>>>
>>>>  --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To post to this group, send email to [email protected].
>>>> To unsubscribe from this group, send email to
>>>> [email protected]<tesseract-ocr%[email protected]>
>>>> .
>>>> For more options, visit this group at
>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google Groups
>>> "tesseract-ocr" group.
>>> To post to this group, send email to [email protected].
>>> To unsubscribe from this group, send email to
>>> [email protected]<tesseract-ocr%[email protected]>
>>> .
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to
>> [email protected]<tesseract-ocr%[email protected]>
>> .
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected]<tesseract-ocr%[email protected]>
> .
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Tesseract Training

Reply via email to