Re: Tesseract Training

Sriranga(78yrsold) Sun, 16 Jan 2011 21:16:38 -0800

Which tool you have used to create boxes. Please also upload box file
generated by you.


On Mon, Jan 17, 2011 at 9:31 AM, KHEM Sochenda <[email protected]>wrote:

> Dear Dmitry,
>
> Thank you again for a very quick response.
>
> I am going to train tesseract for Khmer Language in which there are many
> ligatures are in the same cases as "fi" in some latin fonts.
> The attachment show you the example of the one line khmer sentence, please
> count the box from left to right. You can see that some glyphs are above to
> others. The first glyph is formed of two unicode characters, somehow the
> third glyph and the fifth glyph form a Unicode characters. This is the
> reason why I wish to give each glype its own ID and then I do a post
> processing afterward.
>
> Regarding the two glyphs which are overlapped each other like the case of
> 7th glyph and the 8th glyph, how tesseract will segment these glyphs?  How
> to give the position of the boxes?
>
>
> Thank you very much in advance for your response.
>
>
> Best Regards,
>
> Sochenda
>
>
>
> On Sun, Jan 16, 2011 at 3:48 PM, Dmitry Silaev <[email protected]>wrote:
>
>> Dear Sochenda,
>>
>> I'm not sure what's the ultimate goal of your code assignment but a formal
>> answer to your question is "Yes". You can assign "k001" or "k002" to a
>> bounding box in a .box file. Moreover, you can assign any UTF-8 encoded
>> character sequence. In Tess version 3.0x (current) the only restriction is a
>> 24 byte limit for the entire char sequence length. This also allows you to
>> use not only an abstract code like "k001" but a meaningful character
>> sequence from your real language (e.g. a well-known "fi" ligature in some
>> Latin fonts) which then relieves you from using the pre- and
>> post-processing.
>>
>> If you still prefer using abstract codes then pre-/post-processing can be
>> done without tinkering with Tess's code. Since training as well as
>> recognition result in generation of output files, you can develop a couple
>> of file processing command-line utilities which then can be used along with
>> calls to the Tesseract executable within shell scripts (or .bat files in
>> Windows).
>>
>> For further details you definitely should study thoroughly the
>> "TrainingTesseract3" and "ReadMe" (section "Installation Notes - Tesseract
>> 3.00") documents (
>> http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 and
>> http://code.google.com/p/tesseract-ocr/wiki/ReadMe). These are not quite
>> easy searchable documents but they contain all the info you might need.
>>
>> Warm regards,
>> Dmitry Silaev
>>
>>
>>
>>
>>
>> On Sun, Jan 16, 2011 at 10:42 AM, KHEM Sochenda 
>> <[email protected]>wrote:
>>
>>>
>>> Dear Dmitry,
>>>
>>> Thank you very much for a comprehensive explanation.
>>> Let say, to go straight, does it sound ok by assigning a code like 'k001'
>>> or 'k002' to the glype obtain from tesseract segmentation?
>>>
>>> For post processing, touching the code tesseract, could you please point
>>> me out which I files I should modify to work on. Advice me if the last
>>> version of tesseract will do fine.
>>>
>>> Thank you very much in advance for your time and response back.
>>>
>>> Best Regards,
>>>
>>> Sochenda
>>>
>>>
>>> On Sat, Jan 15, 2011 at 3:05 AM, Dmitry Silaev <[email protected]>wrote:
>>>
>>>> Chenda,
>>>>
>>>> In fact Tesseract doesn't care if you do training for a real language's
>>>> letter and which language this letter belongs to. Simplistically saying 
>>>> Tess
>>>> only saves the mapping of feature sets obtained from training to Unicode
>>>> ids. This implies that during training you can assign virtually any
>>>> character code to virtually any glyph (to be exact, to a connected 
>>>> component
>>>> or to a set of connected components).
>>>>
>>>> If your language script is comprised by a reasonable number of joint
>>>> character combinations then while training you can assign every such
>>>> combination a predefined Unicode id (some restrictions apply). Later, when
>>>> running recognition, you should do some post-processing to decode your
>>>> predefined ids into real language's character sequences.
>>>>
>>>> For good results all this requires you to develop a training file
>>>> pre-processor (mapping: language char combinations -> provisional ids) and 
>>>> a
>>>> recognition result post-processor (mapping: provisional ids -> language 
>>>> char
>>>> sequences). I'm not sure but this also may require correcting character
>>>> property bit masks in the unicharset file (I don't know exactly how this
>>>> information is used by Tess as I don't need it in my project).
>>>>
>>>> Warm regards,
>>>> Dmitry Silaev
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Jan 14, 2011 at 10:25 AM, KHEM Sochenda <[email protected]
>>>> > wrote:
>>>>
>>>>> Dear Tesseract Team,
>>>>>
>>>>> In training new language step, we have to assign a unicode value to
>>>>> each box.
>>>>> I would like to know if a shape that is composed of *several unicode
>>>>> characters?
>>>>> Is there anyway to assign only an id for each box in tesseract?
>>>>>
>>>>> Thank you very much in advance for your response.
>>>>>
>>>>> Best Regards,
>>>>> Chenda *
>>>>>
>>>>>    1. **
>>>>>
>>>>>  --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To post to this group, send email to [email protected].
>>>>> To unsubscribe from this group, send email to
>>>>> [email protected]<tesseract-ocr%[email protected]>
>>>>> .
>>>>> For more options, visit this group at
>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>
>>>>
>>>>  --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To post to this group, send email to [email protected].
>>>> To unsubscribe from this group, send email to
>>>> [email protected]<tesseract-ocr%[email protected]>
>>>> .
>>>> For more options, visit this group at
>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google Groups
>>> "tesseract-ocr" group.
>>> To post to this group, send email to [email protected].
>>> To unsubscribe from this group, send email to
>>> [email protected]<tesseract-ocr%[email protected]>
>>> .
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to
>> [email protected]<tesseract-ocr%[email protected]>
>> .
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected]<tesseract-ocr%[email protected]>
> .
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Tesseract Training

Reply via email to