Re: Tesseract Training

Sriranga(78yrsold) Sun, 16 Jan 2011 22:41:28 -0800

as per wiki instructions.- commandline has to be used to generate box file
as follow - as per wiki instructions.
tesseract <lang.fontname.number.tif >   <lang.fontname.number> batch.nochop
makebox



On Mon, Jan 17, 2011 at 11:55 AM, KHEM Sochenda <[email protected]>wrote:

> In the image, I've done manually.
>
> On Mon, Jan 17, 2011 at 12:16 PM, Sriranga(78yrsold) <
> [email protected]> wrote:
>
>> Which tool you have used to create boxes. Please also upload box file
>> generated by you.
>>
>>
>> On Mon, Jan 17, 2011 at 9:31 AM, KHEM Sochenda <[email protected]>wrote:
>>
>>> Dear Dmitry,
>>>
>>> Thank you again for a very quick response.
>>>
>>> I am going to train tesseract for Khmer Language in which there are many
>>> ligatures are in the same cases as "fi" in some latin fonts.
>>> The attachment show you the example of the one line khmer sentence,
>>> please count the box from left to right. You can see that some glyphs are
>>> above to others. The first glyph is formed of two unicode characters,
>>> somehow the third glyph and the fifth glyph form a Unicode characters. This
>>> is the reason why I wish to give each glype its own ID and then I do a post
>>> processing afterward.
>>>
>>> Regarding the two glyphs which are overlapped each other like the case of
>>> 7th glyph and the 8th glyph, how tesseract will segment these glyphs?  How
>>> to give the position of the boxes?
>>>
>>>
>>> Thank you very much in advance for your response.
>>>
>>>
>>> Best Regards,
>>>
>>> Sochenda
>>>
>>>
>>>
>>> On Sun, Jan 16, 2011 at 3:48 PM, Dmitry Silaev <[email protected]>wrote:
>>>
>>>> Dear Sochenda,
>>>>
>>>> I'm not sure what's the ultimate goal of your code assignment but a
>>>> formal answer to your question is "Yes". You can assign "k001" or "k002" to
>>>> a bounding box in a .box file. Moreover, you can assign any UTF-8 encoded
>>>> character sequence. In Tess version 3.0x (current) the only restriction is 
>>>> a
>>>> 24 byte limit for the entire char sequence length. This also allows you to
>>>> use not only an abstract code like "k001" but a meaningful character
>>>> sequence from your real language (e.g. a well-known "fi" ligature in some
>>>> Latin fonts) which then relieves you from using the pre- and
>>>> post-processing.
>>>>
>>>> If you still prefer using abstract codes then pre-/post-processing can
>>>> be done without tinkering with Tess's code. Since training as well as
>>>> recognition result in generation of output files, you can develop a couple
>>>> of file processing command-line utilities which then can be used along with
>>>> calls to the Tesseract executable within shell scripts (or .bat files in
>>>> Windows).
>>>>
>>>> For further details you definitely should study thoroughly the
>>>> "TrainingTesseract3" and "ReadMe" (section "Installation Notes - Tesseract
>>>> 3.00") documents (
>>>> http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 and
>>>> http://code.google.com/p/tesseract-ocr/wiki/ReadMe). These are not
>>>> quite easy searchable documents but they contain all the info you might
>>>> need.
>>>>
>>>> Warm regards,
>>>> Dmitry Silaev
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, Jan 16, 2011 at 10:42 AM, KHEM Sochenda <[email protected]
>>>> > wrote:
>>>>
>>>>>
>>>>> Dear Dmitry,
>>>>>
>>>>> Thank you very much for a comprehensive explanation.
>>>>> Let say, to go straight, does it sound ok by assigning a code like
>>>>> 'k001' or 'k002' to the glype obtain from tesseract segmentation?
>>>>>
>>>>> For post processing, touching the code tesseract, could you please
>>>>> point me out which I files I should modify to work on. Advice me if the 
>>>>> last
>>>>> version of tesseract will do fine.
>>>>>
>>>>> Thank you very much in advance for your time and response back.
>>>>>
>>>>> Best Regards,
>>>>>
>>>>> Sochenda
>>>>>
>>>>>
>>>>> On Sat, Jan 15, 2011 at 3:05 AM, Dmitry Silaev 
>>>>> <[email protected]>wrote:
>>>>>
>>>>>> Chenda,
>>>>>>
>>>>>> In fact Tesseract doesn't care if you do training for a real
>>>>>> language's letter and which language this letter belongs to. 
>>>>>> Simplistically
>>>>>> saying Tess only saves the mapping of feature sets obtained from 
>>>>>> training to
>>>>>> Unicode ids. This implies that during training you can assign virtually 
>>>>>> any
>>>>>> character code to virtually any glyph (to be exact, to a connected 
>>>>>> component
>>>>>> or to a set of connected components).
>>>>>>
>>>>>> If your language script is comprised by a reasonable number of joint
>>>>>> character combinations then while training you can assign every such
>>>>>> combination a predefined Unicode id (some restrictions apply). Later, 
>>>>>> when
>>>>>> running recognition, you should do some post-processing to decode your
>>>>>> predefined ids into real language's character sequences.
>>>>>>
>>>>>> For good results all this requires you to develop a training file
>>>>>> pre-processor (mapping: language char combinations -> provisional ids) 
>>>>>> and a
>>>>>> recognition result post-processor (mapping: provisional ids -> language 
>>>>>> char
>>>>>> sequences). I'm not sure but this also may require correcting character
>>>>>> property bit masks in the unicharset file (I don't know exactly how this
>>>>>> information is used by Tess as I don't need it in my project).
>>>>>>
>>>>>> Warm regards,
>>>>>> Dmitry Silaev
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Jan 14, 2011 at 10:25 AM, KHEM Sochenda <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Dear Tesseract Team,
>>>>>>>
>>>>>>> In training new language step, we have to assign a unicode value to
>>>>>>> each box.
>>>>>>> I would like to know if a shape that is composed of *several unicode
>>>>>>> characters?
>>>>>>> Is there anyway to assign only an id for each box in tesseract?
>>>>>>>
>>>>>>> Thank you very much in advance for your response.
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Chenda *
>>>>>>>
>>>>>>>    1. **
>>>>>>>
>>>>>>>  --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To post to this group, send email to [email protected].
>>>>>>> To unsubscribe from this group, send email to
>>>>>>> [email protected]<tesseract-ocr%[email protected]>
>>>>>>> .
>>>>>>> For more options, visit this group at
>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>>
>>>>>>
>>>>>>  --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To post to this group, send email to [email protected].
>>>>>> To unsubscribe from this group, send email to
>>>>>> [email protected]<tesseract-ocr%[email protected]>
>>>>>> .
>>>>>> For more options, visit this group at
>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>
>>>>>
>>>>>  --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To post to this group, send email to [email protected].
>>>>> To unsubscribe from this group, send email to
>>>>> [email protected]<tesseract-ocr%[email protected]>
>>>>> .
>>>>> For more options, visit this group at
>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>
>>>>
>>>>  --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To post to this group, send email to [email protected].
>>>> To unsubscribe from this group, send email to
>>>> [email protected]<tesseract-ocr%[email protected]>
>>>> .
>>>> For more options, visit this group at
>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google Groups
>>> "tesseract-ocr" group.
>>> To post to this group, send email to [email protected].
>>> To unsubscribe from this group, send email to
>>> [email protected]<tesseract-ocr%[email protected]>
>>> .
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to
>> [email protected]<tesseract-ocr%[email protected]>
>> .
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected]<tesseract-ocr%[email protected]>
> .
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Tesseract Training

Reply via email to