Re: Tesseract Training

KHEM Sochenda Sun, 16 Jan 2011 22:43:39 -0800

I know how to do it in tesseract, but the image just to show you how the
glyphs should be boxed.


I can send you the box file generate by tesseract anyway.

Regards,

Sochenda

On Mon, Jan 17, 2011 at 1:41 PM, Sriranga(78yrsold) <[email protected]
> wrote:

> as per wiki instructions.- commandline has to be used to generate box file
> as follow - as per wiki instructions.
> tesseract <lang.fontname.number.tif >   <lang.fontname.number> batch.nochop
> makebox
>
>
>
> On Mon, Jan 17, 2011 at 11:55 AM, KHEM Sochenda <[email protected]>wrote:
>
>> In the image, I've done manually.
>>
>> On Mon, Jan 17, 2011 at 12:16 PM, Sriranga(78yrsold) <
>> [email protected]> wrote:
>>
>>> Which tool you have used to create boxes. Please also upload box file
>>> generated by you.
>>>
>>>
>>> On Mon, Jan 17, 2011 at 9:31 AM, KHEM Sochenda 
>>> <[email protected]>wrote:
>>>
>>>> Dear Dmitry,
>>>>
>>>> Thank you again for a very quick response.
>>>>
>>>> I am going to train tesseract for Khmer Language in which there are many
>>>> ligatures are in the same cases as "fi" in some latin fonts.
>>>> The attachment show you the example of the one line khmer sentence,
>>>> please count the box from left to right. You can see that some glyphs are
>>>> above to others. The first glyph is formed of two unicode characters,
>>>> somehow the third glyph and the fifth glyph form a Unicode characters. This
>>>> is the reason why I wish to give each glype its own ID and then I do a post
>>>> processing afterward.
>>>>
>>>> Regarding the two glyphs which are overlapped each other like the case
>>>> of 7th glyph and the 8th glyph, how tesseract will segment these glyphs?
>>>> How to give the position of the boxes?
>>>>
>>>>
>>>> Thank you very much in advance for your response.
>>>>
>>>>
>>>> Best Regards,
>>>>
>>>> Sochenda
>>>>
>>>>
>>>>
>>>> On Sun, Jan 16, 2011 at 3:48 PM, Dmitry Silaev 
>>>> <[email protected]>wrote:
>>>>
>>>>> Dear Sochenda,
>>>>>
>>>>> I'm not sure what's the ultimate goal of your code assignment but a
>>>>> formal answer to your question is "Yes". You can assign "k001" or "k002" 
>>>>> to
>>>>> a bounding box in a .box file. Moreover, you can assign any UTF-8 encoded
>>>>> character sequence. In Tess version 3.0x (current) the only restriction 
>>>>> is a
>>>>> 24 byte limit for the entire char sequence length. This also allows you to
>>>>> use not only an abstract code like "k001" but a meaningful character
>>>>> sequence from your real language (e.g. a well-known "fi" ligature in some
>>>>> Latin fonts) which then relieves you from using the pre- and
>>>>> post-processing.
>>>>>
>>>>> If you still prefer using abstract codes then pre-/post-processing can
>>>>> be done without tinkering with Tess's code. Since training as well as
>>>>> recognition result in generation of output files, you can develop a couple
>>>>> of file processing command-line utilities which then can be used along 
>>>>> with
>>>>> calls to the Tesseract executable within shell scripts (or .bat files in
>>>>> Windows).
>>>>>
>>>>> For further details you definitely should study thoroughly the
>>>>> "TrainingTesseract3" and "ReadMe" (section "Installation Notes - Tesseract
>>>>> 3.00") documents (
>>>>> http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 and
>>>>> http://code.google.com/p/tesseract-ocr/wiki/ReadMe). These are not
>>>>> quite easy searchable documents but they contain all the info you might
>>>>> need.
>>>>>
>>>>> Warm regards,
>>>>> Dmitry Silaev
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Jan 16, 2011 at 10:42 AM, KHEM Sochenda <
>>>>> [email protected]> wrote:
>>>>>
>>>>>>
>>>>>> Dear Dmitry,
>>>>>>
>>>>>> Thank you very much for a comprehensive explanation.
>>>>>> Let say, to go straight, does it sound ok by assigning a code like
>>>>>> 'k001' or 'k002' to the glype obtain from tesseract segmentation?
>>>>>>
>>>>>> For post processing, touching the code tesseract, could you please
>>>>>> point me out which I files I should modify to work on. Advice me if the 
>>>>>> last
>>>>>> version of tesseract will do fine.
>>>>>>
>>>>>> Thank you very much in advance for your time and response back.
>>>>>>
>>>>>> Best Regards,
>>>>>>
>>>>>> Sochenda
>>>>>>
>>>>>>
>>>>>> On Sat, Jan 15, 2011 at 3:05 AM, Dmitry Silaev <[email protected]
>>>>>> > wrote:
>>>>>>
>>>>>>> Chenda,
>>>>>>>
>>>>>>> In fact Tesseract doesn't care if you do training for a real
>>>>>>> language's letter and which language this letter belongs to. 
>>>>>>> Simplistically
>>>>>>> saying Tess only saves the mapping of feature sets obtained from 
>>>>>>> training to
>>>>>>> Unicode ids. This implies that during training you can assign virtually 
>>>>>>> any
>>>>>>> character code to virtually any glyph (to be exact, to a connected 
>>>>>>> component
>>>>>>> or to a set of connected components).
>>>>>>>
>>>>>>> If your language script is comprised by a reasonable number of joint
>>>>>>> character combinations then while training you can assign every such
>>>>>>> combination a predefined Unicode id (some restrictions apply). Later, 
>>>>>>> when
>>>>>>> running recognition, you should do some post-processing to decode your
>>>>>>> predefined ids into real language's character sequences.
>>>>>>>
>>>>>>> For good results all this requires you to develop a training file
>>>>>>> pre-processor (mapping: language char combinations -> provisional ids) 
>>>>>>> and a
>>>>>>> recognition result post-processor (mapping: provisional ids -> language 
>>>>>>> char
>>>>>>> sequences). I'm not sure but this also may require correcting character
>>>>>>> property bit masks in the unicharset file (I don't know exactly how this
>>>>>>> information is used by Tess as I don't need it in my project).
>>>>>>>
>>>>>>> Warm regards,
>>>>>>> Dmitry Silaev
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jan 14, 2011 at 10:25 AM, KHEM Sochenda <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Dear Tesseract Team,
>>>>>>>>
>>>>>>>> In training new language step, we have to assign a unicode value to
>>>>>>>> each box.
>>>>>>>> I would like to know if a shape that is composed of *several
>>>>>>>> unicode characters?
>>>>>>>> Is there anyway to assign only an id for each box in tesseract?
>>>>>>>>
>>>>>>>> Thank you very much in advance for your response.
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>> Chenda *
>>>>>>>>
>>>>>>>>    1. **
>>>>>>>>
>>>>>>>>  --
>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>> To post to this group, send email to [email protected]
>>>>>>>> .
>>>>>>>> To unsubscribe from this group, send email to
>>>>>>>> [email protected]<tesseract-ocr%[email protected]>
>>>>>>>> .
>>>>>>>> For more options, visit this group at
>>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>>>
>>>>>>>
>>>>>>>  --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To post to this group, send email to [email protected].
>>>>>>> To unsubscribe from this group, send email to
>>>>>>> [email protected]<tesseract-ocr%[email protected]>
>>>>>>> .
>>>>>>> For more options, visit this group at
>>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>>
>>>>>>
>>>>>>  --
>>>>>> You received this message because you are subscribed to the Google
>>>>>> Groups "tesseract-ocr" group.
>>>>>> To post to this group, send email to [email protected].
>>>>>> To unsubscribe from this group, send email to
>>>>>> [email protected]<tesseract-ocr%[email protected]>
>>>>>> .
>>>>>> For more options, visit this group at
>>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>>
>>>>>
>>>>>  --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To post to this group, send email to [email protected].
>>>>> To unsubscribe from this group, send email to
>>>>> [email protected]<tesseract-ocr%[email protected]>
>>>>> .
>>>>> For more options, visit this group at
>>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>>
>>>>
>>>>  --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "tesseract-ocr" group.
>>>> To post to this group, send email to [email protected].
>>>> To unsubscribe from this group, send email to
>>>> [email protected]<tesseract-ocr%[email protected]>
>>>> .
>>>> For more options, visit this group at
>>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google Groups
>>> "tesseract-ocr" group.
>>> To post to this group, send email to [email protected].
>>> To unsubscribe from this group, send email to
>>> [email protected]<tesseract-ocr%[email protected]>
>>> .
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To post to this group, send email to [email protected].
>> To unsubscribe from this group, send email to
>> [email protected]<tesseract-ocr%[email protected]>
>> .
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected]<tesseract-ocr%[email protected]>
> .
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Tesseract Training

Reply via email to