Chenda,

In fact Tesseract doesn't care if you do training for a real language's
letter and which language this letter belongs to. Simplistically saying Tess
only saves the mapping of feature sets obtained from training to Unicode
ids. This implies that during training you can assign virtually any
character code to virtually any glyph (to be exact, to a connected component
or to a set of connected components).

If your language script is comprised by a reasonable number of joint
character combinations then while training you can assign every such
combination a predefined Unicode id (some restrictions apply). Later, when
running recognition, you should do some post-processing to decode your
predefined ids into real language's character sequences.

For good results all this requires you to develop a training file
pre-processor (mapping: language char combinations -> provisional ids) and a
recognition result post-processor (mapping: provisional ids -> language char
sequences). I'm not sure but this also may require correcting character
property bit masks in the unicharset file (I don't know exactly how this
information is used by Tess as I don't need it in my project).

Warm regards,
Dmitry Silaev




On Fri, Jan 14, 2011 at 10:25 AM, KHEM Sochenda <khemsoche...@gmail.com>wrote:

> Dear Tesseract Team,
>
> In training new language step, we have to assign a unicode value to each
> box.
> I would like to know if a shape that is composed of *several unicode
> characters?
> Is there anyway to assign only an id for each box in tesseract?
>
> Thank you very much in advance for your response.
>
> Best Regards,
> Chenda *
>
>    1. **
>
>  --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> To unsubscribe from this group, send email to
> tesseract-ocr+unsubscr...@googlegroups.com<tesseract-ocr%2bunsubscr...@googlegroups.com>
> .
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to