Yes, this is possible, at least in theory. In box files you can map
arbitrary glyphs to character sequences. However possibility is high
you'll stumble upon some difficulties with accuracy. From what comes
to my mind for the moment, I can name the two. First, although
Tesseract is somewhat immune to glyph variations, these can be quite
high in the case of handwritten text. Second, Tesseract uses internal
scaling for every glyph (called normalization), so that many word
glyphs obviously different to a human eye can be recognized as the
same word. By the same reason Tess may confuse word glyphs if their
lengths vary much and there are very long words. What is "vary much"
and "very long" should be determined experimentally, though.

BTW I suppose you mean that your historic documents use a connected
script, as not all cursive is necessarily connected, see
http://en.wikipedia.org/wiki/Cursive. With letters that are only
sloppy but not connected, the problem is much easier, and imho it
makes sense to spend some time devising a good segmentation algo and
pre- and post-processing logic to use Tess in a more traditional way.

HTH

Warm regards,
Dmitri Silaev
www.CustomOCR.com





On Wed, Jul 6, 2011 at 7:42 PM, Raj Julha <[email protected]> wrote:
> Hi
>
> I'm planning to train Tesseract on handwritten text, from mainly
> historical documents. Because of the cursive nature of the handwritten
> text it is difficult to isolate single characters so I was planning to
> create images of words and then use a list of words as training
> source. Alternatively I could create a text file with the handwritten
> transcription and the coordinates of each word on the image. Can I use
> that as input for tesseract training? I'm mainly interested in using
> the command line version.
>
> Cheers
>
> Raj
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to