Yes, this is possible, at least in theory. In box files you can map arbitrary glyphs to character sequences. However possibility is high you'll stumble upon some difficulties with accuracy. From what comes to my mind for the moment, I can name the two. First, although Tesseract is somewhat immune to glyph variations, these can be quite high in the case of handwritten text. Second, Tesseract uses internal scaling for every glyph (called normalization), so that many word glyphs obviously different to a human eye can be recognized as the same word. By the same reason Tess may confuse word glyphs if their lengths vary much and there are very long words. What is "vary much" and "very long" should be determined experimentally, though.
BTW I suppose you mean that your historic documents use a connected script, as not all cursive is necessarily connected, see http://en.wikipedia.org/wiki/Cursive. With letters that are only sloppy but not connected, the problem is much easier, and imho it makes sense to spend some time devising a good segmentation algo and pre- and post-processing logic to use Tess in a more traditional way. HTH Warm regards, Dmitri Silaev www.CustomOCR.com On Wed, Jul 6, 2011 at 7:42 PM, Raj Julha <[email protected]> wrote: > Hi > > I'm planning to train Tesseract on handwritten text, from mainly > historical documents. Because of the cursive nature of the handwritten > text it is difficult to isolate single characters so I was planning to > create images of words and then use a list of words as training > source. Alternatively I could create a text file with the handwritten > transcription and the coordinates of each word on the image. Can I use > that as input for tesseract training? I'm mainly interested in using > the command line version. > > Cheers > > Raj > > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

