Thanks for your input Dmitri. Raj
On Jul 6, 10:31 pm, Dmitri Silaev <[email protected]> wrote: > Yes, this is possible, at least in theory. In box files you can map > arbitrary glyphs to character sequences. However possibility is high > you'll stumble upon some difficulties with accuracy. From what comes > to my mind for the moment, I can name the two. First, although > Tesseract is somewhat immune to glyph variations, these can be quite > high in the case of handwritten text. Second, Tesseract uses internal > scaling for every glyph (called normalization), so that many word > glyphs obviously different to a human eye can be recognized as the > same word. By the same reason Tess may confuse word glyphs if their > lengths vary much and there are very long words. What is "vary much" > and "very long" should be determined experimentally, though. > > BTW I suppose you mean that your historic documents use a connected > script, as not all cursive is necessarily connected, > seehttp://en.wikipedia.org/wiki/Cursive. With letters that are only > sloppy but not connected, the problem is much easier, and imho it > makes sense to spend some time devising a good segmentation algo and > pre- and post-processing logic to use Tess in a more traditional way. > > HTH > > Warm regards, > Dmitri Silaevwww.CustomOCR.com > > > > > > > > On Wed, Jul 6, 2011 at 7:42 PM, Raj Julha <[email protected]> wrote: > > Hi > > > I'm planning to train Tesseract on handwritten text, from mainly > > historical documents. Because of the cursive nature of the handwritten > > text it is difficult to isolate single characters so I was planning to > > create images of words and then use a list of words as training > > source. Alternatively I could create a text file with the handwritten > > transcription and the coordinates of each word on the image. Can I use > > that as input for tesseract training? I'm mainly interested in using > > the command line version. > > > Cheers > > > Raj > > > -- > > You received this message because you are subscribed to the Google > > Groups "tesseract-ocr" group. > > To post to this group, send email to [email protected] > > To unsubscribe from this group, send email to > > [email protected] > > For more options, visit this group at > >http://groups.google.com/group/tesseract-ocr?hl=en -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

