which version of tesseract you want to do training? Tesserac 2.0 or Tesseract 3.0 Series?
On Fri, Jul 20, 2012 at 2:33 PM, Nick White <[email protected]> wrote: > Hi Nikola, > > I suggest you don't try training it. Training is mostly for adding > new languages, or at least significantly different fonts. As your > input is English, and a common font, I doubt it would help much over > the standard english training file. > > The results I got from running Tesseract 3 on your sample were > pretty good, though. I'll attach them here. Using -psm 6 made a big > improvement as it meant the table cells were on the correct row. So > I ran: > > tesseract ocr1.png outtest2 -psm 6 > > The problems remaining in the output is 7 being consistently recognised > as ?, and m is regularly misrecognised as r'n or r‘n. I have suggestions > for this. > > If your input data will never have ? in, create an ambig rule which > always changes a ? to a 7 (and similar for the r'n issues). The best > way to do this would be: > > 1) unpack the english training data: > > combine_tessdata -u eng.traineddata eng. > > 2) add the following lines to the end of eng.unicharambigs: > > 1 ? 1 7 1 > 3 r ' n 1 m 1 > 3 r ‘ n 1 m 1 > > 3) recombine the training data: > > combine_tessdata eng. > > And the eng.traineddata file will contain the extra ambig rules. > > Hope this helps, and let us know how you get on. > > Nick > > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > -- Regards --------------------------------------------------------------------------------------- Ankur Rana (ਅੰਕੁਰ ਰਾਣਾ) -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

