Hi, I have also trained tesseract for English on my own and on some images I got more successful results than the eng.traineddata. Here is what I have done: - I tried the eng.traineddata on my images and noted the wrong recognized characters. (e.g. T -> ' I ' like these) - I created a eng.unicharambigs file from those I noted down. - Then I found a 240000 word english dictionary from google and created all the possibilites of the words such as: "and", "And", "AND" , which resulted appr. 720000 word dictionary file. (eng.words_list -> eng.words-dawg) - I found nearly 4000 frequently used words for English (eng.freq_word_list -> eng.freq-dawg) - Then I follwed the procedure from the link http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 and that's it. Hope, it will help you...
-haydar On Jun 24, 7:14 am, Sandeep Parmar <[email protected]> wrote: > Hi all, > > I am evaluating tesseract for my project and I found that its very good > compared to other free OCRs. However I have some > doubts regarding Training Tesseract 3.0 for new font types.I did two things > while training tesseract.. > > 1) I made a text document containing all the Alphabets, numbers and ASCII > charactres for different fonts like Times New Roman, > Arial, Verdana, Comic Sans etc. I got Printout of all and then scanned > them to make TIF images. And i followed the steps mentioned > for training tesserct 3.0 > onhttp://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 > > But, the result I got from my trained data was not comparable to > 'eng.traineddata' provided by default, it was very poor. > > 2) Then I decided to make a traineddata from the TIF & BOX files for > tesseract 2.04 provided by Tesseract from > > http://code.google.com/p/tesseract-ocr/downloads/detail?name=boxtiff-... > I successfully created the my 'eng.traineddata' from this and I got > improved result compared to my first approach. But, the output of > the second approach was differing slightly from the output i got from > original 'eng.traineddata' > > Also, the size of the my trained data was less then the > 'eng.traineddata' provided by Tesseract3.0.exe (windows installaler) > > Please suggest what could be the reason for such differences > > Regards > Sandeep -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

