Re: Regarding Tesseract 3.0 training

Haydar Fri, 24 Jun 2011 08:08:53 -0700

Hi,
I have also trained tesseract for English on my own and on some images
I got
more successful results than the eng.traineddata. Here is what I have
done:
- I tried the eng.traineddata on my images and noted the wrong
recognized characters. (e.g. T -> ' I ' like these)
- I created a eng.unicharambigs file from those I noted down.
- Then I found a 240000 word english dictionary from google and
created all the possibilites of the words such as: "and", "And",
"AND" , which resulted appr. 720000 word dictionary file.
(eng.words_list -> eng.words-dawg)
- I found nearly 4000 frequently used words for English
(eng.freq_word_list -> eng.freq-dawg)
- Then I follwed the procedure from the link
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 and
that's it.
Hope, it will help you...


-haydar

On Jun 24, 7:14 am, Sandeep Parmar <[email protected]>
wrote:
> Hi all,
>
> I am evaluating tesseract for my project and I found that its very good
> compared to other free OCRs. However I have some
> doubts regarding Training Tesseract 3.0 for new font types.I did two things
> while training tesseract..
>
> 1) I made a text document containing all the Alphabets, numbers and ASCII
> charactres for different fonts like Times New Roman,
>     Arial, Verdana, Comic Sans etc. I got Printout of all and then scanned
> them to make TIF images. And i followed the steps mentioned
>     for training tesserct 3.0 
> onhttp://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
>
>     But, the result I got from my trained data was not comparable to
> 'eng.traineddata' provided by default, it was very poor.
>
> 2) Then I decided to make a traineddata from the TIF & BOX files for
> tesseract 2.04 provided by Tesseract from
>
> http://code.google.com/p/tesseract-ocr/downloads/detail?name=boxtiff-...
>      I successfully created the my 'eng.traineddata' from this and I got
> improved result compared to my first approach. But, the output of
>      the second approach was differing slightly from the output i got from
> original 'eng.traineddata'
>
>      Also, the size of the my trained data was less then the
> 'eng.traineddata' provided by Tesseract3.0.exe (windows installaler)
>
> Please suggest what could be the reason for such differences
>
> Regards
> Sandeep

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Regarding Tesseract 3.0 training

Reply via email to