Hi Sandeep,

I have trained with the same tif and box files you posted in the
second link.
And, yes, on some images the accuracy is better. You can search
google
for the dictionary and then after you train test your results and
create your unicharambigs file.
Then train again including that unicharambigs file.
Regards,

-haydar

On Jun 25, 6:46 am, Sandeep Parmar <[email protected]>
wrote:
> Hi Haydar,
>
> Thanks for replying, but i have the following queries,
>
>    1. Did your trained data improve the overall accuracy of OCR?
>    2. Did your trained data give correct results for the images which
>    original trained data identified correctly?
>    3. Can you send me the TIF and BOX file data that you have used for
>    training?
>
> Regards
> Sandeep
>
>
>
>
>
>
>
> On Fri, Jun 24, 2011 at 8:05 PM, Haydar <[email protected]> wrote:
> > Hi,
> > I have also trained tesseract for English on my own and on some images
> > I got
> > more successful results than the eng.traineddata. Here is what I have
> > done:
> > - I tried the eng.traineddata on my images and noted the wrong
> > recognized characters. (e.g. T -> ' I ' like these)
> > - I created a eng.unicharambigs file from those I noted down.
> > - Then I found a 240000 word english dictionary from google and
> > created all the possibilites of the words such as: "and", "And",
> > "AND" , which resulted appr. 720000 word dictionary file.
> > (eng.words_list -> eng.words-dawg)
> > - I found nearly 4000 frequently used words for English
> > (eng.freq_word_list -> eng.freq-dawg)
> > - Then I follwed the procedure from the link
> >http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3and
> > that's it.
> > Hope, it will help you...
>
> > -haydar
>
> > On Jun 24, 7:14 am, Sandeep Parmar <[email protected]>
> > wrote:
> > > Hi all,
>
> > > I am evaluating tesseract for my project and I found that its very good
> > > compared to other free OCRs. However I have some
> > > doubts regarding Training Tesseract 3.0 for new font types.I did two
> > things
> > > while training tesseract..
>
> > > 1) I made a text document containing all the Alphabets, numbers and ASCII
> > > charactres for different fonts like Times New Roman,
> > >     Arial, Verdana, Comic Sans etc. I got Printout of all and then
> > scanned
> > > them to make TIF images. And i followed the steps mentioned
> > >     for training tesserct 3.0 onhttp://
> > code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
>
> > >     But, the result I got from my trained data was not comparable to
> > > 'eng.traineddata' provided by default, it was very poor.
>
> > > 2) Then I decided to make a traineddata from the TIF & BOX files for
> > > tesseract 2.04 provided by Tesseract from
>
> > >http://code.google.com/p/tesseract-ocr/downloads/detail?name=boxtiff-...
> > >      I successfully created the my 'eng.traineddata' from this and I got
> > > improved result compared to my first approach. But, the output of
> > >      the second approach was differing slightly from the output i got
> > from
> > > original 'eng.traineddata'
>
> > >      Also, the size of the my trained data was less then the
> > > 'eng.traineddata' provided by Tesseract3.0.exe (windows installaler)
>
> > > Please suggest what could be the reason for such differences
>
> > > Regards
> > > Sandeep
>
> > --
> > You received this message because you are subscribed to the Google
> > Groups "tesseract-ocr" group.
> > To post to this group, send email to [email protected]
> > To unsubscribe from this group, send email to
> > [email protected]
> > For more options, visit this group at
> >http://groups.google.com/group/tesseract-ocr?hl=en

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to