Re: updating traineddata in Tesseract 3.00

yem Thu, 11 Aug 2011 19:55:13 -0700

Dmitri

Thanks for the response. I was planning to head down this path
originally but I wanted to see how everyone else was getting there's
to work. You answer was very much appreciated.


On Aug 10, 9:31 pm, Dmitri Silaev <[email protected]> wrote:
> It is a known limitation with traineddata files. You cannot *update* a
> traineddata file, you can just *overwrite* some component within it.
> To *add* your new trained samples, you need the old source image/box
> file pairs as well as the new ones, then run "mftraining", and so on
> as usual. Since Google is holding back source data files for English,
> you have no other way to achieve what you want except training for
> *all* characters by yourself.
>
> HTH
>
> Warm regards,
> Dmitri Silaevwww.CustomOCR.com
>
>
>
>
>
>
>
> On Thu, Aug 11, 2011 at 5:14 AM, yem <[email protected]> wrote:
> > Hey everyone. I've spent the last week learning how to use the
> > tesseract and found it to be very good and useful and following this
> > guide:
>
> >http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
>
> > The only problem is I'm trying to update the traineddata I downloaded
> > from the download area but I can't update it. The files name is:
>
> > eng.traineddata.gz
>
> > I've used "combine_tessdata eng." and the new traineddata works as I
> > have tested it by putting it in the tessdata directory. The only
> > problem is I can't update the tessdata/eng.traineddata correctly with
> > my new trainneddata. I tried the following:
>
> > 1. combine_tessdata eng. //to see if i can generate the traineddata
> > 2. combine_tessdata -u eng.traineddata eng. // I want to unpack the
> > files so I know what i can use the "overwrite" command to get these
> > 3. combine_tessdata -o eng.traineddata eng.file1 eng.file2 .... // I
> > take the files that were unpacked. I've tried taking some files or all
> > files but it won't update the traineddata correctly. I know this is an
> > overwrite command.
>
> > the original image i was working with is read correctly after I
> > overwrote the traineddata with the new files. But when I read other
> > images it takes whatever character it has available to fill in the
> > boxes. For example
>
> > "TEST5" was changed to "TESTS" // changed the number '5' to the letter
> > 'S'. the output came out as TESTS just as expected
>
> > for another image I used tesseract with the new trainieddata and i
> > get:
>
> > "5 DOLLARS" will be read as S ESESESES // which is understandable
> > since the new character set has been limited to whatever I just
> > defined
>
> > But I want to continue updating the current training data and not just
> > overwrite what already works. How would I update the current
> > traineddata with new traineddata? Which files would I need to
> > overwrite? Thank you for your responses.
>
> > --
> > You received this message because you are subscribed to the Google
> > Groups "tesseract-ocr" group.
> > To post to this group, send email to [email protected]
> > To unsubscribe from this group, send email to
> > [email protected]
> > For more options, visit this group at
> >http://groups.google.com/group/tesseract-ocr?hl=en

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: updating traineddata in Tesseract 3.00

Reply via email to