Re: updating traineddata in Tesseract 3.00

yem Mon, 15 Aug 2011 20:00:20 -0700

Hello I am now having trouble using the "mftraining" command to
cluster ".tr" files. I've created traineddata for both of the box/img
file pairs and .tr files individually and they have worked. I ran
through all the steps again to put them together such as the
"unicharset_extractor" for  box0 box1 files and those were successful
as well


When I tried to do "mftraining" with the .tr files, I got an error as
showed below


C:\Program Files (x86)\Tesseract-OCR>mftraining -U unicharset -O
eng.unicharset
eng.lucidaconsole.exp0.tr eng.lucidaconsole.exp1.tr
Reading eng.lucidaconsole.exp0.tr ...
lucidaconsole has no defined properties.
Reading eng.lucidaconsole.exp1.tr ...

Writing Merged Microfeat ...Class->NumConfigs == this-
>fontset_table_.get(Class-
>font_set_id).size:Error:Assert failed:in file .\intproto.cpp, line 1268

mftraining will then crash. How would I be able to bring all of
these .tr files together? I've tried different orders such as:

C:\Program Files (x86)\Tesseract-OCR>mftraining -U unicharset -O
eng.unicharset
eng.lucidaconsole.exp0.tr eng.lucidaconsole.exp1.tr

and

C:\Program Files (x86)\Tesseract-OCR>mftraining -U unicharset -O
eng.unicharset
eng.lucidaconsole.exp1.tr eng.lucidaconsole.exp0.tr



On Aug 11, 12:01 pm, yem <[email protected]> wrote:
> Dmitri
>
> Thanks for the response. I was planning to head down this path
> originally but I wanted to see how everyone else was getting there's
> to work. You answer was very much appreciated.
>
> On Aug 10, 9:31 pm, Dmitri Silaev <[email protected]> wrote:
>
>
>
>
>
>
>
> > It is a known limitation with traineddata files. You cannot *update* a
> > traineddata file, you can just *overwrite* some component within it.
> > To *add* your new trained samples, you need the old source image/box
> > file pairs as well as the new ones, then run "mftraining", and so on
> > as usual. Since Google is holding back source data files for English,
> > you have no other way to achieve what you want except training for
> > *all* characters by yourself.
>
> > HTH
>
> > Warm regards,
> > Dmitri Silaevwww.CustomOCR.com
>
> > On Thu, Aug 11, 2011 at 5:14 AM, yem <[email protected]> wrote:
> > > Hey everyone. I've spent the last week learning how to use the
> > > tesseract and found it to be very good and useful and following this
> > > guide:
>
> > >http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
>
> > > The only problem is I'm trying to update the traineddata I downloaded
> > > from the download area but I can't update it. The files name is:
>
> > > eng.traineddata.gz
>
> > > I've used "combine_tessdata eng." and the new traineddata works as I
> > > have tested it by putting it in the tessdata directory. The only
> > > problem is I can't update the tessdata/eng.traineddata correctly with
> > > my new trainneddata. I tried the following:
>
> > > 1. combine_tessdata eng. //to see if i can generate the traineddata
> > > 2. combine_tessdata -u eng.traineddata eng. // I want to unpack the
> > > files so I know what i can use the "overwrite" command to get these
> > > 3. combine_tessdata -o eng.traineddata eng.file1 eng.file2 .... // I
> > > take the files that were unpacked. I've tried taking some files or all
> > > files but it won't update the traineddata correctly. I know this is an
> > > overwrite command.
>
> > > the original image i was working with is read correctly after I
> > > overwrote the traineddata with the new files. But when I read other
> > > images it takes whatever character it has available to fill in the
> > > boxes. For example
>
> > > "TEST5" was changed to "TESTS" // changed the number '5' to the letter
> > > 'S'. the output came out as TESTS just as expected
>
> > > for another image I used tesseract with the new trainieddata and i
> > > get:
>
> > > "5 DOLLARS" will be read as S ESESESES // which is understandable
> > > since the new character set has been limited to whatever I just
> > > defined
>
> > > But I want to continue updating the current training data and not just
> > > overwrite what already works. How would I update the current
> > > traineddata with new traineddata? Which files would I need to
> > > overwrite? Thank you for your responses.
>
> > > --
> > > You received this message because you are subscribed to the Google
> > > Groups "tesseract-ocr" group.
> > > To post to this group, send email to [email protected]
> > > To unsubscribe from this group, send email to
> > > [email protected]
> > > For more options, visit this group at
> > >http://groups.google.com/group/tesseract-ocr?hl=en

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: updating traineddata in Tesseract 3.00

Reply via email to