Hello I am now having trouble using the "mftraining" command to cluster ".tr" files. I've created traineddata for both of the box/img file pairs and .tr files individually and they have worked. I ran through all the steps again to put them together such as the "unicharset_extractor" for box0 box1 files and those were successful as well
When I tried to do "mftraining" with the .tr files, I got an error as showed below C:\Program Files (x86)\Tesseract-OCR>mftraining -U unicharset -O eng.unicharset eng.lucidaconsole.exp0.tr eng.lucidaconsole.exp1.tr Reading eng.lucidaconsole.exp0.tr ... lucidaconsole has no defined properties. Reading eng.lucidaconsole.exp1.tr ... Writing Merged Microfeat ...Class->NumConfigs == this- >fontset_table_.get(Class- >font_set_id).size:Error:Assert failed:in file .\intproto.cpp, line 1268 mftraining will then crash. How would I be able to bring all of these .tr files together? I've tried different orders such as: C:\Program Files (x86)\Tesseract-OCR>mftraining -U unicharset -O eng.unicharset eng.lucidaconsole.exp0.tr eng.lucidaconsole.exp1.tr and C:\Program Files (x86)\Tesseract-OCR>mftraining -U unicharset -O eng.unicharset eng.lucidaconsole.exp1.tr eng.lucidaconsole.exp0.tr On Aug 11, 12:01 pm, yem <[email protected]> wrote: > Dmitri > > Thanks for the response. I was planning to head down this path > originally but I wanted to see how everyone else was getting there's > to work. You answer was very much appreciated. > > On Aug 10, 9:31 pm, Dmitri Silaev <[email protected]> wrote: > > > > > > > > > It is a known limitation with traineddata files. You cannot *update* a > > traineddata file, you can just *overwrite* some component within it. > > To *add* your new trained samples, you need the old source image/box > > file pairs as well as the new ones, then run "mftraining", and so on > > as usual. Since Google is holding back source data files for English, > > you have no other way to achieve what you want except training for > > *all* characters by yourself. > > > HTH > > > Warm regards, > > Dmitri Silaevwww.CustomOCR.com > > > On Thu, Aug 11, 2011 at 5:14 AM, yem <[email protected]> wrote: > > > Hey everyone. I've spent the last week learning how to use the > > > tesseract and found it to be very good and useful and following this > > > guide: > > > >http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 > > > > The only problem is I'm trying to update the traineddata I downloaded > > > from the download area but I can't update it. The files name is: > > > > eng.traineddata.gz > > > > I've used "combine_tessdata eng." and the new traineddata works as I > > > have tested it by putting it in the tessdata directory. The only > > > problem is I can't update the tessdata/eng.traineddata correctly with > > > my new trainneddata. I tried the following: > > > > 1. combine_tessdata eng. //to see if i can generate the traineddata > > > 2. combine_tessdata -u eng.traineddata eng. // I want to unpack the > > > files so I know what i can use the "overwrite" command to get these > > > 3. combine_tessdata -o eng.traineddata eng.file1 eng.file2 .... // I > > > take the files that were unpacked. I've tried taking some files or all > > > files but it won't update the traineddata correctly. I know this is an > > > overwrite command. > > > > the original image i was working with is read correctly after I > > > overwrote the traineddata with the new files. But when I read other > > > images it takes whatever character it has available to fill in the > > > boxes. For example > > > > "TEST5" was changed to "TESTS" // changed the number '5' to the letter > > > 'S'. the output came out as TESTS just as expected > > > > for another image I used tesseract with the new trainieddata and i > > > get: > > > > "5 DOLLARS" will be read as S ESESESES // which is understandable > > > since the new character set has been limited to whatever I just > > > defined > > > > But I want to continue updating the current training data and not just > > > overwrite what already works. How would I update the current > > > traineddata with new traineddata? Which files would I need to > > > overwrite? Thank you for your responses. > > > > -- > > > You received this message because you are subscribed to the Google > > > Groups "tesseract-ocr" group. > > > To post to this group, send email to [email protected] > > > To unsubscribe from this group, send email to > > > [email protected] > > > For more options, visit this group at > > >http://groups.google.com/group/tesseract-ocr?hl=en -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

