It is a known limitation with traineddata files. You cannot *update* a traineddata file, you can just *overwrite* some component within it. To *add* your new trained samples, you need the old source image/box file pairs as well as the new ones, then run "mftraining", and so on as usual. Since Google is holding back source data files for English, you have no other way to achieve what you want except training for *all* characters by yourself.
HTH Warm regards, Dmitri Silaev www.CustomOCR.com On Thu, Aug 11, 2011 at 5:14 AM, yem <[email protected]> wrote: > Hey everyone. I've spent the last week learning how to use the > tesseract and found it to be very good and useful and following this > guide: > > http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 > > The only problem is I'm trying to update the traineddata I downloaded > from the download area but I can't update it. The files name is: > > eng.traineddata.gz > > I've used "combine_tessdata eng." and the new traineddata works as I > have tested it by putting it in the tessdata directory. The only > problem is I can't update the tessdata/eng.traineddata correctly with > my new trainneddata. I tried the following: > > 1. combine_tessdata eng. //to see if i can generate the traineddata > 2. combine_tessdata -u eng.traineddata eng. // I want to unpack the > files so I know what i can use the "overwrite" command to get these > 3. combine_tessdata -o eng.traineddata eng.file1 eng.file2 .... // I > take the files that were unpacked. I've tried taking some files or all > files but it won't update the traineddata correctly. I know this is an > overwrite command. > > the original image i was working with is read correctly after I > overwrote the traineddata with the new files. But when I read other > images it takes whatever character it has available to fill in the > boxes. For example > > "TEST5" was changed to "TESTS" // changed the number '5' to the letter > 'S'. the output came out as TESTS just as expected > > for another image I used tesseract with the new trainieddata and i > get: > > "5 DOLLARS" will be read as S ESESESES // which is understandable > since the new character set has been limited to whatever I just > defined > > But I want to continue updating the current training data and not just > overwrite what already works. How would I update the current > traineddata with new traineddata? Which files would I need to > overwrite? Thank you for your responses. > > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

