Dmitri Thanks for the response. I was planning to head down this path originally but I wanted to see how everyone else was getting there's to work. You answer was very much appreciated.
On Aug 10, 9:31 pm, Dmitri Silaev <[email protected]> wrote: > It is a known limitation with traineddata files. You cannot *update* a > traineddata file, you can just *overwrite* some component within it. > To *add* your new trained samples, you need the old source image/box > file pairs as well as the new ones, then run "mftraining", and so on > as usual. Since Google is holding back source data files for English, > you have no other way to achieve what you want except training for > *all* characters by yourself. > > HTH > > Warm regards, > Dmitri Silaevwww.CustomOCR.com > > > > > > > > On Thu, Aug 11, 2011 at 5:14 AM, yem <[email protected]> wrote: > > Hey everyone. I've spent the last week learning how to use the > > tesseract and found it to be very good and useful and following this > > guide: > > >http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 > > > The only problem is I'm trying to update the traineddata I downloaded > > from the download area but I can't update it. The files name is: > > > eng.traineddata.gz > > > I've used "combine_tessdata eng." and the new traineddata works as I > > have tested it by putting it in the tessdata directory. The only > > problem is I can't update the tessdata/eng.traineddata correctly with > > my new trainneddata. I tried the following: > > > 1. combine_tessdata eng. //to see if i can generate the traineddata > > 2. combine_tessdata -u eng.traineddata eng. // I want to unpack the > > files so I know what i can use the "overwrite" command to get these > > 3. combine_tessdata -o eng.traineddata eng.file1 eng.file2 .... // I > > take the files that were unpacked. I've tried taking some files or all > > files but it won't update the traineddata correctly. I know this is an > > overwrite command. > > > the original image i was working with is read correctly after I > > overwrote the traineddata with the new files. But when I read other > > images it takes whatever character it has available to fill in the > > boxes. For example > > > "TEST5" was changed to "TESTS" // changed the number '5' to the letter > > 'S'. the output came out as TESTS just as expected > > > for another image I used tesseract with the new trainieddata and i > > get: > > > "5 DOLLARS" will be read as S ESESESES // which is understandable > > since the new character set has been limited to whatever I just > > defined > > > But I want to continue updating the current training data and not just > > overwrite what already works. How would I update the current > > traineddata with new traineddata? Which files would I need to > > overwrite? Thank you for your responses. > > > -- > > You received this message because you are subscribed to the Google > > Groups "tesseract-ocr" group. > > To post to this group, send email to [email protected] > > To unsubscribe from this group, send email to > > [email protected] > > For more options, visit this group at > >http://groups.google.com/group/tesseract-ocr?hl=en -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

