While I do not appreciate the prick-lyness of your response, you did help me to see what the problem was by providing a working example using my example, so thank you nonetheless. y-coordinates are swapped in the auto-generated box file produced by TesseractTrainer. Should have been obvious from what I was seeing in cowboxer, was staring at it too long I guess.
The preprocessing, image size, cropping and compression had nothing to do with it...works just as well on uncompressed .tif as with compressed png. I also discovered that TesseractTrainer is ignoring the baseline location of the font in the boxes it generates. So, I have modified it to subtract that offset, and presto, working auto-generated training imagery. Pretty slick. Kudos go to Balthazar Rouberol, who wrote this little gem. PS: Last I checked, broadband connections and Google's file servers are easily capable of handling a 1.6MB tiff, which I included, because that is natively what TesseractTrainer generates...(uncompressed .tif). If you are still on a 56k modem, then you have my apologies for eating your bandwidth... On Monday, October 28, 2013 2:58:27 PM UTC-6, zdenop wrote: > > Well, if you searched this forum profusely, you could already know that > training for "common" font is wasting of time ;-) > There is nobody (including most experience member of this forum) who got > better ocr result by re-training of "common" font (like arial, times). If > this statement is not true, than let me know and sent the proof :-) I will > create tesseract hall-of-fame for you ;-) > > I would suggest you to focus on image pre-processing (=> making it optimal > for OCR) than tesseract training. > > Next: if you get strange output - check if it is not because of input - > see what simple cropping of image can do: eng.arial.exp1.png > > than: > > 1. tesseract eng.arial.exp1.png eng.arial.exp1 makebox > 2. check&edit box file > 3. tesseract eng.arial.exp1.png eng.arial.exp1 -psm 7 box.train > > And here we are: > Tesseract Open Source OCR Engine v3.02 with Leptonica > APPLY_BOXES: > Boxes read from boxfile: 31 > Found 31 good blobs. > TRAINING ... Font name = arial > Generated training data for 8 words > > > PS: Sending image with one short line of text with 1.6 Mb is not very good > idea. Using compression or better image format would be more efficient. See > size of eng.arial.exp1.png > > Zdenko > > > On Mon, Oct 28, 2013 at 3:29 AM, Jonathan Nikkel > <[email protected]<javascript:> > > wrote: > >> Hey there, >> >> I am a Tesseract novice, and would like to solicit some help/advices from >> you smart folks. I will preface by saying that I have read the FAQ, >> searched this forum profusely, read all of the topics, and tried all the >> suggestions/advices I found, with no luck so far. This is probably not a >> difficult one, I assume I must be missing something stupid, but hey, that >> is why we have forums like these =). >> >> What I am using: >> Windows 7 box >> Tesseract v3.02 >> TesseractTrainer (auto-generated .tif's based on input training text, >> automates the training process) >> >> I am able to successfully train the off-the-shelf arial training data >> included with the Tesseract dev files. >> >> I am now trying to train a custom data set with the Arial font (no mods, >> standard installed with windows) using this setup to make sure I understand >> this training process/code, and am setting things up correctly, before >> moving on to more complex fonts. >> >> I am getting 100% failures in blob recognition/box resegmentation, and am >> puzzled as to why. I have tried numerous combinations of character >> spacing, line spacing, font size, image bit depth (I am now using a binary >> image), DPI (using 300 dpi, 3600x3600 now, to be consistent with the >> example trainings), and am trying to home in using a font size that >> achieves an xheight of 25 pixels. I have checked the box file accuracy >> using cowboxer, and am getting accurate boxes it appears. >> >> Attached are some example files; I have tried alternative character >> spacings from nearly touching, up to about double what you see here. I >> have tried all of the pageseg modes, using* {prefix}.tif {prefix} >> nobatch box.train *parameters. Pageseg mode 4 crashes, the rest >> generate 100% resegmentation errors. >> >> Where am I going wrong? Anyone have a working example setup with >> TesseractTraining they can share? >> >> Regards, >> >> -Jon >> >> >> >> >> -- >> -- >> You received this message because you are subscribed to the Google >> Groups "tesseract-ocr" group. >> To post to this group, send email to [email protected]<javascript:> >> To unsubscribe from this group, send email to >> [email protected] <javascript:> >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en >> >> --- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> For more options, visit https://groups.google.com/groups/opt_out. >> > > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

