Well, if you searched this forum profusely, you could already know that training for "common" font is wasting of time ;-) There is nobody (including most experience member of this forum) who got better ocr result by re-training of "common" font (like arial, times). If this statement is not true, than let me know and sent the proof :-) I will create tesseract hall-of-fame for you ;-)
I would suggest you to focus on image pre-processing (=> making it optimal for OCR) than tesseract training. Next: if you get strange output - check if it is not because of input - see what simple cropping of image can do: eng.arial.exp1.png than: 1. tesseract eng.arial.exp1.png eng.arial.exp1 makebox 2. check&edit box file 3. tesseract eng.arial.exp1.png eng.arial.exp1 -psm 7 box.train And here we are: Tesseract Open Source OCR Engine v3.02 with Leptonica APPLY_BOXES: Boxes read from boxfile: 31 Found 31 good blobs. TRAINING ... Font name = arial Generated training data for 8 words PS: Sending image with one short line of text with 1.6 Mb is not very good idea. Using compression or better image format would be more efficient. See size of eng.arial.exp1.png Zdenko On Mon, Oct 28, 2013 at 3:29 AM, Jonathan Nikkel <[email protected]> wrote: > Hey there, > > I am a Tesseract novice, and would like to solicit some help/advices from > you smart folks. I will preface by saying that I have read the FAQ, > searched this forum profusely, read all of the topics, and tried all the > suggestions/advices I found, with no luck so far. This is probably not a > difficult one, I assume I must be missing something stupid, but hey, that > is why we have forums like these =). > > What I am using: > Windows 7 box > Tesseract v3.02 > TesseractTrainer (auto-generated .tif's based on input training text, > automates the training process) > > I am able to successfully train the off-the-shelf arial training data > included with the Tesseract dev files. > > I am now trying to train a custom data set with the Arial font (no mods, > standard installed with windows) using this setup to make sure I understand > this training process/code, and am setting things up correctly, before > moving on to more complex fonts. > > I am getting 100% failures in blob recognition/box resegmentation, and am > puzzled as to why. I have tried numerous combinations of character > spacing, line spacing, font size, image bit depth (I am now using a binary > image), DPI (using 300 dpi, 3600x3600 now, to be consistent with the > example trainings), and am trying to home in using a font size that > achieves an xheight of 25 pixels. I have checked the box file accuracy > using cowboxer, and am getting accurate boxes it appears. > > Attached are some example files; I have tried alternative character > spacings from nearly touching, up to about double what you see here. I > have tried all of the pageseg modes, using* {prefix}.tif {prefix} nobatch > box.train *parameters. Pageseg mode 4 crashes, the rest generate 100% > resegmentation errors. > > Where am I going wrong? Anyone have a working example setup with > TesseractTraining they can share? > > Regards, > > -Jon > > > > > -- > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > > --- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/groups/opt_out. > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.
<<attachment: eng.arial.exp1.png>>
eng.arial.exp1.box
Description: Binary data

