On Tuesday, October 16, 2012 12:27:43 PM UTC+2, TP wrote: > > On Mon, Oct 15, 2012 at 2:27 PM, Nick White > <[email protected]<javascript:>> > wrote: > >> As an added step, you could might consider: rendering to grayscale, > >> slightly blurring (optional), adding a bit of noise, and then > >> re-converting to b&w to simulate what physical scanners do? Maybe do > >> this at 1200dpi and also downsample to 300 dpi. > > > > I wouldn't have thought adding random noise would be helpful; it > > will just distort the shapes which Tesseract will use to match, and > > as it will always get different noise to the type I generated, it > > would only hinder it further. At least that's what I had assumed. Am > > I wrong about this? Has anybody tested whether adding random noise > > to an otherwise clean training improves things? > > I only suggested this because of the following quote from [1]: > > Next print and scan (or use some electronic rendering method) to > create > an image of your training page. Up to 32 training files can be used > (of > multiple pages). It is best to create a mix of fonts and styles (but > in > separate files), including italic and bold. > > NOTE: training from real images is actually quite hard, due to the > spacing-out requirements. This will be improved in a future release. > For > now it is much easier if you can print/scan your own training text. > > The reoccurrence of the word SCAN led me to believe that they are > suggesting actually physically scanning (which implies adding a bit of > noise). >
It depends on your document set but if the input document set is *not* crystal clear (e.g., 600dpi generated from pdfs) than it helps. For one of our test document set without noise 92% character accuracy with noise 97% character accuracy. The interesting question is which algorithm(s) to use... jozef ps: my company plans to publish an article about tesseract including training and noise adding where our experience and expertise will be described in more detail > > [1] http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

