Hi jozef, Did your company publish the article regarding "tesseract including training and noise adding where our experience and expertise will be described in more detail"
On Tuesday, October 16, 2012 9:54:48 PM UTC+5:30, jm wrote: > > > > On Tuesday, October 16, 2012 12:27:43 PM UTC+2, TP wrote: >> >> On Mon, Oct 15, 2012 at 2:27 PM, Nick White <[email protected]> >> wrote: >> >> As an added step, you could might consider: rendering to grayscale, >> >> slightly blurring (optional), adding a bit of noise, and then >> >> re-converting to b&w to simulate what physical scanners do? Maybe do >> >> this at 1200dpi and also downsample to 300 dpi. >> > >> > I wouldn't have thought adding random noise would be helpful; it >> > will just distort the shapes which Tesseract will use to match, and >> > as it will always get different noise to the type I generated, it >> > would only hinder it further. At least that's what I had assumed. Am >> > I wrong about this? Has anybody tested whether adding random noise >> > to an otherwise clean training improves things? >> >> I only suggested this because of the following quote from [1]: >> >> Next print and scan (or use some electronic rendering method) to >> create >> an image of your training page. Up to 32 training files can be used >> (of >> multiple pages). It is best to create a mix of fonts and styles (but >> in >> separate files), including italic and bold. >> >> NOTE: training from real images is actually quite hard, due to the >> spacing-out requirements. This will be improved in a future release. >> For >> now it is much easier if you can print/scan your own training text. >> >> The reoccurrence of the word SCAN led me to believe that they are >> suggesting actually physically scanning (which implies adding a bit of >> noise). >> > > It depends on your document set but if the input document set is *not* > crystal clear (e.g., 600dpi generated from pdfs) than it helps. > > For one of our test document set > without noise 92% character accuracy > with noise 97% character accuracy. > > The interesting question is which algorithm(s) to use... > > jozef > > ps: my company plans to publish an article about tesseract including > training and noise adding where our experience and expertise will be > described in more detail > > > >> >> [1] http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 >> > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

