Re: Thoughts on having the training process take font files directly

jm Tue, 16 Oct 2012 10:01:23 -0700


On Tuesday, October 16, 2012 12:27:43 PM UTC+2, TP wrote:
>
> On Mon, Oct 15, 2012 at 2:27 PM, Nick White 
> <[email protected]<javascript:>> 
> wrote: 
> >> As an added step, you could might consider: rendering to grayscale, 
> >> slightly blurring (optional), adding a bit of noise, and then 
> >> re-converting to b&w to simulate what physical scanners do?  Maybe do 
> >> this at 1200dpi and also downsample to 300 dpi. 
> > 
> > I wouldn't have thought adding random noise would be helpful; it 
> > will just distort the shapes which Tesseract will use to match, and 
> > as it will always get different noise to the type I generated, it 
> > would only hinder it further. At least that's what I had assumed. Am 
> > I wrong about this? Has anybody tested whether adding random noise 
> > to an otherwise clean training improves things? 
>
> I only suggested this because of the following quote from [1]: 
>
>     Next print and scan (or use some electronic rendering method) to 
> create 
>     an image of your training page. Up to 32 training files can be used 
> (of 
>     multiple pages). It is best to create a mix of fonts and styles (but 
> in 
>     separate files), including italic and bold. 
>
>     NOTE: training from real images is actually quite hard, due to the 
>     spacing-out requirements. This will be improved in a future release. 
> For 
>     now it is much easier if you can print/scan your own training text. 
>
> The reoccurrence of the word SCAN led me to believe that they are 
> suggesting actually physically scanning (which implies adding a bit of 
> noise). 
>


It depends on your document set but if the input document set is *not* 
crystal clear (e.g., 600dpi generated from pdfs) than it helps. 

For one of our test document set
without noise 92% character accuracy
with noise 97% character accuracy. 

The interesting question is which algorithm(s) to use...

jozef

ps: my company plans to publish an article about tesseract including 
training and noise adding where our experience and expertise will be 
described in more detail

 

>
> [1] http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 
>

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Thoughts on having the training process take font files directly

Reply via email to