Re: Thoughts on having the training process take font files directly

TP Tue, 16 Oct 2012 03:27:44 -0700

On Mon, Oct 15, 2012 at 2:27 PM, Nick White <[email protected]> wrote:
>> As an added step, you could might consider: rendering to grayscale,
>> slightly blurring (optional), adding a bit of noise, and then
>> re-converting to b&w to simulate what physical scanners do?  Maybe do
>> this at 1200dpi and also downsample to 300 dpi.
>
> I wouldn't have thought adding random noise would be helpful; it
> will just distort the shapes which Tesseract will use to match, and
> as it will always get different noise to the type I generated, it
> would only hinder it further. At least that's what I had assumed. Am
> I wrong about this? Has anybody tested whether adding random noise
> to an otherwise clean training improves things?


I only suggested this because of the following quote from [1]:

    Next print and scan (or use some electronic rendering method) to create
    an image of your training page. Up to 32 training files can be used (of
    multiple pages). It is best to create a mix of fonts and styles (but in
    separate files), including italic and bold.

    NOTE: training from real images is actually quite hard, due to the
    spacing-out requirements. This will be improved in a future release. For
    now it is much easier if you can print/scan your own training text.

The reoccurrence of the word SCAN led me to believe that they are
suggesting actually physically scanning (which implies adding a bit of
noise).

[1] http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Thoughts on having the training process take font files directly

Reply via email to