Re: Thoughts on having the training process take font files directly

Tom Morris Thu, 11 Oct 2012 11:21:03 -0700

On Wednesday, October 10, 2012 12:59:49 PM UTC-4, Nick White wrote:

> So I've been tossing an idea around in my head for a while now, and 
> I think it deserves discussion. 
>
> As I understand it, the box/tif steps basically reduce varying 
> character shapes to basic simplifications, for each font, which can 
> be quickly and smartly compared with the blobs tesseract later 
> reads. 
>
> The recommended way to create box/tif files has been through scans, 
> but this is time consuming, and not practical for languages with a 
> significant number of characters. So some of us have opted to create 
> the image and box files with a few different programs (my lazytrain, 
> jTessBoxEditor I think does, and one or two python based programs, 
> from memory). I can see that theoretically this may be inferior, as 
> it won't capture common scanning distortions of letters, but in 
> practise it seems to work well. 
>
> If we accept that as a valid way to train, it seems like a more 
> sensible idea to extract the character shape prototypes straight 
> from font files. They have the ideal shapes embedded in them, so it 
> shouldn't be particularly difficult, and would make training easier, 
> faster, and I imagine the training files would be smaller, as there 
> would only be one prototype for each character. 
>
> Is there anything I'm missing with this proposal? Does it sound 
> sensible? If so I'll open a ticket for it, and will have a stab at 
> doing it. 
>


In addition to the lack of scanning distortion/noise on the input side, 
you're also likely to be missing out on some of the more sophisticated 
stuff that font machinery does.

Some of the things that come to mind include:

   - font 'hints' which cause the glyph do be rendered differently at 
   different resolutions
   - kerning information which affects glyph placement relative to its 
   neighbors 
   - probably a bunch of other stuff that I'm not familiar with

It's probably harder to accurately extract all the necessary font 
information than it is just to rasterize the font.  Given that there are 
any number of ways to rasterize a font or text/font combination and that 
the native input to an OCR program is a raster image, what are you gaining 
by doing this?  Remember also that the goal isn't to extract "ideal" 
character shapes, but rather *representative* shapes.

Tom

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Re: Thoughts on having the training process take font files directly

Reply via email to