You might improve accuracy by training with your data -- make sure you
follow the instructions and use images of 200--300dpi B/W. You can't
just separate out an existing trained font, but you can specify a
character subset. The training data from older versions is not
compatible. The config files are very similar, but not exactly alike.
The different result accuracy may be due to compiler differences, but
is a known issue.

You can do post-processing for your data to improve accuracy as well
-- find patterns in your data and do programmatic search and replace,
etc. Commercial OCR engines do tricks like that as well.
--Sven


On Thu, Aug 11, 2011 at 4:45 PM, dustin <[email protected]> wrote:
> I'm running into nearly the same issues Philip mentioned in the post
> below (2.04 being far more accurate than 3.0, yet less stable than
> 3.0):
>
> http://groups.google.com/group/tesseract-ocr/browse_thread/thread/10466ace326a6c88/1c392c1f0fb7fd2c?lnk=gst&q=accuracy+of+3.0#1c392c1f0fb7fd2c
>
> Luckily, the scope of my OCR project is much smaller than it sounds
> like Philip's is.  Mine involves OCR'ing documents (and then also
> parsing the results and putting them into a sql db) that all consist
> of text using the exact same font.
>
> Would accuracy be improved by training solely on this single font?  I
> believe I know which font my documents use (or at least a font that
> very closely resembles it).  Is there a way to manually go through the
> existing language data and pull out all other fonts?
>
> Barring this, is the training data between 2.04 and 3.00 compatible?
> That is, could i simply try to copy over some appropriate config and/
> or data files from my 2.04 installation into my 3.00 installation and
> get comparable accuracy?
>
> I am not yet familiar with tesseract's config/data files or its
> training procedure, so please forgive me if this should be obvious...
>
> Thanks,
> Dustin
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>



-- 
``All that is gold does not glitter,
  not all those who wander are lost;
the old that is strong does not wither,
  deep roots are not reached by the frost.
>From the ashes a fire shall be woken,
  a light from the shadows shall spring;
renewed shall be blade that was broken,
  the crownless again shall be king.”

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to