I looked at that but there is precious little I can use from there.

I am grabbing English text from a program (approx 8 pt bitmap font), 
writing it to an image file, and passing the image file through tesseract. 
Following the instructions I have scaled the image up. That gets me from 
about 5% accuracy (original size) to about 10-12% accuracy (scaled up). The 
characters are clear, distinct, and exactly the same every single time. I 
had expected tesseract would do a decent job of it, but this has not proven 
to be the case.

I have *once* (and only once) managed to get the training steps to work, 
completely by accident. I have not been able to recreate that since: While 
editing the .box file for my first font I wondered if I was going through 
all this trouble for nothing, so I took what I had and ran through the 
steps to train it. And noticed a significant increase in accuracy (70-80%), 
even with a incomplete/inaccurate .tif/.box file as a basis.

Satisfied this was worth the effort I finished editing the .box file and 
proceeded to work through the steps again. I'll be damned if I can now get 
through the steps successfully. So I am stuck with a completed .tif/.box 
combination I spent a *lot* of time on that is doing nothing for me since I 
can't get tesseract to train on the blasted thing. Frustrating indeed.

Side note: Contrary to the instructions the "Run Tesseract for Training" 
section, unicharset_extractor would crash and burn every single time for me 
unless I edited the file and replaced UnknownFont with the name of the font.

On Friday, January 10, 2014 6:03:03 AM UTC-4, Nick White wrote:
>
> On Thu, Jan 09, 2014 at 11:46:17AM -0800, Doug . wrote: 
> > And I am still not clear why I have to create a new "language"? I have a 
> number 
> > of bitmap (not truetype) English fonts that Tesseract does a mediocre 
> job on 
> > "out of the box". 
>
> How different are these fonts you're using from ordinary English 
> fonts? Unless they're substantially different you're unlikely to get 
> large gains from training for the new fonts, and your time would be 
> better spent checking the common issues at this page: 
> https://code.google.com/p/tesseract-ocr/wiki/PoorQuality 
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to