I looked at those instructions and there was little in there I could do 
other than scale the image up. Which resulted in 5% accuracy going up to 
10-15% accuracy.

Let me start this explanation over...

I am doing a screen capture of a program, processing it to remove the 
background, leaving me with just the text I am interested in as an image 
file. Write this image out to file and passing it through Tesseract using 
the default English language *should* give me the text. (Since the image is 
approx 8 pt I scaled it up as per the suggestions before writing it to 
file). The individual characters are clear, crisp, and exactly the same 
each and every time. I expected decent results "out of the box". This is 
not the case.

I do have all the characters of the font in a single image file, which I 
thought to use as a basis for creating my own training file. Not 
surprisingly the generated .box file for this image contained a lot of 
"wrong" guesses on what letter is represented by each individual character. 
Which meant some "quality" time with jTessBoxEditor to correct the file.

Partway through this process I thought to "test" this to see if it was even 
worthwhile. I was successful in following the steps (with some alterations 
I did not write down to my regret) and the results were amazing. Even with 
the only partially corrected .box file accuracy shot up to around 70-80%. I 
have since finished editing the .box file with jTestBoxEditor so what is in 
the .box file matches what is in the source image.

And now I will be damned if I can get through the steps to create the 
training file. Several attempts later I am well and truly frustrated. I do 
recall I had to deviate from the "official" instructions to make it work, 
but not what those changes were. Which is why I asked: If have these files 
named like this, what are the commands I have to execute to make this 
process work?

On Friday, January 10, 2014 6:03:03 AM UTC-4, Nick White wrote:
>
> On Thu, Jan 09, 2014 at 11:46:17AM -0800, Doug . wrote: 
> > And I am still not clear why I have to create a new "language"? I have a 
> number 
> > of bitmap (not truetype) English fonts that Tesseract does a mediocre 
> job on 
> > "out of the box". 
>
> How different are these fonts you're using from ordinary English 
> fonts? Unless they're substantially different you're unlikely to get 
> large gains from training for the new fonts, and your time would be 
> better spent checking the common issues at this page: 
> https://code.google.com/p/tesseract-ocr/wiki/PoorQuality 
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to