Hi, I am using tesseract to generate unicode mappings for 'corrupt' font 
files. While I have complete control over rendering of the characters 
(size, positioning, colors) I am having troubles with accuracy. Mainly 
tesseract seems to like numbers over letters. In particular, lower case 
'l's often get detected as vertical bars or ones. Also, latin 'o's and 
zero's get switched around.

For example, the attached png has the text "ByJamesMorApil20" but after 
running the following code I get "ByJamesM0rApi12O" as the result from 
GetUTF8Text.

Notice that the lower case 'o' became a zero, the zero became an upper case 
'o', and lower case 'l' became a one.

TessBaseAPI api;
> api.SetPageSegMode(PSM_SINGLE_LINE);
> api.Init("path_to_trained_files", NULL);
> api.SetImage((const unsigned char*)bmp, width, height, bpp, stride)
> std::string ocr_results( api->GetUTF8Text() );


I have complete control over how the characters and the image are rendered 
(any size, spacing, colors, dpi), but I am still unable to get any better 
accuracy than this so far.

The only restriction is that the input characters are never going to be 
'real' words or sentences, just random order.

I originally tried PSM_SINGLE_CHAR mode, but that caused a lot more errors, 
mainly with capitalization.

Any help on increasing accuracy would be appreciated!

Thanks

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

<<attachment: T3.png>>

Reply via email to