Image Processing. Improving OCR.

Сергей Якушевич Fri, 22 Nov 2013 00:30:38 -0800

Hello guy,
I am facing the following issue.
I asked the question on stack overflow, but did not get sufficient answer.(
stackoverflow<http://stackoverflow.com/questions/20115382/ios-tesseract-ocr-why-recognition-is-so-pure-engine-principle>)
 
 Here is the text:


I have a question about Tesseract OCR principle. As far as I understand, 
after shapes detection , symbols (their forms) are scaled(resized) to have 
some specific font size. Such font size is based on trained data. 
Basically, trained set defines symbols (their geometry,shape), maybe their 
representation.

I am using Tesseract 3.01 (the latest) version on iOS platform. I check 
Tesseract FAQ, looked at forum, but I do not understand why for some images 
I have low quality of recognition.

It is said that font should be bigger than 12pt & image should have more 
than 300 DPI. I did all necessary preprocessing such as blurring (if it is 
needed), contrast enhancement. I even used other engine in Tesseract OCR - 
it is called CUBE.

But for some images (in spite of fact that they are bigger MIN(width, 
height) >1000 - I rescale them for tesseract, I get bad results for 
recognition

http://goo.gl/l9uJMe

However on other set of images results are better:

http://goo.gl/cwA9DC

Those images smaller I do not resize them, (just convert to grayscale mode).

If what I wrote about engine is correct.

Suppose trained set is based on font with size 14pt. Symbols from pictures 
are resized to some specific size, and I do not see any reason why they are 
not recognised in such case.

I also tried custom dictionaries, to penalise non dictionary words - did 
not give too much benefit to recognition.

tesseract = new tesseract::TessBaseAPI();
GenericVector<STRING> variables_name(1),variables_value(1);
variables_name.push_back("user_words_suffix");
variables_value.push_back("user-words");
int retVal = tesseract->Init([self.tesseractDataPath 
cStringUsingEncoding:NSUTF8StringEncoding], NULL,tesseract::OEM_TESSERACT_ONLY, 
NULL, 0, &variables_name, &variables_value, false);
ok |= retVal == 0;
ok |= tesseract->SetVariable("language_model_penalty_non_dict_word", "0.2");
ok |= tesseract->SetVariable("language_model_penalty_non_freq_dict_word", 
"0.2");
if (!ok){
    NSLog(@"Error initializing tesseract!");}

So my question is should I train tesseract on another font?

And ,honestly speaking, why I should train it? on default trained data text 
from Internet, or screen of PC(Mac) I get good recognition.

I also checked original tesseract English trained data it has 38 tiff 
files, that belong to the following families: 1) Аrial 2) verdana 3 )trebuc 
4) times 5) georigia 6 ) cour

It seems that font from image does not belong to this set

What is wrong with images , it is because that text is placed on bottles? 
(non - horizontal), or what ?

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Image Processing. Improving OCR.

Reply via email to