Hello guy, I am facing the following issue. I asked the question on stack overflow, but did not get sufficient answer.( stackoverflow<http://stackoverflow.com/questions/20115382/ios-tesseract-ocr-why-recognition-is-so-pure-engine-principle>) Here is the text:
I have a question about Tesseract OCR principle. As far as I understand, after shapes detection , symbols (their forms) are scaled(resized) to have some specific font size. Such font size is based on trained data. Basically, trained set defines symbols (their geometry,shape), maybe their representation. I am using Tesseract 3.01 (the latest) version on iOS platform. I check Tesseract FAQ, looked at forum, but I do not understand why for some images I have low quality of recognition. It is said that font should be bigger than 12pt & image should have more than 300 DPI. I did all necessary preprocessing such as blurring (if it is needed), contrast enhancement. I even used other engine in Tesseract OCR - it is called CUBE. But for some images (in spite of fact that they are bigger MIN(width, height) >1000 - I rescale them for tesseract, I get bad results for recognition http://goo.gl/l9uJMe However on other set of images results are better: http://goo.gl/cwA9DC Those images smaller I do not resize them, (just convert to grayscale mode). If what I wrote about engine is correct. Suppose trained set is based on font with size 14pt. Symbols from pictures are resized to some specific size, and I do not see any reason why they are not recognised in such case. I also tried custom dictionaries, to penalise non dictionary words - did not give too much benefit to recognition. tesseract = new tesseract::TessBaseAPI(); GenericVector<STRING> variables_name(1),variables_value(1); variables_name.push_back("user_words_suffix"); variables_value.push_back("user-words"); int retVal = tesseract->Init([self.tesseractDataPath cStringUsingEncoding:NSUTF8StringEncoding], NULL,tesseract::OEM_TESSERACT_ONLY, NULL, 0, &variables_name, &variables_value, false); ok |= retVal == 0; ok |= tesseract->SetVariable("language_model_penalty_non_dict_word", "0.2"); ok |= tesseract->SetVariable("language_model_penalty_non_freq_dict_word", "0.2"); if (!ok){ NSLog(@"Error initializing tesseract!");} So my question is should I train tesseract on another font? And ,honestly speaking, why I should train it? on default trained data text from Internet, or screen of PC(Mac) I get good recognition. I also checked original tesseract English trained data it has 38 tiff files, that belong to the following families: 1) Аrial 2) verdana 3 )trebuc 4) times 5) georigia 6 ) cour It seems that font from image does not belong to this set What is wrong with images , it is because that text is placed on bottles? (non - horizontal), or what ? -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

