You might want to look at the dewarp (in leptonica) process uses to correct text warped on a copied page
On Friday, November 22, 2013 2:58:21 AM UTC-5, Сергей Якушевич wrote: > > Hello guy, > I am facing the following issue. > I asked the question on stack overflow, but did not get sufficient answer.( > stackoverflow<http://stackoverflow.com/questions/20115382/ios-tesseract-ocr-why-recognition-is-so-pure-engine-principle>) > > Here is the text: > > I have a question about Tesseract OCR principle. As far as I understand, > after shapes detection , symbols (their forms) are scaled(resized) to have > some specific font size. Such font size is based on trained data. > Basically, trained set defines symbols (their geometry,shape), maybe their > representation. > > I am using Tesseract 3.01 (the latest) version on iOS platform. I check > Tesseract FAQ, looked at forum, but I do not understand why for some images > I have low quality of recognition. > > It is said that font should be bigger than 12pt & image should have more > than 300 DPI. I did all necessary preprocessing such as blurring (if it is > needed), contrast enhancement. I even used other engine in Tesseract OCR - > it is called CUBE. > > But for some images (in spite of fact that they are bigger MIN(width, > height) >1000 - I rescale them for tesseract, I get bad results for > recognition > > http://goo.gl/l9uJMe > > However on other set of images results are better: > > http://goo.gl/cwA9DC > > Those images smaller I do not resize them, (just convert to grayscale > mode). > > If what I wrote about engine is correct. > > Suppose trained set is based on font with size 14pt. Symbols from pictures > are resized to some specific size, and I do not see any reason why they are > not recognised in such case. > > I also tried custom dictionaries, to penalise non dictionary words - did > not give too much benefit to recognition. > > tesseract = new tesseract::TessBaseAPI(); > GenericVector<STRING> variables_name(1),variables_value(1); > variables_name.push_back("user_words_suffix"); > variables_value.push_back("user-words"); > int retVal = tesseract->Init([self.tesseractDataPath > cStringUsingEncoding:NSUTF8StringEncoding], > NULL,tesseract::OEM_TESSERACT_ONLY, NULL, 0, &variables_name, > &variables_value, false); > ok |= retVal == 0; > ok |= tesseract->SetVariable("language_model_penalty_non_dict_word", "0.2"); > ok |= tesseract->SetVariable("language_model_penalty_non_freq_dict_word", > "0.2"); > if (!ok){ > NSLog(@"Error initializing tesseract!");} > > So my question is should I train tesseract on another font? > > And ,honestly speaking, why I should train it? on default trained data > text from Internet, or screen of PC(Mac) I get good recognition. > > I also checked original tesseract English trained data it has 38 tiff > files, that belong to the following families: 1) Аrial 2) verdana 3 )trebuc > 4) times 5) georigia 6 ) cour > > It seems that font from image does not belong to this set > > What is wrong with images , it is because that text is placed on bottles? > (non - horizontal), or what ? > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

