I am new to OCR but so far I have achieved some good results. I am able to extract text from image fairly well. What is concerning me now it the time required for the processing... I am using tesseract 3.02.02 and leptonica-1.71. My script does:
1. receive a jpeg image with 2MB from an url 2. do resize to image in order to have a width of 1000 pixels and a height proportional to the new width. 3. convert my resized image to greyscale image NOTE: my image is now only 60kb 4. create 4 copies of grey image to be appleid 4 PIL default filters: 'SHARPEN', 'SMOOTH', 'UnsharpMask'(radius=2, percent=150, threshold=3), 'AutoContrast'. 5. for each image already processed by a filter then I apply binarization like this: image = image.point(lambda x: 0 if x<128 else 255, '1') #refers to Convert RGB to black OR white <http://stackoverflow.com/questions/18777873/convert-rgb-to-black-or-white> 6. the images one by one are passed for OCR by: text = pytesseract.image_to_string(image) 7. then and finally i do some text cleanup, to verify valid tokens and some forced replacements. What is taking so long? where can I improve or speed up a little? is taking 10sec to run all the script and show results. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/723ed76f-1757-4725-9079-12b2bc516eec%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

