I am new to OCR but so far I have achieved some good results. I am able to 
extract text from image fairly well. What is concerning me now it the time 
required for the processing... I am using tesseract 3.02.02 and 
leptonica-1.71. My script does:

   1. receive a jpeg image with 2MB from an url
   2. do resize to image in order to have a width of 1000 pixels and a 
   height proportional to the new width.
   3. convert my resized image to greyscale image NOTE: my image is now 
   only 60kb
   4. create 4 copies of grey image to be appleid 4 PIL default filters: 
   'SHARPEN', 'SMOOTH', 'UnsharpMask'(radius=2, percent=150, threshold=3), 
   'AutoContrast'.
   5. for each image already processed by a filter then I apply 
   binarization like this: image = image.point(lambda x: 0 if x<128 else 255, 
   '1') #refers to Convert RGB to black OR white 
   <http://stackoverflow.com/questions/18777873/convert-rgb-to-black-or-white>
   6. the images one by one are passed for OCR by: text = 
   pytesseract.image_to_string(image)
   7. then and finally i do some text cleanup, to verify valid tokens and 
   some forced replacements.

What is taking so long? where can I improve or speed up a little? is taking 
10sec to run all the script and show results.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/723ed76f-1757-4725-9079-12b2bc516eec%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to