[tesseract-ocr] Tesseract is ignoring numbers

Anish Radhakrishnan Nair Tue, 02 Jun 2015 23:11:03 -0700


I have to read text from screenshots of speed test results and extract the 
upload and download speeds from them. Most of the images I have tested have 
been of very high quality and I have binarized and also corrected skew if 
necessary, but the results are still only at around 60% accuracy. The 
biggest issue is that after preprocessing some images in which the numbers 
are very clearly distinguishable are not read well. As an example, I have 
attached a test image after preprocessing, and the result of Tesseract 
performing OCR on it.


<https://lh3.googleusercontent.com/-kCTaPk5xzeE/VW6Wk2pyN7I/AAAAAAAAAIQ/D7z6oyM3igA/s1600/bwResult.png>

The result I have received after performing OCR on this picture, in a 
single line is-
000003 4G 15:41 4 83% - / OOKLA SPEEDTEST PWG DOWNLOAD UPLOAD 49 ms Mbps 
Mbps L,» SHARE ‘ ”‘ “\ ‘ 5M I” 1°“ \\ I 2M 20M ‘ I I 1M 0M | , ‘ ‘ ‘ 
1,3,!Ht‘u‘z‘gssz‘:}::;\ ..;~,-. ~‘ ‘ ' 'mmW" 50 ,

Note how the Mbps shows up but the number is completely ignored. How do I 
improve this result?

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/9f4c2105-9e07-4db7-875d-64692222082d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Tesseract is ignoring numbers

Reply via email to