I have to read text from screenshots of speed test results and extract the upload and download speeds from them. Most of the images I have tested have been of very high quality and I have binarized and also corrected skew if necessary, but the results are still only at around 60% accuracy. The biggest issue is that after preprocessing some images in which the numbers are very clearly distinguishable are not read well. As an example, I have attached a test image after preprocessing, and the result of Tesseract performing OCR on it.
<https://lh3.googleusercontent.com/-kCTaPk5xzeE/VW6Wk2pyN7I/AAAAAAAAAIQ/D7z6oyM3igA/s1600/bwResult.png> The result I have received after performing OCR on this picture, in a single line is- 000003 4G 15:41 4 83% - / OOKLA SPEEDTEST PWG DOWNLOAD UPLOAD 49 ms Mbps Mbps L,» SHARE ‘ ”‘ “\ ‘ 5M I” 1°“ \\ I 2M 20M ‘ I I 1M 0M | , ‘ ‘ ‘ 1,3,!Ht‘u‘z‘gssz‘:}::;\ ..;~,-. ~‘ ‘ ' 'mmW" 50 , Note how the Mbps shows up but the number is completely ignored. How do I improve this result? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9f4c2105-9e07-4db7-875d-64692222082d%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

