This is not really an answer. I would experiment with a higher resolution image. And maybe experiment with masking the image using graphicsmagick. The mask would cover the 'ms', 'Mbps', and second 'Mbps'. Good luck!
On Wednesday, June 3, 2015 at 2:10:07 AM UTC-4, Anish Radhakrishnan Nair wrote: > > I have to read text from screenshots of speed test results and extract the > upload and download speeds from them. Most of the images I have tested have > been of very high quality and I have binarized and also corrected skew if > necessary, but the results are still only at around 60% accuracy. The > biggest issue is that after preprocessing some images in which the numbers > are very clearly distinguishable are not read well. As an example, I have > attached a test image after preprocessing, and the result of Tesseract > performing OCR on it. > > > <https://lh3.googleusercontent.com/-kCTaPk5xzeE/VW6Wk2pyN7I/AAAAAAAAAIQ/D7z6oyM3igA/s1600/bwResult.png> > > The result I have received after performing OCR on this picture, in a > single line is- > 000003 4G 15:41 4 83% - / OOKLA SPEEDTEST PWG DOWNLOAD UPLOAD 49 ms Mbps > Mbps L,» SHARE ‘ ”‘ “\ ‘ 5M I” 1°“ \\ I 2M 20M ‘ I I 1M 0M | , ‘ ‘ ‘ > 1,3,!Ht‘u‘z‘gssz‘:}::;\ ..;~,-. ~‘ ‘ ' 'mmW" 50 , > > Note how the Mbps shows up but the number is completely ignored. How do I > improve this result? > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/9da180a7-96d2-47b8-a827-3f44d9cba8d4%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

