On Mon, Aug 19, 2013 at 10:12 PM, Sumana Harihareswara <suma...@wikimedia.org> wrote: > Is there a central list of the problems that OCR software (especially > open source OCR software) has with text written in Indic languages? If > so, I could help encourage people to fix those problems, as volunteers, > via a Google Summer of Code/Outreach Program for Women internship, via a > grant-funded project (such as https://meta.wikimedia.org/wiki/Grants:IEG > ), or via some other method.
<http://www.google-melange.com/gsoc/org/google/gsoc2013/ankur_india> would show that two of the projects that are being undertaken in this iteration of GSoC pertain to OCR and IR (information retrieval). Additionally, for those who want to keep themselves updated with the progress in this space, please make sure that you are in touch with the group organizing <http://www.isical.ac.in/~fire/> Over the past decade I've heard many esteemed research organizations in India talk about how they have OCR systems which are 80-88% accurate. At a large scale, that accuracy is suitably worthless. Add to this the fact that none of the code bases of those systems are in public domain (even if the original research has been done with public funds) which in turn negates any approach to validate the claims of accuracy or, undertake iterative improvement. <http://www.amazon.com/Guide-OCR-Indic-Scripts-Recognition/dp/1848003293> : Guide to OCR for Indic Scripts: Document Recognition and Retrieval (Advances in Computer Vision and Pattern Recognition) is a volume published in 2009 but it does a good job of summing up the problems in the OCR space pertaining to Indic scripts and, also the (then) state-of-the-art. OCR and IR are very interesting to talk about (also, great ideas to raise funds for!). I've rarely seen a serious attempt to take the challenges head on (barring Debayan's attempt with Tesseract). /s -- sankarshan mukhopadhyay <https://twitter.com/#!/sankarshan> _______________________________________________ Wikimediaindia-l mailing list Wikimediaindiaemail@example.com To unsubscribe from the list / change mailing preferences visit https://lists.wikimedia.org/mailman/listinfo/wikimediaindia-l