[Wikitech-l] Fwd: wikicaptcha on GitHub

Federico Leva (Nemo) Wed, 25 Jan 2012 01:23:29 -0800

We sort of use IA's data already, because many Wikisource texts areOCR'ed on IA. If we manage to use OCR improvements within DjVu, itshouldn't be too difficult to reupload such DjVu in their items and thenthey could do what they want with them.

> OCRs generally work by finding lines of text on a page, splitting thelinesinto letters, then recognizing each letter separately. So, an OCR would> know,> for each letter of the recognized text, what is its bounding box onthe page.

> However, to my knowledge there is not a single OCR that exports thisdata, nor> is there a standard format for it. If an open source OCR could bemodified to> do this, then it would be easy to inject data retreieved fromcaptchas back

> into OCR-ed text. And it could be used for so much more :)

I don't understand, what data are you talking about? DjVu is an openformat and can store character mappings, which is what the wikicaptchaproof of concept is based on. There's alsohttps://en.wikipedia.org/wiki/HOCR and IA uses some proprietary ABBYYxml format which AFAIK can be somehow read and converted to hOCR.

The real problem is character training which could be used forsubsequent OCRs. I doubt we can do much here, because everyone usesABBYY, and even tesseract users don't seem to share such data in any way.


Nemo

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Fwd: wikicaptcha on GitHub

Reply via email to