* Andrea Zanni wrote: >At the moment, Wikisource could be a interesting corpora and laboratory for >improving and enhancing OCR, >as the OCR generated text is always proofread and corrected by humans. >As part of our project ( >http://wikisource.org/wiki/Wikisource_vision_development), Micru was >looking for a GSoC candidate for studing the reinsertion of proofread text >into djvus [1], but at the moment didn't find any interested student. We >have some contacts with people at Google working on Tesseract, and they >were available for mentoring.
>[1] We thought about this both for OCR enhancement purposes and files >updating on Commons and Internet Archive (which is off topic here). I built various tools that could be fairly easily adapted for this, my http://www.google.com/search?q=site:lists.w3.org+intitle:hoehrmann+ocr notes are available. One of the tools for instance is a diff tool, see image at <http://lists.w3.org/Archives/Public/www-archive/2012Apr/0031>. -- Björn Höhrmann · mailto:bjo...@hoehrmann.de · http://bjoern.hoehrmann.de Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de 25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ _______________________________________________ Wikimedia-l mailing list Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l