Dear Tesseracters, At Wikisource, the free digital library and sister project of Wikipedia, we have founded a user group [1] to promote international coordination and partnerships with fellow organizations. We have thousands of high quality volunteer proofread pages [2] matched by scans in ca. 50 different languages [3]. Our editing interface of one single page looks like this [4], which has another view as "index" [5] or as text with all pages together [6]. There are several verification levels, the most important are "yellow" which means that one contributor proofread the page, and "green" which means that a second person verified the proofread text.
This past weekend at Wikimania '14 in London we had a meeting were we discussed technical and social issues from several Wikisource language communities. One of the most serious issues was raised by the Belarusian community which uses 2 different scripts with no commercial OCR support. This means that the volunteers have to type each word manually. We wondered if it would be possible to train Tesseract to recognize these old texts using the text that has been already typed. We would like to know if you would be interested in exploring collaboration possibilities. I imagine that with your guidance we could prepare training data not only in different languages, but also from different time periods, scripts, etc. At the moment it is not very clear how to achieve this. Please let us know if you would like to have a hangout/skype conversation any day next week. Cheers, Micru [1] https://meta.wikimedia.org/wiki/Wikisource_Community_User_Group [2] https://wikisource.org/wiki/Wikisource:ProofreadPage_Statistics [3] http://stats.wikimedia.org/wikisource/EN/Sitemap.htm [4] https://en.wikisource.org/wiki/Page%3ATyrannosaurus_and_Other_Cretaceous_Carnivorous_Dinosaurs.pdf/2 [5] https://en.wikisource.org/wiki/Index:Tyrannosaurus_and_Other_Cretaceous_Carnivorous_Dinosaurs.pdf [6] https://en.wikisource.org/wiki/Tyrannosaurus_and_Other_Cretaceous_Carnivorous_Dinosaurs -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJBSGSoA3N2t_ydThoGBoD5bE43VBf2%2Bx82c4TPOS4ON6CXpnw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

