Dear Tesseracters,

At Wikisource, the free digital library and sister project of Wikipedia, we
have founded a user group [1] to promote international coordination and
partnerships with fellow organizations. We have thousands of high quality
volunteer proofread pages [2] matched by scans in ca. 50 different
languages [3]. Our editing interface of one single page looks like this
[4], which has another view as "index" [5] or as text with all pages
together [6]. There are several verification levels, the most important are
"yellow" which means that one contributor proofread the page, and "green"
which means that a second person verified the proofread text.

This past weekend at Wikimania '14 in London we had a meeting were we
discussed technical and social issues from several Wikisource language
communities. One of the most serious issues was raised by the Belarusian
community which uses 2 different scripts with no commercial OCR support.
This means that the volunteers have to type each word manually. We wondered
if it would be possible to train Tesseract to recognize these old texts
using the text that has been already typed.

We would like to know if you would be interested in exploring collaboration
possibilities. I imagine that with your guidance we could prepare training
data not only in different languages, but also from different time periods,
scripts, etc. At the moment it is not very clear how to achieve this.

Please let us know if you would like to have a hangout/skype conversation
any day next week.

Cheers,
Micru


[1] https://meta.wikimedia.org/wiki/Wikisource_Community_User_Group
[2] https://wikisource.org/wiki/Wikisource:ProofreadPage_Statistics
[3] http://stats.wikimedia.org/wikisource/EN/Sitemap.htm
[4]
https://en.wikisource.org/wiki/Page%3ATyrannosaurus_and_Other_Cretaceous_Carnivorous_Dinosaurs.pdf/2
[5]
https://en.wikisource.org/wiki/Index:Tyrannosaurus_and_Other_Cretaceous_Carnivorous_Dinosaurs.pdf
[6]
https://en.wikisource.org/wiki/Tyrannosaurus_and_Other_Cretaceous_Carnivorous_Dinosaurs

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAJBSGSoA3N2t_ydThoGBoD5bE43VBf2%2Bx82c4TPOS4ON6CXpnw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to