Hi,

I don't know where things are with OCR for non-latin scripts, so maybe this is not relevant anymore. Last time I grabbed information about it, there was limitation with the google service which was a problem namely for Indic languages. Well, yesterday we had a contribution day around Alsatian and Franconian dialects <https://fr.wikipedia.org/wiki/Discussion_Projet:Alsace#Journ.C3.A9e_contributive_alsacien.2Ffrancique_20_avril_2016> where I had the opportunity to talk with some linguists. One of them told me that google was in fact using tesseract <https://github.com/tesseract-ocr> for its OCR service, which is open source. According to what she told me (or at least what I remember from this), it works with a trans-script training machine, you have to define matching between picture sample and character and there it goes. Looking quickly at the langdata repository I see that there are stuff about Devenagari, which I believe is a script used in at least a part of Indic texts, isn't it?

Hope that may help,
mathieu
_______________________________________________
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

Reply via email to