[Wikisource-l] Tesseract Open Source OCR Engine

mathieu stumpf guntz Wed, 20 Apr 2016 22:23:53 -0700

Hi,

I don't know where things are with OCR for non-latin scripts, so maybethis is not relevant anymore. Last time I grabbed information about it,there was limitation with the google service which was a problem namelyfor Indic languages. Well, yesterday we had a contribution day aroundAlsatian and Franconian dialects<https://fr.wikipedia.org/wiki/Discussion_Projet:Alsace#Journ.C3.A9e_contributive_alsacien.2Ffrancique_20_avril_2016>where I had the opportunity to talk with some linguists. One of themtold me that google was in fact using tesseract<https://github.com/tesseract-ocr> for its OCR service, which is opensource. According to what she told me (or at least what I remember fromthis), it works with a trans-script training machine, you have to definematching between picture sample and character and there it goes. Lookingquickly at the langdata repository I see that there are stuff aboutDevenagari, which I believe is a script used in at least a part of Indictexts, isn't it?


Hope that may help,
mathieu

_______________________________________________
Wikisource-l mailing list
Wikisource-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikisource-l

[Wikisource-l] Tesseract Open Source OCR Engine

Reply via email to