Hi Yizhen, On Tue, Nov 24, 2015 at 07:08:24PM -0800, Yizhen Hai wrote: > I am working on a volunteer project to digitize the Sutra and all related > materials, most of them in Tibetan.
Sounds like a great project :) > Therefore, I wonder how I can get help to use Tesseract for Tibetan. (I am new > on both OCR and Tesseract and the only programming language I know is R.) I > have no idea how to get started, training Tesseract for a new language? Are you sure Tesseract doesn't already support the Tibetan language you need? I know almost nothing about Tibetan, but I see in the langdata[0] repository (which is used to build the official training files) a Tibetan.unicharset file, which implies it probably does have support. Take a look for the ISO-693 code for the language(s) you're interested in in the tessdata repository[1]. I quickly compared the ISO-693 codes from this wikipedia page[2] with the tessdata and bod (Lhasa Tibetan) is the only one there that I see available. But maybe it's the language you want anyway? > And what if the image contains both Chinese and Tibetan? Please > give me some hints. Tesseract can be told to expect multiple languages in an image, using a plus in the language argument (i.e. '-l eng+spa'). Hope that's helpful. Nick 0. https://github.com/tesseract-ocr/langdata 1. https://github.com/tesseract-ocr/tessdata 2. https://en.wikipedia.org/wiki/Central_Tibetan_language -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/20151125101124.GA31351%40manta.lan. For more options, visit https://groups.google.com/d/optout.

