Hi Yizhen,

On Tue, Nov 24, 2015 at 07:08:24PM -0800, Yizhen Hai wrote:
> I am working on a volunteer project to digitize the Sutra and all related
> materials, most of them in Tibetan.

Sounds like a great project :)

> Therefore, I wonder how I can get help to use Tesseract for Tibetan. (I am new
> on both OCR and Tesseract and the only programming language I know is R.) I
> have no idea how to get started, training Tesseract for a new language?

Are you sure Tesseract doesn't already support the Tibetan language 
you need? I know almost nothing about Tibetan, but I see in the 
langdata[0] repository (which is used to build the official training 
files) a Tibetan.unicharset file, which implies it probably does 
have support. Take a look for the ISO-693 code for the language(s) 
you're interested in in the tessdata repository[1].

I quickly compared the ISO-693 codes from this wikipedia page[2] 
with the tessdata and bod (Lhasa Tibetan) is the only one there that 
I see available. But maybe it's the language you want anyway?

> And what if the image contains both Chinese and Tibetan? Please 
> give me some hints.

Tesseract can be told to expect multiple languages in an image, 
using a plus in the language argument (i.e. '-l eng+spa').

Hope that's helpful.

Nick

0. https://github.com/tesseract-ocr/langdata
1. https://github.com/tesseract-ocr/tessdata
2. https://en.wikipedia.org/wiki/Central_Tibetan_language

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/20151125101124.GA31351%40manta.lan.
For more options, visit https://groups.google.com/d/optout.

Reply via email to