hi
sorry for my English , I hope you can help me.
I've trained tesseract for persian by running following commands (
https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract) : 

training/text2image --text=training_text.txt --outputbase=per.Arial.exp0 
--font='Arial' --fonts_dir=/home/bita/TrainingPersian

tesseract per.Arial.exp0.tif per.Arial.exp0 box.train

unicharset_extractor per.Arial.exp0.box

set_unicharset_properties per.Arial.exp0.box
(by reading this issue: 
https://github.com/tesseract-ocr/tesseract/issues/318 and put 
Arabic.unicharset and Arabic.xheights in script_dir path )
set_unicharset_properties -U unicharset -O new_unicharset -X xheights 
--script_dir=/home/bita/langdata

mv unicharset unicharset_Old

mv new_unicharset unicharset

shapeclustering -F font_properties -U unicharset per.Arial.exp0.tr

mftraining -F font_properties -U unicharset -X xheights -O per.unicharset 
per.Arial.exp0.tr 

cntraining per.Arial.exp0.tr

wordlist2dawg frequent_words_list per.freq-dawg per.unicharset

wordlist2dawg words_list per.word-dawg per.unicharset

mv shapetable per.shapetable

mv normproto per.normproto

mv inttemp per.inttemp

mv pffmtable per.pffmtable

combine_tessdata per.


and for testing the result I've taken a screen shot from one part of my 
training text and  increase the resolution up to 300 dpi by GIMP (I tried 
to make an image that doesn't have noise) , but the accuracy is not good at 
all.

How can I increase the accuracy?
which font size should I choose when I take the screenshot?
the structure of Persian Language is much different from English, for 
example the shape of one character is modify depending on where it is 
locate in word (first, middle ,last)  but in unicharset
for all of these,the main character recognized.
also the character are connected in words (somethings like handwritten in 
English)

so does Tesseract work for language like Persian or Arabic?


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a76e5f27-1e2e-4c65-99ee-75ed01e03f2d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to