hi sorry for my English , I hope you can help me. I've trained tesseract for persian by running following commands ( https://github.com/tesseract-ocr/tesseract/wiki/Training-Tesseract) :
training/text2image --text=training_text.txt --outputbase=per.Arial.exp0 --font='Arial' --fonts_dir=/home/bita/TrainingPersian tesseract per.Arial.exp0.tif per.Arial.exp0 box.train unicharset_extractor per.Arial.exp0.box set_unicharset_properties per.Arial.exp0.box (by reading this issue: https://github.com/tesseract-ocr/tesseract/issues/318 and put Arabic.unicharset and Arabic.xheights in script_dir path ) set_unicharset_properties -U unicharset -O new_unicharset -X xheights --script_dir=/home/bita/langdata mv unicharset unicharset_Old mv new_unicharset unicharset shapeclustering -F font_properties -U unicharset per.Arial.exp0.tr mftraining -F font_properties -U unicharset -X xheights -O per.unicharset per.Arial.exp0.tr cntraining per.Arial.exp0.tr wordlist2dawg frequent_words_list per.freq-dawg per.unicharset wordlist2dawg words_list per.word-dawg per.unicharset mv shapetable per.shapetable mv normproto per.normproto mv inttemp per.inttemp mv pffmtable per.pffmtable combine_tessdata per. and for testing the result I've taken a screen shot from one part of my training text and increase the resolution up to 300 dpi by GIMP (I tried to make an image that doesn't have noise) , but the accuracy is not good at all. How can I increase the accuracy? which font size should I choose when I take the screenshot? the structure of Persian Language is much different from English, for example the shape of one character is modify depending on where it is locate in word (first, middle ,last) but in unicharset for all of these,the main character recognized. also the character are connected in words (somethings like handwritten in English) so does Tesseract work for language like Persian or Arabic? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a76e5f27-1e2e-4c65-99ee-75ed01e03f2d%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

