> > Hi Serkan, > > ** "*I wonder if the existing language models generated for Arabic and/or Farsi*" yes there is one for Arabic and one for Farsi, they are called lang-name.traineddata such as ara.traineddata and eng.traineddata you can find them and download them from GitHub here <https://github.com/tesseract-ocr/tessdata/> I tried Arabic and Japanese models and they were really good, and the good thing that Tesseract guys keep enhancing the engine and models, I'm sure that the Farsi model is good as well, once I used it on a small Farsi script to answer a question on GitHub and it gave good result
** *"**Ottoman alphabet has 34 letters"* from the links that you have shared there is some letters in Ottoman alphabet the letter "Nef" it doesn't exist in Arabic and I think it doesn't exist in Farsi either I can conform that several letters are not in Arabic, I don't know about Farsi but I think Farsi doesn't contain all the Ottoman letters, so your best bet would be Farsi yet the extra letters will be recognized wrong, ** "Please find the picture and tesseract result attached" the image is in a good condition, and the result text, I give it accuracy of 90% accurate, or above, but if you noticed in the image at the most right side of it the image is like bent slightly, like the inner edge of the book, which will affect the accuracy, such as the words "چنتلک" and "په ده" which I believe Tesseract would detect them easily if there were in the middle for example, other words like "اومایان" should is solved in Arabic, refer to this issue <https://github.com/tesseract-ocr/tesseract/issues/840> I have opened earlier on GitHub and its fixed in Tesseract version 4, I don't know if its fixed or even exists in Farsi model ** *"the ocr using Farsi model . I wonder if the training set for the model of any one from Arabic or Farsi may be modified and used to create Ottoman language model or should I work to collect training data for the ottoman language model from scratch ?"* here I'm just spitballing , as far as I know that the corresponding text or any alphabet in computers it has a range of a representation in ASCII code, so any letter in Ottoman doesn't exist or have representation in ASCII code it couldn't be written as an output, does Ottoman writing system exists in Office Word for Example? training a model in Tesseract is to enhance the shape detecting in Tesseract engine by introducing other fonts and other potential shapes, which you can find it in Tesseract articles under tuning or training, but adding to the model I don't think that exists as an option from a user side, I recommend to consult regarding this matter with the people who work on Tesseract such as a guy called Shree, or Smith Ray (he is the one responsible of the Tesseract algorithm I believe) ** *"**The fonts for the Arabic or Farsi model guess does not contains all the letters of 35 and this may be a problem"* I meant in this paragraph the font type like "Calibri" or "Arial" not the alphabet, because such as Arabic and English some letters change their shape depending on the font, its all explained in the issue that I have shared above in GitHub ** *"**Afak, hand writing is separate phase of OCR which is also very hard but may be getting easier using ML technics like DL"* probably its doable I'm not that well informed in Machine learning, but the usual case I didn't use Tesseract with handwritten documents, I remember I even found a question about it in this group someone is asking does Tesseract fit for handwritten or not and people answered him "no not really" -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a620524a-9f05-4fec-8cb8-74818b9b5088%40googlegroups.com.