Hi Serkan, if Ottoman Letters have code to represent them then yes its doable
On Friday, December 20, 2019 at 12:06:56 AM UTC+2, Serkan Taş wrote: > Hi Ibrahim, > > According to Shree's advices I am going to work on training for some time, > of course before I am going to work on the alphabet and other symbols in > arabic and farsi dataset which are common with ottoman. I am still not sure > how to finetune existing data set but going to try to understand. > > For ms-word, when I install TTF prepared for Ottoman alphabet, yes I can > see all 34 letters of ottoman in a document, > > On Thu, Dec 19, 2019 at 11:10 AM Ibr <ibr....@gmail.com <javascript:>> > wrote: > >> Hi Serkan, >>> >> >> My pleasure brother, any time :) >> >> *"**Do I need a new model for ottoman, what you think ?"* of course I >> think It would help you a lot but honestly I really have no clue how to >> create a trained data for Ottoman or any other language, that's why maybe >> your best shot is Farsi trained date, unless of course you know how to >> create Ottoman trained data >> >> >> *"I understand that if any letter that does not have ASCII correspondence >> can not be recognized and converted to text. Right ? ** if yes can we >> say that that letters can never be contained in OCR ?"* theoretically >> yes if I understand this matter correct, why I mentioned the Unicode and >> ASCII at the first place? because I have faced this issue before and I >> opened an issue about it, refer to this issue >> <https://github.com/tesseract-ocr/tesstrain/issues/128> and you can see >> how each character has its own corresponding code. that's why I asked you >> if the Ottoman writing system is recognized by other editors such as MS >> Office, according to Shree's comment *"If all required Ottoman >> characters do not have a Unicode codepoint, then you may have to assign >> some random letter instead"* seems like any Ottoman letter doesn't >> contain its code wont be recognized, again, I think if you look deeper into >> Farsi alphabet and compare it with the Ottoman alphabet you might conclude >> that Farsi should do, since Tesseract doesn't work on meaning only >> characters, unfortunately I can't help you with this since I only know just >> little of Farsi, you need someone specialized in Farsi or a native like an >> Iranian or Azerbaijani. >> >> Good thing that Shree is here, this guy is an expert in this matter and >> helpful as well, specially since were brought the Unicode and ASCII >> representation and creating trained data to the table he knows these stuff >> more than me >> >> Again, you should pay attention to the quality of the images, some images >> might not have good results but due to some imperfections in the images >> itself like old line or dots, so some image enhancements to the image will >> give better results >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to tesser...@googlegroups.com <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/5b32cb1c-65f1-4fc8-a763-fc42e9d58cca%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/5b32cb1c-65f1-4fc8-a763-fc42e9d58cca%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/cafa2e12-24d2-4080-9347-3f5204050de1%40googlegroups.com.