Hi İbrahim, I worked on the subject and found some workings and documents that confirms that all ottoman letters have Unicode correspondence. I am going to finetune trained models as Shree advises.
Here are the samples of the four letters and the Unicode values : ﭖ : %uFB56 ﭺ : %uFB7A ﮊ : %uFB8A ﮒ : %uFB92 ﯓ : %uFBD3 Selamlar On Tue, Dec 24, 2019 at 6:19 PM Ibr <ibr.ham...@gmail.com> wrote: > Hi Serkan, > > if Ottoman Letters have code to represent them then yes its doable > > On Friday, December 20, 2019 at 12:06:56 AM UTC+2, Serkan Taş wrote: > >> Hi Ibrahim, >> >> According to Shree's advices I am going to work on training for some >> time, of course before I am going to work on the alphabet and other symbols >> in arabic and farsi dataset which are common with ottoman. I am still not >> sure how to finetune existing data set but going to try to understand. >> >> For ms-word, when I install TTF prepared for Ottoman alphabet, yes I can >> see all 34 letters of ottoman in a document, >> >> On Thu, Dec 19, 2019 at 11:10 AM Ibr <ibr....@gmail.com> wrote: >> >>> Hi Serkan, >>>> >>> >>> My pleasure brother, any time :) >>> >>> *"**Do I need a new model for ottoman, what you think ?"* of course I >>> think It would help you a lot but honestly I really have no clue how to >>> create a trained data for Ottoman or any other language, that's why maybe >>> your best shot is Farsi trained date, unless of course you know how to >>> create Ottoman trained data >>> >>> >>> *"I understand that if any letter that does not have ASCII >>> correspondence can not be recognized and converted to text. Right ? ** >>> if yes can we say that that letters can never be contained in OCR ?"* >>> theoretically yes if I understand this matter correct, why I mentioned the >>> Unicode and ASCII at the first place? because I have faced this issue >>> before and I opened an issue about it, refer to this issue >>> <https://github.com/tesseract-ocr/tesstrain/issues/128> and you can see >>> how each character has its own corresponding code. that's why I asked you >>> if the Ottoman writing system is recognized by other editors such as MS >>> Office, according to Shree's comment *"If all required Ottoman >>> characters do not have a Unicode codepoint, then you may have to assign >>> some random letter instead"* seems like any Ottoman letter doesn't >>> contain its code wont be recognized, again, I think if you look deeper into >>> Farsi alphabet and compare it with the Ottoman alphabet you might conclude >>> that Farsi should do, since Tesseract doesn't work on meaning only >>> characters, unfortunately I can't help you with this since I only know just >>> little of Farsi, you need someone specialized in Farsi or a native like an >>> Iranian or Azerbaijani. >>> >>> Good thing that Shree is here, this guy is an expert in this matter and >>> helpful as well, specially since were brought the Unicode and ASCII >>> representation and creating trained data to the table he knows these stuff >>> more than me >>> >>> Again, you should pay attention to the quality of the images, some >>> images might not have good results but due to some imperfections in the >>> images itself like old line or dots, so some image enhancements to the >>> image will give better results >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesser...@googlegroups.com. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/5b32cb1c-65f1-4fc8-a763-fc42e9d58cca%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/5b32cb1c-65f1-4fc8-a763-fc42e9d58cca%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/cafa2e12-24d2-4080-9347-3f5204050de1%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/cafa2e12-24d2-4080-9347-3f5204050de1%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAGCxbmswMW8eFRT6WjRQU-GqCT3dP6zwUbsZxcSTveVCXASKtQ%40mail.gmail.com.