Hi Ibrahim, According to Shree's advices I am going to work on training for some time, of course before I am going to work on the alphabet and other symbols in arabic and farsi dataset which are common with ottoman. I am still not sure how to finetune existing data set but going to try to understand.
For ms-word, when I install TTF prepared for Ottoman alphabet, yes I can see all 34 letters of ottoman in a document, On Thu, Dec 19, 2019 at 11:10 AM Ibr <ibr.ham...@gmail.com> wrote: > Hi Serkan, >> > > My pleasure brother, any time :) > > *"**Do I need a new model for ottoman, what you think ?"* of course I > think It would help you a lot but honestly I really have no clue how to > create a trained data for Ottoman or any other language, that's why maybe > your best shot is Farsi trained date, unless of course you know how to > create Ottoman trained data > > > *"I understand that if any letter that does not have ASCII correspondence > can not be recognized and converted to text. Right ? ** if yes can we say > that that letters can never be contained in OCR ?"* theoretically yes if > I understand this matter correct, why I mentioned the Unicode and ASCII at > the first place? because I have faced this issue before and I opened an > issue about it, refer to this issue > <https://github.com/tesseract-ocr/tesstrain/issues/128> and you can see > how each character has its own corresponding code. that's why I asked you > if the Ottoman writing system is recognized by other editors such as MS > Office, according to Shree's comment *"If all required Ottoman characters > do not have a Unicode codepoint, then you may have to assign some random > letter instead"* seems like any Ottoman letter doesn't contain its code > wont be recognized, again, I think if you look deeper into Farsi alphabet > and compare it with the Ottoman alphabet you might conclude that Farsi > should do, since Tesseract doesn't work on meaning only characters, > unfortunately I can't help you with this since I only know just little of > Farsi, you need someone specialized in Farsi or a native like an Iranian or > Azerbaijani. > > Good thing that Shree is here, this guy is an expert in this matter and > helpful as well, specially since were brought the Unicode and ASCII > representation and creating trained data to the table he knows these stuff > more than me > > Again, you should pay attention to the quality of the images, some images > might not have good results but due to some imperfections in the images > itself like old line or dots, so some image enhancements to the image will > give better results > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/5b32cb1c-65f1-4fc8-a763-fc42e9d58cca%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/5b32cb1c-65f1-4fc8-a763-fc42e9d58cca%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAGCxbmup9_GCV_QS12Dkxkb22sJpanHQeez9H3xqtkfNMuKA%2BA%40mail.gmail.com.