Hi İbrahim,

I worked on the subject and found some workings and documents that confirms
that all ottoman letters have Unicode correspondence. I am going to
finetune trained models as Shree advises.

Here are the samples of the four letters and the Unicode values :
ﭖ : %uFB56
ﭺ  :   %uFB7A
ﮊ : %uFB8A
ﮒ : %uFB92
ﯓ   : %uFBD3
Selamlar

On Tue, Dec 24, 2019 at 6:19 PM Ibr <ibr.ham...@gmail.com> wrote:

> Hi Serkan,
>
> if Ottoman Letters have code to represent them then yes its doable
>
> On Friday, December 20, 2019 at 12:06:56 AM UTC+2, Serkan Taş wrote:
>
>> Hi Ibrahim,
>>
>> According to Shree's advices I am going to work on training for some
>> time, of course before I am going to work on the alphabet and other symbols
>> in arabic and farsi dataset which are common with ottoman. I am still not
>> sure how to finetune existing data set but going to try to understand.
>>
>> For ms-word, when I install TTF prepared for Ottoman alphabet, yes I can
>> see all 34 letters of ottoman in a document,
>>
>> On Thu, Dec 19, 2019 at 11:10 AM Ibr <ibr....@gmail.com> wrote:
>>
>>> Hi Serkan,
>>>>
>>>
>>> My pleasure brother, any time :)
>>>
>>> *"**Do I need a new model for ottoman, what you think ?"* of course I
>>> think It would help you a lot but honestly I really have no clue how to
>>> create a trained data for Ottoman or any other language, that's why maybe
>>> your best shot is Farsi trained date, unless of course you know how to
>>> create Ottoman trained data
>>>
>>>
>>> *"I understand that if any letter that does not have ASCII
>>> correspondence can not be recognized and converted to text. Right ? **
>>> if yes can we say that that letters can never be contained in OCR ?"*
>>> theoretically yes if I understand this matter correct, why I mentioned the
>>> Unicode and ASCII at the first place? because I have faced this issue
>>> before and I opened an issue about it, refer to this issue
>>> <https://github.com/tesseract-ocr/tesstrain/issues/128> and you can see
>>> how each character has its own corresponding code. that's why I asked you
>>> if the Ottoman writing system is recognized by other editors such as MS
>>> Office, according to Shree's comment *"If all required Ottoman
>>> characters do not have a Unicode codepoint, then you may have to assign
>>> some random letter instead"* seems like any Ottoman letter doesn't
>>> contain its code wont be recognized, again, I think if you look deeper into
>>> Farsi alphabet and compare it with the Ottoman alphabet you might conclude
>>> that Farsi should do, since Tesseract doesn't work on meaning only
>>> characters, unfortunately I can't help you with this since I only know just
>>> little of Farsi, you need someone specialized in Farsi or a native like an
>>> Iranian or Azerbaijani.
>>>
>>> Good thing that Shree is here, this guy is an expert in this matter and
>>> helpful as well, specially  since were brought the Unicode and ASCII
>>> representation and creating trained data to the table he knows these stuff
>>> more than me
>>>
>>> Again, you should pay attention to the quality of the images, some
>>> images might not have good results but due to some imperfections in the
>>> images itself like old line or dots, so some image enhancements to the
>>> image will give better results
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesser...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/5b32cb1c-65f1-4fc8-a763-fc42e9d58cca%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/5b32cb1c-65f1-4fc8-a763-fc42e9d58cca%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/cafa2e12-24d2-4080-9347-3f5204050de1%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/cafa2e12-24d2-4080-9347-3f5204050de1%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAGCxbmswMW8eFRT6WjRQU-GqCT3dP6zwUbsZxcSTveVCXASKtQ%40mail.gmail.com.

Reply via email to