Re: Ynt: [tesseract-ocr] Re: How to use Tesseract Arabic OCR.

Ibr Tue, 17 Dec 2019 01:04:56 -0800

>
> Hi Serkan,
>

How Tesseract works is like the following, each language or writing system, 
it has a model which depend on to make recognition of the characters in the 
image, I guess it depends on something called (stroke width transformation) 
which is actually detecting the shapes, if while scanning an image detected 
a shape (letter in the image) that already recognize Tesseract will assign 
it as the corresponding letter that has the same shape and write it in the 
output text, and then the next shape and so on, in Tesseract every language 
has its own model (a model in ML is more like the brain which decide the 
results depending on the input), WHY I'm telling you all of this? to give 
you an idea how it works and to let you know, you can't be conclusive about 
the results, even with great accuracy you might still have some errors, 
that's how machine learning in general, that's why usually people train the 
model and to enhance its accuracy,


About Ottoman  writing system you said *"**The language is Turkish 
originally"* Tesseract doesn't care about the meaning of the text, just the 
shapes, *"**alphabet is some kind of mixture of Arabic and Farsi alphabet"* 
I'm a native Arabic speaker, yet I can read the first image that you have 
shared "*ödev"*  without knowing what it means (except for  few words I 
already know in Turkish language) I also can read Farsi as well, but the 
problem with Farsi alphabet it contains extra letters that doesn't exist in 
Arabic, very close but slightly different for example (چ) is same as (ج) 
but with three dots, in Farsi both letters exists, but in Arabic only the 
second one exists, so run the two letter on the Farsi model, will work 
fine, but on Arabic model, I think both letter will be recognized as the 
letter with the one dot only. Arabic has 28 letters but Farsi has 32 
letters I guess, so that means if Ottoman alphabet contains letter from 
Farsi, Arabic model wont be enough since Farsi contains Arabic letters and 
some extra letters, now if Ottoman alphabet and Farsi alphabet are the 
same, for sure Farsi model (I think its fas.traineddata 
<https://github.com/tesseract-ocr/tessdata/blob/master/fas.traineddata> ) 
will work fine, but if there are some letters in Ottoman alphabet doesn't 
exist in Farsi then, these letters wont be recognized or recognized wrong

About the font, I'm not sure what is the font used in both pictures but the 
first picture definitely it exists in the Arabic model, in Tesseract 4 (at 
least when I used it last time lime almost a year ago) its contains I think 
5000 Arabic fonts, which covers almost all the fonts, so I don't think you 
would need any training on different fonts 

Last thing, when I used Tesseract it was giving a perfect results for 
Arabic and Japanese as well, for formal documents, but handwritten 
documents the accuracy is really low, I don't know if this still the case 
or not, but if it is, handwritten wont have good results, for example the 
second image that you have shared "sample01" I assure you it wont be 
recognized even if you have Ottoman model, the first one I'm not sure, I 
think it would be recognized but any word that has a small space due to 
being old document, the resulted word will be separated, to be honest you 
wont know for sure until you try it on the Tesseract, Tesseract since 
version 4 is easy to use, specially its not necessary to train the model on 
new fonts, so in my opinion open a question on this Google group or on 
GitHub asking if there is an Ottoman model, or since you seem you know 
these stuff you can decide if the Farsi model will do, try on the Farsi 
model

I wish I was helpful enough, I said to much details but only to give you 
the full image of what's going on so you would decide if it fits or not, 
since I don't have enough information about Ottoman writing system, if you 
still have any question I'm here to help :)

teşekkürler :)
Ibrahim


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/1642e20a-1de4-4f83-aa1b-fbfbbae9fd7e%40googlegroups.com.

Re: Ynt: [tesseract-ocr] Re: How to use Tesseract Arabic OCR.

Reply via email to