Re: Ynt: [tesseract-ocr] Re: How to use Tesseract Arabic OCR.

Ibr Wed, 18 Dec 2019 02:09:53 -0800

>
> Hi Serkan,
>
>
** "*I wonder if the existing language models generated for Arabic and/or 
Farsi*"  yes there is one for Arabic and one for Farsi, they are called 
lang-name.traineddata such as ara.traineddata and eng.traineddata you can 
find them and download them from GitHub here 
<https://github.com/tesseract-ocr/tessdata/> I tried Arabic and Japanese 
models and they were really good, and the good thing that Tesseract guys 
keep enhancing the engine and models, I'm sure that the Farsi model is good 
as well, once I used it on a small Farsi script to answer a question on 
GitHub and it gave good result


** *"**Ottoman alphabet has 34 letters"* from the links that you have 
shared there is some letters in Ottoman alphabet the letter "Nef" it 
doesn't exist in Arabic and I think it doesn't exist in Farsi either I can 
conform that several letters are not in Arabic, I don't know about Farsi 
but I think Farsi doesn't contain all the Ottoman letters, so your best bet 
would be Farsi yet the extra letters will be recognized wrong, 

** "Please find the picture  and tesseract result attached" the image is in 
a good condition, and the result text, I give it accuracy of 90% accurate, 
or above, but if you noticed in the image at the most right side of it the 
image is like bent slightly, like the inner edge of the book, which will 
affect the accuracy, such as the words "چنتلک" and "په ده" which I believe 
Tesseract would detect them easily if there were in the middle for example, 
other words like "اومایان" should is solved in Arabic, refer to this issue 
<https://github.com/tesseract-ocr/tesseract/issues/840> I have opened 
earlier on GitHub and its fixed in Tesseract version 4, I don't know if its 
fixed or even exists in Farsi model

** *"the ocr using Farsi model . I wonder if the training set for the model 
of any one from Arabic or Farsi may be modified and used  to create Ottoman 
language model or should I work to collect training data for the ottoman 
language model from scratch ?"*  here I'm just spitballing , as far as I 
know that the corresponding text or any alphabet in computers it has a 
range of a representation in ASCII code, so any letter in Ottoman doesn't 
exist or have representation in ASCII code it couldn't be written as an 
output, does Ottoman writing system exists in Office Word for Example? 
training a model in Tesseract is to enhance the shape detecting in 
Tesseract engine by introducing other fonts and other potential shapes, 
which you can find it in Tesseract articles under tuning or training, but 
adding to the model I don't think that exists as an option from a user 
side, I recommend to consult regarding this matter with the people who work 
on Tesseract such as a guy called Shree, or Smith Ray (he is the one 
responsible of the Tesseract algorithm I believe) 

** *"**The fonts for the Arabic or Farsi model guess does not contains all 
the letters of 35 and this may be a problem"* I meant in this paragraph the 
font type like "Calibri" or "Arial" not the alphabet, because such as  
Arabic and English some letters change their shape depending on the font, 
its all explained in the issue that I have shared above in GitHub

** *"**Afak, hand writing is separate phase of OCR which is also very hard 
but may be getting easier using ML technics like DL"* probably its doable 
I'm not that well informed in Machine learning, but the usual case I didn't 
use Tesseract with handwritten documents, I remember I even found a 
question about it in this group someone is asking does Tesseract fit for 
handwritten or not and people answered him "no not really"

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a620524a-9f05-4fec-8cb8-74818b9b5088%40googlegroups.com.

Re: Ynt: [tesseract-ocr] Re: How to use Tesseract Arabic OCR.

Reply via email to