Re: [tesseract-ocr] Re: How to make training for Arabic in Tesseract 4.0

Ibr Thu, 04 May 2017 05:07:24 -0700

i shree
actually I saw the section that was talking about lstmtraining, but I what 
I said was the result of following the tesseract messages, what happened 
from the beginning was that I used to train .traineddata files for English, 
and worked fine, but for Arabic it was failing, so I saw the argument oem 
at tesseract and I used it then the tesseract asked for the lstm file, then 
I came across the article about the tesseract 4.00alpha which was including 
Arabic.
then I created the lstm file but again the tesseract failed at detecting 
the text from the image, I felt that the old .traineddata (created by 
tesseract 3.03) compatible with lstmf file,  search for the cause of the 
problem and I found this issue 
<https://github.com/tesseract-ocr/tesseract/issues/487>, got the official 
traineddata and the accuracy for detecting Arabic text image was correct 
except for the characters that I described in the issue that I referred 
earlier.


if I'm not mistaken the lstmtraining section is to enhance the accuracy, 
correct? 
it seems that if the لا case and الم case are solved in the ara.traineddata 
the accuracy of Arabic detecting will be as good as English detection

On Thursday, May 4, 2017 at 12:52:42 PM UTC+3, shree wrote:

> Ibr,
>
> You are incorrect in your description of LSTM training.
>
> What you are doing will use the ara.traineddata provided in the repo, 
> there will be no change in output.
>
> Once lstmf files are created, you have to run lstmtraining which will run 
> for days/weeks  to give you a good result.
>
> Please read about LSTM training on wiki.
>
> On May 4, 2017 2:58 PM, "Ibr" <[email protected] <javascript:>> wrote:
>
>> if you are referring to tesseract 4.00alpha with liptonica 1.74.1, and if 
>> you compiled them in the correct way and got the binaries that you need for 
>> training lmstf files, then I recommend to follow the suggestions that is 
>> made by tesseract devs which is: once you create an .lstmf file for a 
>> certain font (that can be used for Arabic writing) then get the official 
>> ara.traineddata file from GitHub paste it in tessdata folder, and the lstmf 
>> file in tesseract folder and run the command  tesseract text_image 
>> result_text -l ara --oem 1 
>> what Arabic characters exactly are you trying to enhance the accuracy for 
>> ?
>>
>> On Saturday, April 8, 2017 at 11:52:25 AM UTC+3, Ahmad Moawad wrote:
>>
>>> Hello All,
>>>
>>>
>>> I want to make training for Arabic language in Tesseract 4.0, and The 
>>> result of this version is great but still need some tunning, so I got 
>>> jTessBoxEditor 2.0 beta.
>>> I tried to modify the incorrect characters and build ara.traineddata. 
>>> After copying the ara.traineddata to 
>>> /usr/share/tesseract-ocr/4.00/tessdata, I got random characters when I run 
>>> the tesseract on the image.
>>> So any suggestion of how making training for Version 4.0, I already know 
>>> that that last version 3.0x cube doesn't included in 4.0 LSTM or waiting 
>>> until Ray makes another updated ara.traineddata.
>>>
>>> ,Thanks.
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/1c842b1e-1dc1-418b-a5b7-368c11e7dfa5%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/1c842b1e-1dc1-418b-a5b7-368c11e7dfa5%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/a344115b-ab55-4bcd-a689-fcd40bce61a2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: How to make training for Arabic in Tesseract 4.0

Reply via email to