i shree actually I saw the section that was talking about lstmtraining, but I what I said was the result of following the tesseract messages, what happened from the beginning was that I used to train .traineddata files for English, and worked fine, but for Arabic it was failing, so I saw the argument oem at tesseract and I used it then the tesseract asked for the lstm file, then I came across the article about the tesseract 4.00alpha which was including Arabic. then I created the lstm file but again the tesseract failed at detecting the text from the image, I felt that the old .traineddata (created by tesseract 3.03) compatible with lstmf file, search for the cause of the problem and I found this issue <https://github.com/tesseract-ocr/tesseract/issues/487>, got the official traineddata and the accuracy for detecting Arabic text image was correct except for the characters that I described in the issue that I referred earlier.
if I'm not mistaken the lstmtraining section is to enhance the accuracy, correct? it seems that if the لا case and الم case are solved in the ara.traineddata the accuracy of Arabic detecting will be as good as English detection On Thursday, May 4, 2017 at 12:52:42 PM UTC+3, shree wrote: > Ibr, > > You are incorrect in your description of LSTM training. > > What you are doing will use the ara.traineddata provided in the repo, > there will be no change in output. > > Once lstmf files are created, you have to run lstmtraining which will run > for days/weeks to give you a good result. > > Please read about LSTM training on wiki. > > On May 4, 2017 2:58 PM, "Ibr" <[email protected] <javascript:>> wrote: > >> if you are referring to tesseract 4.00alpha with liptonica 1.74.1, and if >> you compiled them in the correct way and got the binaries that you need for >> training lmstf files, then I recommend to follow the suggestions that is >> made by tesseract devs which is: once you create an .lstmf file for a >> certain font (that can be used for Arabic writing) then get the official >> ara.traineddata file from GitHub paste it in tessdata folder, and the lstmf >> file in tesseract folder and run the command tesseract text_image >> result_text -l ara --oem 1 >> what Arabic characters exactly are you trying to enhance the accuracy for >> ? >> >> On Saturday, April 8, 2017 at 11:52:25 AM UTC+3, Ahmad Moawad wrote: >> >>> Hello All, >>> >>> >>> I want to make training for Arabic language in Tesseract 4.0, and The >>> result of this version is great but still need some tunning, so I got >>> jTessBoxEditor 2.0 beta. >>> I tried to modify the incorrect characters and build ara.traineddata. >>> After copying the ara.traineddata to >>> /usr/share/tesseract-ocr/4.00/tessdata, I got random characters when I run >>> the tesseract on the image. >>> So any suggestion of how making training for Version 4.0, I already know >>> that that last version 3.0x cube doesn't included in 4.0 LSTM or waiting >>> until Ray makes another updated ara.traineddata. >>> >>> ,Thanks. >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/1c842b1e-1dc1-418b-a5b7-368c11e7dfa5%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/1c842b1e-1dc1-418b-a5b7-368c11e7dfa5%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a344115b-ab55-4bcd-a689-fcd40bce61a2%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

