Re: [tesseract-ocr] Training for Kurdish in Arabic script

2020-02-01 Thread manu pranay
thank you shree.
I am done with my retraining top layer training with a good accuracy rate.
but i wanted to know, how can find accuracy in terms of percentage ?
and can you please help how can i train handwritten pdf.
thank you very much for your help.


On Sat, Feb 1, 2020 at 12:33 PM Shree Devi Kumar 
wrote:

> lstmtraining \
>   --debug_interval -1 \
>   --traineddata data/modi/modi.traineddata \
>   --append_index 5 --net_spec "[Lfx128 O1c1]" \
>   --continue_from data/mar/modi.lstm \
>   --model_output data/modi/checkpoints/modiLayer \
>   --train_listfile data/modi/list.train \
>   --eval_listfile data/modi/list.eval \
>   --max_iterations 99
>
> On Sat, Feb 1, 2020 at 11:33 AM manu pranay 
> wrote:
>
>> Thank you so much for your help shree.
>> the links you provided were very helpful for me.
>>
>> now i am trying to train lstm training with retraining the top layer.
>> can you please provide me with the commands for  retraining top layer .
>>
>> thank you very much.
>>
>>
>> On Tue, Jan 28, 2020 at 12:36 PM Shree Devi Kumar 
>> wrote:
>>
>>> Please see https://github.com/Shreeshrii/tesstrain-ckb It uses a
>>> modified training text based on what you sent and earlier text that  I had
>>> from Pewan and other corpora.
>>>
>>> Currently the training data includes
>>> * AWN 0-9
>>> * AEN - ARabic numbers
>>> * No Persian numbers since some shapes are similar to Arabic Numbers
>>>
>>> Fonts do not include those which convert 0-9 to either Arabic or Persian
>>> numbers.
>>>
>>> The replace layer training is still ongoing. The eval results look much
>>> better than the official ara or script/Arabic, however I do not have any
>>> real world images for testing.
>>>
>>> ArialArial BoldTahomaTahoma Bold
>>> tessdata_fast/ara Accuracy 62.74 63.49 61.56 61.71
>>> tessdata_fast/ara Basic Arabic 95.68 95.22 95.76 94.10
>>> tessdata_fast/ara Arabic Extended 0.31 1.13 0.41 1.32
>>> tessdata_fast/script/Arabic Accuracy 80.99 80.83 83.02 77.17
>>> tessdata_fast/script/Arabic Basic Arabic 96.68 96.34 96.05 93.87
>>> tessdata_fast/script/Arabic Arabic Extended 57.20 58.23 63.76 54.72
>>> ckbLayer_1.661_152089_296500
>>> ckbLayer_fast Accuracy 98.20 97.78 98.06 96.13
>>> ckbLayer_fast Basic Arabic 99.10 99.15 98.54 98.44
>>> ckbLayer_fast Arabic Extended 98.30 98.70 99.10 96.27
>>>
>>>
>>> On Mon, Jan 13, 2020 at 7:17 PM Ayub Rauf wrote:
>>>
>>>> Hi,
>>>> I attached full training text with forbidden_characters in it.
>>>> really both of number types will be used and I see two type numbers
>>>> written in books but Kurdish institute verified that Arabic numbers will be
>>>> used from now on. Persian numbers written by Iranian Kurds and Arabic
>>>> number used by Iraqi Kurds but as I said numbers in ckb should be
>>>> written by Arabic type, but we have to recognize two type in OCR.
>>>> just like two types of "ك" and "ک" that written in books but now we
>>>> only use "ک".
>>>> I think these similarities won't into problem after that we can correct
>>>> letters in a spell checker.
>>>> As I said before Arial and Tahoma fonts are the most used fonts books
>>>> written by.
>>>>
>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWm%3DXQaxBergf5-OUE-C8jB3u12dSOPUPchRZT4w21Z-g%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWm%3DXQaxBergf5-OUE-C8jB3u12dSOPUPchRZT4w21Z-g%40mail.gmail.com?utm_medium=email_source=footer>
>>> .
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/CAOt%3D%2B%3Dbip7ehaT3VWcSoHN4HX5eP8Lmoe7tgdPcYoBLywrbuEA%40mail.gmail.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAOt%3D%2B%3Dbip7ehaT3VWcSoHN4HX5eP8Lmo

Re: [tesseract-ocr] Training for Kurdish in Arabic script

2020-01-31 Thread manu pranay
Thank you so much for your help shree.
the links you provided were very helpful for me.

now i am trying to train lstm training with retraining the top layer.
can you please provide me with the commands for  retraining top layer .

thank you very much.


On Tue, Jan 28, 2020 at 12:36 PM Shree Devi Kumar 
wrote:

> Please see https://github.com/Shreeshrii/tesstrain-ckb It uses a modified
> training text based on what you sent and earlier text that  I had from
> Pewan and other corpora.
>
> Currently the training data includes
> * AWN 0-9
> * AEN - ARabic numbers
> * No Persian numbers since some shapes are similar to Arabic Numbers
>
> Fonts do not include those which convert 0-9 to either Arabic or Persian
> numbers.
>
> The replace layer training is still ongoing. The eval results look much
> better than the official ara or script/Arabic, however I do not have any
> real world images for testing.
>
> ArialArial BoldTahomaTahoma Bold
> tessdata_fast/ara Accuracy 62.74 63.49 61.56 61.71
> tessdata_fast/ara Basic Arabic 95.68 95.22 95.76 94.10
> tessdata_fast/ara Arabic Extended 0.31 1.13 0.41 1.32
> tessdata_fast/script/Arabic Accuracy 80.99 80.83 83.02 77.17
> tessdata_fast/script/Arabic Basic Arabic 96.68 96.34 96.05 93.87
> tessdata_fast/script/Arabic Arabic Extended 57.20 58.23 63.76 54.72
> ckbLayer_1.661_152089_296500
> ckbLayer_fast Accuracy 98.20 97.78 98.06 96.13
> ckbLayer_fast Basic Arabic 99.10 99.15 98.54 98.44
> ckbLayer_fast Arabic Extended 98.30 98.70 99.10 96.27
>
>
> On Mon, Jan 13, 2020 at 7:17 PM Ayub Rauf wrote:
>
>> Hi,
>> I attached full training text with forbidden_characters in it.
>> really both of number types will be used and I see two type numbers
>> written in books but Kurdish institute verified that Arabic numbers will be
>> used from now on. Persian numbers written by Iranian Kurds and Arabic
>> number used by Iraqi Kurds but as I said numbers in ckb should be
>> written by Arabic type, but we have to recognize two type in OCR.
>> just like two types of "ك" and "ک" that written in books but now we only
>> use "ک".
>> I think these similarities won't into problem after that we can correct
>> letters in a spell checker.
>> As I said before Arial and Tahoma fonts are the most used fonts books
>> written by.
>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWm%3DXQaxBergf5-OUE-C8jB3u12dSOPUPchRZT4w21Z-g%40mail.gmail.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAOt%3D%2B%3Dbip7ehaT3VWcSoHN4HX5eP8Lmoe7tgdPcYoBLywrbuEA%40mail.gmail.com.


[tesseract-ocr] how to train arabic language using tesseract 4.

2020-01-27 Thread manu pranay
hello,

can anyone help me with the steps, 
how to train arabic language using tesseract 4,
as if i generated tif file as well as box file, what's next please help me.

thank you. 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e3b6cd44-f6e7-4740-b0b4-c115ee5e6064%40googlegroups.com.


Re: [tesseract-ocr] Re: How to make training for Arabic in Tesseract 4.0

2020-01-27 Thread manu pranay
shree,
can you please help me out how to perform arabic training on tesseract 4.

thank you


On Thursday, May 4, 2017 at 3:22:42 PM UTC+5:30, shree wrote:
>
> Ibr,
>
> You are incorrect in your description of LSTM training.
>
> What you are doing will use the ara.traineddata provided in the repo, 
> there will be no change in output.
>
> Once lstmf files are created, you have to run lstmtraining which will run 
> for days/weeks  to give you a good result.
>
> Please read about LSTM training on wiki.
>
> On May 4, 2017 2:58 PM, "Ibr" > wrote:
>
>> if you are referring to tesseract 4.00alpha with liptonica 1.74.1, and if 
>> you compiled them in the correct way and got the binaries that you need for 
>> training lmstf files, then I recommend to follow the suggestions that is 
>> made by tesseract devs which is: once you create an .lstmf file for a 
>> certain font (that can be used for Arabic writing) then get the official 
>> ara.traineddata file from GitHub paste it in tessdata folder, and the lstmf 
>> file in tesseract folder and run the command  tesseract text_image 
>> result_text -l ara --oem 1 
>> what Arabic characters exactly are you trying to enhance the accuracy for 
>> ?
>>
>> On Saturday, April 8, 2017 at 11:52:25 AM UTC+3, Ahmad Moawad wrote:
>>
>>> Hello All,
>>>
>>>
>>> I want to make training for Arabic language in Tesseract 4.0, and The 
>>> result of this version is great but still need some tunning, so I got 
>>> jTessBoxEditor 2.0 beta.
>>> I tried to modify the incorrect characters and build ara.traineddata. 
>>> After copying the ara.traineddata to 
>>> /usr/share/tesseract-ocr/4.00/tessdata, I got random characters when I run 
>>> the tesseract on the image.
>>> So any suggestion of how making training for Version 4.0, I already know 
>>> that that last version 3.0x cube doesn't included in 4.0 LSTM or waiting 
>>> until Ray makes another updated ara.traineddata.
>>>
>>> ,Thanks.
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesser...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/1c842b1e-1dc1-418b-a5b7-368c11e7dfa5%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7bf66a4e-f85f-4b87-bf82-5688cb2cac8a%40googlegroups.com.


[tesseract-ocr] Re: speed of tesseract OCR

2017-03-15 Thread Manu
Probably this is what you are looking for.

https://groups.google.com/forum/#!topic/tesseract-dev/LErriuT-sck

On Thursday, 9 March 2017 09:06:20 UTC+1, vngo...@mail.ru wrote:
>
> Hi all!
>
> It is very interesting for me if someone had any tests with different CPU, 
> memorry e.t.c
> Is it real to make tesseract recognize a difficult image(small font, 
> tables, dirty image) less then 1 second.
> If real then what I need to use for the best result?
>
> Thank you.
>
>
>
-- 
DISCLAIMER:
La información contenida en este mensaje puede ser de uso interno o 
confidencial. Si al revisarla usted entiende que no es el destinatario, no 
puede copiar o distribuir el mensaje a nadie, debe destruirlo y notificar 
tal hecho al emisor. Las opiniones o cualquier otra información contenida, 
no relacionada con el negocio de Input For You no debe considerarse como 
emitida ni aprobada por el mismo.

This information is intended to be confidential and for the exclusive use 
of the individual or entity named above only . If you are not the intended 
recipient, be aware that retention, dissemination, distribution or copying 
of this message is strictly prohibited. If you received it by mistake, 
please notify the sender immediately and return it to the address above. 
The opinions and views expressed and any other information contained in 
this message which are not directly related to the businesses of Input for 
You are not to be considered as disclosed, shared nor approved by it.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/aa21734c-8183-407c-9037-7b5f94628bf7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.