Re: [tesseract-ocr] Tesseract Performance

Soumik Ranjan Dasgupta Thu, 07 Jan 2021 02:15:12 -0800

Hi Shreeshrii,

I took your command exactly as it is and ran it (made sure the 
tessdata_best directory is present in $HOME
 with best ben.traineddata) and ran into an extremely weird error.
Here is the log:


find data/ben-ground-truth -name '*.gt.txt' | xargs cat | sort | uniq > 
"data/ben/all-gt"
combine_tessdata -u /root/tessdata_best/ben.traineddata  data/ben/ben
Version 
string:4.00.00alpha:ben:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx64Lrx64Lfx512O1c1]
0:config:size=377, offset=192
17:lstm:size=10605707, offset=569
18:lstm-punc-dawg:size=3154, offset=10606276
19:lstm-word-dawg:size=427618, offset=10609430
20:lstm-number-dawg:size=426, offset=11037048
21:lstm-unicharset:size=6866, offset=11037474
22:lstm-recoder:size=1003, offset=11044340
23:version:size=80, offset=11045343
Extracting tessdata components from /root/tessdata_best/ben.traineddata
Wrote data/ben/ben.config
Wrote data/ben/ben.lstm
Wrote data/ben/ben.lstm-punc-dawg
Wrote data/ben/ben.lstm-word-dawg
Wrote data/ben/ben.lstm-number-dawg
Wrote data/ben/ben.lstm-unicharset
Wrote data/ben/ben.lstm-recoder
Wrote data/ben/ben.version
unicharset_extractor --output_unicharset "data/ben/my.unicharset" 
--norm_mode 2 "data/ben/all-gt"
Bad box coordinates in boxfile string!  কি জানি কেন প্রদ্যুম্নের বার বার 
মনে আসছিল সেই জীর্ণ পরিচ্ছদপরা 
Extracting unicharset from plain text file data/ben/all-gt
Wrote unicharset file data/ben/my.unicharset
merge_unicharsets data/ben/ben.lstm-unicharset data/ben/my.unicharset  
"data/ben/unicharset"
Loaded unicharset of size 111 from file data/ben/ben.lstm-unicharset
Loaded unicharset of size 76 from file data/ben/my.unicharset
Wrote unicharset file data/ben/unicharset.
PYTHONIOENCODING=utf-8 python3 generate_wordstr_box.py -i 
"data/ben-ground-truth/24-022.tif" -t "data/ben-ground-truth/24-022.gt.txt" 
> "data/ben-ground-truth/24-022.box"
Traceback (most recent call last):
  File "generate_wordstr_box.py", line 7, in <module>
    import bidi.algorithm
ModuleNotFoundError: No module named 'bidi'
Makefile:207: recipe for target 'data/ben-ground-truth/24-022.box' failed
make: *** [data/ben-ground-truth/24-022.box] Error 1

I should mention I double checked the 24-022.gt.txt and 24-022.tif files 
and both of them are valid. Any reason why this might be happening? How can 
I fix this?
On Saturday, January 2, 2021 at 11:01:27 AM UTC+5:30 shree wrote:

> Soumik,
>
> I have uploaded the bash scripts and the generated reports and graphs to 
> `ben` branch in my fork of tesstrain repo. See
>
> https://github.com/Shreeshrii/tesstrain/tree/ben
> and
>
> https://github.com/Shreeshrii/tesstrain/commit/a6474ef2dbbac47803d13b6f92fdcf8c9dc3107b
>
> Results for the validation data (not seen by lstmtraining either for 
> training or eval, shows an improvement over both ben and script/Bengali.
>
> To improve results further, check groundtruth transcription for any 
> missing words, normalize the text and try with some more training data.
>
>
> On Fri, Jan 1, 2021 at 6:41 PM Shree Devi Kumar <[email protected]> 
> wrote:
>
>>
>> nohup make MODEL_NAME=ben START_MODEL=ben LANG_TYPE=Indic 
>>  GROUND_TRUTH_DIR=data/ben-ground-truth TESSDATA=$HOME/tessdata_best 
>> DEBUG_INTERVAL=-1 training MAX_ITERATIONS=50000 >> data/ben.log &
>>
>> Graphs are created using the training log file as well as validation log 
>> files. Some of these require using PRs which have not yet been merged in 
>> tesstrain repo.
>>
>> See
>> https://github.com/tesseract-ocr/tesstrain/pulls
>>
>> For Evaluation reports, I used 
>> https://github.com/eddieantonio/ocreval
>>
>>
>>
>> On Fri, Jan 1, 2021 at 12:09 PM Soumik Ranjan Dasgupta <
>> [email protected]> wrote:
>>
>>> Hi Shreeshrii,
>>>
>>> Can you please tell me the training command  used? Also, how can I 
>>> create the graphs and these other documents?
>>>
>>> On Sat, 26 Dec 2020, 18:37 Shree Devi Kumar, <[email protected]> wrote:
>>>
>>>> Soumik,
>>>>
>>>> I used your groundtruth and trained using ben as the START_MODEL.  I 
>>>> got best results on the validation set of images at around 5000 
>>>> iterations. 
>>>> see attached Accuracy report and CER graph.
>>>>
>>>>
>>>>
>>>> On Thu, Dec 24, 2020 at 8:36 PM Soumik Ranjan Dasgupta <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi everyone,
>>>>> I wanted to do fine-tune the ben.traineddata model by using some 
>>>>> ancient text that were supposedly printed with typeset. I have roughly 
>>>>> around 1k lines of text and tried the normal fine-tuning approach with 
>>>>> around 25k iterations. 
>>>>> The thing that surprised me the most was even after packing the 
>>>>> traineddata (character error was around 4%) and testing an unseen image, 
>>>>> the performance was exactly the same. Not a single character was 
>>>>> different!
>>>>> You can find the traineddata, training data, the logs and the source 
>>>>> code at this link:
>>>>> https://github.com/srdg/unarchived_ben_tess/releases/tag/v0.0.4-alpha
>>>>>
>>>>> Can anyone tell me exactly what I am doing wrong here? Do I need to 
>>>>> change any training parameter, increase my training data, or anything 
>>>>> else 
>>>>> completely?
>>>>>
>>>>> Best regards,
>>>>> Soumik
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/1fc044d1-b0ae-45d5-9041-e6fbf8ec5089n%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1fc044d1-b0ae-45d5-9041-e6fbf8ec5089n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>
>>>>
>>>> -- 
>>>>
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVZ3A7CUEqw29Gxu6r1-cLHPTLFt%3D%3D0C0109D_6x6C7Kw%40mail.gmail.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVZ3A7CUEqw29Gxu6r1-cLHPTLFt%3D%3D0C0109D_6x6C7Kw%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/CAM-%2BFN%3DZggnH4wV5vUhY9nsSqjKg9xZ5TQDoCMwSqf7H0oPogQ%40mail.gmail.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/CAM-%2BFN%3DZggnH4wV5vUhY9nsSqjKg9xZ5TQDoCMwSqf7H0oPogQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>
>>
>> -- 
>>
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>
>
> -- 
>
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/9e188ca3-e477-4ce4-aaad-5c83d2fb5152n%40googlegroups.com.

Re: [tesseract-ocr] Tesseract Performance

Reply via email to