Todays lesson: it is possible to disable TARGET_ERROR_RATE. 
If you find your training stopping prematurely because it is hitting the 
target_error, then, you can disable it and train by epochs (iterations) 
only . 
https://github.com/tesseract-ocr/tesseract/issues/4157

On Monday, October 30, 2023 at 5:46:10 PM UTC+3 Des Bw wrote:

> Another lesson I learned today: starting from a smaller number of 
> iteration and slowly increasing it is bad. We all should train from epochs. 
> Every interruption is causing tesseract to re-start from the beginning. 
> Basically, the data that is appearing to latter parts might not be used for 
> training. 
>
> Look at  what Stefan said here: 
> https://github.com/tesseract-ocr/tesseract/issues/3954 
>
>
> On Sunday, October 29, 2023 at 3:18:32 PM UTC+3 Des Bw wrote:
>
>>
>> *BCER is a lie: *
>>
>> (B)CER is unrealistic measure of accuracy. I*t is a lie. * I have said 
>> it a couple of times already. The BCER we get during the training is 
>> nowhere close to the reality of the accuracy of our model. I have many 
>> occasions where my training achieved 0 error rate and stopped the training. 
>> But, when I tested the output using independent evaluation tools, the best 
>> I can get was 95-97% accuracy on the synthetic data and 90-92% accuracy on 
>> actual scanned documents (data). 
>> - So, we need to find a way to turn off the target_error-rate parameter 
>> which stops the training when the model thinks it achieved 0% error. May be 
>> can assign a negative value to it. I am going to try it if it will turn it 
>> off. 
>> On Tuesday, October 24, 2023 at 3:55:54 PM UTC+3 Des Bw wrote:
>>
>>> You can add  *training >> data/lang.log &* to the end of your training 
>>> script (shell) to get a log saved inside your data folder. You also add 
>>> *DEBUG_INTERVAL=-1 
>>> training >> data/lang.log &. *This one flashes more detailed 
>>> information on the console; and saves a short log inside the data folder. 
>>> If you want to save everything displayed in the console saved to log file, 
>>> you can check out methods listed here: 
>>>
>>> https://unix.stackexchange.com/questions/200637/save-all-the-terminal-output-to-a-file
>>>
>>>
>>> On Tuesday, October 24, 2023 at 3:45:23 PM UTC+3 renec...@gmail.com 
>>> wrote:
>>>
>>>> I have made a first try for a fine tuning, the script run a second and 
>>>> end without any error message. Where can I find a log file ? 
>>>>
>>>> Le lun. 23 oct. 2023 à 14:01, Keith Smith <keith...@discover.com> a 
>>>> écrit :
>>>>
>>>>> Rene, the name “foo” is simply an example (or fictitious) font or 
>>>>> language name.  When training a new language or font, you should replace 
>>>>> “foo” with the name of your language or font.  The standard is to choose 
>>>>> 3 
>>>>> letters, but that is not required.  In fact, I have been training a font 
>>>>> named “micr_e13b” and it is working technically for me (though the 
>>>>> accuracy 
>>>>> isn’t good enough yet).  Note the underscore character between sections 
>>>>> of 
>>>>> the name.
>>>>>
>>>>>  
>>>>>
>>>>> Internal
>>>>>
>>>>> *From: *tesser...@googlegroups.com <tesser...@googlegroups.com> on 
>>>>> behalf of René JM Clais <renec...@gmail.com>
>>>>> *Date: *Sunday, October 22, 2023 at 12:41 PM
>>>>> *To: *tesser...@googlegroups.com <tesser...@googlegroups.com>
>>>>> *Subject: *[EXTERNAL] Re: [tesseract-ocr] Lessons, best practices, 
>>>>> recommendations, strategies, hacks
>>>>>
>>>>> *CAUTION EXTERNAL EMAIL *
>>>>> *DO NOT open attachments or click on links from unknown senders or 
>>>>> unexpected emails.*
>>>>>
>>>>>  
>>>>>
>>>>> Hi Keith,
>>>>>
>>>>> The foo.traindedata is not existing but do you mean : the trainedata I 
>>>>> want to train   ex:  hye.traineddata  ?
>>>>>
>>>>> In my case I should add a new character in the hye.traineddata
>>>>>
>>>>> It seems that I can do this using the option 2 !
>>>>>
>>>>> But how ?  Which command  should I use to execute this function and 
>>>>> what does mean this process ?
>>>>>
>>>>>  
>>>>>
>>>>> Thank you for your help
>>>>>
>>>>> Regards
>>>>>
>>>>> René
>>>>>
>>>>>  
>>>>>
>>>>> Le sam. 21 oct. 2023 à 17:18, Keith Smith <keiths...@gmail.com> a 
>>>>> écrit :
>>>>>
>>>>> Thank you Des for your help in this community.  It is greatly 
>>>>> appreciated!
>>>>>
>>>>> As one who is struggling, may I make a suggestion.
>>>>>
>>>>> I have started a google doc here 
>>>>> <https://urldefense.com/v3/__https://docs.google.com/document/d/1Vz6y4LcqczAAE2yKc_xYecy1eChjHZbsxb13_7ntUh0/edit?usp=sharing__;!!MjXRb4uW6x5k!HFOAD-quUbb2dHADKsKiyk_BK3xW49ZAh87HZ3mPU9myi2Zk2t-bdP3ptvhcsV64KhX43EgYbPFZJ5M8Lr1zVHA$>
>>>>>  
>>>>> with a suggested format for a tutorial which would be very helpful to me 
>>>>> and I think to others. It is editable by anyone with the link.
>>>>>
>>>>> I'm glad to put in any work myself, but my guess is that there are 
>>>>> things in the doc that could be filled without much effort by you or 
>>>>> others.
>>>>>
>>>>> If this is true, once the doc is filled out, the contents of the 
>>>>> google doc could be submitted as a PR to the tesstrain repo.
>>>>>
>>>>> Again, just a suggestion that I hope would be helpful to all.
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Keith
>>>>>
>>>>>  
>>>>>
>>>>> On Sat, Oct 21, 2023 at 8:28 AM Des Bw <desal...@gmail.com> wrote:
>>>>>
>>>>> There is no exhaustive user manual for training tesseract. We all 
>>>>> start in the darkness; and accumulate bits of information in different 
>>>>> places to learn the ins and outs of tesseract. 
>>>>>
>>>>>  
>>>>>
>>>>> It would be great if we can collectively write a better manual. Up 
>>>>> until then, we can drop /collect our observations, best  practices, hacks 
>>>>> and lessons we accumulated in our adventure with tesseract.  
>>>>>
>>>>>  
>>>>>
>>>>> I will start with some of my observations. I collect them by reading 
>>>>> in between the lines: from my own failed experiments: 
>>>>>
>>>>> 1. Training from scratch is very difficult because tesseract requires 
>>>>> extensive data set. It looks like it requires over 300,000 test lines 
>>>>> (around 26mb text file).
>>>>>
>>>>> https://github.com/tesseract-ocr/tesseract/issues/3909 
>>>>> <https://urldefense.com/v3/__https://github.com/tesseract-ocr/tesseract/issues/3909__;!!MjXRb4uW6x5k!HFOAD-quUbb2dHADKsKiyk_BK3xW49ZAh87HZ3mPU9myi2Zk2t-bdP3ptvhcsV64KhX43EgYbPFZJ5M8Yk4Xcmo$>
>>>>>
>>>>>  
>>>>>
>>>>>  Multiple that with the fonts you want to train, the data grows 
>>>>> exponentially. That requires very powerful computers running for weeks 
>>>>> and 
>>>>> months. 
>>>>>
>>>>> So, for the regular users, training from a network layer, or fine 
>>>>> tuning are the most plausible options. 
>>>>>
>>>>>  
>>>>>
>>>>> 2. Best practice: make your text lines not too long. The recommended 
>>>>> number of works in a line is 10-12. Again from the above link. 
>>>>>
>>>>>  
>>>>>
>>>>> ( ...to be continued)
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/bf0cd568-9b5b-4e42-be6e-6225ed6a3892n%40googlegroups.com
>>>>>  
>>>>> <https://urldefense.com/v3/__https://groups.google.com/d/msgid/tesseract-ocr/bf0cd568-9b5b-4e42-be6e-6225ed6a3892n*40googlegroups.com?utm_medium=email&utm_source=footer__;JQ!!MjXRb4uW6x5k!HFOAD-quUbb2dHADKsKiyk_BK3xW49ZAh87HZ3mPU9myi2Zk2t-bdP3ptvhcsV64KhX43EgYbPFZJ5M8KHJKCVc$>
>>>>> .
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAL1pF5ZHL-_9shmwX%3DAUrnDWHJZBWiZutT9zc-j8Oxih8c6D2A%40mail.gmail.com
>>>>>  
>>>>> <https://urldefense.com/v3/__https://groups.google.com/d/msgid/tesseract-ocr/CAL1pF5ZHL-_9shmwX*3DAUrnDWHJZBWiZutT9zc-j8Oxih8c6D2A*40mail.gmail.com?utm_medium=email&utm_source=footer__;JSU!!MjXRb4uW6x5k!HFOAD-quUbb2dHADKsKiyk_BK3xW49ZAh87HZ3mPU9myi2Zk2t-bdP3ptvhcsV64KhX43EgYbPFZJ5M8MAoDn2A$>
>>>>> .
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAPJAo_rtwFJ247UCtLgggB_WTs0%3DUajag0_M29Fe%2B8zCy0OZXw%40mail.gmail.com
>>>>>  
>>>>> <https://urldefense.com/v3/__https://groups.google.com/d/msgid/tesseract-ocr/CAPJAo_rtwFJ247UCtLgggB_WTs0*3DUajag0_M29Fe*2B8zCy0OZXw*40mail.gmail.com?utm_medium=email&utm_source=footer__;JSUl!!MjXRb4uW6x5k!HFOAD-quUbb2dHADKsKiyk_BK3xW49ZAh87HZ3mPU9myi2Zk2t-bdP3ptvhcsV64KhX43EgYbPFZJ5M8U3w3mDk$>
>>>>> .
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to tesseract-oc...@googlegroups.com.
>>>>>
>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/PH0PR19MB567279E2B80440267AA1D2F7B6D8A%40PH0PR19MB5672.namprd19.prod.outlook.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/PH0PR19MB567279E2B80440267AA1D2F7B6D8A%40PH0PR19MB5672.namprd19.prod.outlook.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6544f165-efb3-43ca-aaac-cff7ef443003n%40googlegroups.com.

Reply via email to