Another lesson I learned today: starting from a smaller number of iteration 
and slowly increasing it is bad. We all should train from epochs. 
Every interruption is causing tesseract to re-start from the beginning. 
Basically, the data that is appearing to latter parts might not be used for 
training. 

Look at  what Stefan said 
here: https://github.com/tesseract-ocr/tesseract/issues/3954 


On Sunday, October 29, 2023 at 3:18:32 PM UTC+3 Des Bw wrote:

>
> *BCER is a lie: *
>
> (B)CER is unrealistic measure of accuracy. I*t is a lie. * I have said it 
> a couple of times already. The BCER we get during the training is nowhere 
> close to the reality of the accuracy of our model. I have many occasions 
> where my training achieved 0 error rate and stopped the training. But, when 
> I tested the output using independent evaluation tools, the best I can get 
> was 95-97% accuracy on the synthetic data and 90-92% accuracy on actual 
> scanned documents (data). 
> - So, we need to find a way to turn off the target_error-rate parameter 
> which stops the training when the model thinks it achieved 0% error. May be 
> can assign a negative value to it. I am going to try it if it will turn it 
> off. 
> On Tuesday, October 24, 2023 at 3:55:54 PM UTC+3 Des Bw wrote:
>
>> You can add  *training >> data/lang.log &* to the end of your training 
>> script (shell) to get a log saved inside your data folder. You also add 
>> *DEBUG_INTERVAL=-1 
>> training >> data/lang.log &. *This one flashes more detailed information 
>> on the console; and saves a short log inside the data folder. If you want 
>> to save everything displayed in the console saved to log file, you can 
>> check out methods listed here: 
>>
>> https://unix.stackexchange.com/questions/200637/save-all-the-terminal-output-to-a-file
>>
>>
>> On Tuesday, October 24, 2023 at 3:45:23 PM UTC+3 [email protected] 
>> wrote:
>>
>>> I have made a first try for a fine tuning, the script run a second and 
>>> end without any error message. Where can I find a log file ? 
>>>
>>> Le lun. 23 oct. 2023 à 14:01, Keith Smith <[email protected]> a 
>>> écrit :
>>>
>>>> Rene, the name “foo” is simply an example (or fictitious) font or 
>>>> language name.  When training a new language or font, you should replace 
>>>> “foo” with the name of your language or font.  The standard is to choose 3 
>>>> letters, but that is not required.  In fact, I have been training a font 
>>>> named “micr_e13b” and it is working technically for me (though the 
>>>> accuracy 
>>>> isn’t good enough yet).  Note the underscore character between sections of 
>>>> the name.
>>>>
>>>>  
>>>>
>>>> Internal
>>>>
>>>> *From: *[email protected] <[email protected]> on 
>>>> behalf of René JM Clais <[email protected]>
>>>> *Date: *Sunday, October 22, 2023 at 12:41 PM
>>>> *To: *[email protected] <[email protected]>
>>>> *Subject: *[EXTERNAL] Re: [tesseract-ocr] Lessons, best practices, 
>>>> recommendations, strategies, hacks
>>>>
>>>> *CAUTION EXTERNAL EMAIL *
>>>> *DO NOT open attachments or click on links from unknown senders or 
>>>> unexpected emails.*
>>>>
>>>>  
>>>>
>>>> Hi Keith,
>>>>
>>>> The foo.traindedata is not existing but do you mean : the trainedata I 
>>>> want to train   ex:  hye.traineddata  ?
>>>>
>>>> In my case I should add a new character in the hye.traineddata
>>>>
>>>> It seems that I can do this using the option 2 !
>>>>
>>>> But how ?  Which command  should I use to execute this function and 
>>>> what does mean this process ?
>>>>
>>>>  
>>>>
>>>> Thank you for your help
>>>>
>>>> Regards
>>>>
>>>> René
>>>>
>>>>  
>>>>
>>>> Le sam. 21 oct. 2023 à 17:18, Keith Smith <[email protected]> a 
>>>> écrit :
>>>>
>>>> Thank you Des for your help in this community.  It is greatly 
>>>> appreciated!
>>>>
>>>> As one who is struggling, may I make a suggestion.
>>>>
>>>> I have started a google doc here 
>>>> <https://urldefense.com/v3/__https://docs.google.com/document/d/1Vz6y4LcqczAAE2yKc_xYecy1eChjHZbsxb13_7ntUh0/edit?usp=sharing__;!!MjXRb4uW6x5k!HFOAD-quUbb2dHADKsKiyk_BK3xW49ZAh87HZ3mPU9myi2Zk2t-bdP3ptvhcsV64KhX43EgYbPFZJ5M8Lr1zVHA$>
>>>>  
>>>> with a suggested format for a tutorial which would be very helpful to me 
>>>> and I think to others. It is editable by anyone with the link.
>>>>
>>>> I'm glad to put in any work myself, but my guess is that there are 
>>>> things in the doc that could be filled without much effort by you or 
>>>> others.
>>>>
>>>> If this is true, once the doc is filled out, the contents of the google 
>>>> doc could be submitted as a PR to the tesstrain repo.
>>>>
>>>> Again, just a suggestion that I hope would be helpful to all.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Keith
>>>>
>>>>  
>>>>
>>>> On Sat, Oct 21, 2023 at 8:28 AM Des Bw <[email protected]> wrote:
>>>>
>>>> There is no exhaustive user manual for training tesseract. We all start 
>>>> in the darkness; and accumulate bits of information in different places to 
>>>> learn the ins and outs of tesseract. 
>>>>
>>>>  
>>>>
>>>> It would be great if we can collectively write a better manual. Up 
>>>> until then, we can drop /collect our observations, best  practices, hacks 
>>>> and lessons we accumulated in our adventure with tesseract.  
>>>>
>>>>  
>>>>
>>>> I will start with some of my observations. I collect them by reading in 
>>>> between the lines: from my own failed experiments: 
>>>>
>>>> 1. Training from scratch is very difficult because tesseract requires 
>>>> extensive data set. It looks like it requires over 300,000 test lines 
>>>> (around 26mb text file).
>>>>
>>>> https://github.com/tesseract-ocr/tesseract/issues/3909 
>>>> <https://urldefense.com/v3/__https://github.com/tesseract-ocr/tesseract/issues/3909__;!!MjXRb4uW6x5k!HFOAD-quUbb2dHADKsKiyk_BK3xW49ZAh87HZ3mPU9myi2Zk2t-bdP3ptvhcsV64KhX43EgYbPFZJ5M8Yk4Xcmo$>
>>>>
>>>>  
>>>>
>>>>  Multiple that with the fonts you want to train, the data grows 
>>>> exponentially. That requires very powerful computers running for weeks and 
>>>> months. 
>>>>
>>>> So, for the regular users, training from a network layer, or fine 
>>>> tuning are the most plausible options. 
>>>>
>>>>  
>>>>
>>>> 2. Best practice: make your text lines not too long. The recommended 
>>>> number of works in a line is 10-12. Again from the above link. 
>>>>
>>>>  
>>>>
>>>> ( ...to be continued)
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/bf0cd568-9b5b-4e42-be6e-6225ed6a3892n%40googlegroups.com
>>>>  
>>>> <https://urldefense.com/v3/__https://groups.google.com/d/msgid/tesseract-ocr/bf0cd568-9b5b-4e42-be6e-6225ed6a3892n*40googlegroups.com?utm_medium=email&utm_source=footer__;JQ!!MjXRb4uW6x5k!HFOAD-quUbb2dHADKsKiyk_BK3xW49ZAh87HZ3mPU9myi2Zk2t-bdP3ptvhcsV64KhX43EgYbPFZJ5M8KHJKCVc$>
>>>> .
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAL1pF5ZHL-_9shmwX%3DAUrnDWHJZBWiZutT9zc-j8Oxih8c6D2A%40mail.gmail.com
>>>>  
>>>> <https://urldefense.com/v3/__https://groups.google.com/d/msgid/tesseract-ocr/CAL1pF5ZHL-_9shmwX*3DAUrnDWHJZBWiZutT9zc-j8Oxih8c6D2A*40mail.gmail.com?utm_medium=email&utm_source=footer__;JSU!!MjXRb4uW6x5k!HFOAD-quUbb2dHADKsKiyk_BK3xW49ZAh87HZ3mPU9myi2Zk2t-bdP3ptvhcsV64KhX43EgYbPFZJ5M8MAoDn2A$>
>>>> .
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAPJAo_rtwFJ247UCtLgggB_WTs0%3DUajag0_M29Fe%2B8zCy0OZXw%40mail.gmail.com
>>>>  
>>>> <https://urldefense.com/v3/__https://groups.google.com/d/msgid/tesseract-ocr/CAPJAo_rtwFJ247UCtLgggB_WTs0*3DUajag0_M29Fe*2B8zCy0OZXw*40mail.gmail.com?utm_medium=email&utm_source=footer__;JSUl!!MjXRb4uW6x5k!HFOAD-quUbb2dHADKsKiyk_BK3xW49ZAh87HZ3mPU9myi2Zk2t-bdP3ptvhcsV64KhX43EgYbPFZJ5M8U3w3mDk$>
>>>> .
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>>
>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/PH0PR19MB567279E2B80440267AA1D2F7B6D8A%40PH0PR19MB5672.namprd19.prod.outlook.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/PH0PR19MB567279E2B80440267AA1D2F7B6D8A%40PH0PR19MB5672.namprd19.prod.outlook.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/19383c97-68fd-48cb-a7a8-25cfc296c660n%40googlegroups.com.

Reply via email to