Todays lesson: it is possible to disable TARGET_ERROR_RATE. If you find your training stopping prematurely because it is hitting the target_error, then, you can disable it and train by epochs (iterations) only . https://github.com/tesseract-ocr/tesseract/issues/4157
On Monday, October 30, 2023 at 5:46:10 PM UTC+3 Des Bw wrote: > Another lesson I learned today: starting from a smaller number of > iteration and slowly increasing it is bad. We all should train from epochs. > Every interruption is causing tesseract to re-start from the beginning. > Basically, the data that is appearing to latter parts might not be used for > training. > > Look at what Stefan said here: > https://github.com/tesseract-ocr/tesseract/issues/3954 > > > On Sunday, October 29, 2023 at 3:18:32 PM UTC+3 Des Bw wrote: > >> >> *BCER is a lie: * >> >> (B)CER is unrealistic measure of accuracy. I*t is a lie. * I have said >> it a couple of times already. The BCER we get during the training is >> nowhere close to the reality of the accuracy of our model. I have many >> occasions where my training achieved 0 error rate and stopped the training. >> But, when I tested the output using independent evaluation tools, the best >> I can get was 95-97% accuracy on the synthetic data and 90-92% accuracy on >> actual scanned documents (data). >> - So, we need to find a way to turn off the target_error-rate parameter >> which stops the training when the model thinks it achieved 0% error. May be >> can assign a negative value to it. I am going to try it if it will turn it >> off. >> On Tuesday, October 24, 2023 at 3:55:54 PM UTC+3 Des Bw wrote: >> >>> You can add *training >> data/lang.log &* to the end of your training >>> script (shell) to get a log saved inside your data folder. You also add >>> *DEBUG_INTERVAL=-1 >>> training >> data/lang.log &. *This one flashes more detailed >>> information on the console; and saves a short log inside the data folder. >>> If you want to save everything displayed in the console saved to log file, >>> you can check out methods listed here: >>> >>> https://unix.stackexchange.com/questions/200637/save-all-the-terminal-output-to-a-file >>> >>> >>> On Tuesday, October 24, 2023 at 3:45:23 PM UTC+3 renec...@gmail.com >>> wrote: >>> >>>> I have made a first try for a fine tuning, the script run a second and >>>> end without any error message. Where can I find a log file ? >>>> >>>> Le lun. 23 oct. 2023 à 14:01, Keith Smith <keith...@discover.com> a >>>> écrit : >>>> >>>>> Rene, the name “foo” is simply an example (or fictitious) font or >>>>> language name. When training a new language or font, you should replace >>>>> “foo” with the name of your language or font. The standard is to choose >>>>> 3 >>>>> letters, but that is not required. In fact, I have been training a font >>>>> named “micr_e13b” and it is working technically for me (though the >>>>> accuracy >>>>> isn’t good enough yet). Note the underscore character between sections >>>>> of >>>>> the name. >>>>> >>>>> >>>>> >>>>> Internal >>>>> >>>>> *From: *tesser...@googlegroups.com <tesser...@googlegroups.com> on >>>>> behalf of René JM Clais <renec...@gmail.com> >>>>> *Date: *Sunday, October 22, 2023 at 12:41 PM >>>>> *To: *tesser...@googlegroups.com <tesser...@googlegroups.com> >>>>> *Subject: *[EXTERNAL] Re: [tesseract-ocr] Lessons, best practices, >>>>> recommendations, strategies, hacks >>>>> >>>>> *CAUTION EXTERNAL EMAIL * >>>>> *DO NOT open attachments or click on links from unknown senders or >>>>> unexpected emails.* >>>>> >>>>> >>>>> >>>>> Hi Keith, >>>>> >>>>> The foo.traindedata is not existing but do you mean : the trainedata I >>>>> want to train ex: hye.traineddata ? >>>>> >>>>> In my case I should add a new character in the hye.traineddata >>>>> >>>>> It seems that I can do this using the option 2 ! >>>>> >>>>> But how ? Which command should I use to execute this function and >>>>> what does mean this process ? >>>>> >>>>> >>>>> >>>>> Thank you for your help >>>>> >>>>> Regards >>>>> >>>>> René >>>>> >>>>> >>>>> >>>>> Le sam. 21 oct. 2023 à 17:18, Keith Smith <keiths...@gmail.com> a >>>>> écrit : >>>>> >>>>> Thank you Des for your help in this community. It is greatly >>>>> appreciated! >>>>> >>>>> As one who is struggling, may I make a suggestion. >>>>> >>>>> I have started a google doc here >>>>> <https://urldefense.com/v3/__https://docs.google.com/document/d/1Vz6y4LcqczAAE2yKc_xYecy1eChjHZbsxb13_7ntUh0/edit?usp=sharing__;!!MjXRb4uW6x5k!HFOAD-quUbb2dHADKsKiyk_BK3xW49ZAh87HZ3mPU9myi2Zk2t-bdP3ptvhcsV64KhX43EgYbPFZJ5M8Lr1zVHA$> >>>>> >>>>> with a suggested format for a tutorial which would be very helpful to me >>>>> and I think to others. It is editable by anyone with the link. >>>>> >>>>> I'm glad to put in any work myself, but my guess is that there are >>>>> things in the doc that could be filled without much effort by you or >>>>> others. >>>>> >>>>> If this is true, once the doc is filled out, the contents of the >>>>> google doc could be submitted as a PR to the tesstrain repo. >>>>> >>>>> Again, just a suggestion that I hope would be helpful to all. >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Keith >>>>> >>>>> >>>>> >>>>> On Sat, Oct 21, 2023 at 8:28 AM Des Bw <desal...@gmail.com> wrote: >>>>> >>>>> There is no exhaustive user manual for training tesseract. We all >>>>> start in the darkness; and accumulate bits of information in different >>>>> places to learn the ins and outs of tesseract. >>>>> >>>>> >>>>> >>>>> It would be great if we can collectively write a better manual. Up >>>>> until then, we can drop /collect our observations, best practices, hacks >>>>> and lessons we accumulated in our adventure with tesseract. >>>>> >>>>> >>>>> >>>>> I will start with some of my observations. I collect them by reading >>>>> in between the lines: from my own failed experiments: >>>>> >>>>> 1. Training from scratch is very difficult because tesseract requires >>>>> extensive data set. It looks like it requires over 300,000 test lines >>>>> (around 26mb text file). >>>>> >>>>> https://github.com/tesseract-ocr/tesseract/issues/3909 >>>>> <https://urldefense.com/v3/__https://github.com/tesseract-ocr/tesseract/issues/3909__;!!MjXRb4uW6x5k!HFOAD-quUbb2dHADKsKiyk_BK3xW49ZAh87HZ3mPU9myi2Zk2t-bdP3ptvhcsV64KhX43EgYbPFZJ5M8Yk4Xcmo$> >>>>> >>>>> >>>>> >>>>> Multiple that with the fonts you want to train, the data grows >>>>> exponentially. That requires very powerful computers running for weeks >>>>> and >>>>> months. >>>>> >>>>> So, for the regular users, training from a network layer, or fine >>>>> tuning are the most plausible options. >>>>> >>>>> >>>>> >>>>> 2. Best practice: make your text lines not too long. The recommended >>>>> number of works in a line is 10-12. Again from the above link. >>>>> >>>>> >>>>> >>>>> ( ...to be continued) >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesseract-oc...@googlegroups.com. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/bf0cd568-9b5b-4e42-be6e-6225ed6a3892n%40googlegroups.com >>>>> >>>>> <https://urldefense.com/v3/__https://groups.google.com/d/msgid/tesseract-ocr/bf0cd568-9b5b-4e42-be6e-6225ed6a3892n*40googlegroups.com?utm_medium=email&utm_source=footer__;JQ!!MjXRb4uW6x5k!HFOAD-quUbb2dHADKsKiyk_BK3xW49ZAh87HZ3mPU9myi2Zk2t-bdP3ptvhcsV64KhX43EgYbPFZJ5M8KHJKCVc$> >>>>> . >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesseract-oc...@googlegroups.com. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAL1pF5ZHL-_9shmwX%3DAUrnDWHJZBWiZutT9zc-j8Oxih8c6D2A%40mail.gmail.com >>>>> >>>>> <https://urldefense.com/v3/__https://groups.google.com/d/msgid/tesseract-ocr/CAL1pF5ZHL-_9shmwX*3DAUrnDWHJZBWiZutT9zc-j8Oxih8c6D2A*40mail.gmail.com?utm_medium=email&utm_source=footer__;JSU!!MjXRb4uW6x5k!HFOAD-quUbb2dHADKsKiyk_BK3xW49ZAh87HZ3mPU9myi2Zk2t-bdP3ptvhcsV64KhX43EgYbPFZJ5M8MAoDn2A$> >>>>> . >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesseract-oc...@googlegroups.com. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/CAPJAo_rtwFJ247UCtLgggB_WTs0%3DUajag0_M29Fe%2B8zCy0OZXw%40mail.gmail.com >>>>> >>>>> <https://urldefense.com/v3/__https://groups.google.com/d/msgid/tesseract-ocr/CAPJAo_rtwFJ247UCtLgggB_WTs0*3DUajag0_M29Fe*2B8zCy0OZXw*40mail.gmail.com?utm_medium=email&utm_source=footer__;JSUl!!MjXRb4uW6x5k!HFOAD-quUbb2dHADKsKiyk_BK3xW49ZAh87HZ3mPU9myi2Zk2t-bdP3ptvhcsV64KhX43EgYbPFZJ5M8U3w3mDk$> >>>>> . >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesseract-oc...@googlegroups.com. >>>>> >>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/PH0PR19MB567279E2B80440267AA1D2F7B6D8A%40PH0PR19MB5672.namprd19.prod.outlook.com >>>>> >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/PH0PR19MB567279E2B80440267AA1D2F7B6D8A%40PH0PR19MB5672.namprd19.prod.outlook.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> >>>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6544f165-efb3-43ca-aaac-cff7ef443003n%40googlegroups.com.