I have updated the guide explaining on how to train by cutting the top layer. You can check it out. I hope it is helpful.
On Sunday, October 22, 2023 at 7:41:15 PM UTC+3 [email protected] wrote: > Hi Keith, > The foo.traindedata is not existing but do you mean : the trainedata I > want to train ex: hye.traineddata ? > In my case I should add a new character in the hye.traineddata > It seems that I can do this using the option 2 ! > But how ? Which command should I use to execute this function and what > does mean this process ? > > Thank you for your help > Regards > René > > Le sam. 21 oct. 2023 à 17:18, Keith Smith <[email protected]> a écrit : > >> Thank you Des for your help in this community. It is greatly appreciated! >> As one who is struggling, may I make a suggestion. >> I have started a google doc here >> <https://docs.google.com/document/d/1Vz6y4LcqczAAE2yKc_xYecy1eChjHZbsxb13_7ntUh0/edit?usp=sharing> >> >> with a suggested format for a tutorial which would be very helpful to me >> and I think to others. It is editable by anyone with the link. >> I'm glad to put in any work myself, but my guess is that there are things >> in the doc that could be filled without much effort by you or others. >> If this is true, once the doc is filled out, the contents of the google >> doc could be submitted as a PR to the tesstrain repo. >> Again, just a suggestion that I hope would be helpful to all. >> >> Thanks, >> Keith >> >> On Sat, Oct 21, 2023 at 8:28 AM Des Bw <[email protected]> wrote: >> >>> There is no exhaustive user manual for training tesseract. We all start >>> in the darkness; and accumulate bits of information in different places to >>> learn the ins and outs of tesseract. >>> >>> It would be great if we can collectively write a better manual. Up until >>> then, we can drop /collect our observations, best practices, hacks and >>> lessons we accumulated in our adventure with tesseract. >>> >>> I will start with some of my observations. I collect them by reading in >>> between the lines: from my own failed experiments: >>> 1. Training from scratch is very difficult because tesseract requires >>> extensive data set. It looks like it requires over 300,000 test lines >>> (around 26mb text file). >>> https://github.com/tesseract-ocr/tesseract/issues/3909 >>> >>> Multiple that with the fonts you want to train, the data grows >>> exponentially. That requires very powerful computers running for weeks and >>> months. >>> So, for the regular users, training from a network layer, or fine tuning >>> are the most plausible options. >>> >>> 2. Best practice: make your text lines not too long. The recommended >>> number of works in a line is 10-12. Again from the above link. >>> >>> ( ...to be continued) >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/bf0cd568-9b5b-4e42-be6e-6225ed6a3892n%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/tesseract-ocr/bf0cd568-9b5b-4e42-be6e-6225ed6a3892n%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> > To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/CAL1pF5ZHL-_9shmwX%3DAUrnDWHJZBWiZutT9zc-j8Oxih8c6D2A%40mail.gmail.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/CAL1pF5ZHL-_9shmwX%3DAUrnDWHJZBWiZutT9zc-j8Oxih8c6D2A%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/81b35697-8a44-43e0-b1a9-6b6360d6890en%40googlegroups.com.

