That is good starter dear Keith. Very good idea. We can contribute texts and ideas; and develop it into a booklet or "getting started guide"--making additional explanatory comments, practical examples and elaborations on the official guide (which very dense, and incomplete). - the tips and best practices can be then be distributed across the tutorial/guide, as you already started. On Saturday, October 21, 2023 at 6:18:06 PM UTC+3 Keith Smith wrote:
> Thank you Des for your help in this community. It is greatly appreciated! > As one who is struggling, may I make a suggestion. > I have started a google doc here > <https://docs.google.com/document/d/1Vz6y4LcqczAAE2yKc_xYecy1eChjHZbsxb13_7ntUh0/edit?usp=sharing> > > with a suggested format for a tutorial which would be very helpful to me > and I think to others. It is editable by anyone with the link. > I'm glad to put in any work myself, but my guess is that there are things > in the doc that could be filled without much effort by you or others. > If this is true, once the doc is filled out, the contents of the google > doc could be submitted as a PR to the tesstrain repo. > Again, just a suggestion that I hope would be helpful to all. > > Thanks, > Keith > > On Sat, Oct 21, 2023 at 8:28 AM Des Bw <[email protected]> wrote: > >> There is no exhaustive user manual for training tesseract. We all start >> in the darkness; and accumulate bits of information in different places to >> learn the ins and outs of tesseract. >> >> It would be great if we can collectively write a better manual. Up until >> then, we can drop /collect our observations, best practices, hacks and >> lessons we accumulated in our adventure with tesseract. >> >> I will start with some of my observations. I collect them by reading in >> between the lines: from my own failed experiments: >> 1. Training from scratch is very difficult because tesseract requires >> extensive data set. It looks like it requires over 300,000 test lines >> (around 26mb text file). >> https://github.com/tesseract-ocr/tesseract/issues/3909 >> >> Multiple that with the fonts you want to train, the data grows >> exponentially. That requires very powerful computers running for weeks and >> months. >> So, for the regular users, training from a network layer, or fine tuning >> are the most plausible options. >> >> 2. Best practice: make your text lines not too long. The recommended >> number of works in a line is 10-12. Again from the above link. >> >> ( ...to be continued) >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/bf0cd568-9b5b-4e42-be6e-6225ed6a3892n%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/bf0cd568-9b5b-4e42-be6e-6225ed6a3892n%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e2452724-5655-4358-b9ce-b6a28ffa4aa0n%40googlegroups.com.

