Another useful parameter to turn ON would have been *perspective*. But, that one is not working for me.
On Saturday, October 21, 2023 at 10:45:31 PM UTC+3 Des Bw wrote: > I have been experimenting with the text2image script: > Here are some of my observations so far: > * '--strip_unrenderable_words=false':* The idea of this > parameter seems to remove characters that are not covered by a certain > font. But, I am getting better results with the False value. --Turning this > to True removes more characters. Keeping it false flushes a warning that 1 > character has been dropped. But, the overall number of characters getting > removed is less (closer to the truth-value). > *'--distort_image=true':* For those of use would like to > apply tesseract for ocring scanned documents: distortion is unavoidable. > Turning the feature ON trains the model to get used to the distortion. It > is turned OFF by default. > * '--invert=false'*: inverting the image to black is > uncommon. So, from the distortion parameters, the inversion is less > relevant (less common) for scanned documents. So, keep this one to false. > > Another big mistake I made when I was training was putting the following: > * '--char_spacing=1.0',* > This one puts space between the characters. That creates a perfect > environment--get great results during the training. But, the final model > will be less fit to recognize dense texts. > On Saturday, October 21, 2023 at 8:58:01 PM UTC+3 Des Bw wrote: > >> That is good starter dear Keith. Very good idea. We can contribute texts >> and ideas; and develop it into a booklet or "getting started guide"--making >> additional explanatory comments, practical examples and elaborations on the >> official guide (which very dense, and incomplete). >> - the tips and best practices can be then be distributed across the >> tutorial/guide, as you already started. >> On Saturday, October 21, 2023 at 6:18:06 PM UTC+3 Keith Smith wrote: >> >>> Thank you Des for your help in this community. It is greatly >>> appreciated! >>> As one who is struggling, may I make a suggestion. >>> I have started a google doc here >>> <https://docs.google.com/document/d/1Vz6y4LcqczAAE2yKc_xYecy1eChjHZbsxb13_7ntUh0/edit?usp=sharing> >>> >>> with a suggested format for a tutorial which would be very helpful to me >>> and I think to others. It is editable by anyone with the link. >>> I'm glad to put in any work myself, but my guess is that there are >>> things in the doc that could be filled without much effort by you or others. >>> If this is true, once the doc is filled out, the contents of the google >>> doc could be submitted as a PR to the tesstrain repo. >>> Again, just a suggestion that I hope would be helpful to all. >>> >>> Thanks, >>> Keith >>> >>> On Sat, Oct 21, 2023 at 8:28 AM Des Bw <[email protected]> wrote: >>> >>>> There is no exhaustive user manual for training tesseract. We all start >>>> in the darkness; and accumulate bits of information in different places to >>>> learn the ins and outs of tesseract. >>>> >>>> It would be great if we can collectively write a better manual. Up >>>> until then, we can drop /collect our observations, best practices, hacks >>>> and lessons we accumulated in our adventure with tesseract. >>>> >>>> I will start with some of my observations. I collect them by reading in >>>> between the lines: from my own failed experiments: >>>> 1. Training from scratch is very difficult because tesseract requires >>>> extensive data set. It looks like it requires over 300,000 test lines >>>> (around 26mb text file). >>>> https://github.com/tesseract-ocr/tesseract/issues/3909 >>>> >>>> Multiple that with the fonts you want to train, the data grows >>>> exponentially. That requires very powerful computers running for weeks and >>>> months. >>>> So, for the regular users, training from a network layer, or fine >>>> tuning are the most plausible options. >>>> >>>> 2. Best practice: make your text lines not too long. The recommended >>>> number of works in a line is 10-12. Again from the above link. >>>> >>>> ( ...to be continued) >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/bf0cd568-9b5b-4e42-be6e-6225ed6a3892n%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/bf0cd568-9b5b-4e42-be6e-6225ed6a3892n%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/f493e4eb-1797-4581-9d33-2f90e8a769f9n%40googlegroups.com.

