Re: [tesseract-ocr] Lessons, best practices, recommendations, strategies, hacks

Des Bw Sat, 21 Oct 2023 12:47:58 -0700

Another useful parameter to turn ON would have been *perspective*. But, 
that one is not working for me.


On Saturday, October 21, 2023 at 10:45:31 PM UTC+3 Des Bw wrote:

> I have been experimenting with the text2image script: 
> Here are some of my observations so far: 
>                * '--strip_unrenderable_words=false':* The idea of this 
> parameter seems to remove characters that are not covered by a certain 
> font. But, I am getting better results with the False value. --Turning this 
> to True removes more characters. Keeping it false flushes a warning that 1 
> character has been dropped. But, the overall number of characters getting 
> removed is less (closer to the truth-value). 
>                 *'--distort_image=true':* For those of use would like to 
> apply tesseract for ocring scanned documents: distortion is unavoidable. 
> Turning the feature ON trains the model to get used to the distortion. It 
> is turned OFF by default. 
>                * '--invert=false'*: inverting the image to black is 
> uncommon. So, from the distortion parameters, the inversion is less 
> relevant (less common) for scanned documents. So, keep this one to false. 
>
> Another big mistake I made when I was training was putting the following: 
> * '--char_spacing=1.0',*
> This one puts space between the characters. That creates a perfect 
> environment--get great results during the training. But, the final model 
> will be less fit to recognize dense texts. 
> On Saturday, October 21, 2023 at 8:58:01 PM UTC+3 Des Bw wrote:
>
>> That is good starter dear Keith. Very good idea. We can contribute texts 
>> and ideas; and develop it into a booklet or "getting started guide"--making 
>> additional explanatory comments, practical examples and elaborations on the 
>> official guide (which very dense, and incomplete). 
>> - the tips and best practices can be then be distributed across the 
>> tutorial/guide, as you already started. 
>> On Saturday, October 21, 2023 at 6:18:06 PM UTC+3 Keith Smith wrote:
>>
>>> Thank you Des for your help in this community.  It is greatly 
>>> appreciated!
>>> As one who is struggling, may I make a suggestion.
>>> I have started a google doc here 
>>> <https://docs.google.com/document/d/1Vz6y4LcqczAAE2yKc_xYecy1eChjHZbsxb13_7ntUh0/edit?usp=sharing>
>>>  
>>> with a suggested format for a tutorial which would be very helpful to me 
>>> and I think to others. It is editable by anyone with the link.
>>> I'm glad to put in any work myself, but my guess is that there are 
>>> things in the doc that could be filled without much effort by you or others.
>>> If this is true, once the doc is filled out, the contents of the google 
>>> doc could be submitted as a PR to the tesstrain repo.
>>> Again, just a suggestion that I hope would be helpful to all.
>>>
>>> Thanks,
>>> Keith
>>>
>>> On Sat, Oct 21, 2023 at 8:28 AM Des Bw <[email protected]> wrote:
>>>
>>>> There is no exhaustive user manual for training tesseract. We all start 
>>>> in the darkness; and accumulate bits of information in different places to 
>>>> learn the ins and outs of tesseract. 
>>>>
>>>> It would be great if we can collectively write a better manual. Up 
>>>> until then, we can drop /collect our observations, best  practices, hacks 
>>>> and lessons we accumulated in our adventure with tesseract.  
>>>>
>>>> I will start with some of my observations. I collect them by reading in 
>>>> between the lines: from my own failed experiments: 
>>>> 1. Training from scratch is very difficult because tesseract requires 
>>>> extensive data set. It looks like it requires over 300,000 test lines 
>>>> (around 26mb text file).
>>>> https://github.com/tesseract-ocr/tesseract/issues/3909
>>>>
>>>>  Multiple that with the fonts you want to train, the data grows 
>>>> exponentially. That requires very powerful computers running for weeks and 
>>>> months. 
>>>> So, for the regular users, training from a network layer, or fine 
>>>> tuning are the most plausible options. 
>>>>
>>>> 2. Best practice: make your text lines not too long. The recommended 
>>>> number of works in a line is 10-12. Again from the above link. 
>>>>
>>>> ( ...to be continued)
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/bf0cd568-9b5b-4e42-be6e-6225ed6a3892n%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/bf0cd568-9b5b-4e42-be6e-6225ed6a3892n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/f493e4eb-1797-4581-9d33-2f90e8a769f9n%40googlegroups.com.

Re: [tesseract-ocr] Lessons, best practices, recommendations, strategies, hacks

Reply via email to