Re: [tesseract-ocr] Lessons, best practices, recommendations, strategies, hacks

Des Bw Sat, 21 Oct 2023 12:45:35 -0700

I have been experimenting with the text2image script: 
Here are some of my observations so far: 
               * '--strip_unrenderable_words=false':* The idea of this 
parameter seems to remove characters that are not covered by a certain 
font. But, I am getting better results with the False value. --Turning this 
to True removes more characters. Keeping it false flushes a warning that 1 
character has been dropped. But, the overall number of characters getting 
removed is less (closer to the truth-value). 
                *'--distort_image=true':* For those of use would like to 
apply tesseract for ocring scanned documents: distortion is unavoidable. 
Turning the feature ON trains the model to get used to the distortion. It 
is turned OFF by default. 
               * '--invert=false'*: inverting the image to black is 
uncommon. So, from the distortion parameters, the inversion is less 
relevant (less common) for scanned documents. So, keep this one to false.


Another big mistake I made when I was training was putting the following: 
* '--char_spacing=1.0',*
This one puts space between the characters. That creates a perfect 
environment--get great results during the training. But, the final model 
will be less fit to recognize dense texts. 
On Saturday, October 21, 2023 at 8:58:01 PM UTC+3 Des Bw wrote:

> That is good starter dear Keith. Very good idea. We can contribute texts 
> and ideas; and develop it into a booklet or "getting started guide"--making 
> additional explanatory comments, practical examples and elaborations on the 
> official guide (which very dense, and incomplete). 
> - the tips and best practices can be then be distributed across the 
> tutorial/guide, as you already started. 
> On Saturday, October 21, 2023 at 6:18:06 PM UTC+3 Keith Smith wrote:
>
>> Thank you Des for your help in this community.  It is greatly appreciated!
>> As one who is struggling, may I make a suggestion.
>> I have started a google doc here 
>> <https://docs.google.com/document/d/1Vz6y4LcqczAAE2yKc_xYecy1eChjHZbsxb13_7ntUh0/edit?usp=sharing>
>>  
>> with a suggested format for a tutorial which would be very helpful to me 
>> and I think to others. It is editable by anyone with the link.
>> I'm glad to put in any work myself, but my guess is that there are things 
>> in the doc that could be filled without much effort by you or others.
>> If this is true, once the doc is filled out, the contents of the google 
>> doc could be submitted as a PR to the tesstrain repo.
>> Again, just a suggestion that I hope would be helpful to all.
>>
>> Thanks,
>> Keith
>>
>> On Sat, Oct 21, 2023 at 8:28 AM Des Bw <[email protected]> wrote:
>>
>>> There is no exhaustive user manual for training tesseract. We all start 
>>> in the darkness; and accumulate bits of information in different places to 
>>> learn the ins and outs of tesseract. 
>>>
>>> It would be great if we can collectively write a better manual. Up until 
>>> then, we can drop /collect our observations, best  practices, hacks and 
>>> lessons we accumulated in our adventure with tesseract.  
>>>
>>> I will start with some of my observations. I collect them by reading in 
>>> between the lines: from my own failed experiments: 
>>> 1. Training from scratch is very difficult because tesseract requires 
>>> extensive data set. It looks like it requires over 300,000 test lines 
>>> (around 26mb text file).
>>> https://github.com/tesseract-ocr/tesseract/issues/3909
>>>
>>>  Multiple that with the fonts you want to train, the data grows 
>>> exponentially. That requires very powerful computers running for weeks and 
>>> months. 
>>> So, for the regular users, training from a network layer, or fine tuning 
>>> are the most plausible options. 
>>>
>>> 2. Best practice: make your text lines not too long. The recommended 
>>> number of works in a line is 10-12. Again from the above link. 
>>>
>>> ( ...to be continued)
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/bf0cd568-9b5b-4e42-be6e-6225ed6a3892n%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/bf0cd568-9b5b-4e42-be6e-6225ed6a3892n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/2aabafb6-82d0-45cd-b4e1-97f2ff8b73aan%40googlegroups.com.

Re: [tesseract-ocr] Lessons, best practices, recommendations, strategies, hacks

Reply via email to