Re: [tesseract-ocr] Lessons, best practices, recommendations, strategies, hacks

Des Bw Sat, 21 Oct 2023 10:58:05 -0700

That is good starter dear Keith. Very good idea. We can contribute texts 
and ideas; and develop it into a booklet or "getting started guide"--making 
additional explanatory comments, practical examples and elaborations on the 
official guide (which very dense, and incomplete). 
- the tips and best practices can be then be distributed across the 
tutorial/guide, as you already started. 
On Saturday, October 21, 2023 at 6:18:06 PM UTC+3 Keith Smith wrote:


> Thank you Des for your help in this community.  It is greatly appreciated!
> As one who is struggling, may I make a suggestion.
> I have started a google doc here 
> <https://docs.google.com/document/d/1Vz6y4LcqczAAE2yKc_xYecy1eChjHZbsxb13_7ntUh0/edit?usp=sharing>
>  
> with a suggested format for a tutorial which would be very helpful to me 
> and I think to others. It is editable by anyone with the link.
> I'm glad to put in any work myself, but my guess is that there are things 
> in the doc that could be filled without much effort by you or others.
> If this is true, once the doc is filled out, the contents of the google 
> doc could be submitted as a PR to the tesstrain repo.
> Again, just a suggestion that I hope would be helpful to all.
>
> Thanks,
> Keith
>
> On Sat, Oct 21, 2023 at 8:28 AM Des Bw <[email protected]> wrote:
>
>> There is no exhaustive user manual for training tesseract. We all start 
>> in the darkness; and accumulate bits of information in different places to 
>> learn the ins and outs of tesseract. 
>>
>> It would be great if we can collectively write a better manual. Up until 
>> then, we can drop /collect our observations, best  practices, hacks and 
>> lessons we accumulated in our adventure with tesseract.  
>>
>> I will start with some of my observations. I collect them by reading in 
>> between the lines: from my own failed experiments: 
>> 1. Training from scratch is very difficult because tesseract requires 
>> extensive data set. It looks like it requires over 300,000 test lines 
>> (around 26mb text file).
>> https://github.com/tesseract-ocr/tesseract/issues/3909
>>
>>  Multiple that with the fonts you want to train, the data grows 
>> exponentially. That requires very powerful computers running for weeks and 
>> months. 
>> So, for the regular users, training from a network layer, or fine tuning 
>> are the most plausible options. 
>>
>> 2. Best practice: make your text lines not too long. The recommended 
>> number of works in a line is 10-12. Again from the above link. 
>>
>> ( ...to be continued)
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/bf0cd568-9b5b-4e42-be6e-6225ed6a3892n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/bf0cd568-9b5b-4e42-be6e-6225ed6a3892n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e2452724-5655-4358-b9ce-b6a28ffa4aa0n%40googlegroups.com.

Re: [tesseract-ocr] Lessons, best practices, recommendations, strategies, hacks

Reply via email to