I have updated the guide explaining on how to train  by cutting the top 
layer. You can check it out. I hope it is helpful. 

On Sunday, October 22, 2023 at 7:41:15 PM UTC+3 [email protected] wrote:

> Hi Keith,
> The foo.traindedata is not existing but do you mean : the trainedata I 
> want to train   ex:  hye.traineddata  ?
> In my case I should add a new character in the hye.traineddata
> It seems that I can do this using the option 2 !
> But how ?  Which command  should I use to execute this function and what 
> does mean this process ?
>
> Thank you for your help
> Regards
> René
>
> Le sam. 21 oct. 2023 à 17:18, Keith Smith <[email protected]> a écrit :
>
>> Thank you Des for your help in this community.  It is greatly appreciated!
>> As one who is struggling, may I make a suggestion.
>> I have started a google doc here 
>> <https://docs.google.com/document/d/1Vz6y4LcqczAAE2yKc_xYecy1eChjHZbsxb13_7ntUh0/edit?usp=sharing>
>>  
>> with a suggested format for a tutorial which would be very helpful to me 
>> and I think to others. It is editable by anyone with the link.
>> I'm glad to put in any work myself, but my guess is that there are things 
>> in the doc that could be filled without much effort by you or others.
>> If this is true, once the doc is filled out, the contents of the google 
>> doc could be submitted as a PR to the tesstrain repo.
>> Again, just a suggestion that I hope would be helpful to all.
>>
>> Thanks,
>> Keith
>>
>> On Sat, Oct 21, 2023 at 8:28 AM Des Bw <[email protected]> wrote:
>>
>>> There is no exhaustive user manual for training tesseract. We all start 
>>> in the darkness; and accumulate bits of information in different places to 
>>> learn the ins and outs of tesseract. 
>>>
>>> It would be great if we can collectively write a better manual. Up until 
>>> then, we can drop /collect our observations, best  practices, hacks and 
>>> lessons we accumulated in our adventure with tesseract.  
>>>
>>> I will start with some of my observations. I collect them by reading in 
>>> between the lines: from my own failed experiments: 
>>> 1. Training from scratch is very difficult because tesseract requires 
>>> extensive data set. It looks like it requires over 300,000 test lines 
>>> (around 26mb text file).
>>> https://github.com/tesseract-ocr/tesseract/issues/3909
>>>
>>>  Multiple that with the fonts you want to train, the data grows 
>>> exponentially. That requires very powerful computers running for weeks and 
>>> months. 
>>> So, for the regular users, training from a network layer, or fine tuning 
>>> are the most plausible options. 
>>>
>>> 2. Best practice: make your text lines not too long. The recommended 
>>> number of works in a line is 10-12. Again from the above link. 
>>>
>>> ( ...to be continued)
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/bf0cd568-9b5b-4e42-be6e-6225ed6a3892n%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/bf0cd568-9b5b-4e42-be6e-6225ed6a3892n%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/CAL1pF5ZHL-_9shmwX%3DAUrnDWHJZBWiZutT9zc-j8Oxih8c6D2A%40mail.gmail.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/CAL1pF5ZHL-_9shmwX%3DAUrnDWHJZBWiZutT9zc-j8Oxih8c6D2A%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/81b35697-8a44-43e0-b1a9-6b6360d6890en%40googlegroups.com.

Reply via email to