Rene, the name “foo” is simply an example (or fictitious) font or language 
name.  When training a new language or font, you should replace “foo” with the 
name of your language or font.  The standard is to choose 3 letters, but that 
is not required.  In fact, I have been training a font named “micr_e13b” and it 
is working technically for me (though the accuracy isn’t good enough yet).  
Note the underscore character between sections of the name.



Internal
From: [email protected] <[email protected]> on behalf 
of René JM Clais <[email protected]>
Date: Sunday, October 22, 2023 at 12:41 PM
To: [email protected] <[email protected]>
Subject: [EXTERNAL] Re: [tesseract-ocr] Lessons, best practices, 
recommendations, strategies, hacks
CAUTION EXTERNAL EMAIL
DO NOT open attachments or click on links from unknown senders or unexpected 
emails.

Hi Keith,
The foo.traindedata is not existing but do you mean : the trainedata I want to 
train   ex:  hye.traineddata  ?
In my case I should add a new character in the hye.traineddata
It seems that I can do this using the option 2 !
But how ?  Which command  should I use to execute this function and what does 
mean this process ?

Thank you for your help
Regards
René

Le sam. 21 oct. 2023 à 17:18, Keith Smith 
<[email protected]<mailto:[email protected]>> a écrit :
Thank you Des for your help in this community.  It is greatly appreciated!
As one who is struggling, may I make a suggestion.
I have started a google doc 
here<https://urldefense.com/v3/__https://docs.google.com/document/d/1Vz6y4LcqczAAE2yKc_xYecy1eChjHZbsxb13_7ntUh0/edit?usp=sharing__;!!MjXRb4uW6x5k!HFOAD-quUbb2dHADKsKiyk_BK3xW49ZAh87HZ3mPU9myi2Zk2t-bdP3ptvhcsV64KhX43EgYbPFZJ5M8Lr1zVHA$>
 with a suggested format for a tutorial which would be very helpful to me and I 
think to others. It is editable by anyone with the link.
I'm glad to put in any work myself, but my guess is that there are things in 
the doc that could be filled without much effort by you or others.
If this is true, once the doc is filled out, the contents of the google doc 
could be submitted as a PR to the tesstrain repo.
Again, just a suggestion that I hope would be helpful to all.


Thanks,
Keith

On Sat, Oct 21, 2023 at 8:28 AM Des Bw 
<[email protected]<mailto:[email protected]>> wrote:
There is no exhaustive user manual for training tesseract. We all start in the 
darkness; and accumulate bits of information in different places to learn the 
ins and outs of tesseract.

It would be great if we can collectively write a better manual. Up until then, 
we can drop /collect our observations, best  practices, hacks and lessons we 
accumulated in our adventure with tesseract.

I will start with some of my observations. I collect them by reading in between 
the lines: from my own failed experiments:
1. Training from scratch is very difficult because tesseract requires extensive 
data set. It looks like it requires over 300,000 test lines (around 26mb text 
file).
https://github.com/tesseract-ocr/tesseract/issues/3909<https://urldefense.com/v3/__https://github.com/tesseract-ocr/tesseract/issues/3909__;!!MjXRb4uW6x5k!HFOAD-quUbb2dHADKsKiyk_BK3xW49ZAh87HZ3mPU9myi2Zk2t-bdP3ptvhcsV64KhX43EgYbPFZJ5M8Yk4Xcmo$>

 Multiple that with the fonts you want to train, the data grows exponentially. 
That requires very powerful computers running for weeks and months.
So, for the regular users, training from a network layer, or fine tuning are 
the most plausible options.

2. Best practice: make your text lines not too long. The recommended number of 
works in a line is 10-12. Again from the above link.

( ...to be continued)
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
[email protected]<mailto:[email protected]>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/bf0cd568-9b5b-4e42-be6e-6225ed6a3892n%40googlegroups.com<https://urldefense.com/v3/__https://groups.google.com/d/msgid/tesseract-ocr/bf0cd568-9b5b-4e42-be6e-6225ed6a3892n*40googlegroups.com?utm_medium=email&utm_source=footer__;JQ!!MjXRb4uW6x5k!HFOAD-quUbb2dHADKsKiyk_BK3xW49ZAh87HZ3mPU9myi2Zk2t-bdP3ptvhcsV64KhX43EgYbPFZJ5M8KHJKCVc$>.
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
[email protected]<mailto:[email protected]>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAL1pF5ZHL-_9shmwX%3DAUrnDWHJZBWiZutT9zc-j8Oxih8c6D2A%40mail.gmail.com<https://urldefense.com/v3/__https://groups.google.com/d/msgid/tesseract-ocr/CAL1pF5ZHL-_9shmwX*3DAUrnDWHJZBWiZutT9zc-j8Oxih8c6D2A*40mail.gmail.com?utm_medium=email&utm_source=footer__;JSU!!MjXRb4uW6x5k!HFOAD-quUbb2dHADKsKiyk_BK3xW49ZAh87HZ3mPU9myi2Zk2t-bdP3ptvhcsV64KhX43EgYbPFZJ5M8MAoDn2A$>.
--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
[email protected]<mailto:[email protected]>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAPJAo_rtwFJ247UCtLgggB_WTs0%3DUajag0_M29Fe%2B8zCy0OZXw%40mail.gmail.com<https://urldefense.com/v3/__https://groups.google.com/d/msgid/tesseract-ocr/CAPJAo_rtwFJ247UCtLgggB_WTs0*3DUajag0_M29Fe*2B8zCy0OZXw*40mail.gmail.com?utm_medium=email&utm_source=footer__;JSUl!!MjXRb4uW6x5k!HFOAD-quUbb2dHADKsKiyk_BK3xW49ZAh87HZ3mPU9myi2Zk2t-bdP3ptvhcsV64KhX43EgYbPFZJ5M8U3w3mDk$>.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/PH0PR19MB567279E2B80440267AA1D2F7B6D8A%40PH0PR19MB5672.namprd19.prod.outlook.com.

Reply via email to