Hi all,

I am doing my project using Tesseract v4.00, and always getting the 
traineddata output in the same size after training with my own data.
I suppose that I did not do the steps correctly..

The only data that I provided were:
1. training_text
2. puncs (I just reduced the general punc as provided in tesseract github)
3. numbers
4. wordlists (I made various wordlists for several training, ranging 
between 100.000 - 2.000.000) 
5. font name (I also made various fonts for several training, ranging 
between 1 - 20 fonts)

The steps that I did were:
1. Made tiff file, unicharset and other complement data using tesstrain.sh
2. Made tiff file, unicharset and other complement data using tesstrain.sh 
for evaluation
3. Combined unicharset, wordlists, puncs, numbers and version_str to create 
started traineddata using combine_lang_data ( I am still not confident with 
the value of version_str though)
4. Trained data using lstmtraining
5. Combined all output file using lstmtraining --continue_from ...

Yet, all of my training ended with same size which is 10.5MB..
Did I do all my steps correctly?

Once, I also trained with modifying WORD_DAWG_FACTOR in 
language_spesific.sh to 0 and 1, because I want to read the text and match 
100% with my wordlists. But, the result also did not satisfy me, some words 
are not in my wordlists such as "USISUSISU".
Do you know whats the cause?

I really appreciate if anyone can help or suggest any solution.
Thankyou !!

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/b6ca74b2-1e50-44cb-93f6-586fcd26cec5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to