[tesseract-ocr] Traineddata always ended in same size and did not match with wordlist

easymavinmind Mon, 08 Jan 2018 04:25:14 -0800

Hi all,

I am doing my project using Tesseract v4.00, and always getting the 
traineddata output in the same size after training with my own data.
I suppose that I did not do the steps correctly..

The only data that I provided were:
1. training_text
2. puncs (I just reduced the general punc as provided in tesseract github)
3. numbers
4. wordlists (I made various wordlists for several training, ranging
between 100.000 - 2.000.000)
5. font name (I also made various fonts for several training, ranging
between 1 - 20 fonts)

The steps that I did were:
1. Made tiff file, unicharset and other complement data using tesstrain.sh
2. Made tiff file, unicharset and other complement data using tesstrain.sh
for evaluation
3. Combined unicharset, wordlists, puncs, numbers and version_str to create
started traineddata using combine_lang_data ( I am still not confident with
the value of version_str though)
4. Trained data using lstmtraining
5. Combined all output file using lstmtraining --continue_from ...

Yet, all of my training ended with same size which is 10.5MB..
Did I do all my steps correctly?

Once, I also trained with modifying WORD_DAWG_FACTOR in
language_spesific.sh to 0 and 1, because I want to read the text and match
100% with my wordlists. But, the result also did not satisfy me, some words
are not in my wordlists such as "USISUSISU".
Do you know whats the cause?

I really appreciate if anyone can help or suggest any solution.
Thankyou !!

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/b6ca74b2-1e50-44cb-93f6-586fcd26cec5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Traineddata always ended in same size and did not match with wordlist

Reply via email to