Hi all, I am doing my project using Tesseract v4.00, and always getting the traineddata output in the same size after training with my own data. I suppose that I did not do the steps correctly..
The only data that I provided were: 1. training_text 2. puncs (I just reduced the general punc as provided in tesseract github) 3. numbers 4. wordlists (I made various wordlists for several training, ranging between 100.000 - 2.000.000) 5. font name (I also made various fonts for several training, ranging between 1 - 20 fonts) The steps that I did were: 1. Made tiff file, unicharset and other complement data using tesstrain.sh 2. Made tiff file, unicharset and other complement data using tesstrain.sh for evaluation 3. Combined unicharset, wordlists, puncs, numbers and version_str to create started traineddata using combine_lang_data ( I am still not confident with the value of version_str though) 4. Trained data using lstmtraining 5. Combined all output file using lstmtraining --continue_from ... Yet, all of my training ended with same size which is 10.5MB.. Did I do all my steps correctly? Once, I also trained with modifying WORD_DAWG_FACTOR in language_spesific.sh to 0 and 1, because I want to read the text and match 100% with my wordlists. But, the result also did not satisfy me, some words are not in my wordlists such as "USISUSISU". Do you know whats the cause? I really appreciate if anyone can help or suggest any solution. Thankyou !! -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b6ca74b2-1e50-44cb-93f6-586fcd26cec5%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

