Wow, thank you for your time and response ! I really appreciate that. My reason for using combine_lang_data is to make my punc, wordlist, and numbers effects the trainned data.. Or, it doesn't work like that?
Now, I will try your shell script for training, and will share the result if its done On Tuesday, January 9, 2018 at 6:17:40 PM UTC+7, shree wrote: > > 1. If you use tesstrain.sh, it will create the starter traineddata, you do > NOT need to run combine_lang_data. If you want to change version string, > look at tesstrain_utils.sh and modify the command in it. > > 2. If you are always getting the same size file, it looks like that you > are probably copying some old file as traineddata as part of your script. > It could be copying from a wrong folder or some such thing. > > I am attaching a bash script, you can modify it for your setup and try if > that helps. > > ShreeDevi > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > > On Tue, Jan 9, 2018 at 9:39 AM, <[email protected] <javascript:>> wrote: > >> Yes, I did the following command in tesseract/training directory: >> >> lstmtraining --stop_training --continue_from >> ../result/mylangoutput/base_checkpoint --traineddata >> ../result/mylangcombine/mylang/mylang.traineddata --model_output >> ../result/mylangoutput/mylang.traineddata >> >> On Monday, January 8, 2018 at 7:36:50 PM UTC+7, shree wrote: >>> >>> Did you use --stop_training flag at the end? >>> >>> ShreeDevi >>> ____________________________________________________________ >>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>> >>> On Mon, Jan 8, 2018 at 5:51 PM, <[email protected]> wrote: >>> >>>> Hi all, >>>> >>>> I am doing my project using Tesseract v4.00, and always getting the >>>> traineddata output in the same size after training with my own data. >>>> I suppose that I did not do the steps correctly.. >>>> >>>> The only data that I provided were: >>>> 1. training_text >>>> 2. puncs (I just reduced the general punc as provided in tesseract >>>> github) >>>> 3. numbers >>>> 4. wordlists (I made various wordlists for several training, ranging >>>> between 100.000 - 2.000.000) >>>> 5. font name (I also made various fonts for several training, ranging >>>> between 1 - 20 fonts) >>>> >>>> The steps that I did were: >>>> 1. Made tiff file, unicharset and other complement data using >>>> tesstrain.sh >>>> 2. Made tiff file, unicharset and other complement data using >>>> tesstrain.sh for evaluation >>>> 3. Combined unicharset, wordlists, puncs, numbers and version_str to >>>> create started traineddata using combine_lang_data ( I am still not >>>> confident with the value of version_str though) >>>> 4. Trained data using lstmtraining >>>> 5. Combined all output file using lstmtraining --continue_from ... >>>> >>>> Yet, all of my training ended with same size which is 10.5MB.. >>>> Did I do all my steps correctly? >>>> >>>> Once, I also trained with modifying WORD_DAWG_FACTOR in >>>> language_spesific.sh to 0 and 1, because I want to read the text and match >>>> 100% with my wordlists. But, the result also did not satisfy me, some >>>> words >>>> are not in my wordlists such as "USISUSISU". >>>> Do you know whats the cause? >>>> >>>> I really appreciate if anyone can help or suggest any solution. >>>> Thankyou !! >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/b6ca74b2-1e50-44cb-93f6-586fcd26cec5%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/b6ca74b2-1e50-44cb-93f6-586fcd26cec5%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/8ef2e463-9fd8-48c2-9498-19fb2cd32628%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/8ef2e463-9fd8-48c2-9498-19fb2cd32628%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/d150b2f7-4cbf-49cc-a958-19f863de7ddc%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

