> > > My reason for using combine_lang_data is to make my punc, wordlist, and > numbers effects the trainned data.. Or, it doesn't work like that? >
If you update the files in langdata folder and then run tesstrain.sh, it will automatically use your files. > > Now, I will try your shell script for training, and will share the result > if its done > You will need to modify it according to the location of your files. Also, update the fonts list as per your requirements. > > > On Tuesday, January 9, 2018 at 6:17:40 PM UTC+7, shree wrote: >> >> 1. If you use tesstrain.sh, it will create the starter traineddata, you >> do NOT need to run combine_lang_data. If you want to change version string, >> look at tesstrain_utils.sh and modify the command in it. >> >> 2. If you are always getting the same size file, it looks like that you >> are probably copying some old file as traineddata as part of your script. >> It could be copying from a wrong folder or some such thing. >> >> I am attaching a bash script, you can modify it for your setup and try if >> that helps. >> >> ShreeDevi >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> >> On Tue, Jan 9, 2018 at 9:39 AM, <[email protected]> wrote: >> >>> Yes, I did the following command in tesseract/training directory: >>> >>> lstmtraining --stop_training --continue_from >>> ../result/mylangoutput/base_checkpoint --traineddata >>> ../result/mylangcombine/mylang/mylang.traineddata --model_output >>> ../result/mylangoutput/mylang.traineddata >>> >>> On Monday, January 8, 2018 at 7:36:50 PM UTC+7, shree wrote: >>>> >>>> Did you use --stop_training flag at the end? >>>> >>>> ShreeDevi >>>> ____________________________________________________________ >>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>> >>>> On Mon, Jan 8, 2018 at 5:51 PM, <[email protected]> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I am doing my project using Tesseract v4.00, and always getting the >>>>> traineddata output in the same size after training with my own data. >>>>> I suppose that I did not do the steps correctly.. >>>>> >>>>> The only data that I provided were: >>>>> 1. training_text >>>>> 2. puncs (I just reduced the general punc as provided in tesseract >>>>> github) >>>>> 3. numbers >>>>> 4. wordlists (I made various wordlists for several training, ranging >>>>> between 100.000 - 2.000.000) >>>>> 5. font name (I also made various fonts for several training, ranging >>>>> between 1 - 20 fonts) >>>>> >>>>> The steps that I did were: >>>>> 1. Made tiff file, unicharset and other complement data using >>>>> tesstrain.sh >>>>> 2. Made tiff file, unicharset and other complement data using >>>>> tesstrain.sh for evaluation >>>>> 3. Combined unicharset, wordlists, puncs, numbers and version_str to >>>>> create started traineddata using combine_lang_data ( I am still not >>>>> confident with the value of version_str though) >>>>> 4. Trained data using lstmtraining >>>>> 5. Combined all output file using lstmtraining --continue_from ... >>>>> >>>>> Yet, all of my training ended with same size which is 10.5MB.. >>>>> Did I do all my steps correctly? >>>>> >>>>> Once, I also trained with modifying WORD_DAWG_FACTOR in >>>>> language_spesific.sh to 0 and 1, because I want to read the text and match >>>>> 100% with my wordlists. But, the result also did not satisfy me, some >>>>> words >>>>> are not in my wordlists such as "USISUSISU". >>>>> Do you know whats the cause? >>>>> >>>>> I really appreciate if anyone can help or suggest any solution. >>>>> Thankyou !! >>>>> >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to [email protected]. >>>>> To post to this group, send email to [email protected]. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/b6ca74b2-1e5 >>>>> 0-44cb-93f6-586fcd26cec5%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/b6ca74b2-1e50-44cb-93f6-586fcd26cec5%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit https://groups.google.com/d/ms >>> gid/tesseract-ocr/8ef2e463-9fd8-48c2-9498-19fb2cd32628%40goo >>> glegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/8ef2e463-9fd8-48c2-9498-19fb2cd32628%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/d150b2f7-4cbf-49cc-a958-19f863de7ddc% > 40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/d150b2f7-4cbf-49cc-a958-19f863de7ddc%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWAfMDAJeT2N_DknMdjAgwV5KT-zDhaneXzR6sdTQDrXQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

