Re: [tesseract-ocr] Traineddata always ended in same size and did not match with wordlist

ShreeDevi Kumar Tue, 09 Jan 2018 04:36:40 -0800

>
>
> My reason for using combine_lang_data is to make my punc, wordlist, and
> numbers effects the trainned data.. Or, it doesn't work like that?
>


If you update the files in langdata folder and then run tesstrain.sh, it
will automatically use your files.


>
> Now, I will try your shell script for training, and will share the result
> if its done
>

You will need to modify it according to the location of your files.

Also, update the fonts list as per your requirements.


>
>
> On Tuesday, January 9, 2018 at 6:17:40 PM UTC+7, shree wrote:
>>
>> 1. If you use tesstrain.sh, it will create the starter traineddata, you
>> do NOT need to run combine_lang_data. If you want to change version string,
>> look at tesstrain_utils.sh and modify the command in it.
>>
>> 2. If you are always getting the same size file, it looks like that you
>> are probably copying some old file as traineddata as part of your script.
>> It could be copying from a wrong folder or some such thing.
>>
>> I am attaching a bash script, you can modify it for your setup and try if
>> that helps.
>>
>> ShreeDevi
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Tue, Jan 9, 2018 at 9:39 AM, <[email protected]> wrote:
>>
>>> Yes, I did the following command in tesseract/training directory:
>>>
>>> lstmtraining --stop_training --continue_from
>>> ../result/mylangoutput/base_checkpoint --traineddata
>>> ../result/mylangcombine/mylang/mylang.traineddata --model_output
>>> ../result/mylangoutput/mylang.traineddata
>>>
>>> On Monday, January 8, 2018 at 7:36:50 PM UTC+7, shree wrote:
>>>>
>>>> Did you use --stop_training flag at the end?
>>>>
>>>> ShreeDevi
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>>> On Mon, Jan 8, 2018 at 5:51 PM, <[email protected]> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I am doing my project using Tesseract v4.00, and always getting the
>>>>> traineddata output in the same size after training with my own data.
>>>>> I suppose that I did not do the steps correctly..
>>>>>
>>>>> The only data that I provided were:
>>>>> 1. training_text
>>>>> 2. puncs (I just reduced the general punc as provided in tesseract
>>>>> github)
>>>>> 3. numbers
>>>>> 4. wordlists (I made various wordlists for several training, ranging
>>>>> between 100.000 - 2.000.000)
>>>>> 5. font name (I also made various fonts for several training, ranging
>>>>> between 1 - 20 fonts)
>>>>>
>>>>> The steps that I did were:
>>>>> 1. Made tiff file, unicharset and other complement data using
>>>>> tesstrain.sh
>>>>> 2. Made tiff file, unicharset and other complement data using
>>>>> tesstrain.sh for evaluation
>>>>> 3. Combined unicharset, wordlists, puncs, numbers and version_str to
>>>>> create started traineddata using combine_lang_data ( I am still not
>>>>> confident with the value of version_str though)
>>>>> 4. Trained data using lstmtraining
>>>>> 5. Combined all output file using lstmtraining --continue_from ...
>>>>>
>>>>> Yet, all of my training ended with same size which is 10.5MB..
>>>>> Did I do all my steps correctly?
>>>>>
>>>>> Once, I also trained with modifying WORD_DAWG_FACTOR in
>>>>> language_spesific.sh to 0 and 1, because I want to read the text and match
>>>>> 100% with my wordlists. But, the result also did not satisfy me, some 
>>>>> words
>>>>> are not in my wordlists such as "USISUSISU".
>>>>> Do you know whats the cause?
>>>>>
>>>>> I really appreciate if anyone can help or suggest any solution.
>>>>> Thankyou !!
>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/b6ca74b2-1e5
>>>>> 0-44cb-93f6-586fcd26cec5%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/b6ca74b2-1e50-44cb-93f6-586fcd26cec5%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/8ef2e463-9fd8-48c2-9498-19fb2cd32628%40goo
>>> glegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/8ef2e463-9fd8-48c2-9498-19fb2cd32628%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/d150b2f7-4cbf-49cc-a958-19f863de7ddc%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/d150b2f7-4cbf-49cc-a958-19f863de7ddc%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWAfMDAJeT2N_DknMdjAgwV5KT-zDhaneXzR6sdTQDrXQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Traineddata always ended in same size and did not match with wordlist

Reply via email to