Re: [tesseract-ocr] Traineddata always ended in same size and did not match with wordlist

easymavinmind Tue, 09 Jan 2018 03:45:40 -0800

Wow, thank you for your time and response !
I really appreciate that.

My reason for using combine_lang_data is to make my punc, wordlist, and 
numbers effects the trainned data.. Or, it doesn't work like that?


Now, I will try your shell script for training, and will share the result 
if its done 


On Tuesday, January 9, 2018 at 6:17:40 PM UTC+7, shree wrote:
>
> 1. If you use tesstrain.sh, it will create the starter traineddata, you do 
> NOT need to run combine_lang_data. If you want to change version string, 
> look at tesstrain_utils.sh and modify the command in it.
>
> 2. If you are always getting the same size file, it looks like that you 
> are probably copying some old file as traineddata as part of your script. 
> It could be copying from a wrong folder or some such thing.
>
> I am attaching a bash script, you can modify it for your setup and try if 
> that helps.
>
> ShreeDevi
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Tue, Jan 9, 2018 at 9:39 AM, <[email protected] <javascript:>> wrote:
>
>> Yes, I did the following command in tesseract/training directory:
>>
>> lstmtraining --stop_training --continue_from 
>> ../result/mylangoutput/base_checkpoint --traineddata 
>> ../result/mylangcombine/mylang/mylang.traineddata --model_output 
>> ../result/mylangoutput/mylang.traineddata
>>
>> On Monday, January 8, 2018 at 7:36:50 PM UTC+7, shree wrote:
>>>
>>> Did you use --stop_training flag at the end?
>>>
>>> ShreeDevi
>>> ____________________________________________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Mon, Jan 8, 2018 at 5:51 PM, <[email protected]> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am doing my project using Tesseract v4.00, and always getting the 
>>>> traineddata output in the same size after training with my own data.
>>>> I suppose that I did not do the steps correctly..
>>>>
>>>> The only data that I provided were:
>>>> 1. training_text
>>>> 2. puncs (I just reduced the general punc as provided in tesseract 
>>>> github)
>>>> 3. numbers
>>>> 4. wordlists (I made various wordlists for several training, ranging 
>>>> between 100.000 - 2.000.000) 
>>>> 5. font name (I also made various fonts for several training, ranging 
>>>> between 1 - 20 fonts)
>>>>
>>>> The steps that I did were:
>>>> 1. Made tiff file, unicharset and other complement data using 
>>>> tesstrain.sh
>>>> 2. Made tiff file, unicharset and other complement data using 
>>>> tesstrain.sh for evaluation
>>>> 3. Combined unicharset, wordlists, puncs, numbers and version_str to 
>>>> create started traineddata using combine_lang_data ( I am still not 
>>>> confident with the value of version_str though)
>>>> 4. Trained data using lstmtraining
>>>> 5. Combined all output file using lstmtraining --continue_from ...
>>>>
>>>> Yet, all of my training ended with same size which is 10.5MB..
>>>> Did I do all my steps correctly?
>>>>
>>>> Once, I also trained with modifying WORD_DAWG_FACTOR in 
>>>> language_spesific.sh to 0 and 1, because I want to read the text and match 
>>>> 100% with my wordlists. But, the result also did not satisfy me, some 
>>>> words 
>>>> are not in my wordlists such as "USISUSISU".
>>>> Do you know whats the cause?
>>>>
>>>> I really appreciate if anyone can help or suggest any solution.
>>>> Thankyou !!
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/b6ca74b2-1e50-44cb-93f6-586fcd26cec5%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/b6ca74b2-1e50-44cb-93f6-586fcd26cec5%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/8ef2e463-9fd8-48c2-9498-19fb2cd32628%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/8ef2e463-9fd8-48c2-9498-19fb2cd32628%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/d150b2f7-4cbf-49cc-a958-19f863de7ddc%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Traineddata always ended in same size and did not match with wordlist

Reply via email to