Re: [tesseract-ocr] Train Tesseract 4.0 LSTM based on images

Ahmad Moawad Wed, 12 Apr 2017 02:06:07 -0700

Thanks Shree for your reply I appreciate it, My intention: is that right 
path for training Tesseract 4.0 LSTM or not?


On Wednesday, April 12, 2017 at 10:49:24 AM UTC+2, shree wrote:
>
> Read the bash scripts in
>
> tesstrain.sh
> tesstrain_utils.sh
> language_specific.sh
>
> In training directory
>
> To understand more detail about lstm training 
>
> - excuse the brevity, sent from mobile
>
> On 12-Apr-2017 10:47 AM, "Ahmad Moawad" <[email protected] <javascript:>> 
> wrote:
>
>> this is the part from 
>> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00
>>
>> My question related to the image part not making training from text 
>>
>>
>> The overall training process is similar to training 3.04 
>> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract> 
>> Conceptually the same:
>>
>>    1. Prepare training text. 
>>    
>> <https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951>
>>    2. Render text to image + box file. (Or create hand-made box files 
>>    for existing image data.)
>>    3. Make unicharset file.
>>    4. Optionally make dictionary data.
>>    5. Run tesseract to process image + box file to make training data 
>>    set.
>>    6. Run training on training data set.
>>    7. Combine data files.
>>
>> Are the above steps similar to: 
>>
>> tesseract ara.arial.exp4.tif ara.arial.exp4 nobatch box.train
>> unicharset_extractor ara.arial.exp4.box
>> echo "arial 0 0 1 0 0" > font_properties # tell Tesseract informations 
>> about the font
>> mftraining -F font_properties -U unicharset -O ara.unicharset ara.arial.
>> exp4.tr
>> shapeclustering -F unicharset ara.arial.exp4.tr
>> cntraining ara.arial.exp4.tr
>>
>> mv inttemp ara.inttemp
>> mv normproto ara.normproto
>> mv pffmtable ara.pffmtable
>> mv shapetable ara.shapetable
>> combine_tessdata ara.
>>
>>
>> Should I use these steps or not.
>>
>>
>> The key differences are:
>>
>>    - The boxes only need to be at the *textline level.* It is thus *far 
>>    easier* to make training data from existing image data.
>>    - The .tr files are replaced by .lstmf data files.
>>    - Fonts *can and should be mixed freely* instead of being separate.
>>    - The clustering steps (mftraining, cntraining, shapeclustering) are 
>>    replaced with a single slow lstmtraining step.
>>
>> for this part i don't a lot about it.
>>
>>
>> Thanks!
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/60029df8-2149-4f5a-8c3a-32e96c27ce79%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/60029df8-2149-4f5a-8c3a-32e96c27ce79%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/c698286e-f9d5-4d7c-85ae-22a763a0d05b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Train Tesseract 4.0 LSTM based on images

Reply via email to