Re: [tesseract-ocr] Train Tesseract 4.0 LSTM based on images

ShreeDevi Kumar Wed, 12 Apr 2017 01:50:01 -0700

Read the bash scripts in

tesstrain.sh
tesstrain_utils.sh
language_specific.sh


In training directory

To understand more detail about lstm training

- excuse the brevity, sent from mobile

On 12-Apr-2017 10:47 AM, "Ahmad Moawad" <ahmadmoaw...@gmail.com> wrote:

> this is the part from https://github.com/tesseract-ocr/tesseract/wiki/
> TrainingTesseract-4.00
>
> My question related to the image part not making training from text
>
>
> The overall training process is similar to training 3.04
> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract>
> Conceptually the same:
>
>    1. Prepare training text.
>    
> <https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951>
>    2. Render text to image + box file. (Or create hand-made box files for
>    existing image data.)
>    3. Make unicharset file.
>    4. Optionally make dictionary data.
>    5. Run tesseract to process image + box file to make training data set.
>    6. Run training on training data set.
>    7. Combine data files.
>
> Are the above steps similar to:
>
> tesseract ara.arial.exp4.tif ara.arial.exp4 nobatch box.train
> unicharset_extractor ara.arial.exp4.box
> echo "arial 0 0 1 0 0" > font_properties # tell Tesseract informations
> about the font
> mftraining -F font_properties -U unicharset -O ara.unicharset ara.arial.
> exp4.tr
> shapeclustering -F unicharset ara.arial.exp4.tr
> cntraining ara.arial.exp4.tr
>
> mv inttemp ara.inttemp
> mv normproto ara.normproto
> mv pffmtable ara.pffmtable
> mv shapetable ara.shapetable
> combine_tessdata ara.
>
>
> Should I use these steps or not.
>
>
> The key differences are:
>
>    - The boxes only need to be at the *textline level.* It is thus *far
>    easier* to make training data from existing image data.
>    - The .tr files are replaced by .lstmf data files.
>    - Fonts *can and should be mixed freely* instead of being separate.
>    - The clustering steps (mftraining, cntraining, shapeclustering) are
>    replaced with a single slow lstmtraining step.
>
> for this part i don't a lot about it.
>
>
> Thanks!
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/60029df8-2149-4f5a-8c3a-32e96c27ce79%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/60029df8-2149-4f5a-8c3a-32e96c27ce79%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWWM_F9Epr0HQG_EU70dZRqcPFpyGOxupK93J%3DiqvS0cA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Train Tesseract 4.0 LSTM based on images

Reply via email to