[tesseract-ocr] Train Tesseract 4.0 LSTM based on images

Ahmad Moawad Tue, 11 Apr 2017 22:17:53 -0700


this is the part from 
https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

My question related to the image part not making training from text

The overall training process is similar to training 3.04
<https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract>
Conceptually the same:

1. Prepare training text.

<https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951>
2. Render text to image + box file. (Or create hand-made box files for
existing image data.)
3. Make unicharset file.
4. Optionally make dictionary data.
5. Run tesseract to process image + box file to make training data set.
6. Run training on training data set.
7. Combine data files.

Are the above steps similar to:

tesseract ara.arial.exp4.tif ara.arial.exp4 nobatch box.train
unicharset_extractor ara.arial.exp4.box
echo "arial 0 0 1 0 0" > font_properties # tell Tesseract informations
about the font
mftraining -F font_properties -U unicharset -O ara.unicharset ara.arial.exp4
.tr
shapeclustering -F unicharset ara.arial.exp4.tr
cntraining ara.arial.exp4.tr

mv inttemp ara.inttemp
mv normproto ara.normproto
mv pffmtable ara.pffmtable
mv shapetable ara.shapetable
combine_tessdata ara.

Should I use these steps or not.

The key differences are:

- The boxes only need to be at the *textline level.* It is thus *far
easier* to make training data from existing image data.
- The .tr files are replaced by .lstmf data files.
- Fonts *can and should be mixed freely* instead of being separate.
- The clustering steps (mftraining, cntraining, shapeclustering) are
replaced with a single slow lstmtraining step.

for this part i don't a lot about it.

Thanks!

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/60029df8-2149-4f5a-8c3a-32e96c27ce79%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Train Tesseract 4.0 LSTM based on images

Reply via email to