Read the bash scripts in tesstrain.sh tesstrain_utils.sh language_specific.sh
In training directory To understand more detail about lstm training - excuse the brevity, sent from mobile On 12-Apr-2017 10:47 AM, "Ahmad Moawad" <ahmadmoaw...@gmail.com> wrote: > this is the part from https://github.com/tesseract-ocr/tesseract/wiki/ > TrainingTesseract-4.00 > > My question related to the image part not making training from text > > > The overall training process is similar to training 3.04 > <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract> > Conceptually the same: > > 1. Prepare training text. > > <https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951> > 2. Render text to image + box file. (Or create hand-made box files for > existing image data.) > 3. Make unicharset file. > 4. Optionally make dictionary data. > 5. Run tesseract to process image + box file to make training data set. > 6. Run training on training data set. > 7. Combine data files. > > Are the above steps similar to: > > tesseract ara.arial.exp4.tif ara.arial.exp4 nobatch box.train > unicharset_extractor ara.arial.exp4.box > echo "arial 0 0 1 0 0" > font_properties # tell Tesseract informations > about the font > mftraining -F font_properties -U unicharset -O ara.unicharset ara.arial. > exp4.tr > shapeclustering -F unicharset ara.arial.exp4.tr > cntraining ara.arial.exp4.tr > > mv inttemp ara.inttemp > mv normproto ara.normproto > mv pffmtable ara.pffmtable > mv shapetable ara.shapetable > combine_tessdata ara. > > > Should I use these steps or not. > > > The key differences are: > > - The boxes only need to be at the *textline level.* It is thus *far > easier* to make training data from existing image data. > - The .tr files are replaced by .lstmf data files. > - Fonts *can and should be mixed freely* instead of being separate. > - The clustering steps (mftraining, cntraining, shapeclustering) are > replaced with a single slow lstmtraining step. > > for this part i don't a lot about it. > > > Thanks! > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/60029df8-2149-4f5a-8c3a-32e96c27ce79% > 40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/60029df8-2149-4f5a-8c3a-32e96c27ce79%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWWM_F9Epr0HQG_EU70dZRqcPFpyGOxupK93J%3DiqvS0cA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.