Thanks Shree for your reply I appreciate it, My intention: is that right path for training Tesseract 4.0 LSTM or not?
On Wednesday, April 12, 2017 at 10:49:24 AM UTC+2, shree wrote: > > Read the bash scripts in > > tesstrain.sh > tesstrain_utils.sh > language_specific.sh > > In training directory > > To understand more detail about lstm training > > - excuse the brevity, sent from mobile > > On 12-Apr-2017 10:47 AM, "Ahmad Moawad" <[email protected] <javascript:>> > wrote: > >> this is the part from >> https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00 >> >> My question related to the image part not making training from text >> >> >> The overall training process is similar to training 3.04 >> <https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract> >> Conceptually the same: >> >> 1. Prepare training text. >> >> <https://github.com/tesseract-ocr/tesseract/issues/654#issuecomment-274574951> >> 2. Render text to image + box file. (Or create hand-made box files >> for existing image data.) >> 3. Make unicharset file. >> 4. Optionally make dictionary data. >> 5. Run tesseract to process image + box file to make training data >> set. >> 6. Run training on training data set. >> 7. Combine data files. >> >> Are the above steps similar to: >> >> tesseract ara.arial.exp4.tif ara.arial.exp4 nobatch box.train >> unicharset_extractor ara.arial.exp4.box >> echo "arial 0 0 1 0 0" > font_properties # tell Tesseract informations >> about the font >> mftraining -F font_properties -U unicharset -O ara.unicharset ara.arial. >> exp4.tr >> shapeclustering -F unicharset ara.arial.exp4.tr >> cntraining ara.arial.exp4.tr >> >> mv inttemp ara.inttemp >> mv normproto ara.normproto >> mv pffmtable ara.pffmtable >> mv shapetable ara.shapetable >> combine_tessdata ara. >> >> >> Should I use these steps or not. >> >> >> The key differences are: >> >> - The boxes only need to be at the *textline level.* It is thus *far >> easier* to make training data from existing image data. >> - The .tr files are replaced by .lstmf data files. >> - Fonts *can and should be mixed freely* instead of being separate. >> - The clustering steps (mftraining, cntraining, shapeclustering) are >> replaced with a single slow lstmtraining step. >> >> for this part i don't a lot about it. >> >> >> Thanks! >> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/60029df8-2149-4f5a-8c3a-32e96c27ce79%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/60029df8-2149-4f5a-8c3a-32e96c27ce79%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c698286e-f9d5-4d7c-85ae-22a763a0d05b%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

