Lstm training is not like legacy training. Please read the wiki pages regarding 4.0 training. I have given all sample commands there. There are 3 different ways of training.
Read the bash scripts regarding training to know more. tesstrain.sh with --linedata-only creates the box tiff pairs but only the lstmf file is saved in output dir. Without --linedata-only you will get 3.0 traineddata. There are multiple steps to be done using the lstmf files to create the final 4.0 traineddata. Since you want to write a tutorial, please do your own reading and trials first - excuse the brevity, sent from mobile On 12-Apr-2017 4:08 PM, <srns...@gmail.com> wrote: > Sorry, I have given wrong commands for arabic. Actually i was referring to > english. > > tesseract eng.arial.exp4.tif eng.arial.exp4 nobatch box.train > unicharset_extractor eng.arial.exp4.box > echo "arial 0 0 1 0 0" > font_properties # tell Tesseract informations > about the font > mftraining -F font_properties -U unicharset -O eng.unicharset eng.arial. > exp4.tr > shapeclustering -F unicharset eng.arial.exp4.tr > cntraining eng.arial.exp4.tr > > mv inttemp eng.inttemp > mv normproto eng.normproto > mv pffmtable eng.pffmtable > mv shapetable eng.shapetable > combine_tessdata eng. > > > I request you to suggest the changes for the below commands with respect > to tesseract 4.0 , these commands are for tess 3.0. > Please suggest changes for the above steps. I plan to publish a rigorous > explanative tutorial after getting overview of all the steps. > Thank you. > > > > > > > On Wednesday, April 12, 2017 at 4:04:42 PM UTC+5:30, shree wrote: >> >> Arabic was never trained with the legacy tesseract engine and I doubt you >> will get any improvement over existing traineddata using cube or lstm. >> >> You are free to experiment and see what you come up with. >> >> I have pointed to the bash scripts for training. Please refer to them for >> the correct process. >> >> - excuse the brevity, sent from mobile >> >> On 12-Apr-2017 4:00 PM, <srn...@gmail.com> wrote: >> >>> Hello shree, Thank you for your valuable reply.. Are there any changes i >>> need to follow for the steps below.. I request you to suggest the changes >>> for the below commands, these are for tess 3.0 >>> >>> tesseract ara.arial.exp4.tif ara.arial.exp4 nobatch box.train >>> unicharset_extractor ara.arial.exp4.box >>> echo "arial 0 0 1 0 0" > font_properties # tell Tesseract informations >>> about the font >>> mftraining -F font_properties -U unicharset -O ara.unicharset ara.arial. >>> exp4.tr >>> shapeclustering -F unicharset ara.arial.exp4.tr >>> cntraining ara.arial.exp4.tr >>> >>> mv inttemp ara.inttemp >>> mv normproto ara.normproto >>> mv pffmtable ara.pffmtable >>> mv shapetable ara.shapetable >>> combine_tessdata ara. >>> >>> >>> Please suggest changes for the above steps. I plan to publish a rigorous >>> explanative tutorial after getting overview of all the steps. >>> Thank you. >>> >>> >>> On Wednesday, April 12, 2017 at 3:38:11 PM UTC+5:30, shree wrote: >>>> >>>> see https://github.com/tesseract-ocr/tesseract/blob/master/ >>>> training/tesstrain.sh >>>> >>>> >>>> if ((LINEDATA)); then >>>> phase_E_extract_features "lstm.train" 8 "lstmf" >>>> make__lstmdata >>>> else >>>> phase_E_extract_features "box.train" 8 "tr" >>>> phase_C_cluster_prototypes "${TRAINING_DIR}/${LANG_CODE}.normproto" >>>> if [[ "${ENABLE_SHAPE_CLUSTERING}" == "y" ]]; then >>>> phase_S_cluster_shapes >>>> fi >>>> phase_M_cluster_microfeatures >>>> phase_B_generate_ambiguities >>>> make__traineddata >>>> fi >>>> >>>> -------------------- >>>> >>>> lstm.train is for LSTM training >>>> >>>> box.train is for 3.0 Tesseract legacy engine training >>>> >>>> Please note that current master code is for alpha testing for 4.0 LSTM >>>> and will most probably drop support for legacy engine. >>>> >>>> If you want the legacy tesseract engine and train for it, please use >>>> the 3.05 branch of the github repo. >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesseract-oc...@googlegroups.com. >>> To post to this group, send email to tesser...@googlegroups.com. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit https://groups.google.com/d/ms >>> gid/tesseract-ocr/70a9d13b-a28b-4e6f-9c78-ec1c41361d96%40goo >>> glegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/70a9d13b-a28b-4e6f-9c78-ec1c41361d96%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/e4a2c775-6e31-4a48-9e37-f981f862d37f% > 40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/e4a2c775-6e31-4a48-9e37-f981f862d37f%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXOm1xgt697X%2By87W-vyygXzLuL%2BwN2yL55Ud28qgYB3g%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.