Sorry, I have given wrong commands for arabic. Actually i was referring to english.
tesseract eng.arial.exp4.tif eng.arial.exp4 nobatch box.train unicharset_extractor eng.arial.exp4.box echo "arial 0 0 1 0 0" > font_properties # tell Tesseract informations about the font mftraining -F font_properties -U unicharset -O eng.unicharset eng.arial.exp4 .tr shapeclustering -F unicharset eng.arial.exp4.tr cntraining eng.arial.exp4.tr mv inttemp eng.inttemp mv normproto eng.normproto mv pffmtable eng.pffmtable mv shapetable eng.shapetable combine_tessdata eng. I request you to suggest the changes for the below commands with respect to tesseract 4.0 , these commands are for tess 3.0. Please suggest changes for the above steps. I plan to publish a rigorous explanative tutorial after getting overview of all the steps. Thank you. On Wednesday, April 12, 2017 at 4:04:42 PM UTC+5:30, shree wrote: > > Arabic was never trained with the legacy tesseract engine and I doubt you > will get any improvement over existing traineddata using cube or lstm. > > You are free to experiment and see what you come up with. > > I have pointed to the bash scripts for training. Please refer to them for > the correct process. > > - excuse the brevity, sent from mobile > > On 12-Apr-2017 4:00 PM, <[email protected] <javascript:>> wrote: > >> Hello shree, Thank you for your valuable reply.. Are there any changes i >> need to follow for the steps below.. I request you to suggest the changes >> for the below commands, these are for tess 3.0 >> >> tesseract ara.arial.exp4.tif ara.arial.exp4 nobatch box.train >> unicharset_extractor ara.arial.exp4.box >> echo "arial 0 0 1 0 0" > font_properties # tell Tesseract informations >> about the font >> mftraining -F font_properties -U unicharset -O ara.unicharset ara.arial. >> exp4.tr >> shapeclustering -F unicharset ara.arial.exp4.tr >> cntraining ara.arial.exp4.tr >> >> mv inttemp ara.inttemp >> mv normproto ara.normproto >> mv pffmtable ara.pffmtable >> mv shapetable ara.shapetable >> combine_tessdata ara. >> >> >> Please suggest changes for the above steps. I plan to publish a rigorous >> explanative tutorial after getting overview of all the steps. >> Thank you. >> >> >> On Wednesday, April 12, 2017 at 3:38:11 PM UTC+5:30, shree wrote: >>> >>> see >>> https://github.com/tesseract-ocr/tesseract/blob/master/training/tesstrain.sh >>> >>> >>> if ((LINEDATA)); then >>> phase_E_extract_features "lstm.train" 8 "lstmf" >>> make__lstmdata >>> else >>> phase_E_extract_features "box.train" 8 "tr" >>> phase_C_cluster_prototypes "${TRAINING_DIR}/${LANG_CODE}.normproto" >>> if [[ "${ENABLE_SHAPE_CLUSTERING}" == "y" ]]; then >>> phase_S_cluster_shapes >>> fi >>> phase_M_cluster_microfeatures >>> phase_B_generate_ambiguities >>> make__traineddata >>> fi >>> >>> -------------------- >>> >>> lstm.train is for LSTM training >>> >>> box.train is for 3.0 Tesseract legacy engine training >>> >>> Please note that current master code is for alpha testing for 4.0 LSTM >>> and will most probably drop support for legacy engine. >>> >>> If you want the legacy tesseract engine and train for it, please use the >>> 3.05 branch of the github repo. >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/70a9d13b-a28b-4e6f-9c78-ec1c41361d96%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/70a9d13b-a28b-4e6f-9c78-ec1c41361d96%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e4a2c775-6e31-4a48-9e37-f981f862d37f%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

