>> I installed Tesseract 4.00alpha on Linux. How did you install it?
Did you use the latest code from github? ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Sun, Jun 25, 2017 at 8:18 PM, Clement <[email protected]> wrote: > Thanks for your reply. I have another question related to the oem option > you mentioned. Is it for the training command (tesstrain.sh) or the > recognition command (tesseract)? > > I installed Tesseract 4.00alpha on Linux. When I ran tesseract on an > image, I got the old format (3.x version) that's without the extra spaces > but the recognition quality was poor. I've no other version of Tesseract > installed on the same box. > > I tried to specify the "--oem 1" option but it didn't work: > $ tesseract 001a3.png 001a3 -l chi_sim --oem 1 > read_params_file: Can't open 1 > Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica > > > On Wednesday, June 21, 2017 at 11:40:22 PM UTC-7, Clement wrote: >> >> I am new to Tesseract-OCR and need help in training the engine to >> recognize Simplified Chinese texts. >> >> I just installed Tesseract 4.00Alpha on Windows 10: >> >> $ tesseract --version >> tesseract 4.00.00alpha >> leptonica-1.74.1 >> libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.20 : >> libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0 >> >> I have 3 images containing a Simplified Chinese sentence of different >> sizes: >> >> chi_sim.Microsoft_Yahei.exp1.tif (small) >> chi_sim.Microsoft_Yahei.exp2.tif (medium) >> chi_sim.Microsoft_Yahei.exp3.tif (large) >> >> I ran Tesseract to recognize the texts in the images using the commands >> below: >> >> $ tesseract -l chi_sim chi_sim.Microsoft_Yahei.exp1.tif >> chi_sim.Microsoft_Yahei.exp1a >> $ tesseract -l chi_sim chi_sim.Microsoft_Yahei.exp2.tif >> chi_sim.Microsoft_Yahei.exp2a >> $ tesseract -l chi_sim chi_sim.Microsoft_Yahei.exp3.tif >> chi_sim.Microsoft_Yahei.exp3a >> >> Tesseract was able to recognize the texts in the large image perfectly. >> It missed the last "period" symbol in the medium image, and failed to >> recognize a number of characters in the small image. >> >> I'd like to train Tesseract to be able to recognize >> chi_sim.Microsoft_Yahei.exp1.tif and chi_sim.Microsoft_Yahei.exp2.tif. I >> created box files for both images as chi_sim.Microsoft_Yahei.exp1.box >> and chi_sim.Microsoft_Yahei.exp2.box using jTessBoxEditor. >> >> The Windows version of Tesseract 4.0 I installed didn't come with >> tesstrain.sh. I downloaded the source and was able to extract the training >> commands. The documentation mentioned about LSTM but I couldn't find any >> LSTM call within the tesstrain.sh script. Anyway, I ran the extracted >> commands as below ($TESS_LANG is the path of the langdata folder.): >> >> = Phase I: Generating training images = >> $ unicharset_extractor -D ./chi_sim chi_sim.Microsoft_Yahei.exp1.box >> chi_sim.Microsoft_Yahei.exp2.box >> >> = Phase UP: Generating unicharset and unichar properties files = >> $ set_unicharset_properties -U ./chi_sim/unicharset -O >> ./chi_sim/chi_sim.unicharset -X ./chi_sim/chi_sim.xheights >> --script_dir=$TESS_LANG >> >> = Phase D: Generating Dawg files = >> $ wordlist2dawg -r 1 $TESS_LANG/chi_sim/chi_sim.wordlist >> ./chi_sim/chi_sim.word-dawg ./chi_sim/chi_sim.unicharset >> >> = Phase E: Extracting features = >> >> $ tesseract chi_sim.Microsoft_Yahei.exp2.tif >> chi_sim.Microsoft_Yahei.exp2 box.train $TESS_LANG/chi_sim/chi_sim.config >> $ tesseract chi_sim.Microsoft_Yahei.exp1.tif >> chi_sim.Microsoft_Yahei.exp1 box.train $TESS_LANG/chi_sim/chi_sim.config >> >> = Phase C: Clustering feature prototypes (cnTraining) = >> $ cntraining -D ./chi_sim chi_sim.Microsoft_Yahei.exp1.tr >> chi_sim.Microsoft_Yahei.exp2.tr >> >> = Phase M : Clustering microfeatures (mfTraining) = >> $ mftraining -D ./chi_sim/ -U ./chi_sim/chi_sim.unicharset -O >> ./chi_sim/chi_sim.mfunicharset -F $TESS_LANG/font_properties -X >> ./chi_sim/chi_sim.xheights chi_sim.Microsoft_Yahei.exp1.tr >> chi_sim.Microsoft_Yahei.exp2.tr >> >> = Making final traineddata file = >> $ cp $TESS_LANG/chi_sim/chi_sim.config ./chi_sim/. >> >> Add "chi_sim." to files "inttemp", "normproto", "pffmtable", and >> "shapetable" >> >> $ combine_tessdata ./chi_sim/chi_sim. >> >> $ cp ./chi_sim/chi_sim.traineddata $TESSDATA_PREFIX/tessdata/chi_ >> sim_1.traineddata >> >> =================================== >> >> I reran Tesseract on the 3 images using the commands below: >> >> $ tesseract -l chi_sim_1+chi_sim chi_sim.Microsoft_Yahei.exp1.tif >> chi_sim.Microsoft_Yahei.exp1b >> >> $ tesseract -l chi_sim_1+chi_sim chi_sim.Microsoft_Yahei.exp2.tif >> chi_sim.Microsoft_Yahei.exp2b >> >> $ tesseract -l chi_sim_1+chi_sim chi_sim.Microsoft_Yahei.exp3.tif >> chi_sim.Microsoft_Yahei.exp3b >> >> The large image still produces perfect result. The medium image gives the >> same result as before missing a "period" symbol. The small image actually >> returns worse result detecting wrong number of words from the image. >> >> I am attaching a zip files containing the images, the box files, and the >> results (.txt) returned from the initial runs and the runs after the >> training. >> >> Are my training steps incorrect? What can I do to improve the quality of >> the OCR engine? Any suggestion will be much appreciated! >> >> >> >> -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/14d6afaa-f220-4b03-b12f-330f1c98501a% > 40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/14d6afaa-f220-4b03-b12f-330f1c98501a%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXUATfmzsos5mE31_G9718EFmuZ9FUdTw9d6s5rGy4TRA%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

