Thanks for your reply. I have another question related to the oem option you mentioned. Is it for the training command (tesstrain.sh) or the recognition command (tesseract)?
I installed Tesseract 4.00alpha on Linux. When I ran tesseract on an image, I got the old format (3.x version) that's without the extra spaces but the recognition quality was poor. I've no other version of Tesseract installed on the same box. I tried to specify the "--oem 1" option but it didn't work: $ tesseract 001a3.png 001a3 -l chi_sim --oem 1 read_params_file: Can't open 1 Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica On Wednesday, June 21, 2017 at 11:40:22 PM UTC-7, Clement wrote: > > I am new to Tesseract-OCR and need help in training the engine to > recognize Simplified Chinese texts. > > I just installed Tesseract 4.00Alpha on Windows 10: > > $ tesseract --version > tesseract 4.00.00alpha > leptonica-1.74.1 > libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.20 : > libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0 > > I have 3 images containing a Simplified Chinese sentence of different > sizes: > > chi_sim.Microsoft_Yahei.exp1.tif (small) > chi_sim.Microsoft_Yahei.exp2.tif (medium) > chi_sim.Microsoft_Yahei.exp3.tif (large) > > I ran Tesseract to recognize the texts in the images using the commands > below: > > $ tesseract -l chi_sim chi_sim.Microsoft_Yahei.exp1.tif > chi_sim.Microsoft_Yahei.exp1a > $ tesseract -l chi_sim chi_sim.Microsoft_Yahei.exp2.tif > chi_sim.Microsoft_Yahei.exp2a > $ tesseract -l chi_sim chi_sim.Microsoft_Yahei.exp3.tif > chi_sim.Microsoft_Yahei.exp3a > > Tesseract was able to recognize the texts in the large image perfectly. It > missed the last "period" symbol in the medium image, and failed to > recognize a number of characters in the small image. > > I'd like to train Tesseract to be able to recognize > chi_sim.Microsoft_Yahei.exp1.tif and chi_sim.Microsoft_Yahei.exp2.tif. I > created box files for both images as chi_sim.Microsoft_Yahei.exp1.box and > chi_sim.Microsoft_Yahei.exp2.box using jTessBoxEditor. > > The Windows version of Tesseract 4.0 I installed didn't come with > tesstrain.sh. I downloaded the source and was able to extract the training > commands. The documentation mentioned about LSTM but I couldn't find any > LSTM call within the tesstrain.sh script. Anyway, I ran the extracted > commands as below ($TESS_LANG is the path of the langdata folder.): > > = Phase I: Generating training images = > $ unicharset_extractor -D ./chi_sim chi_sim.Microsoft_Yahei.exp1.box > chi_sim.Microsoft_Yahei.exp2.box > > = Phase UP: Generating unicharset and unichar properties files = > $ set_unicharset_properties -U ./chi_sim/unicharset -O > ./chi_sim/chi_sim.unicharset -X ./chi_sim/chi_sim.xheights > --script_dir=$TESS_LANG > > = Phase D: Generating Dawg files = > $ wordlist2dawg -r 1 $TESS_LANG/chi_sim/chi_sim.wordlist > ./chi_sim/chi_sim.word-dawg ./chi_sim/chi_sim.unicharset > > = Phase E: Extracting features = > > $ tesseract chi_sim.Microsoft_Yahei.exp2.tif chi_sim.Microsoft_Yahei.exp2 > box.train $TESS_LANG/chi_sim/chi_sim.config > $ tesseract chi_sim.Microsoft_Yahei.exp1.tif chi_sim.Microsoft_Yahei.exp1 > box.train $TESS_LANG/chi_sim/chi_sim.config > > = Phase C: Clustering feature prototypes (cnTraining) = > $ cntraining -D ./chi_sim chi_sim.Microsoft_Yahei.exp1.tr > chi_sim.Microsoft_Yahei.exp2.tr > > = Phase M : Clustering microfeatures (mfTraining) = > $ mftraining -D ./chi_sim/ -U ./chi_sim/chi_sim.unicharset -O > ./chi_sim/chi_sim.mfunicharset -F $TESS_LANG/font_properties -X > ./chi_sim/chi_sim.xheights chi_sim.Microsoft_Yahei.exp1.tr > chi_sim.Microsoft_Yahei.exp2.tr > > = Making final traineddata file = > $ cp $TESS_LANG/chi_sim/chi_sim.config ./chi_sim/. > > Add "chi_sim." to files "inttemp", "normproto", "pffmtable", and > "shapetable" > > $ combine_tessdata ./chi_sim/chi_sim. > > $ cp ./chi_sim/chi_sim.traineddata > $TESSDATA_PREFIX/tessdata/chi_sim_1.traineddata > > =================================== > > I reran Tesseract on the 3 images using the commands below: > > $ tesseract -l chi_sim_1+chi_sim chi_sim.Microsoft_Yahei.exp1.tif > chi_sim.Microsoft_Yahei.exp1b > > $ tesseract -l chi_sim_1+chi_sim chi_sim.Microsoft_Yahei.exp2.tif > chi_sim.Microsoft_Yahei.exp2b > > $ tesseract -l chi_sim_1+chi_sim chi_sim.Microsoft_Yahei.exp3.tif > chi_sim.Microsoft_Yahei.exp3b > > The large image still produces perfect result. The medium image gives the > same result as before missing a "period" symbol. The small image actually > returns worse result detecting wrong number of words from the image. > > I am attaching a zip files containing the images, the box files, and the > results (.txt) returned from the initial runs and the runs after the > training. > > Are my training steps incorrect? What can I do to improve the quality of > the OCR engine? Any suggestion will be much appreciated! > > > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/14d6afaa-f220-4b03-b12f-330f1c98501a%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

