[tesseract-ocr] Re: Need help training Simplified Chinese.

Clement Sun, 25 Jun 2017 07:48:39 -0700

Thanks for your reply. I have another question related to the oem option 
you mentioned. Is it for the training command (tesstrain.sh) or the 
recognition command (tesseract)?


I installed Tesseract 4.00alpha on Linux. When I ran tesseract on an image, 
I got the old format (3.x version) that's without the extra spaces but the 
recognition quality was poor. I've no other version of Tesseract installed 
on the same box.

I tried to specify the "--oem 1" option but it didn't work:
$ tesseract 001a3.png 001a3 -l chi_sim --oem 1
read_params_file: Can't open 1
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
 

On Wednesday, June 21, 2017 at 11:40:22 PM UTC-7, Clement wrote:
>
> I am new to Tesseract-OCR and need help in training the engine to 
> recognize Simplified Chinese texts.
>
> I just installed Tesseract 4.00Alpha on Windows 10:
>
> $ tesseract --version
> tesseract 4.00.00alpha
>  leptonica-1.74.1
>   libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.20 : 
> libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0
>
> I have 3 images containing a Simplified Chinese sentence of different 
> sizes:
>
> chi_sim.Microsoft_Yahei.exp1.tif (small)
> chi_sim.Microsoft_Yahei.exp2.tif (medium)
> chi_sim.Microsoft_Yahei.exp3.tif (large)
>
> I ran Tesseract to recognize the texts in the images using the commands 
> below:
>
> $ tesseract -l chi_sim chi_sim.Microsoft_Yahei.exp1.tif 
> chi_sim.Microsoft_Yahei.exp1a
> $ tesseract -l chi_sim chi_sim.Microsoft_Yahei.exp2.tif 
> chi_sim.Microsoft_Yahei.exp2a
> $ tesseract -l chi_sim chi_sim.Microsoft_Yahei.exp3.tif 
> chi_sim.Microsoft_Yahei.exp3a
>
> Tesseract was able to recognize the texts in the large image perfectly. It 
> missed the last "period" symbol in the medium image, and failed to 
> recognize a number of characters in the small image.
>
> I'd like to train Tesseract to be able to recognize 
> chi_sim.Microsoft_Yahei.exp1.tif and chi_sim.Microsoft_Yahei.exp2.tif. I 
> created box files for both images as chi_sim.Microsoft_Yahei.exp1.box and 
> chi_sim.Microsoft_Yahei.exp2.box using jTessBoxEditor.
>
> The Windows version of Tesseract 4.0 I installed didn't come with 
> tesstrain.sh. I downloaded the source and was able to extract the training 
> commands. The documentation mentioned about LSTM but I couldn't find any 
> LSTM call within the tesstrain.sh script. Anyway, I ran the extracted 
> commands as below ($TESS_LANG is the path of the langdata folder.):
>
> = Phase I: Generating training images =
> $ unicharset_extractor -D ./chi_sim chi_sim.Microsoft_Yahei.exp1.box 
> chi_sim.Microsoft_Yahei.exp2.box
>
> = Phase UP: Generating unicharset and unichar properties files =
> $ set_unicharset_properties -U ./chi_sim/unicharset -O 
> ./chi_sim/chi_sim.unicharset -X ./chi_sim/chi_sim.xheights 
> --script_dir=$TESS_LANG
>
> = Phase D: Generating Dawg files =
> $ wordlist2dawg -r 1 $TESS_LANG/chi_sim/chi_sim.wordlist 
> ./chi_sim/chi_sim.word-dawg ./chi_sim/chi_sim.unicharset
>
> = Phase E: Extracting features =
>
> $ tesseract chi_sim.Microsoft_Yahei.exp2.tif chi_sim.Microsoft_Yahei.exp2 
> box.train $TESS_LANG/chi_sim/chi_sim.config
> $ tesseract chi_sim.Microsoft_Yahei.exp1.tif chi_sim.Microsoft_Yahei.exp1 
> box.train $TESS_LANG/chi_sim/chi_sim.config
>
> = Phase C: Clustering feature prototypes (cnTraining) =
> $ cntraining -D ./chi_sim chi_sim.Microsoft_Yahei.exp1.tr 
> chi_sim.Microsoft_Yahei.exp2.tr 
>
> = Phase M : Clustering microfeatures (mfTraining) =
> $ mftraining -D ./chi_sim/ -U ./chi_sim/chi_sim.unicharset -O 
> ./chi_sim/chi_sim.mfunicharset -F $TESS_LANG/font_properties -X 
> ./chi_sim/chi_sim.xheights chi_sim.Microsoft_Yahei.exp1.tr 
> chi_sim.Microsoft_Yahei.exp2.tr 
>
> = Making final traineddata file =
> $ cp $TESS_LANG/chi_sim/chi_sim.config ./chi_sim/.
>
> Add "chi_sim." to files "inttemp", "normproto", "pffmtable", and 
> "shapetable"
>
> $ combine_tessdata ./chi_sim/chi_sim.
>
> $ cp ./chi_sim/chi_sim.traineddata 
> $TESSDATA_PREFIX/tessdata/chi_sim_1.traineddata
>
> ===================================
>
> I reran Tesseract on the 3 images using the commands below:
>
> $  tesseract -l chi_sim_1+chi_sim chi_sim.Microsoft_Yahei.exp1.tif 
> chi_sim.Microsoft_Yahei.exp1b
>
> $  tesseract -l chi_sim_1+chi_sim chi_sim.Microsoft_Yahei.exp2.tif 
> chi_sim.Microsoft_Yahei.exp2b
>
> $  tesseract -l chi_sim_1+chi_sim chi_sim.Microsoft_Yahei.exp3.tif 
> chi_sim.Microsoft_Yahei.exp3b
>
> The large image still produces perfect result. The medium image gives the 
> same result as before missing a "period" symbol. The small image actually 
> returns worse result detecting wrong number of words from the image.
>
> I am attaching a zip files containing the images, the box files, and the 
> results (.txt) returned from the initial runs and the runs after the 
> training. 
>
> Are my training steps incorrect? What can I do to improve the quality of 
> the OCR engine? Any suggestion will be much appreciated!
>
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/14d6afaa-f220-4b03-b12f-330f1c98501a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Need help training Simplified Chinese.

Reply via email to