Re: [tesseract-ocr] Need help training Simplified Chinese.

ShreeDevi Kumar Thu, 22 Jun 2017 01:27:49 -0700

Your best bet for improving recognition is to preprocess the small and
medium images to larger size.
Please  see https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality


Tesseract 4.00.00alpha currently has two different ocr engines in it. The
legacy tesseract engine is accessible with --oem 0 and new LSTM engine is
accessible with --oem 1.
The option --oem 2 will use both together and --oem 3 will use the one
which has been defined as default.

The training process that you followed builds a new model for the legacy
engine, not LSTM.

If you notice the output for your first test, you will notice that there
are spaces after each character in the OCRed text, which has been reported
as an issue with the LSTM model. The legacy model does not add the extra
spaces but the accuracy is lower.

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Jun 22, 2017 at 11:15 AM, Clement <[email protected]> wrote:

> I am new to Tesseract-OCR and need help in training the engine to
> recognize Simplified Chinese texts.
>
> I just installed Tesseract 4.00Alpha on Windows 10:
>
> $ tesseract --version
> tesseract 4.00.00alpha
>  leptonica-1.74.1
>   libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.20 :
> libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0
>
> I have 3 images containing a Simplified Chinese sentence of different
> sizes:
>
> chi_sim.Microsoft_Yahei.exp1.tif (small)
> chi_sim.Microsoft_Yahei.exp2.tif (medium)
> chi_sim.Microsoft_Yahei.exp3.tif (large)
>
> I ran Tesseract to recognize the texts in the images using the commands
> below:
>
> $ tesseract -l chi_sim chi_sim.Microsoft_Yahei.exp1.tif
> chi_sim.Microsoft_Yahei.exp1a
> $ tesseract -l chi_sim chi_sim.Microsoft_Yahei.exp2.tif
> chi_sim.Microsoft_Yahei.exp2a
> $ tesseract -l chi_sim chi_sim.Microsoft_Yahei.exp3.tif
> chi_sim.Microsoft_Yahei.exp3a
>
> Tesseract was able to recognize the texts in the large image perfectly. It
> missed the last "period" symbol in the medium image, and failed to
> recognize a number of characters in the small image.
>
> I'd like to train Tesseract to be able to recognize
> chi_sim.Microsoft_Yahei.exp1.tif and chi_sim.Microsoft_Yahei.exp2.tif. I
> created box files for both images as chi_sim.Microsoft_Yahei.exp1.box and
> chi_sim.Microsoft_Yahei.exp2.box using jTessBoxEditor.
>
> The Windows version of Tesseract 4.0 I installed didn't come with
> tesstrain.sh. I downloaded the source and was able to extract the training
> commands. The documentation mentioned about LSTM but I couldn't find any
> LSTM call within the tesstrain.sh script. Anyway, I ran the extracted
> commands as below ($TESS_LANG is the path of the langdata folder.):
>
> = Phase I: Generating training images =
> $ unicharset_extractor -D ./chi_sim chi_sim.Microsoft_Yahei.exp1.box
> chi_sim.Microsoft_Yahei.exp2.box
>
> = Phase UP: Generating unicharset and unichar properties files =
> $ set_unicharset_properties -U ./chi_sim/unicharset -O
> ./chi_sim/chi_sim.unicharset -X ./chi_sim/chi_sim.xheights
> --script_dir=$TESS_LANG
>
> = Phase D: Generating Dawg files =
> $ wordlist2dawg -r 1 $TESS_LANG/chi_sim/chi_sim.wordlist
> ./chi_sim/chi_sim.word-dawg ./chi_sim/chi_sim.unicharset
>
> = Phase E: Extracting features =
>
> $ tesseract chi_sim.Microsoft_Yahei.exp2.tif chi_sim.Microsoft_Yahei.exp2
> box.train $TESS_LANG/chi_sim/chi_sim.config
> $ tesseract chi_sim.Microsoft_Yahei.exp1.tif chi_sim.Microsoft_Yahei.exp1
> box.train $TESS_LANG/chi_sim/chi_sim.config
>
> = Phase C: Clustering feature prototypes (cnTraining) =
> $ cntraining -D ./chi_sim chi_sim.Microsoft_Yahei.exp1.tr
> chi_sim.Microsoft_Yahei.exp2.tr
>
> = Phase M : Clustering microfeatures (mfTraining) =
> $ mftraining -D ./chi_sim/ -U ./chi_sim/chi_sim.unicharset -O
> ./chi_sim/chi_sim.mfunicharset -F $TESS_LANG/font_properties -X
> ./chi_sim/chi_sim.xheights chi_sim.Microsoft_Yahei.exp1.tr
> chi_sim.Microsoft_Yahei.exp2.tr
>
> = Making final traineddata file =
> $ cp $TESS_LANG/chi_sim/chi_sim.config ./chi_sim/.
>
> Add "chi_sim." to files "inttemp", "normproto", "pffmtable", and
> "shapetable"
>
> $ combine_tessdata ./chi_sim/chi_sim.
>
> $ cp ./chi_sim/chi_sim.traineddata $TESSDATA_PREFIX/tessdata/chi_
> sim_1.traineddata
>
> ===================================
>
> I reran Tesseract on the 3 images using the commands below:
>
> $  tesseract -l chi_sim_1+chi_sim chi_sim.Microsoft_Yahei.exp1.tif
> chi_sim.Microsoft_Yahei.exp1b
>
> $  tesseract -l chi_sim_1+chi_sim chi_sim.Microsoft_Yahei.exp2.tif
> chi_sim.Microsoft_Yahei.exp2b
>
> $  tesseract -l chi_sim_1+chi_sim chi_sim.Microsoft_Yahei.exp3.tif
> chi_sim.Microsoft_Yahei.exp3b
>
> The large image still produces perfect result. The medium image gives the
> same result as before missing a "period" symbol. The small image actually
> returns worse result detecting wrong number of words from the image.
>
> I am attaching a zip files containing the images, the box files, and the
> results (.txt) returned from the initial runs and the runs after the
> training.
>
> Are my training steps incorrect? What can I do to improve the quality of
> the OCR engine? Any suggestion will be much appreciated!
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/94caecfe-698d-4724-bf28-a46579d1e21f%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/94caecfe-698d-4724-bf28-a46579d1e21f%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXqAh72rWZ2xqTw3MhXyEzz4fZZ9HizJcMgr4mHx-rV%2Bw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Need help training Simplified Chinese.

Reply via email to