Re: [tesseract-ocr] Re: Need help training Simplified Chinese.

ShreeDevi Kumar Sun, 25 Jun 2017 07:53:42 -0700

>> I installed Tesseract 4.00alpha on Linux.

How did you install it?


Did you use the latest code from github?

ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sun, Jun 25, 2017 at 8:18 PM, Clement <[email protected]> wrote:

> Thanks for your reply. I have another question related to the oem option
> you mentioned. Is it for the training command (tesstrain.sh) or the
> recognition command (tesseract)?
>
> I installed Tesseract 4.00alpha on Linux. When I ran tesseract on an
> image, I got the old format (3.x version) that's without the extra spaces
> but the recognition quality was poor. I've no other version of Tesseract
> installed on the same box.
>
> I tried to specify the "--oem 1" option but it didn't work:
> $ tesseract 001a3.png 001a3 -l chi_sim --oem 1
> read_params_file: Can't open 1
> Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
>
>
> On Wednesday, June 21, 2017 at 11:40:22 PM UTC-7, Clement wrote:
>>
>> I am new to Tesseract-OCR and need help in training the engine to
>> recognize Simplified Chinese texts.
>>
>> I just installed Tesseract 4.00Alpha on Windows 10:
>>
>> $ tesseract --version
>> tesseract 4.00.00alpha
>>  leptonica-1.74.1
>>   libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.20 :
>> libtiff 4.0.6 : zlib 1.2.8 : libwebp 0.4.3 : libopenjp2 2.1.0
>>
>> I have 3 images containing a Simplified Chinese sentence of different
>> sizes:
>>
>> chi_sim.Microsoft_Yahei.exp1.tif (small)
>> chi_sim.Microsoft_Yahei.exp2.tif (medium)
>> chi_sim.Microsoft_Yahei.exp3.tif (large)
>>
>> I ran Tesseract to recognize the texts in the images using the commands
>> below:
>>
>> $ tesseract -l chi_sim chi_sim.Microsoft_Yahei.exp1.tif
>> chi_sim.Microsoft_Yahei.exp1a
>> $ tesseract -l chi_sim chi_sim.Microsoft_Yahei.exp2.tif
>> chi_sim.Microsoft_Yahei.exp2a
>> $ tesseract -l chi_sim chi_sim.Microsoft_Yahei.exp3.tif
>> chi_sim.Microsoft_Yahei.exp3a
>>
>> Tesseract was able to recognize the texts in the large image perfectly.
>> It missed the last "period" symbol in the medium image, and failed to
>> recognize a number of characters in the small image.
>>
>> I'd like to train Tesseract to be able to recognize
>> chi_sim.Microsoft_Yahei.exp1.tif and chi_sim.Microsoft_Yahei.exp2.tif. I
>> created box files for both images as chi_sim.Microsoft_Yahei.exp1.box
>> and chi_sim.Microsoft_Yahei.exp2.box using jTessBoxEditor.
>>
>> The Windows version of Tesseract 4.0 I installed didn't come with
>> tesstrain.sh. I downloaded the source and was able to extract the training
>> commands. The documentation mentioned about LSTM but I couldn't find any
>> LSTM call within the tesstrain.sh script. Anyway, I ran the extracted
>> commands as below ($TESS_LANG is the path of the langdata folder.):
>>
>> = Phase I: Generating training images =
>> $ unicharset_extractor -D ./chi_sim chi_sim.Microsoft_Yahei.exp1.box
>> chi_sim.Microsoft_Yahei.exp2.box
>>
>> = Phase UP: Generating unicharset and unichar properties files =
>> $ set_unicharset_properties -U ./chi_sim/unicharset -O
>> ./chi_sim/chi_sim.unicharset -X ./chi_sim/chi_sim.xheights
>> --script_dir=$TESS_LANG
>>
>> = Phase D: Generating Dawg files =
>> $ wordlist2dawg -r 1 $TESS_LANG/chi_sim/chi_sim.wordlist
>> ./chi_sim/chi_sim.word-dawg ./chi_sim/chi_sim.unicharset
>>
>> = Phase E: Extracting features =
>>
>> $ tesseract chi_sim.Microsoft_Yahei.exp2.tif
>> chi_sim.Microsoft_Yahei.exp2 box.train $TESS_LANG/chi_sim/chi_sim.config
>> $ tesseract chi_sim.Microsoft_Yahei.exp1.tif
>> chi_sim.Microsoft_Yahei.exp1 box.train $TESS_LANG/chi_sim/chi_sim.config
>>
>> = Phase C: Clustering feature prototypes (cnTraining) =
>> $ cntraining -D ./chi_sim chi_sim.Microsoft_Yahei.exp1.tr
>> chi_sim.Microsoft_Yahei.exp2.tr
>>
>> = Phase M : Clustering microfeatures (mfTraining) =
>> $ mftraining -D ./chi_sim/ -U ./chi_sim/chi_sim.unicharset -O
>> ./chi_sim/chi_sim.mfunicharset -F $TESS_LANG/font_properties -X
>> ./chi_sim/chi_sim.xheights chi_sim.Microsoft_Yahei.exp1.tr
>> chi_sim.Microsoft_Yahei.exp2.tr
>>
>> = Making final traineddata file =
>> $ cp $TESS_LANG/chi_sim/chi_sim.config ./chi_sim/.
>>
>> Add "chi_sim." to files "inttemp", "normproto", "pffmtable", and
>> "shapetable"
>>
>> $ combine_tessdata ./chi_sim/chi_sim.
>>
>> $ cp ./chi_sim/chi_sim.traineddata $TESSDATA_PREFIX/tessdata/chi_
>> sim_1.traineddata
>>
>> ===================================
>>
>> I reran Tesseract on the 3 images using the commands below:
>>
>> $  tesseract -l chi_sim_1+chi_sim chi_sim.Microsoft_Yahei.exp1.tif
>> chi_sim.Microsoft_Yahei.exp1b
>>
>> $  tesseract -l chi_sim_1+chi_sim chi_sim.Microsoft_Yahei.exp2.tif
>> chi_sim.Microsoft_Yahei.exp2b
>>
>> $  tesseract -l chi_sim_1+chi_sim chi_sim.Microsoft_Yahei.exp3.tif
>> chi_sim.Microsoft_Yahei.exp3b
>>
>> The large image still produces perfect result. The medium image gives the
>> same result as before missing a "period" symbol. The small image actually
>> returns worse result detecting wrong number of words from the image.
>>
>> I am attaching a zip files containing the images, the box files, and the
>> results (.txt) returned from the initial runs and the runs after the
>> training.
>>
>> Are my training steps incorrect? What can I do to improve the quality of
>> the OCR engine? Any suggestion will be much appreciated!
>>
>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/14d6afaa-f220-4b03-b12f-330f1c98501a%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/14d6afaa-f220-4b03-b12f-330f1c98501a%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXUATfmzsos5mE31_G9718EFmuZ9FUdTw9d6s5rGy4TRA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: Need help training Simplified Chinese.

Reply via email to