Re: [tesseract-ocr] Incorrect segmentation of Chinese characters even after training a new model

wei ren Mon, 25 Sep 2017 12:31:59 -0700

Thank you for the suggestion. Will give tesseract 4.0 a try. I hear that 
tesseract 4.0 uses LSTM neural network, so its performance will be much 
better, especially for Chinese, but it may be much slower, is that true?


By the way, I have also tried tweaking the parameters of tesseract 3.05, 
and have significantly improved the results with the following parameters:

assume_fixed_pitch_char_segment  1
textord_use_cjk_fp_model         1
textord_old_xheight              1
textord_min_xheight             60
textord_noise_hfract           0.1



On Thursday, September 21, 2017 at 4:01:26 AM UTC-7, shree wrote:
>
> You will have much better results if you use the new version of tesseract 
> from https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr
> and the traineddata files from 
> https://github.com/tesseract-ocr/tessdata_best
>
> ShreeDevi
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Thu, Sep 21, 2017 at 2:44 PM, wei ren <[email protected] <javascript:>
> > wrote:
>
>> I am new to OCR and tesseract. Please forgive me if I ask some "stupid" 
>> questions.
>>
>> I try using tesseract 3.04.01 to recognize the Chinese characters in the 
>> attached two images and get absurd results, so I merge the two images into 
>> one and use the merged image yueyue.title.exp0.tif to train a new model. 
>> Below are the steps:
>>
>> 1. Create the box file.
>>
>> $ tesseract yueyue.title.exp0.tif yueyue.title.exp0 -l chi_sim 
>> batch.nochop makebox
>>
>> 2. Correct the errors in the box file in jTessBoxEditor.
>>
>> I fix the segmentation errors and assign the correct Chinese characters 
>> to the segmentations.
>>
>> 3. Train the new model.
>>
>> $ tesseract yueyue.title.exp0.tif yueyue.title.exp0 nobatch box.train
>> $ unicharset_extractor yueyue.title.exp0.box
>>
>> 4. Define a font_properties file with the content.
>>
>> title 0 0 0 0 0
>>
>> 5. Clustering.
>>
>> $ shapeclustering -F font_properties -U unicharset yueyue.title.exp0.tr
>> $ mftraining -F font_properties -U unicharset -O unicharset 
>> yueyue.title.exp0.tr
>> $ cntraining yueyue.title.exp0.tr
>>
>> 6. Prefix all the files with "title.".
>>
>> $ mv unicharset title.unicharset 
>> $ mv inttemp title.inttemp
>> $ mv pffmtable title.pffmtable
>> $ mv shapetable title.shapetable
>> $ mv normproto title.normproto
>>
>> 7. Put all the files together.
>>
>> $ combine_tessdata title.
>>
>> 8. Copy the new model to the tesseract-ocr tessdata directory.
>>
>> $ sudo cp title.traineddata /usr/share/tesseract-ocr/tessdata/
>>
>> Then I type the following command to recognize again the Chinese 
>> characters in the merged trained image.
>>
>> $ tesseract yueyue.title.exp0.tif stdout -l title
>>
>> Both the expected result is "老妇人和母鸡", but the actual result of the first 
>> page is "老 老老老妇 人老妇母老鸡老" and the actual result of the second page is 
>> "老老妇人和母老鸡". I generate a box file using the new model which is also 
>> attached,  
>>
>> $ tesseract yueyue.title.exp0.tif yueyue.title.exp0 -l title batch.nochop 
>> makebox
>>
>> , and find that although tesseract only assigns the characters in the new 
>> model to the segmentations, it can't get the correct segmentations. As you 
>> can see, three characters are split into two segmentations, respectively. 
>> But when I correct the trained box file, I have merged those two 
>> segmentations into one. 
>>
>>
>>
>> <https://lh3.googleusercontent.com/-r8UG3Svsbpo/WcN_98MjS7I/AAAAAAAAU8M/4ZMvHYfgOQ8OVp_fHIw__uZmTA6rFhyEgCLcBGAs/s1600/box2.png>
>>
>> <https://lh3.googleusercontent.com/-Wga1p7T579U/WcN_18CI2VI/AAAAAAAAU8I/Yvm9IB5zOGIXsbdcugPYTgiMRxbC02TTQCLcBGAs/s1600/box1.png>
>>
>>
>>
>> <https://lh3.googleusercontent.com/-Wga1p7T579U/WcN_18CI2VI/AAAAAAAAU8I/Yvm9IB5zOGIXsbdcugPYTgiMRxbC02TTQCLcBGAs/s1600/box1.png>
>>
>>
>>
>> <https://lh3.googleusercontent.com/-Wga1p7T579U/WcN_18CI2VI/AAAAAAAAU8I/Yvm9IB5zOGIXsbdcugPYTgiMRxbC02TTQCLcBGAs/s1600/box1.png>
>>
>>
>>
>> <https://lh3.googleusercontent.com/-Wga1p7T579U/WcN_18CI2VI/AAAAAAAAU8I/Yvm9IB5zOGIXsbdcugPYTgiMRxbC02TTQCLcBGAs/s1600/box1.png>
>>
>>
>>
>> <https://lh3.googleusercontent.com/-Wga1p7T579U/WcN_18CI2VI/AAAAAAAAU8I/Yvm9IB5zOGIXsbdcugPYTgiMRxbC02TTQCLcBGAs/s1600/box1.png>
>>
>> I have tried specified the font as bold and/or fixed in font_properties 
>> and it doesn't help. I have also tried various page segmentation methods 
>> and it doesn't help either. 
>>
>>
>> I also attach the trained tessdata here so you can easily reproduce the 
>> problems. Any hint or suggestion will be highly appreciated.
>>
>> <https://lh3.googleusercontent.com/-Wga1p7T579U/WcN_18CI2VI/AAAAAAAAU8I/Yvm9IB5zOGIXsbdcugPYTgiMRxbC02TTQCLcBGAs/s1600/box1.png>
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/18590868-ba1e-457d-8953-b002987d497d%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/18590868-ba1e-457d-8953-b002987d497d%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4a702893-da3f-4b26-998e-aba4f04271cb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Incorrect segmentation of Chinese characters even after training a new model

Reply via email to