Re: [tesseract-ocr] Incorrect segmentation of Chinese characters even after training a new model

ShreeDevi Kumar Thu, 21 Sep 2017 04:02:18 -0700

You will have much better results if you use the new version of tesseract
from https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr
and the traineddata files from
https://github.com/tesseract-ocr/tessdata_best


ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Sep 21, 2017 at 2:44 PM, wei ren <[email protected]> wrote:

> I am new to OCR and tesseract. Please forgive me if I ask some "stupid"
> questions.
>
> I try using tesseract 3.04.01 to recognize the Chinese characters in the
> attached two images and get absurd results, so I merge the two images into
> one and use the merged image yueyue.title.exp0.tif to train a new model.
> Below are the steps:
>
> 1. Create the box file.
>
> $ tesseract yueyue.title.exp0.tif yueyue.title.exp0 -l chi_sim
> batch.nochop makebox
>
> 2. Correct the errors in the box file in jTessBoxEditor.
>
> I fix the segmentation errors and assign the correct Chinese characters to
> the segmentations.
>
> 3. Train the new model.
>
> $ tesseract yueyue.title.exp0.tif yueyue.title.exp0 nobatch box.train
> $ unicharset_extractor yueyue.title.exp0.box
>
> 4. Define a font_properties file with the content.
>
> title 0 0 0 0 0
>
> 5. Clustering.
>
> $ shapeclustering -F font_properties -U unicharset yueyue.title.exp0.tr
> $ mftraining -F font_properties -U unicharset -O unicharset
> yueyue.title.exp0.tr
> $ cntraining yueyue.title.exp0.tr
>
> 6. Prefix all the files with "title.".
>
> $ mv unicharset title.unicharset
> $ mv inttemp title.inttemp
> $ mv pffmtable title.pffmtable
> $ mv shapetable title.shapetable
> $ mv normproto title.normproto
>
> 7. Put all the files together.
>
> $ combine_tessdata title.
>
> 8. Copy the new model to the tesseract-ocr tessdata directory.
>
> $ sudo cp title.traineddata /usr/share/tesseract-ocr/tessdata/
>
> Then I type the following command to recognize again the Chinese
> characters in the merged trained image.
>
> $ tesseract yueyue.title.exp0.tif stdout -l title
>
> Both the expected result is "老妇人和母鸡", but the actual result of the first
> page is "老 老老老妇 人老妇母老鸡老" and the actual result of the second page is
> "老老妇人和母老鸡". I generate a box file using the new model which is also
> attached,
>
> $ tesseract yueyue.title.exp0.tif yueyue.title.exp0 -l title batch.nochop
> makebox
>
> , and find that although tesseract only assigns the characters in the new
> model to the segmentations, it can't get the correct segmentations. As you
> can see, three characters are split into two segmentations, respectively.
> But when I correct the trained box file, I have merged those two
> segmentations into one.
>
>
>
> <https://lh3.googleusercontent.com/-r8UG3Svsbpo/WcN_98MjS7I/AAAAAAAAU8M/4ZMvHYfgOQ8OVp_fHIw__uZmTA6rFhyEgCLcBGAs/s1600/box2.png>
>
> <https://lh3.googleusercontent.com/-Wga1p7T579U/WcN_18CI2VI/AAAAAAAAU8I/Yvm9IB5zOGIXsbdcugPYTgiMRxbC02TTQCLcBGAs/s1600/box1.png>
>
>
>
> <https://lh3.googleusercontent.com/-Wga1p7T579U/WcN_18CI2VI/AAAAAAAAU8I/Yvm9IB5zOGIXsbdcugPYTgiMRxbC02TTQCLcBGAs/s1600/box1.png>
>
>
>
> <https://lh3.googleusercontent.com/-Wga1p7T579U/WcN_18CI2VI/AAAAAAAAU8I/Yvm9IB5zOGIXsbdcugPYTgiMRxbC02TTQCLcBGAs/s1600/box1.png>
>
>
>
> <https://lh3.googleusercontent.com/-Wga1p7T579U/WcN_18CI2VI/AAAAAAAAU8I/Yvm9IB5zOGIXsbdcugPYTgiMRxbC02TTQCLcBGAs/s1600/box1.png>
>
>
>
> <https://lh3.googleusercontent.com/-Wga1p7T579U/WcN_18CI2VI/AAAAAAAAU8I/Yvm9IB5zOGIXsbdcugPYTgiMRxbC02TTQCLcBGAs/s1600/box1.png>
>
> I have tried specified the font as bold and/or fixed in font_properties
> and it doesn't help. I have also tried various page segmentation methods
> and it doesn't help either.
>
>
> I also attach the trained tessdata here so you can easily reproduce the
> problems. Any hint or suggestion will be highly appreciated.
>
> <https://lh3.googleusercontent.com/-Wga1p7T579U/WcN_18CI2VI/AAAAAAAAU8I/Yvm9IB5zOGIXsbdcugPYTgiMRxbC02TTQCLcBGAs/s1600/box1.png>
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/18590868-ba1e-457d-8953-b002987d497d%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/18590868-ba1e-457d-8953-b002987d497d%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUrjUhbXyp_Cghyy%2BTeLu19xyPa48vi%3DEOSrhotYGfDVQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Incorrect segmentation of Chinese characters even after training a new model

Reply via email to