You will have much better results if you use the new version of tesseract from https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr and the traineddata files from https://github.com/tesseract-ocr/tessdata_best
ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Thu, Sep 21, 2017 at 2:44 PM, wei ren <[email protected]> wrote: > I am new to OCR and tesseract. Please forgive me if I ask some "stupid" > questions. > > I try using tesseract 3.04.01 to recognize the Chinese characters in the > attached two images and get absurd results, so I merge the two images into > one and use the merged image yueyue.title.exp0.tif to train a new model. > Below are the steps: > > 1. Create the box file. > > $ tesseract yueyue.title.exp0.tif yueyue.title.exp0 -l chi_sim > batch.nochop makebox > > 2. Correct the errors in the box file in jTessBoxEditor. > > I fix the segmentation errors and assign the correct Chinese characters to > the segmentations. > > 3. Train the new model. > > $ tesseract yueyue.title.exp0.tif yueyue.title.exp0 nobatch box.train > $ unicharset_extractor yueyue.title.exp0.box > > 4. Define a font_properties file with the content. > > title 0 0 0 0 0 > > 5. Clustering. > > $ shapeclustering -F font_properties -U unicharset yueyue.title.exp0.tr > $ mftraining -F font_properties -U unicharset -O unicharset > yueyue.title.exp0.tr > $ cntraining yueyue.title.exp0.tr > > 6. Prefix all the files with "title.". > > $ mv unicharset title.unicharset > $ mv inttemp title.inttemp > $ mv pffmtable title.pffmtable > $ mv shapetable title.shapetable > $ mv normproto title.normproto > > 7. Put all the files together. > > $ combine_tessdata title. > > 8. Copy the new model to the tesseract-ocr tessdata directory. > > $ sudo cp title.traineddata /usr/share/tesseract-ocr/tessdata/ > > Then I type the following command to recognize again the Chinese > characters in the merged trained image. > > $ tesseract yueyue.title.exp0.tif stdout -l title > > Both the expected result is "老妇人和母鸡", but the actual result of the first > page is "老 老老老妇 人老妇母老鸡老" and the actual result of the second page is > "老老妇人和母老鸡". I generate a box file using the new model which is also > attached, > > $ tesseract yueyue.title.exp0.tif yueyue.title.exp0 -l title batch.nochop > makebox > > , and find that although tesseract only assigns the characters in the new > model to the segmentations, it can't get the correct segmentations. As you > can see, three characters are split into two segmentations, respectively. > But when I correct the trained box file, I have merged those two > segmentations into one. > > > > <https://lh3.googleusercontent.com/-r8UG3Svsbpo/WcN_98MjS7I/AAAAAAAAU8M/4ZMvHYfgOQ8OVp_fHIw__uZmTA6rFhyEgCLcBGAs/s1600/box2.png> > > <https://lh3.googleusercontent.com/-Wga1p7T579U/WcN_18CI2VI/AAAAAAAAU8I/Yvm9IB5zOGIXsbdcugPYTgiMRxbC02TTQCLcBGAs/s1600/box1.png> > > > > <https://lh3.googleusercontent.com/-Wga1p7T579U/WcN_18CI2VI/AAAAAAAAU8I/Yvm9IB5zOGIXsbdcugPYTgiMRxbC02TTQCLcBGAs/s1600/box1.png> > > > > <https://lh3.googleusercontent.com/-Wga1p7T579U/WcN_18CI2VI/AAAAAAAAU8I/Yvm9IB5zOGIXsbdcugPYTgiMRxbC02TTQCLcBGAs/s1600/box1.png> > > > > <https://lh3.googleusercontent.com/-Wga1p7T579U/WcN_18CI2VI/AAAAAAAAU8I/Yvm9IB5zOGIXsbdcugPYTgiMRxbC02TTQCLcBGAs/s1600/box1.png> > > > > <https://lh3.googleusercontent.com/-Wga1p7T579U/WcN_18CI2VI/AAAAAAAAU8I/Yvm9IB5zOGIXsbdcugPYTgiMRxbC02TTQCLcBGAs/s1600/box1.png> > > I have tried specified the font as bold and/or fixed in font_properties > and it doesn't help. I have also tried various page segmentation methods > and it doesn't help either. > > > I also attach the trained tessdata here so you can easily reproduce the > problems. Any hint or suggestion will be highly appreciated. > > <https://lh3.googleusercontent.com/-Wga1p7T579U/WcN_18CI2VI/AAAAAAAAU8I/Yvm9IB5zOGIXsbdcugPYTgiMRxbC02TTQCLcBGAs/s1600/box1.png> > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/18590868-ba1e-457d-8953-b002987d497d% > 40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/18590868-ba1e-457d-8953-b002987d497d%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduUrjUhbXyp_Cghyy%2BTeLu19xyPa48vi%3DEOSrhotYGfDVQ%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

