Recently I modified the tesstrain_utils.sh and --max_pages=3 option for text2image command, it seems the the normal Japanese now can work happlily, but the half-width characters still in a poor accuracy. Now I wonder how many characters should I add to the jpn.training_text, the wiki [ Fine Tuning for ± a few characters] said it should be 20-repeat of the ±, but I tried about 20-repeat for every half-width characters and it seems no use. When the count of repeat came to 30 and it seems getting better but not good enough, then I tried the 150-repeat level and the results gone worse.
在 2017年11月9日星期四 UTC+8上午8:35:50,Li Xianglei写道: > > Yes, I added half-width characters to the given jpn.training_text and > takes it as new jpn.training_text. > > 在 2017年11月9日星期四 UTC+8上午1:21:45,shree写道: >> >> does your training text include both half width and normal japanese? >> >> ShreeDevi >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> >> On Wed, Nov 8, 2017 at 4:01 PM, Li Xianglei <[email protected]> wrote: >> >>> Hi all, >>> >>> I'm trying to use tesseract to recognize Japanese on image. >>> I found that it get a poor accuracy with the half-width >>> Japanese(Katakana). >>> I'am trying to improve the accuracy by fine-tuning , >>> both [ Fine Tuning for ± a few characters] and [Training Just a >>> Few Layers] have been tried, >>> it seems may improve the accuracy of half-width Japanese but do a >>> lot of harm to the normal Japanese recognition. >>> Here is the way I do the fine-turing. >>> >>> 1 add half-width Japanese to the lang/jpn/jpn.training_text (clone >>> from tesseract-ocr/langdata seems train data for v3) >>> 2 Create train data by tesstrain.sh >>> 3 combine_tessdata -e /usr/local/tesseract/share/tessdata/jpn. >>> traineddata(which is best/jpn.traineddata) trainhalfwidth/jpn.lstm >>> 4 lstmtraining --model_output trainhalfwidth/jpnhw \ >>> --continue_from trainhalfwidth/jpn.lstm \ >>> --traineddata trainhalfwidth/jpn/jpn.traineddata\ >>> --old_traineddata /usr/local/tesseract/share/tessdata/ >>> jpn.traineddata \ >>> --train_listfile trainhalfwidth/jpn.training_files.txt >>> --max_iterations 3600 &> trainhalfwidth/basetrain.log >>> >>> Any advice? Thank you >>> >>> #It seems Ray is working on the train data for lstm, any news so far? >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/604e4981-9ca4-48be-980d-999df93f73ed%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/tesseract-ocr/604e4981-9ca4-48be-980d-999df93f73ed%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/cc904552-f397-47fe-999b-c18b2b469c19%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

