Re: [tesseract-ocr] train tesseract to improve the half-width Japanese(Katakana) recognition.

Li Xianglei Thu, 09 Nov 2017 18:37:09 -0800

>
> Recently I modified the tesstrain_utils.sh and --max_pages=3 option 
> for text2image command,


 Got an error, I mean I modified the  tesstrain_utils.sh and *remove* the 
--max_pages=3 
option.


在 2017年11月10日星期五 UTC+8上午10:29:21，Li Xianglei写道：
>
> Recently I modified the tesstrain_utils.sh and --max_pages=3 option 
> for text2image command,
> it seems the the normal Japanese now can work happlily, but the 
> half-width characters still in a poor accuracy.
> Now I wonder how many characters should I add to the jpn.training_text, 
> the wiki [ Fine Tuning for  ±  a few characters] said it should be 
> 20-repeat of the  ±, but I tried about 20-repeat for every half-width 
> characters and it seems no use.
> When the count of repeat came to 30 and it seems getting better but not 
> good enough,
> then I tried the 150-repeat level and the results gone worse.
>
> 在 2017年11月9日星期四 UTC+8上午8:35:50，Li Xianglei写道：
>>
>> Yes, I added half-width characters to the given jpn.training_text and 
>> takes it as new jpn.training_text.
>>
>> 在 2017年11月9日星期四 UTC+8上午1:21:45，shree写道：
>>>
>>> does your training text include both half width and normal japanese?
>>>
>>> ShreeDevi
>>> ____________________________________________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Wed, Nov 8, 2017 at 4:01 PM, Li Xianglei <[email protected]> wrote:
>>>
>>>> Hi all,
>>>>     
>>>>       I'm trying to use tesseract to recognize Japanese on image.
>>>>       I found that it get a poor accuracy with the  half-width 
>>>> Japanese(Katakana).
>>>>       I'am trying to improve the accuracy by fine-tuning , 
>>>>       both [ Fine Tuning for  ±  a few characters] and [Training Just a 
>>>> Few Layers] have been tried,
>>>>       it seems may improve the  accuracy of half-width Japanese but do 
>>>> a lot of harm to the normal Japanese  recognition.
>>>>       Here is the way I do the fine-turing.
>>>>
>>>>    1 add  half-width Japanese to the lang/jpn/jpn.training_text (clone 
>>>> from tesseract-ocr/langdata seems train data for v3)
>>>>    2 Create train data by tesstrain.sh
>>>>    3 combine_tessdata -e /usr/local/tesseract/share/tessdata/jpn.
>>>> traineddata(which is best/jpn.traineddata) trainhalfwidth/jpn.lstm
>>>>    4 lstmtraining --model_output trainhalfwidth/jpnhw \
>>>>                   --continue_from trainhalfwidth/jpn.lstm \
>>>>                   --traineddata trainhalfwidth/jpn/jpn.traineddata\
>>>>                   --old_traineddata /usr/local/tesseract/share/tessdata
>>>> /jpn.traineddata \
>>>>                   --train_listfile trainhalfwidth/jpn.training_files.txt 
>>>> --max_iterations 3600 &> trainhalfwidth/basetrain.log
>>>>
>>>>   Any advice? Thank you
>>>>
>>>>    #It seems Ray is working on the train data for lstm, any news so far?
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/604e4981-9ca4-48be-980d-999df93f73ed%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/604e4981-9ca4-48be-980d-999df93f73ed%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/88208dd3-41af-496f-a3d4-6c339d05022d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] train tesseract to improve the half-width Japanese(Katakana) recognition.

Reply via email to