Update:

1. When using a smaller training_text for chi_sim for plus training, the
unicharset gets restricted. So, merge the lstm-unicharset with it.

2. The unicharset for chi_sim using langdata is different from the one
extracted from tessdata_best. so using training_text from langdata will add
more characters.

3. The fonts used for LSTM training are given in langdata_lstm in
okfonts.txt. For plus training same fonts should be used otherwise it will
require training of new typefaces.

4. Another user was trying to fine-tune chi_sim (check old forum posts) to
add theta sign. If I remember correctly, the plus type training did not
work for it. Replace top layer was the better option.

5. I am training with the following fonts.
"Adobe Heiti Std" \
"Adobe Kaiti Std" \
"Arial Unicode MS" \
"Bitstream CyberCJK" \
"Microsoft YaHei UI" \
"Microsoft YaHei" \
"NSimSun" \
"Noto Sans CJK SC" \
"Noto Sans Mono CJK SC" \
"STXihei" \
"SimSun" \
"WenQuanYi Zen Hei Medium" \
"WenQuanYi Zen Hei Mono Medium" \
"WenQuanYi Zen Hei Sharp Medium" \

At iteration 1046/1100/1100, Mean rms=0.704%, delta=1.445%, char
train=4.888%, word train=46.842%, skip ratio=0%,  New best char error =
4.888 wrote best
model:/home/ubuntu/tesstutorial/chi_sim_plus/chi_sim_plus4.888_1046.checkpoint
wrote checkpoint.


On Wed, Jun 19, 2019 at 12:36 AM Jingjing Lin <[email protected]> wrote:

> Can you please test on arrows (↑
> <https://en.wikipedia.org/wiki/%E2%86%91_(disambiguation)> or ↓
> <https://en.wikipedia.org/wiki/%E2%86%93_(disambiguation)>) instead of ±
> if it's not inconvenient for you?
>
> 在 2019年6月18日星期二 UTC-4下午2:21:18,shree写道:
>>
>> I will test tomorrow and let you know
>>
>> On Tue, 18 Jun 2019, 23:47 Jingjing Lin, <[email protected]> wrote:
>>
>>> It still couldn't work after I increased the number of ± to about 100.
>>> And the error rate after 2000 iterations is about 11. This is a pretty high
>>> error rate compare to what we have for adding a few characters to eng. With
>>> such high error rate, I would not be surprised that it could't recognize
>>> some special characters like ±. Is this it for chi_sim? Or can I increase
>>> iterations to make the error rate smaller?
>>> Thanks for your help.
>>>
>>> 在 2019年6月18日星期二 UTC-4上午10:32:37,shree写道:
>>>>
>>>>  increase the number of ± to about 100
>>>>
>>>> On Tue, Jun 18, 2019 at 7:39 PM Jingjing Lin <[email protected]>
>>>> wrote:
>>>>
>>>>> Sorry to bother you again and again.
>>>>> I reduced the training text to about 450 lines, with like 30 ± in it.
>>>>> I used two fonts and iteration of 1000. But it looks like ± is still not
>>>>> picked up by the BEST OCR TEXT at all, it always recognizes ± as something
>>>>> else. What is happening here? Should I increase the number of ±? Or do I
>>>>> need to increase the number of fonts? I'm trying increasing iterations.
>>>>>
>>>>> 在 2019年6月18日星期二 UTC-4上午12:28:25,shree写道:
>>>>>>
>>>>>> If you increase the iterations then the plus type of training will
>>>>>> not give good result, i.e. the other letters will lose accuracy.
>>>>>>
>>>>>> You can try to reduce the training text size while still keeping all
>>>>>> the characters that you need as part of the training text,
>>>>>>
>>>>>> On Tue, Jun 18, 2019 at 2:24 AM Jingjing Lin <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> I was only using two different fonts and It only achieved lowest
>>>>>>> error rate of 11.271 after the training, does this mean I really need to
>>>>>>> increase the iterations?
>>>>>>>
>>>>>>> 在 2019年6月17日星期一 UTC-4下午2:16:31,shree写道:
>>>>>>>>
>>>>>>>> How big was your training text? How many iterations? Did the fonts
>>>>>>>> you use for training support the plus minus sign?
>>>>>>>>
>>>>>>>> You can run training with -- debug-level of -1 so that you can see
>>>>>>>> whether the plus minus is being picked for training in the console 
>>>>>>>> messages.
>>>>>>>>
>>>>>>>> On Mon, 17 Jun 2019, 23:29 Jingjing Lin, <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks. It works. The new character I added was there.
>>>>>>>>>
>>>>>>>>> Do you have any idea why after fine tuning tesseract still
>>>>>>>>> couldn't recognize the new character I added? When I tried to add '±' 
>>>>>>>>> to
>>>>>>>>> eng it works, but when I tried to add '±' to chi_sim, it couldn't work
>>>>>>>>> (explained below). Is there anything we need to pay attention to when 
>>>>>>>>> fine
>>>>>>>>> tuning other langs rather than eng?
>>>>>>>>>
>>>>>>>>> I used
>>>>>>>>>
>>>>>>>>> lstmeval --model ~/tesstutorial/trainplusminus/plusminus_checkpoint \
>>>>>>>>>   --traineddata 
>>>>>>>>> ~/tesstutorial/trainplusminus/chi_sim/chi_sim.traineddata \
>>>>>>>>>   --eval_listfile 
>>>>>>>>> ~/tesstutorial/evalplusminus/chi_sim.training_files.txt 2>&1 |
>>>>>>>>>   grep ±
>>>>>>>>>
>>>>>>>>> to check and ± only shows up in Truth but not in OCR
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 在 2019年6月17日星期一 UTC-4上午11:31:24,shree写道:
>>>>>>>>>>
>>>>>>>>>> combine_tessdata -u new.traineddata new.
>>>>>>>>>>
>>>>>>>>>> will unpack the traineddata file. check new.lstm-unicharset in it
>>>>>>>>>>
>>>>>>>>>> On Monday, June 17, 2019 at 8:20:24 PM UTC+5:30, Jingjing Lin
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> I tried to fine tune the model and add a new character via
>>>>>>>>>>> training, but it seems it still couldn't recognize this new 
>>>>>>>>>>> character using
>>>>>>>>>>> the new traineddata generated. To debug I want to check whether 
>>>>>>>>>>> this new
>>>>>>>>>>> character is in the .unicharset in the new traineddata generated. 
>>>>>>>>>>> Is there
>>>>>>>>>>> a way to do this?
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>> You received this message because you are subscribed to the Google
>>>>>>>>> Groups "tesseract-ocr" group.
>>>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>>>> send an email to [email protected].
>>>>>>>>> To post to this group, send email to [email protected].
>>>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>>>> To view this discussion on the web visit
>>>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/d251e677-5f9d-4f8f-b41a-aa015538ca47%40googlegroups.com
>>>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/d251e677-5f9d-4f8f-b41a-aa015538ca47%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>>>> .
>>>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>>>
>>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to [email protected].
>>>>>>> To post to this group, send email to [email protected].
>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/692ad4d1-ff8e-4a67-a582-645a3fa5b941%40googlegroups.com
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/692ad4d1-ff8e-4a67-a582-645a3fa5b941%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> ____________________________________________________________
>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/6d299e90-fc12-4a52-989f-5b787db5f1f7%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/6d299e90-fc12-4a52-989f-5b787db5f1f7%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/d5d4c267-c6e4-41e6-b0ab-01391a1b666d%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/d5d4c267-c6e4-41e6-b0ab-01391a1b666d%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/1a993e08-1444-4791-a8b7-981c6ba0cdbd%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/1a993e08-1444-4791-a8b7-981c6ba0cdbd%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWkq8Qw032B7qS-nmnrTBN5uKJamkONYa8xwr3sYFvF4g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to