Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-19 Thread Jingjing Lin
Thanks for your comments. So did you mean we cannot use the method to add a special character to eng to add a special character to chi_sim? We'll have to retrain the top layer to achieve this? Another question is, when we use a smaller .training_text, the .unicharset only contains a limited

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-19 Thread Shree Devi Kumar
Old thread https://groups.google.com/forum/#!searchin/tesseract-ocr/layer$20chi_sim%7Csort:date/tesseract-ocr/iFMg7Gjczq4/f7_XRop2BAAJ On Wed, Jun 19, 2019 at 9:13 PM Shree Devi Kumar wrote: > Update: > > 1. When using a smaller training_text for chi_sim for plus training, the > unicharset

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-19 Thread Shree Devi Kumar
Update: 1. When using a smaller training_text for chi_sim for plus training, the unicharset gets restricted. So, merge the lstm-unicharset with it. 2. The unicharset for chi_sim using langdata is different from the one extracted from tessdata_best. so using training_text from langdata will add

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-18 Thread Jingjing Lin
Can you please test on arrows (↑ or ↓ ) instead of ± if it's not inconvenient for you? 在 2019年6月18日星期二 UTC-4下午2:21:18,shree写道: > > I will test tomorrow and let you know > > On

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-18 Thread Jingjing Lin
Thanks a lot! 在 2019年6月18日星期二 UTC-4下午2:21:18,shree写道: > > I will test tomorrow and let you know > > On Tue, 18 Jun 2019, 23:47 Jingjing Lin, > > wrote: > >> It still couldn't work after I increased the number of ± to about 100. >> And the error rate after 2000 iterations is about 11. This is a

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-18 Thread Shree Devi Kumar
I will test tomorrow and let you know On Tue, 18 Jun 2019, 23:47 Jingjing Lin, wrote: > It still couldn't work after I increased the number of ± to about 100. And > the error rate after 2000 iterations is about 11. This is a pretty high > error rate compare to what we have for adding a few

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-18 Thread Jingjing Lin
It still couldn't work after I increased the number of ± to about 100. And the error rate after 2000 iterations is about 11. This is a pretty high error rate compare to what we have for adding a few characters to eng. With such high error rate, I would not be surprised that it could't recognize

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-18 Thread Shree Devi Kumar
increase the number of ± to about 100 On Tue, Jun 18, 2019 at 7:39 PM Jingjing Lin wrote: > Sorry to bother you again and again. > I reduced the training text to about 450 lines, with like 30 ± in it. I > used two fonts and iteration of 1000. But it looks like ± is still not > picked up by the

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-18 Thread Jingjing Lin
Sorry to bother you again and again. I reduced the training text to about 450 lines, with like 30 ± in it. I used two fonts and iteration of 1000. But it looks like ± is still not picked up by the BEST OCR TEXT at all, it always recognizes ± as something else. What is happening here? Should I

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-18 Thread Jingjing Lin
Thanks for your advice. I'll try reduce the training text size. 在 2019年6月18日星期二 UTC-4上午12:28:25,shree写道: > > If you increase the iterations then the plus type of training will not > give good result, i.e. the other letters will lose accuracy. > > You can try to reduce the training text size

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-17 Thread Shree Devi Kumar
If you increase the iterations then the plus type of training will not give good result, i.e. the other letters will lose accuracy. You can try to reduce the training text size while still keeping all the characters that you need as part of the training text, On Tue, Jun 18, 2019 at 2:24 AM

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-17 Thread Shree Devi Kumar
Yes, each iteration is one line. For eng, the langdata training text is about 80 lines and you add 15 symbols for plus minus. With 30 fonts, you will have about 2400 lines. So in 3600 iterations, all samples will be seen and trained. For chi_sim with larger training text it will be different.

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-17 Thread Jingjing Lin
when I checked with --debug_interval -1 I found that although ± is in the GROUND TRUTH, it always showed as + or something else but not ± in the BEST OCR TEXT. What can I do in this situation? 在 2019年6月17日星期一 UTC-4下午2:16:31,shree写道: > > How big was your training text? How many iterations? Did

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-17 Thread Jingjing Lin
I was only using two different fonts and It only achieved lowest error rate of 11.271 after the training, does this mean I really need to increase the iterations? 在 2019年6月17日星期一 UTC-4下午2:16:31,shree写道: > > How big was your training text? How many iterations? Did the fonts you use > for

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-17 Thread Jingjing Lin
The training text was only about 2200 lines (200kB) and I used iteration of 3600. The fonts I used support ±. What do you mean by 'whether ± is being picked for training'? When I set --debug_interval -1 I found in every iteration it only outputs one line, does that mean in every iteration

Re: [tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-17 Thread Shree Devi Kumar
How big was your training text? How many iterations? Did the fonts you use for training support the plus minus sign? You can run training with -- debug-level of -1 so that you can see whether the plus minus is being picked for training in the console messages. On Mon, 17 Jun 2019, 23:29 Jingjing

[tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-17 Thread Jingjing Lin
Thanks. It works. The new character I added was there. Do you have any idea why after fine tuning tesseract still couldn't recognize the new character I added? When I tried to add '±' to eng it works, but when I tried to add '±' to chi_sim, it couldn't work (explained below). Is there anything

[tesseract-ocr] Re: how to check .unicharset in a .traineddata file

2019-06-17 Thread shree
combine_tessdata -u new.traineddata new. will unpack the traineddata file. check new.lstm-unicharset in it On Monday, June 17, 2019 at 8:20:24 PM UTC+5:30, Jingjing Lin wrote: > > I tried to fine tune the model and add a new character via training, but > it seems it still couldn't recognize