Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-26 Thread Piyush Chandra
Hi Shree, Could you please help me with the issue: https://groups.google.com/forum/#!topic/tesseract-ocr/DvuCBEKoVOo Sorry for tagging you in this post. Thanks in advance mam! On Thursday, 16 April 2020 20:08:52 UTC+5:30, shree wrote: > > You are training from scratch. It will take thousands

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-16 Thread Shree Devi Kumar
You are training from scratch. It will take thousands of iterations. Try fine-tuning. On Thu, Apr 16, 2020, 19:51 Piyush Chandra wrote: > Hi Shree, > > Thanks for replying. > > So shall I remove them from text file and create a unicharset file after > that or do I have do do something while

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-16 Thread Piyush Chandra
Hi Shree, Thanks for replying. So shall I remove them from text file and create a unicharset file after that or do I have do do something while creating the lstmf files? Also, Will this affect the training if I don't remove this? I saw that training was continuing but the best char error was

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-16 Thread Shree Devi Kumar
U+0965 ॥ e0 a5 a5 DEVANAGARI DOUBLE DANDA On Thu, Apr 16, 2020, 19:25 Shree Devi Kumar wrote: > U+200D ‍ e2 80 8d ZERO WIDTH JOINER > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-16 Thread Shree Devi Kumar
U+200D ‍ e2 80 8d ZERO WIDTH JOINER -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-14 Thread Piyush Chandra
hin.des0.txt These are the files I used. For box file, I used the below command: tesseract hin.des0.PNG hin.des0 -l hin lstmbox On Wednesday, 15 April 2020 06:52:48 UTC+5:30, shree wrote: > > How are you creating the

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-14 Thread Shree Devi Kumar
How are you creating the box files? On Wed, Apr 15, 2020, 01:52 Piyush Chandra wrote: > For other files, when I try on linux, its coming like this: > > unicharset_extractor --norm_mode 2 hin.desk0.box hin.desk1.box > Extracting unicharset from box file hin.desk0.box > Invalid start of grapheme

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-14 Thread Piyush Chandra
For other files, when I try on linux, its coming like this: unicharset_extractor --norm_mode 2 hin.desk0.box hin.desk1.box Extracting unicharset from box file hin.desk0.box Invalid start of grapheme sequence:H=0x94d Normalization failed for string '्' Invalid start of grapheme sequence:M=0x93e

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-14 Thread Piyush Chandra
Hi Shree, When I used unicharset extractor command, I get these error: unicharset_extractor --norm_mode 2 --output_unicharset min.unicharset hin.exp1.box Extracting unicharset from box file hin.exp1.box Invalid start of grapheme sequence:M=0x93e Normalization failed for string 'ा' Invalid

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-09 Thread Piyush Chandra
Thanks for the help! :) On Thursday, 9 April 2020 12:34:38 UTC+5:30, shree wrote: > > # Normalization mode - 2, 1 - for unicharset_extractor and Pass through > Recoder for combine_lang_model > ifeq ($(LANG_TYPE),Indic) > NORM_MODE =2 > RECODER =--pass_through_recoder > > > On Thu, Apr 9, 2020 at

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-09 Thread Shree Devi Kumar
# Normalization mode - 2, 1 - for unicharset_extractor and Pass through Recoder for combine_lang_model ifeq ($(LANG_TYPE),Indic) NORM_MODE =2 RECODER =--pass_through_recoder On Thu, Apr 9, 2020 at 12:29 PM Shree Devi Kumar wrote: > Unicharset will look like the following: > > द 1

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-09 Thread Shree Devi Kumar
Unicharset will look like the following: द 1 34,72,192,192,100,122,0,0,99,114 Devanagari 11 0 11 द # द [926 ]x र 1 58,64,192,192,84,119,0,0,81,110 Devanagari 12 0 12 र # र [930 ]x ् 0 3,32,61,197,12,181,0,0,0,1 Devanagari 13 17 13 ् # ् [94d ] श 1 61,64,192,195,128,148,0,12,130,147 Devanagari 14

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-09 Thread Piyush Chandra
Thank you Shree for giving the overview. Could you please help me understand your last point? Your unicharset should have Unicode codepoints. what does that mean? any example would be helpful. I was actually using akshara (attached box fiile image) . On Thursday, 9 April 2020 09:02:43

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-08 Thread Shree Devi Kumar
devenagari.unicharset, Latin.unicharset and radical-stroke.txt The script unicharset are useful in setting character properties. For most scripts they are already available in langadata_lstm. I don't think they are mandatory for lstm training but by copying them once you can avoid the warning

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-08 Thread Piyush Chandra
Hi Shree, I am actually learning about create a new language traineddata for new languages. I would also like to contribute for tesseract. For this I am learning this. I have followed all your post as well as you projects on github. (Wanted to thank you for helping and contributing so many

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-08 Thread Shree Devi Kumar
Why do you want to fine-tune eng to get to hindi traineddata? You can fine-tune hin.traineddata or script/Devanagari.traineddata. On Wed, Apr 8, 2020, 21:00 Piyush Chandra wrote: > When I downloaded the devenagari.unicharset, Latin.unicharset and > radical-stroke.txt > , it worked. What are

[tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-08 Thread Piyush Chandra
When I downloaded the devenagari.unicharset, Latin.unicharset and radical-stroke.txt , it worked. What are these files and why we need this? Do we need to use these every time we work for new language or we need to create our own??? On Wednesday, 8 April 2020 20:42:44 UTC+5:30, Piyush Chandra

[tesseract-ocr] Re: Tesseract error while combine_lang_model

2020-04-08 Thread Piyush Chandra
On Wednesday, 8 April 2020 20:42:44 UTC+5:30, Piyush Chandra wrote: > > Hi, > > I am trying to create a hindi traineddata from scratch using > eng.traineddata. > > I used some png and txt files to create box file using lstmbox and edited > those box files to correct the words. > > Then, I used