Hi Shree, When I used unicharset extractor command, I get these error:
unicharset_extractor --norm_mode 2 --output_unicharset min.unicharset hin.exp1.box Extracting unicharset from box file hin.exp1.box Invalid start of grapheme sequence:M=0x93e Normalization failed for string 'αñ╛' Invalid start of grapheme sequence:D=0x901 Normalization failed for string 'αñü' Invalid start of grapheme sequence:M=0x941 Normalization failed for string 'αÑü' Invalid start of grapheme sequence:M=0x947 Normalization failed for string 'αÑç' Invalid start of grapheme sequence:M=0x940 Normalization failed for string 'αÑÇ' Invalid start of grapheme sequence:M=0x948 Normalization failed for string 'αÑê' Mirror ] of [ is not in unicharset Wrote unicharset file min.unicharset The box file used was: ह 28 33 261 74 0 ा 28 33 261 74 0 ँ 28 33 261 74 0 , 28 33 261 74 0 28 33 261 74 0 म 28 33 261 74 0 ु 28 33 261 74 0 झ 28 33 261 74 0 े 28 33 261 74 0 28 33 261 74 0 [ 28 33 261 74 0 ख 28 33 261 74 0 28 33 261 74 0 ल 28 33 261 74 0 ग 28 33 261 74 0 ी 28 33 261 74 0 28 33 261 74 0 ह 28 33 261 74 0 ै 28 33 261 74 0 । 28 33 261 74 0 28 33 261 74 0 Do I need to just ignore them or what am I missing here? On Thursday, 9 April 2020 12:34:38 UTC+5:30, shree wrote: > > # Normalization mode - 2, 1 - for unicharset_extractor and Pass through > Recoder for combine_lang_model > ifeq ($(LANG_TYPE),Indic) > NORM_MODE =2 > RECODER =--pass_through_recoder > > > On Thu, Apr 9, 2020 at 12:29 PM Shree Devi Kumar <shree...@gmail.com > <javascript:>> wrote: > >> Unicharset will look like the following: >> >> द 1 34,72,192,192,100,122,0,0,99,114 Devanagari 11 0 11 द # द [926 ]x >> र 1 58,64,192,192,84,119,0,0,81,110 Devanagari 12 0 12 र # र [930 ]x >> ् 0 3,32,61,197,12,181,0,0,0,1 Devanagari 13 17 13 ् # ् [94d ] >> श 1 61,64,192,195,128,148,0,12,130,147 Devanagari 14 0 14 श # श [936 ]x >> य 1 63,64,192,192,114,142,0,0,111,133 Devanagari 15 0 15 य # य [92f ]x >> त 1 61,64,192,192,112,135,0,0,110,126 Devanagari 16 0 16 त # त [924 ]x >> ि 0 62,65,228,253,132,279,0,0,40,65 Devanagari 17 0 17 ि # ि [93f ] >> प 1 63,64,192,192,98,126,0,0,97,119 Devanagari 18 0 18 प # प [92a ]x >> ू 0 1,35,67,197,33,193,0,0,0,1 Devanagari 19 17 19 ू # ू [942 ] >> ज 1 63,64,192,192,138,165,0,0,128,157 Devanagari 20 0 20 ज # ज [91c ]x >> >> You can unpack any of the existing traineddatas from tessdata_best or >> tessdata_fast and check. >> >> combine_tessdata -u >> >> and looks at the lstm-unicharset in the components >> >> On Thu, Apr 9, 2020 at 12:15 PM Piyush Chandra <piyus...@gmail.com >> <javascript:>> wrote: >> >>> Thank you Shree for giving the overview. >>> >>> Could you please help me understand your last point? Your unicharset >>> should have Unicode codepoints. what does that mean? any example would be >>> helpful. I was actually using akshara (attached box fiile image) . >>> >>> >>> >>> On Thursday, 9 April 2020 09:02:43 UTC+5:30, shree wrote: >>>> >>>> devenagari.unicharset, Latin.unicharset and radical-stroke.txt >>>> >>>> The script unicharset are useful in setting character properties. For >>>> most scripts they are already available in langadata_lstm. I don't think >>>> they are mandatory for lstm training but by copying them once you can >>>> avoid >>>> the warning messages. >>>> >>>> radical-stroke.txt is used only for CJK languages, but tesseract checks >>>> for it during training process, so you need to make it available. >>>> >>>> For chattisgarhi, if training for as written in Devanagari, I will >>>> suggest training from script/Devanagari.traineddata rather than English. >>>> >>>> Please note if you are starting from scratch, then you don't need a >>>> starting traineddata. If you use one, then you are finetuning. >>>> >>>> Finally, you need to use the correct mode for Indic language with >>>> unicharset_extractor. Your unicharset should have Unicode codepoints, not >>>> akshara (consanant vowel sign combination). >>>> >>>> >>>> >>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesser...@googlegroups.com <javascript:>. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/338f0a8e-d998-4411-bcb6-8d49dfbb4ab6%40googlegroups.com >>> >>> <https://groups.google.com/d/msgid/tesseract-ocr/338f0a8e-d998-4411-bcb6-8d49dfbb4ab6%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> >> >> >> -- >> >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> > > > -- > > ____________________________________________________________ > भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/e25655fe-4793-40c0-bb95-eabda187a252%40googlegroups.com.