Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

Piyush Chandra Tue, 14 Apr 2020 04:32:04 -0700

Hi Shree, 

When I used unicharset extractor command, I get these error:


unicharset_extractor --norm_mode 2 --output_unicharset min.unicharset 
hin.exp1.box
Extracting unicharset from box file hin.exp1.box
Invalid start of grapheme sequence:M=0x93e
Normalization failed for string 'αñ╛'
Invalid start of grapheme sequence:D=0x901
Normalization failed for string 'αñü'
Invalid start of grapheme sequence:M=0x941
Normalization failed for string 'αÑü'
Invalid start of grapheme sequence:M=0x947
Normalization failed for string 'αÑç'
Invalid start of grapheme sequence:M=0x940
Normalization failed for string 'αÑÇ'
Invalid start of grapheme sequence:M=0x948
Normalization failed for string 'αÑê'
Mirror ] of [ is not in unicharset
Wrote unicharset file min.unicharset

The box file used was:

ह 28 33 261 74 0
ा 28 33 261 74 0
ँ 28 33 261 74 0
, 28 33 261 74 0
  28 33 261 74 0
म 28 33 261 74 0
ु 28 33 261 74 0
झ 28 33 261 74 0
े 28 33 261 74 0
  28 33 261 74 0
[ 28 33 261 74 0
ख 28 33 261 74 0
  28 33 261 74 0
ल 28 33 261 74 0
ग 28 33 261 74 0
ी 28 33 261 74 0
  28 33 261 74 0
ह 28 33 261 74 0
ै 28 33 261 74 0
। 28 33 261 74 0
28 33 261 74 0

Do I need to just ignore them or what am I missing here?

On Thursday, 9 April 2020 12:34:38 UTC+5:30, shree wrote:
>
> # Normalization mode - 2, 1 - for unicharset_extractor and Pass through 
> Recoder for combine_lang_model
> ifeq ($(LANG_TYPE),Indic)
> NORM_MODE =2
> RECODER =--pass_through_recoder
>
>
> On Thu, Apr 9, 2020 at 12:29 PM Shree Devi Kumar <[email protected] 
> <javascript:>> wrote:
>
>> Unicharset will look like the following:
>>
>> द 1 34,72,192,192,100,122,0,0,99,114 Devanagari 11 0 11 द # द [926 ]x
>> र 1 58,64,192,192,84,119,0,0,81,110 Devanagari 12 0 12 र # र [930 ]x
>> ् 0 3,32,61,197,12,181,0,0,0,1 Devanagari 13 17 13 ् # ् [94d ]
>> श 1 61,64,192,195,128,148,0,12,130,147 Devanagari 14 0 14 श # श [936 ]x
>> य 1 63,64,192,192,114,142,0,0,111,133 Devanagari 15 0 15 य # य [92f ]x
>> त 1 61,64,192,192,112,135,0,0,110,126 Devanagari 16 0 16 त # त [924 ]x
>> ि 0 62,65,228,253,132,279,0,0,40,65 Devanagari 17 0 17 ि # ि [93f ]
>> प 1 63,64,192,192,98,126,0,0,97,119 Devanagari 18 0 18 प # प [92a ]x
>> ू 0 1,35,67,197,33,193,0,0,0,1 Devanagari 19 17 19 ू # ू [942 ]
>> ज 1 63,64,192,192,138,165,0,0,128,157 Devanagari 20 0 20 ज # ज [91c ]x
>>
>> You can unpack any of the existing traineddatas from tessdata_best or 
>> tessdata_fast and check.
>>
>> combine_tessdata -u 
>>
>> and looks at the lstm-unicharset in the components
>>
>> On Thu, Apr 9, 2020 at 12:15 PM Piyush Chandra <[email protected] 
>> <javascript:>> wrote:
>>
>>> Thank you Shree for giving the overview.
>>>
>>> Could you please help me understand your last point? Your unicharset 
>>> should have Unicode codepoints. what does that mean? any example would be 
>>> helpful. I was actually using akshara (attached box fiile image) .
>>>
>>>
>>>
>>> On Thursday, 9 April 2020 09:02:43 UTC+5:30, shree wrote:
>>>>
>>>> devenagari.unicharset, Latin.unicharset and radical-stroke.txt
>>>>
>>>> The script unicharset are useful in setting character properties. For 
>>>> most scripts they are already available in langadata_lstm. I don't  think 
>>>> they are mandatory for lstm training but by copying them once you can 
>>>> avoid 
>>>> the warning messages.
>>>>
>>>> radical-stroke.txt is used only for CJK languages, but tesseract checks 
>>>> for it during training process, so you need to make it available.
>>>>
>>>> For chattisgarhi, if training for as written in Devanagari, I will 
>>>> suggest training from script/Devanagari.traineddata rather than English.
>>>>
>>>> Please note if you are starting from scratch, then you don't need a 
>>>> starting traineddata. If you use one, then you are finetuning.
>>>>
>>>> Finally,  you need to use the correct mode for Indic language with 
>>>> unicharset_extractor. Your unicharset should have Unicode codepoints, not 
>>>> akshara (consanant vowel sign combination).
>>>>
>>>>
>>>>
>>>>
>>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected] <javascript:>.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/338f0a8e-d998-4411-bcb6-8d49dfbb4ab6%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/338f0a8e-d998-4411-bcb6-8d49dfbb4ab6%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>
>>
>> -- 
>>
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>
>
> -- 
>
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e25655fe-4793-40c0-bb95-eabda187a252%40googlegroups.com.

Re: [tesseract-ocr] Re: Tesseract error while combine_lang_model

Reply via email to