# Normalization mode - 2, 1 - for unicharset_extractor and Pass through
Recoder for combine_lang_model
ifeq ($(LANG_TYPE),Indic)
NORM_MODE =2
RECODER =--pass_through_recoder


On Thu, Apr 9, 2020 at 12:29 PM Shree Devi Kumar <[email protected]>
wrote:

> Unicharset will look like the following:
>
> द 1 34,72,192,192,100,122,0,0,99,114 Devanagari 11 0 11 द # द [926 ]x
> र 1 58,64,192,192,84,119,0,0,81,110 Devanagari 12 0 12 र # र [930 ]x
> ् 0 3,32,61,197,12,181,0,0,0,1 Devanagari 13 17 13 ् # ् [94d ]
> श 1 61,64,192,195,128,148,0,12,130,147 Devanagari 14 0 14 श # श [936 ]x
> य 1 63,64,192,192,114,142,0,0,111,133 Devanagari 15 0 15 य # य [92f ]x
> त 1 61,64,192,192,112,135,0,0,110,126 Devanagari 16 0 16 त # त [924 ]x
> ि 0 62,65,228,253,132,279,0,0,40,65 Devanagari 17 0 17 ि # ि [93f ]
> प 1 63,64,192,192,98,126,0,0,97,119 Devanagari 18 0 18 प # प [92a ]x
> ू 0 1,35,67,197,33,193,0,0,0,1 Devanagari 19 17 19 ू # ू [942 ]
> ज 1 63,64,192,192,138,165,0,0,128,157 Devanagari 20 0 20 ज # ज [91c ]x
>
> You can unpack any of the existing traineddatas from tessdata_best or
> tessdata_fast and check.
>
> combine_tessdata -u
>
> and looks at the lstm-unicharset in the components
>
> On Thu, Apr 9, 2020 at 12:15 PM Piyush Chandra <[email protected]>
> wrote:
>
>> Thank you Shree for giving the overview.
>>
>> Could you please help me understand your last point? Your unicharset
>> should have Unicode codepoints. what does that mean? any example would be
>> helpful. I was actually using akshara (attached box fiile image) .
>>
>>
>>
>> On Thursday, 9 April 2020 09:02:43 UTC+5:30, shree wrote:
>>>
>>> devenagari.unicharset, Latin.unicharset and radical-stroke.txt
>>>
>>> The script unicharset are useful in setting character properties. For
>>> most scripts they are already available in langadata_lstm. I don't  think
>>> they are mandatory for lstm training but by copying them once you can avoid
>>> the warning messages.
>>>
>>> radical-stroke.txt is used only for CJK languages, but tesseract checks
>>> for it during training process, so you need to make it available.
>>>
>>> For chattisgarhi, if training for as written in Devanagari, I will
>>> suggest training from script/Devanagari.traineddata rather than English.
>>>
>>> Please note if you are starting from scratch, then you don't need a
>>> starting traineddata. If you use one, then you are finetuning.
>>>
>>> Finally,  you need to use the correct mode for Indic language with
>>> unicharset_extractor. Your unicharset should have Unicode codepoints, not
>>> akshara (consanant vowel sign combination).
>>>
>>>
>>>
>>>
>>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/338f0a8e-d998-4411-bcb6-8d49dfbb4ab6%40googlegroups.com
>> <https://groups.google.com/d/msgid/tesseract-ocr/338f0a8e-d998-4411-bcb6-8d49dfbb4ab6%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>
>
> --
>
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>


-- 

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWP%2B9G6%2BBPFxmDPhwLdYEy4OMAadS9%3DuV6gtuYW79wRiQ%40mail.gmail.com.

Reply via email to