[tesseract-ocr] Re: Unrecognized characters in the traineddata model

robertyoung0511 Thu, 17 Aug 2017 18:54:54 -0700

Maybe some other information about these special characters also help me. 
If you know about it, please leave words.


Thanks.

在 2017年8月18日星期五 UTC+8上午9:45:11，[email protected]写道：
>
> I have debugged the code, and find that the special characters 'Joined' 
> and '|Broken|0|1' are added while generating the unicharset file. 
>
> But what is the function of these characters? Can anyone tell me which 
> stage in the training process, these characters play in a role? I can't 
> find it. Thx a lot.
>
> For other special characters, such as 'cl', '|d|0|2', '|d|1|2', what is 
> the function of these characters? Are they added in the combine_lang_model 
> stage? 
>
> Can you help me?
>
>
> Thanks sincerely.
>
> 在 2017年8月15日星期二 UTC+8下午1:47:10，[email protected]写道：
>>
>> Hello,
>>
>> I have extracted all the characters and id numbers from the 
>> chi_sim.traineddata. And all the characters are stored in a txt file, which 
>> can be demonstrated following:
>>
>> 0     
>> 1    Joined
>> 2    |Broken|0|1
>> 3    S
>> 4    D
>> 5    F
>> 6    8
>> 7    7
>> 8    0
>> 9    K
>> 10    O
>> 11    U
>> 12    H
>> 13    E
>> 14    I
>> 15    4
>> 16    5
>> 17    1
>> 18    9
>> 19    &
>> 20    C
>> 21    W
>> 22    N
>> 23    _
>> 24    P
>> 25    M
>> 26    T
>> 27    V
>> 28    R
>> 29    L
>> 30    A
>> 31    Y
>> 32    2
>> 33    J
>> 34    B
>> 35    G
>> 36    3
>> 37    6
>> 38    Z
>> 39    X
>> 40    Q
>> 41    '
>> 42    +
>> 43    -
>> 44    .
>> 45    #
>> 46    e
>> 47    v
>> 48    a
>> 49    m
>> 50    i
>> 51    z
>> 52    o
>> 53    l
>> 54    s
>> 55    h
>> 56    n
>> 57    d
>> 58    g
>> 59    y
>> 60    u
>> 61    王
>> 62    汝
>> 63    敏
>> 64    邹
>> 65    立
>> 66    健
>> 67    熊
>> ...
>> ...
>> 4013    扔
>> 4014    嗨
>> 4015    髋
>> 4016    「
>> 4017    [
>> 4018    』
>> 4019    瀵
>> 4020    〕
>> 4021    掺
>> 4022    |"|0|2
>> 4023    |"|1|2
>> 4024    rn
>> 4025    |m|0|2
>> 4026    |m|1|2
>> 4027    in
>> 4028    cl
>> 4029    |d|0|2
>> 4030    |d|1|2
>> 4031    rm
>> 4032    |rm|0|2
>> 4033    |rm|1|2
>> 4034    nn
>> 4035    |nn|0|2
>> 4036    |nn|1|2
>> 4037    ri
>> 4038    |n|0|2
>> 4039    |n|1|2
>> 4040    |h|0|2
>> 4041    |h|1|2
>> 4042    |u|0|2
>> 4043    |u|1|2
>> 4044    |m|0|3
>> 4045    |m|1|3
>> 4046    |m|2|3
>> 4047    |H|0|2
>> 4048    |H|1|2
>> 4049    |H|0|3
>> 4050    |H|1|3
>> 4051    |H|2|3
>> 4052    |w|0|2
>> 4053    |w|1|2
>> 4054    |W|0|2
>> 4055    |W|1|2
>> 4056    fi
>> 4057    |k|0|2
>> 4058    |k|1|2
>> 4059    ki
>> 4060    |ki|0|2
>> 4061    |ki|1|2
>> 4062    |in|0|2
>> 4063    |in|1|2
>> 4064    tl
>> 4065    th
>> ...
>>
>>
>> I can recognize most of the characters, such as the han, ladin alphabet. 
>> But some characters, such as 'Joined', ' |Broken|0|1' at the file header, 
>> and |"|0|2, |m|0|2 at the end of the file, cannot be recognized by myself.
>>
>> Can you explan what these characters mean?
>> 4059    ki
>> 4060    |ki|0|2
>> 4061    |ki|1|2
>> 4062    |in|0|2
>> 4063    |in|1|2
>>  and so on
>>
>>
>> Thx alot.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/342c926d-adbf-418d-af1b-4ade6a1841b2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Unrecognized characters in the traineddata model

Reply via email to