[tesseract-ocr] Re: Unrecognized characters in the traineddata model

robertyoung0511 Thu, 17 Aug 2017 18:45:58 -0700

I have debugged the code, and find that the special characters 'Joined' and 
'|Broken|0|1' are added while generating the unicharset file.


But what is the function of these characters? Can anyone tell me which 
stage in the training process, these characters play in a role? I can't 
find it. Thx a lot.

For other special characters, such as 'cl', '|d|0|2', '|d|1|2', what is the 
function of these characters? Are they added in the combine_lang_model 
stage? 

Can you help me?


Thanks sincerely.

在 2017年8月15日星期二 UTC+8下午1:47:10，[email protected]写道：
>
> Hello,
>
> I have extracted all the characters and id numbers from the 
> chi_sim.traineddata. And all the characters are stored in a txt file, which 
> can be demonstrated following:
>
> 0     
> 1    Joined
> 2    |Broken|0|1
> 3    S
> 4    D
> 5    F
> 6    8
> 7    7
> 8    0
> 9    K
> 10    O
> 11    U
> 12    H
> 13    E
> 14    I
> 15    4
> 16    5
> 17    1
> 18    9
> 19    &
> 20    C
> 21    W
> 22    N
> 23    _
> 24    P
> 25    M
> 26    T
> 27    V
> 28    R
> 29    L
> 30    A
> 31    Y
> 32    2
> 33    J
> 34    B
> 35    G
> 36    3
> 37    6
> 38    Z
> 39    X
> 40    Q
> 41    '
> 42    +
> 43    -
> 44    .
> 45    #
> 46    e
> 47    v
> 48    a
> 49    m
> 50    i
> 51    z
> 52    o
> 53    l
> 54    s
> 55    h
> 56    n
> 57    d
> 58    g
> 59    y
> 60    u
> 61    王
> 62    汝
> 63    敏
> 64    邹
> 65    立
> 66    健
> 67    熊
> ...
> ...
> 4013    扔
> 4014    嗨
> 4015    髋
> 4016    「
> 4017    [
> 4018    』
> 4019    瀵
> 4020    〕
> 4021    掺
> 4022    |"|0|2
> 4023    |"|1|2
> 4024    rn
> 4025    |m|0|2
> 4026    |m|1|2
> 4027    in
> 4028    cl
> 4029    |d|0|2
> 4030    |d|1|2
> 4031    rm
> 4032    |rm|0|2
> 4033    |rm|1|2
> 4034    nn
> 4035    |nn|0|2
> 4036    |nn|1|2
> 4037    ri
> 4038    |n|0|2
> 4039    |n|1|2
> 4040    |h|0|2
> 4041    |h|1|2
> 4042    |u|0|2
> 4043    |u|1|2
> 4044    |m|0|3
> 4045    |m|1|3
> 4046    |m|2|3
> 4047    |H|0|2
> 4048    |H|1|2
> 4049    |H|0|3
> 4050    |H|1|3
> 4051    |H|2|3
> 4052    |w|0|2
> 4053    |w|1|2
> 4054    |W|0|2
> 4055    |W|1|2
> 4056    fi
> 4057    |k|0|2
> 4058    |k|1|2
> 4059    ki
> 4060    |ki|0|2
> 4061    |ki|1|2
> 4062    |in|0|2
> 4063    |in|1|2
> 4064    tl
> 4065    th
> ...
>
>
> I can recognize most of the characters, such as the han, ladin alphabet. 
> But some characters, such as 'Joined', ' |Broken|0|1' at the file header, 
> and |"|0|2, |m|0|2 at the end of the file, cannot be recognized by myself.
>
> Can you explan what these characters mean?
> 4059    ki
> 4060    |ki|0|2
> 4061    |ki|1|2
> 4062    |in|0|2
> 4063    |in|1|2
>  and so on
>
>
> Thx alot.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/db617ab0-d486-4792-8782-e722d620e154%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: Unrecognized characters in the traineddata model

Reply via email to