Maybe some other information about these special characters also help me. If you know about it, please leave words.
Thanks. 在 2017年8月18日星期五 UTC+8上午9:45:11,[email protected]写道: > > I have debugged the code, and find that the special characters 'Joined' > and '|Broken|0|1' are added while generating the unicharset file. > > But what is the function of these characters? Can anyone tell me which > stage in the training process, these characters play in a role? I can't > find it. Thx a lot. > > For other special characters, such as 'cl', '|d|0|2', '|d|1|2', what is > the function of these characters? Are they added in the combine_lang_model > stage? > > Can you help me? > > > Thanks sincerely. > > 在 2017年8月15日星期二 UTC+8下午1:47:10,[email protected]写道: >> >> Hello, >> >> I have extracted all the characters and id numbers from the >> chi_sim.traineddata. And all the characters are stored in a txt file, which >> can be demonstrated following: >> >> 0 >> 1 Joined >> 2 |Broken|0|1 >> 3 S >> 4 D >> 5 F >> 6 8 >> 7 7 >> 8 0 >> 9 K >> 10 O >> 11 U >> 12 H >> 13 E >> 14 I >> 15 4 >> 16 5 >> 17 1 >> 18 9 >> 19 & >> 20 C >> 21 W >> 22 N >> 23 _ >> 24 P >> 25 M >> 26 T >> 27 V >> 28 R >> 29 L >> 30 A >> 31 Y >> 32 2 >> 33 J >> 34 B >> 35 G >> 36 3 >> 37 6 >> 38 Z >> 39 X >> 40 Q >> 41 ' >> 42 + >> 43 - >> 44 . >> 45 # >> 46 e >> 47 v >> 48 a >> 49 m >> 50 i >> 51 z >> 52 o >> 53 l >> 54 s >> 55 h >> 56 n >> 57 d >> 58 g >> 59 y >> 60 u >> 61 王 >> 62 汝 >> 63 敏 >> 64 邹 >> 65 立 >> 66 健 >> 67 熊 >> ... >> ... >> 4013 扔 >> 4014 嗨 >> 4015 髋 >> 4016 「 >> 4017 [ >> 4018 』 >> 4019 瀵 >> 4020 〕 >> 4021 掺 >> 4022 |"|0|2 >> 4023 |"|1|2 >> 4024 rn >> 4025 |m|0|2 >> 4026 |m|1|2 >> 4027 in >> 4028 cl >> 4029 |d|0|2 >> 4030 |d|1|2 >> 4031 rm >> 4032 |rm|0|2 >> 4033 |rm|1|2 >> 4034 nn >> 4035 |nn|0|2 >> 4036 |nn|1|2 >> 4037 ri >> 4038 |n|0|2 >> 4039 |n|1|2 >> 4040 |h|0|2 >> 4041 |h|1|2 >> 4042 |u|0|2 >> 4043 |u|1|2 >> 4044 |m|0|3 >> 4045 |m|1|3 >> 4046 |m|2|3 >> 4047 |H|0|2 >> 4048 |H|1|2 >> 4049 |H|0|3 >> 4050 |H|1|3 >> 4051 |H|2|3 >> 4052 |w|0|2 >> 4053 |w|1|2 >> 4054 |W|0|2 >> 4055 |W|1|2 >> 4056 fi >> 4057 |k|0|2 >> 4058 |k|1|2 >> 4059 ki >> 4060 |ki|0|2 >> 4061 |ki|1|2 >> 4062 |in|0|2 >> 4063 |in|1|2 >> 4064 tl >> 4065 th >> ... >> >> >> I can recognize most of the characters, such as the han, ladin alphabet. >> But some characters, such as 'Joined', ' |Broken|0|1' at the file header, >> and |"|0|2, |m|0|2 at the end of the file, cannot be recognized by myself. >> >> Can you explan what these characters mean? >> 4059 ki >> 4060 |ki|0|2 >> 4061 |ki|1|2 >> 4062 |in|0|2 >> 4063 |in|1|2 >> and so on >> >> >> Thx alot. >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/342c926d-adbf-418d-af1b-4ade6a1841b2%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

