I have debugged the code, and find that the special characters 'Joined' and '|Broken|0|1' are added while generating the unicharset file.
But what is the function of these characters? Can anyone tell me which stage in the training process, these characters play in a role? I can't find it. Thx a lot. For other special characters, such as 'cl', '|d|0|2', '|d|1|2', what is the function of these characters? Are they added in the combine_lang_model stage? Can you help me? Thanks sincerely. 在 2017年8月15日星期二 UTC+8下午1:47:10,[email protected]写道: > > Hello, > > I have extracted all the characters and id numbers from the > chi_sim.traineddata. And all the characters are stored in a txt file, which > can be demonstrated following: > > 0 > 1 Joined > 2 |Broken|0|1 > 3 S > 4 D > 5 F > 6 8 > 7 7 > 8 0 > 9 K > 10 O > 11 U > 12 H > 13 E > 14 I > 15 4 > 16 5 > 17 1 > 18 9 > 19 & > 20 C > 21 W > 22 N > 23 _ > 24 P > 25 M > 26 T > 27 V > 28 R > 29 L > 30 A > 31 Y > 32 2 > 33 J > 34 B > 35 G > 36 3 > 37 6 > 38 Z > 39 X > 40 Q > 41 ' > 42 + > 43 - > 44 . > 45 # > 46 e > 47 v > 48 a > 49 m > 50 i > 51 z > 52 o > 53 l > 54 s > 55 h > 56 n > 57 d > 58 g > 59 y > 60 u > 61 王 > 62 汝 > 63 敏 > 64 邹 > 65 立 > 66 健 > 67 熊 > ... > ... > 4013 扔 > 4014 嗨 > 4015 髋 > 4016 「 > 4017 [ > 4018 』 > 4019 瀵 > 4020 〕 > 4021 掺 > 4022 |"|0|2 > 4023 |"|1|2 > 4024 rn > 4025 |m|0|2 > 4026 |m|1|2 > 4027 in > 4028 cl > 4029 |d|0|2 > 4030 |d|1|2 > 4031 rm > 4032 |rm|0|2 > 4033 |rm|1|2 > 4034 nn > 4035 |nn|0|2 > 4036 |nn|1|2 > 4037 ri > 4038 |n|0|2 > 4039 |n|1|2 > 4040 |h|0|2 > 4041 |h|1|2 > 4042 |u|0|2 > 4043 |u|1|2 > 4044 |m|0|3 > 4045 |m|1|3 > 4046 |m|2|3 > 4047 |H|0|2 > 4048 |H|1|2 > 4049 |H|0|3 > 4050 |H|1|3 > 4051 |H|2|3 > 4052 |w|0|2 > 4053 |w|1|2 > 4054 |W|0|2 > 4055 |W|1|2 > 4056 fi > 4057 |k|0|2 > 4058 |k|1|2 > 4059 ki > 4060 |ki|0|2 > 4061 |ki|1|2 > 4062 |in|0|2 > 4063 |in|1|2 > 4064 tl > 4065 th > ... > > > I can recognize most of the characters, such as the han, ladin alphabet. > But some characters, such as 'Joined', ' |Broken|0|1' at the file header, > and |"|0|2, |m|0|2 at the end of the file, cannot be recognized by myself. > > Can you explan what these characters mean? > 4059 ki > 4060 |ki|0|2 > 4061 |ki|1|2 > 4062 |in|0|2 > 4063 |in|1|2 > and so on > > > Thx alot. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/db617ab0-d486-4792-8782-e722d620e154%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

