Hello, I have extracted all the characters and id numbers from the chi_sim.traineddata. And all the characters are stored in a txt file, which can be demonstrated following:
0 1 Joined 2 |Broken|0|1 3 S 4 D 5 F 6 8 7 7 8 0 9 K 10 O 11 U 12 H 13 E 14 I 15 4 16 5 17 1 18 9 19 & 20 C 21 W 22 N 23 _ 24 P 25 M 26 T 27 V 28 R 29 L 30 A 31 Y 32 2 33 J 34 B 35 G 36 3 37 6 38 Z 39 X 40 Q 41 ' 42 + 43 - 44 . 45 # 46 e 47 v 48 a 49 m 50 i 51 z 52 o 53 l 54 s 55 h 56 n 57 d 58 g 59 y 60 u 61 王 62 汝 63 敏 64 邹 65 立 66 健 67 熊 ... ... 4013 扔 4014 嗨 4015 髋 4016 「 4017 [ 4018 』 4019 瀵 4020 〕 4021 掺 4022 |"|0|2 4023 |"|1|2 4024 rn 4025 |m|0|2 4026 |m|1|2 4027 in 4028 cl 4029 |d|0|2 4030 |d|1|2 4031 rm 4032 |rm|0|2 4033 |rm|1|2 4034 nn 4035 |nn|0|2 4036 |nn|1|2 4037 ri 4038 |n|0|2 4039 |n|1|2 4040 |h|0|2 4041 |h|1|2 4042 |u|0|2 4043 |u|1|2 4044 |m|0|3 4045 |m|1|3 4046 |m|2|3 4047 |H|0|2 4048 |H|1|2 4049 |H|0|3 4050 |H|1|3 4051 |H|2|3 4052 |w|0|2 4053 |w|1|2 4054 |W|0|2 4055 |W|1|2 4056 fi 4057 |k|0|2 4058 |k|1|2 4059 ki 4060 |ki|0|2 4061 |ki|1|2 4062 |in|0|2 4063 |in|1|2 4064 tl 4065 th ... I can recognize most of the characters, such as the han, ladin alphabet. But some characters, such as 'Joined', ' |Broken|0|1' at the file header, and |"|0|2, |m|0|2 at the end of the file, cannot be recognized by myself. Can you explan what these characters mean? 4059 ki 4060 |ki|0|2 4061 |ki|1|2 4062 |in|0|2 4063 |in|1|2 and so on Thx alot. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/b042f6e0-7fc9-487b-bcc6-0acf22c343fd%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

