> Can you please provide explanation why do you think that "unicharset_extractor.exe produces wrong and uncomplete files"?
Because this is definitely wrong: 90 NULL 0 NULL 0 A 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # A [41 ]A B 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # B [42 ]A C 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # C [43 ]A D 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # D [44 ]A E 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # E [45 ]A F 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # F [46 ]A G 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # G [47 ]A H 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # H [48 ]A I 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # I [49 ]A J 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # J [4a ]A K 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # K [4b ]A L 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # L [4c ]A M 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # M [4d ]A N 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # N [4e ]A O 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # O [4f ]A P 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # P [50 ]A Q 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # Q [51 ]A R 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # R [52 ]A S 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # S [53 ]A T 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # T [54 ]A U 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # U [55 ]A V 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # V [56 ]A W 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # W [57 ]A X 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # X [58 ]A Y 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # Y [59 ]A Z 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # Z [5a ]A a 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 1 0 0 # a [61 ]a b 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 2 0 0 # b [62 ]a c 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 3 0 0 # c [63 ]a d 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 4 0 0 # d [64 ]a 1.) The column "other_case" should contain the ID of the other-case letter. For the lowercase letters they point correctly to the uppercase letters. But the uppercase letters they all have a value of -1 which is wrong. Here should be the corresponding ID of the lowercase letter. 2.) The script name is always NULL. It should be LATIN or COMMON 3.) All the min / max values are completely missing. They are 0, 255 or 32767. 10 missing columns! 4.) The last column "normed_form" is missing. With the '#' a comment is starting. But when reading this unicharset the '#' is misinterpreted as the "normed_form". Here should be mostly the same letter as in the first column. Here you see a unicharset extracted from a trainddata file with all columns filled correctly: A 5 52,68,216,255,100,216,0,17,98,231 Latin 2 0 15 A # A [41 ]A B 5 62,68,216,255,91,227,0,27,106,227 Latin 23 0 102 B # B [42 ]A etc.. a 3 58,65,186,200,85,164,0,26,97,185 Latin 15 0 2 a # a [61 ]a b 3 58,64,216,255,87,180,0,25,100,200 Latin 102 0 23 b # b [62 ]a Result: The unicharset_extractor tool is very buggy. I have to edit all by hand. So my question remains: Were do I find a detailed documentation of the Unicharset file ??? -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/40473df4-df33-45f1-a593-2348f15b6b0b%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

