First of all - the source code is documentation... Next - it is just your expectation that it is wrong ;-) Did manual changing of values bring any improvement to OCR? I would not be surprised if that values are not use by current version of tesseract.
Zdenko On Fri, Jul 4, 2014 at 7:02 PM, Albrecht Hilker <[email protected]> wrote: > > > Can you please provide explanation why do you think that > "unicharset_extractor.exe produces wrong and uncomplete files"? > > Because this is definitely wrong: > > 90 > NULL 0 NULL 0 > A 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # A [41 ]A > B 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # B [42 ]A > C 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # C [43 ]A > D 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # D [44 ]A > E 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # E [45 ]A > F 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # F [46 ]A > G 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # G [47 ]A > H 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # H [48 ]A > I 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # I [49 ]A > J 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # J [4a ]A > K 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # K [4b ]A > L 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # L [4c ]A > M 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # M [4d ]A > N 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # N [4e ]A > O 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # O [4f ]A > P 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # P [50 ]A > Q 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # Q [51 ]A > R 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # R [52 ]A > S 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # S [53 ]A > T 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # T [54 ]A > U 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # U [55 ]A > V 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # V [56 ]A > W 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # W [57 ]A > X 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # X [58 ]A > Y 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # Y [59 ]A > Z 5 0,255,0,255,0,32767,0,32767,0,32767 NULL -1 0 0 # Z [5a ]A > a 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 1 0 0 # a [61 ]a > b 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 2 0 0 # b [62 ]a > c 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 3 0 0 # c [63 ]a > d 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 4 0 0 # d [64 ]a > > > > 1.) > The column "other_case" should contain the ID of the other-case letter. > For the lowercase letters they point correctly to the uppercase letters. > But the uppercase letters they all have a value of -1 which is wrong. > Here should be the corresponding ID of the lowercase letter. > > 2.) > The script name is always NULL. > It should be LATIN or COMMON > > 3.) > All the min / max values are completely missing. > They are 0, 255 or 32767. > 10 missing columns! > > 4.) > The last column "normed_form" is missing. > With the '#' a comment is starting. > But when reading this unicharset the '#' is misinterpreted as the > "normed_form". > Here should be mostly the same letter as in the first column. > > > > Here you see a unicharset extracted from a trainddata file with all > columns filled correctly: > > A 5 52,68,216,255,100,216,0,17,98,231 Latin 2 0 15 A # A [41 ]A > B 5 62,68,216,255,91,227,0,27,106,227 Latin 23 0 102 B # B [42 ]A > > etc.. > > a 3 58,65,186,200,85,164,0,26,97,185 Latin 15 0 2 a # a [61 ]a > b 3 58,64,216,255,87,180,0,25,100,200 Latin 102 0 23 b # b [62 ]a > > > Result: > The unicharset_extractor tool is very buggy. > I have to edit all by hand. > > > So my question remains: > > Were do I find a detailed documentation of the Unicharset file ??? > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/40473df4-df33-45f1-a593-2348f15b6b0b%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/40473df4-df33-45f1-a593-2348f15b6b0b%40googlegroups.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8xOV1-r5-4brRxHZUEzcgRRB2UWbjnC_VbwMwxfiUG7Rw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.

