In some recent posts, I've seen people with similar problems as mine, but no answer as how to fix it. I'm trying to train tesseract to be more accurate with a new font. When creating the unicharset using unicharset_extractor on my box file:
``` a 32 692 165 958 0 b 221 734 354 958 0 c 32 446 165 628 0 d 221 488 354 628 0 e 32 275 165 373 0 f 221 317 277 373 0 ``` I get the following output: ``` 9 NULL 0 NULL 0 Joined 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # Joined [4a 6f 69 6e 65 64 ] |Broken|0|1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # Broken a 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # a [61 ] b 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # b [62 ] c 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # c [63 ] d 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # d [64 ] e 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # e [65 ] f 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0 # f [66 ] ``` and when i run shapeclustering, if gives a the first few lines of: ``` Bad properties for index 3, char a: 0,255 0,255 0,0 0,0 0,0 Bad properties for index 4, char b: 0,255 0, ``` It seems that the unicharset_extractor isn't properly parsing the box file. Some obvious problems with the unicharset file are the "properties" bit mask is 0, the "glyph_metrics" field appears invalid (0,255,0,255,0,0,0,0,0,0), the "script" field should be either "Latin" or "Common", but is NULL, etc. Anyone have an idea why is is happening? O/S: Ubuntu 15.10 Tesseract Ver: 3.04 Posts with no simple resolution: https://github.com/tesseract-ocr/tesseract/issues/139 -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/6af0b1c6-bd5a-4bbe-aac0-c95df30d7924%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

