[tesseract-ocr] Training Tesseract: unicharset extractor producing "Bad properties"

guspolledri Mon, 30 Nov 2015 14:42:46 -0800

In some recent posts, I've seen people with similar problems as mine, but 
no answer as how to fix it.  I'm trying to train tesseract to be more 
accurate with a new font.  When creating the unicharset using 
unicharset_extractor on my box file:


```
a 32 692 165 958 0 
b 221 734 354 958 0 
c 32 446 165 628 0 
d 221 488 354 628 0 
e 32 275 165 373 0 
f 221 317 277 373 0
```

I get the following output:

```
9 
NULL 0 NULL 0 
Joined 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0     # Joined [4a 6f 69 6e 65 64 
] 
|Broken|0|1 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0        # Broken 
a 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # a [61 ] 
b 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # b [62 ] 
c 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # c [63 ] 
d 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # d [64 ] 
e 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # e [65 ] 
f 0 0,255,0,255,0,0,0,0,0,0 NULL 0 0 0  # f [66 ]
```

and when i run shapeclustering, if gives a the first few lines of:

```
Bad properties for index 3, char a: 0,255 0,255 0,0 0,0 0,0 
Bad properties for index 4, char b: 0,255 0,
```

It seems that the unicharset_extractor isn't properly parsing the box file. 
 Some obvious problems with the unicharset file are the "properties" bit 
mask is 0, the "glyph_metrics" field appears invalid 
(0,255,0,255,0,0,0,0,0,0), the "script" field should be either "Latin" or 
"Common", but is NULL, etc.

Anyone have an idea why is is happening?

O/S: Ubuntu 15.10
Tesseract Ver: 3.04

Posts with no simple resolution:
https://github.com/tesseract-ocr/tesseract/issues/139

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6af0b1c6-bd5a-4bbe-aac0-c95df30d7924%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Training Tesseract: unicharset extractor producing "Bad properties"

Reply via email to