I also tried training with all the data. I seem to have the same
problem with accuracy being much less than what you get with the
default one.

One thing that looks a bit off is my unicharset file contains lots of
NULLS and contents doesn't seem to match the documentation on doing
training:

108
NULL 0 NULL 0
t 3 0,255,0,255 NULL 41 # t [74 ]a
h 3 0,255,0,255 NULL 81 # h [68 ]a
a 3 0,255,0,255 NULL 57 # a [61 ]a
n 3 0,255,0,255 NULL 14 # n [6e ]a
P 5 0,255,0,255 NULL 30 # P [50 ]A
o 3 0,255,0,255 NULL 25 # o [6f ]a
e 3 0,255,0,255 NULL 58 # e [65 ]a
: 10 0,255,0,255 NULL 8 # : [3a ]p
r 3 0,255,0,255 NULL 52 # r [72 ]a
etc...

Also when combining the files I get this output:

Combining tessdata files
TessdataManager combined tesseract data files.
Offset for type 0 is -1
Offset for type 1 is 108
Offset for type 2 is -1
Offset for type 3 is 3961
Offset for type 4 is 701702
Offset for type 5 is 702267
Offset for type 6 is -1
Offset for type 7 is 716918
Offset for type 8 is -1
Offset for type 9 is 717216
Offset for type 10 is -1
Offset for type 11 is -1
Offset for type 12 is -1

So I obviously don't have all the necessary files. Would this effect
accuracy when recognising single characters?


On Feb 11, 10:17 am, Chris <[email protected]> wrote:
> Hi All,
>
> I'm using tesseract quite successfully in my code. I have a
> preprocessing step that locate the characters I need to recognise and
> then I feed them into tesseract using the PSM_SINGLE_CHAR mode.
>
> This works great with the default eng.traineddata
>
> I'm also constraining the tessedit_char_whitelist to just have numbers
> and upper case letters as that is the only thing I have in my
> character set.
>
> I want to reduce the size of my app and the traineddata is by far the
> largest chunk of data at the moment.
>
> What I've tried to do is retrain tesseract so that it only has the
> characters I need in the training data. I've done this successfully,
> but when I use my newly created eng.traineddata the accuracy is much
> worse than if I use the default eng.traineddata.
>
> Any ideas why this should be? I thought if anything that accuracy
> would improve if I'd removed all the unnecessary characters from the
> data.
>
> I'm doing my training by taking the box files and stripping out all
> the characters I don't need and then running through the training
> instructions.
>
> I'm using tesseract3.01
>
> Any thoughts?
>
> Cheers
> Chris.

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to