Hi Chris,
I have the same experience - that leads me to conclusion it does not
make sense to train "common" fonts...
I think google use different process (more detailed; more/other tools?)
comparing to information available on wiki... IMHO situation is
improving with each release, so I wait for additional information
regarding 3.02 training.
On other hand there is place for community to train "non-standard" fonts
(e.g. in my case fraktur). I planned to write blog about my experience
when I helped to Slovak version of project Gutenberg, but there is
always something more urgent... ;-)
Zdenko
Dn(a 11.02.2012 14:47, Chris wrote / napísal(a):
I also tried training with all the data. I seem to have the same
problem with accuracy being much less than what you get with the
default one.
One thing that looks a bit off is my unicharset file contains lots of
NULLS and contents doesn't seem to match the documentation on doing
training:
108
NULL 0 NULL 0
t 3 0,255,0,255 NULL 41 # t [74 ]a
h 3 0,255,0,255 NULL 81 # h [68 ]a
a 3 0,255,0,255 NULL 57 # a [61 ]a
n 3 0,255,0,255 NULL 14 # n [6e ]a
P 5 0,255,0,255 NULL 30 # P [50 ]A
o 3 0,255,0,255 NULL 25 # o [6f ]a
e 3 0,255,0,255 NULL 58 # e [65 ]a
: 10 0,255,0,255 NULL 8 # : [3a ]p
r 3 0,255,0,255 NULL 52 # r [72 ]a
etc...
Also when combining the files I get this output:
Combining tessdata files
TessdataManager combined tesseract data files.
Offset for type 0 is -1
Offset for type 1 is 108
Offset for type 2 is -1
Offset for type 3 is 3961
Offset for type 4 is 701702
Offset for type 5 is 702267
Offset for type 6 is -1
Offset for type 7 is 716918
Offset for type 8 is -1
Offset for type 9 is 717216
Offset for type 10 is -1
Offset for type 11 is -1
Offset for type 12 is -1
So I obviously don't have all the necessary files. Would this effect
accuracy when recognising single characters?
On Feb 11, 10:17 am, Chris<[email protected]> wrote:
Hi All,
I'm using tesseract quite successfully in my code. I have a
preprocessing step that locate the characters I need to recognise and
then I feed them into tesseract using the PSM_SINGLE_CHAR mode.
This works great with the default eng.traineddata
I'm also constraining the tessedit_char_whitelist to just have numbers
and upper case letters as that is the only thing I have in my
character set.
I want to reduce the size of my app and the traineddata is by far the
largest chunk of data at the moment.
What I've tried to do is retrain tesseract so that it only has the
characters I need in the training data. I've done this successfully,
but when I use my newly created eng.traineddata the accuracy is much
worse than if I use the default eng.traineddata.
Any ideas why this should be? I thought if anything that accuracy
would improve if I'd removed all the unnecessary characters from the
data.
I'm doing my training by taking the box files and stripping out all
the characters I don't need and then running through the training
instructions.
I'm using tesseract3.01
Any thoughts?
Cheers
Chris.
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en