I think you are right - I don't think the sample box data provided for
download can be the same data that is used by google to create the
trained data.

On Feb 12, 12:42 pm, Zdenko Podobný <[email protected]> wrote:
> Hi Chris,
>
> I have the same experience - that leads me to conclusion it does not
> make sense to train "common" fonts...
> I think google use different process  (more detailed; more/other tools?)
> comparing to information available on wiki... IMHO situation is
> improving with each release, so I wait for additional information
> regarding 3.02 training.
>
> On other hand there is place for community to train "non-standard" fonts
> (e.g. in my case fraktur). I planned to write blog about my experience
> when I helped to Slovak version of project Gutenberg, but there is
> always something more urgent... ;-)
>
> Zdenko
>
> Dn(a 11.02.2012 14:47, Chris  wrote / nap�sal(a):
>
>
>
>
>
>
>
> > I also tried training with all the data. I seem to have the same
> > problem with accuracy being much less than what you get with the
> > default one.
>
> > One thing that looks a bit off is my unicharset file contains lots of
> > NULLS and contents doesn't seem to match the documentation on doing
> > training:
>
> > 108
> > NULL 0 NULL 0
> > t 3 0,255,0,255 NULL 41 # t [74 ]a
> > h 3 0,255,0,255 NULL 81 # h [68 ]a
> > a 3 0,255,0,255 NULL 57 # a [61 ]a
> > n 3 0,255,0,255 NULL 14 # n [6e ]a
> > P 5 0,255,0,255 NULL 30 # P [50 ]A
> > o 3 0,255,0,255 NULL 25 # o [6f ]a
> > e 3 0,255,0,255 NULL 58 # e [65 ]a
> > : 10 0,255,0,255 NULL 8 # : [3a ]p
> > r 3 0,255,0,255 NULL 52 # r [72 ]a
> > etc...
>
> > Also when combining the files I get this output:
>
> > Combining tessdata files
> > TessdataManager combined tesseract data files.
> > Offset for type 0 is -1
> > Offset for type 1 is 108
> > Offset for type 2 is -1
> > Offset for type 3 is 3961
> > Offset for type 4 is 701702
> > Offset for type 5 is 702267
> > Offset for type 6 is -1
> > Offset for type 7 is 716918
> > Offset for type 8 is -1
> > Offset for type 9 is 717216
> > Offset for type 10 is -1
> > Offset for type 11 is -1
> > Offset for type 12 is -1
>
> > So I obviously don't have all the necessary files. Would this effect
> > accuracy when recognising single characters?
>
> > On Feb 11, 10:17 am, Chris<[email protected]>  wrote:
> >> Hi All,
>
> >> I'm using tesseract quite successfully in my code. I have a
> >> preprocessing step that locate the characters I need to recognise and
> >> then I feed them into tesseract using the PSM_SINGLE_CHAR mode.
>
> >> This works great with the default eng.traineddata
>
> >> I'm also constraining the tessedit_char_whitelist to just have numbers
> >> and upper case letters as that is the only thing I have in my
> >> character set.
>
> >> I want to reduce the size of my app and the traineddata is by far the
> >> largest chunk of data at the moment.
>
> >> What I've tried to do is retrain tesseract so that it only has the
> >> characters I need in the training data. I've done this successfully,
> >> but when I use my newly created eng.traineddata the accuracy is much
> >> worse than if I use the default eng.traineddata.
>
> >> Any ideas why this should be? I thought if anything that accuracy
> >> would improve if I'd removed all the unnecessary characters from the
> >> data.
>
> >> I'm doing my training by taking the box files and stripping out all
> >> the characters I don't need and then running through the training
> >> instructions.
>
> >> I'm using tesseract3.01
>
> >> Any thoughts?
>
> >> Cheers
> >> Chris.

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to