Chris: thanks much for the awesome tip! My problem was exactly the same as 
yours. Training Tesseract is not the best way you would want to spend your 
days. So I followed this tip and saved myself a hell lot of heartburn.

On Monday, February 13, 2012 2:44:49 PM UTC+5:30, Chris wrote:
>
> I've given up on retraining tesseract. I can't get the same accuracy 
> as the default training data with the sample box data. 
>
> But I solved my problem of app size by unpacking the training data, 
> deleting the bits I don't need and then packaging it back up. 
>
> combine_tessdata -u eng.traineddata eng. 
>
> delete the bits you don't need - in my case I don't need any of the 
> dawg files as I'm just recognising single chars 
>
> then do: 
>
> combine_tessdata eng. 
>
>
>
> On Feb 12, 2:59 pm, Chris <[email protected]> wrote: 
> > I think you are right - I don't think the sample box data provided for 
> > download can be the same data that is used by google to create the 
> > trained data. 
> > 
> > On Feb 12, 12:42 pm, Zdenko Podobný <[email protected]> wrote: 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > > Hi Chris, 
> > 
> > > I have the same experience - that leads me to conclusion it does not 
> > > make sense to train "common" fonts... 
> > > I think google use different process  (more detailed; more/other 
> tools?) 
> > > comparing to information available on wiki... IMHO situation is 
> > > improving with each release, so I wait for additional information 
> > > regarding 3.02 training. 
> > 
> > > On other hand there is place for community to train "non-standard" 
> fonts 
> > > (e.g. in my case fraktur). I planned to write blog about my experience 
> > > when I helped to Slovak version of project Gutenberg, but there is 
> > > always something more urgent... ;-) 
> > 
> > > Zdenko 
> > 
> > > Dn(a 11.02.2012 14:47, Chris  wrote / nap�sal(a): 
> > 
> > > > I also tried training with all the data. I seem to have the same 
> > > > problem with accuracy being much less than what you get with the 
> > > > default one. 
> > 
> > > > One thing that looks a bit off is my unicharset file contains lots 
> of 
> > > > NULLS and contents doesn't seem to match the documentation on doing 
> > > > training: 
> > 
> > > > 108 
> > > > NULL 0 NULL 0 
> > > > t 3 0,255,0,255 NULL 41 # t [74 ]a 
> > > > h 3 0,255,0,255 NULL 81 # h [68 ]a 
> > > > a 3 0,255,0,255 NULL 57 # a [61 ]a 
> > > > n 3 0,255,0,255 NULL 14 # n [6e ]a 
> > > > P 5 0,255,0,255 NULL 30 # P [50 ]A 
> > > > o 3 0,255,0,255 NULL 25 # o [6f ]a 
> > > > e 3 0,255,0,255 NULL 58 # e [65 ]a 
> > > > : 10 0,255,0,255 NULL 8 # : [3a ]p 
> > > > r 3 0,255,0,255 NULL 52 # r [72 ]a 
> > > > etc... 
> > 
> > > > Also when combining the files I get this output: 
> > 
> > > > Combining tessdata files 
> > > > TessdataManager combined tesseract data files. 
> > > > Offset for type 0 is -1 
> > > > Offset for type 1 is 108 
> > > > Offset for type 2 is -1 
> > > > Offset for type 3 is 3961 
> > > > Offset for type 4 is 701702 
> > > > Offset for type 5 is 702267 
> > > > Offset for type 6 is -1 
> > > > Offset for type 7 is 716918 
> > > > Offset for type 8 is -1 
> > > > Offset for type 9 is 717216 
> > > > Offset for type 10 is -1 
> > > > Offset for type 11 is -1 
> > > > Offset for type 12 is -1 
> > 
> > > > So I obviously don't have all the necessary files. Would this effect 
> > > > accuracy when recognising single characters? 
> > 
> > > > On Feb 11, 10:17 am, Chris<[email protected]>  wrote: 
> > > >> Hi All, 
> > 
> > > >> I'm using tesseract quite successfully in my code. I have a 
> > > >> preprocessing step that locate the characters I need to recognise 
> and 
> > > >> then I feed them into tesseract using the PSM_SINGLE_CHAR mode. 
> > 
> > > >> This works great with the default eng.traineddata 
> > 
> > > >> I'm also constraining the tessedit_char_whitelist to just have 
> numbers 
> > > >> and upper case letters as that is the only thing I have in my 
> > > >> character set. 
> > 
> > > >> I want to reduce the size of my app and the traineddata is by far 
> the 
> > > >> largest chunk of data at the moment. 
> > 
> > > >> What I've tried to do is retrain tesseract so that it only has the 
> > > >> characters I need in the training data. I've done this 
> successfully, 
> > > >> but when I use my newly created eng.traineddata the accuracy is 
> much 
> > > >> worse than if I use the default eng.traineddata. 
> > 
> > > >> Any ideas why this should be? I thought if anything that accuracy 
> > > >> would improve if I'd removed all the unnecessary characters from 
> the 
> > > >> data. 
> > 
> > > >> I'm doing my training by taking the box files and stripping out all 
> > > >> the characters I don't need and then running through the training 
> > > >> instructions. 
> > 
> > > >> I'm using tesseract3.01 
> > 
> > > >> Any thoughts? 
> > 
> > > >> Cheers 
> > > >> Chris.

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to