I also tried training with all the data. I seem to have the same problem with accuracy being much less than what you get with the default one.
One thing that looks a bit off is my unicharset file contains lots of NULLS and contents doesn't seem to match the documentation on doing training: 108 NULL 0 NULL 0 t 3 0,255,0,255 NULL 41 # t [74 ]a h 3 0,255,0,255 NULL 81 # h [68 ]a a 3 0,255,0,255 NULL 57 # a [61 ]a n 3 0,255,0,255 NULL 14 # n [6e ]a P 5 0,255,0,255 NULL 30 # P [50 ]A o 3 0,255,0,255 NULL 25 # o [6f ]a e 3 0,255,0,255 NULL 58 # e [65 ]a : 10 0,255,0,255 NULL 8 # : [3a ]p r 3 0,255,0,255 NULL 52 # r [72 ]a etc... Also when combining the files I get this output: Combining tessdata files TessdataManager combined tesseract data files. Offset for type 0 is -1 Offset for type 1 is 108 Offset for type 2 is -1 Offset for type 3 is 3961 Offset for type 4 is 701702 Offset for type 5 is 702267 Offset for type 6 is -1 Offset for type 7 is 716918 Offset for type 8 is -1 Offset for type 9 is 717216 Offset for type 10 is -1 Offset for type 11 is -1 Offset for type 12 is -1 So I obviously don't have all the necessary files. Would this effect accuracy when recognising single characters? On Feb 11, 10:17 am, Chris <[email protected]> wrote: > Hi All, > > I'm using tesseract quite successfully in my code. I have a > preprocessing step that locate the characters I need to recognise and > then I feed them into tesseract using the PSM_SINGLE_CHAR mode. > > This works great with the default eng.traineddata > > I'm also constraining the tessedit_char_whitelist to just have numbers > and upper case letters as that is the only thing I have in my > character set. > > I want to reduce the size of my app and the traineddata is by far the > largest chunk of data at the moment. > > What I've tried to do is retrain tesseract so that it only has the > characters I need in the training data. I've done this successfully, > but when I use my newly created eng.traineddata the accuracy is much > worse than if I use the default eng.traineddata. > > Any ideas why this should be? I thought if anything that accuracy > would improve if I'd removed all the unnecessary characters from the > data. > > I'm doing my training by taking the box files and stripping out all > the characters I don't need and then running through the training > instructions. > > I'm using tesseract3.01 > > Any thoughts? > > Cheers > Chris. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

