for those that may have run into similar problems: increasing font character spacing improved considerably the accuracy.
Marco On Thursday, October 17, 2013 11:52:09 AM UTC+2, Marco wrote: > > Hi everybody, > > I am working on a project where I need to OCR images generated > programmatically. IOW, I have one app that dumps (base64) text into images > and another one that is supposed to recover the text from the images (long > story..). As the default eng.traineddata failed to recognize some > characters I decided to train Tesseract with the consolas font. The problem > is that, not matter what I try, it keeps making the same errors - so I must > be doing something wrong. > > For example: sometimes O is detected in place of 0, sometimes 0 in place > of O. It also (sometimes!) reads c and w in place of C and W. > > here is my script: > > tesseract.exe foo\foo.consolas.00000.bmp foo\foo.consolas.00000 nobatch > box.train > tesseract.exe foo\foo.consolas.00001.bmp foo\foo.consolas.00001 nobatch > box.train > > unicharset_extractor.exe foo\foo.consolas.00000.box > foo\foo.consolas.00001.box > > shapeclustering.exe -F font_properties -U unicharset foo\ > foo.consolas.00000.tr foo\foo.consolas.00001.tr > > mftraining.exe -F font_properties -U unicharset -O foo.unicharset foo\ > foo.consolas.00000.tr foo\foo.consolas.00001.tr > > cntraining.exe foo\foo.consolas.00000.tr foo\foo.consolas.00001.tr > > copy shapetable foo.shapetable > copy normproto foo.normproto > copy inttemp foo.inttemp > copy pffmtable foo.pffmtable > > combine_tessdata foo. > > Then I run tesseract using -lang foo. > > notes: > > I have checked over and over the box files and they *look* fine to me ( I > use JBoxEdit) > all 64 characters where found. > images are 300dpi > font size is 12 (see image) > > What am I doing wrong? > > Thanks!! > > Marco > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

