for those that may have run into similar problems: increasing font 
character spacing improved considerably the accuracy.

Marco

On Thursday, October 17, 2013 11:52:09 AM UTC+2, Marco wrote:
>
> Hi everybody,
>
> I am working on a project where I need to OCR images generated 
> programmatically. IOW, I have one app that dumps (base64) text into images 
> and another one that is supposed to recover the text from the images (long 
> story..). As the default eng.traineddata failed to recognize some 
> characters I decided to train Tesseract with the consolas font. The problem 
> is that, not matter what I try, it keeps making the same errors - so I must 
> be doing something wrong. 
>
> For example: sometimes O is detected in place of 0, sometimes 0 in place 
> of O. It also (sometimes!) reads c and w in place of C and W. 
>
> here is my script:
>
> tesseract.exe foo\foo.consolas.00000.bmp foo\foo.consolas.00000 nobatch 
> box.train 
> tesseract.exe foo\foo.consolas.00001.bmp foo\foo.consolas.00001 nobatch 
> box.train 
>
> unicharset_extractor.exe foo\foo.consolas.00000.box 
> foo\foo.consolas.00001.box 
>
> shapeclustering.exe -F font_properties -U unicharset foo\
> foo.consolas.00000.tr foo\foo.consolas.00001.tr 
>
> mftraining.exe -F font_properties -U unicharset -O foo.unicharset foo\
> foo.consolas.00000.tr foo\foo.consolas.00001.tr 
>
> cntraining.exe foo\foo.consolas.00000.tr foo\foo.consolas.00001.tr 
>
> copy shapetable foo.shapetable
> copy normproto foo.normproto
> copy inttemp foo.inttemp
> copy pffmtable foo.pffmtable
>
> combine_tessdata foo.
>
> Then I run tesseract using -lang foo.
>
> notes: 
>
> I have checked over and over the box files and they *look* fine to me ( I 
> use JBoxEdit)
> all 64 characters where found.
> images are 300dpi
> font size is 12 (see image)
>
> What am I doing wrong?
>
> Thanks!!
>
> Marco
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to