On Wed, 11 Sep 2013, Stuart wrote:
I'm trying to convert some old C code I only have
printouts of back to source. I expected to have to do a
little editing, but Tesseract is having serious problems.
I scanned the images in at 800 DPI, it looks clean and I
tried some of the imagemagic scripts to cleanup, it looks
a bit cleaner on the screen but did not help the OCR
accuracy.
Searches on this topic yield loads of refernces on how ot
link tesseract libraries into your own C but nothing about
actually processing C code.
I have tried adding user words for things like fprintf
etc... and common variable names in the code, but it does
not help (although I'm not entirely convinced I did it
right).
Does anyone have any advice ?
Should it work ok, maybe its the proportional spaced times
roman font its in thats causing problems.
Thanks,
Stuart
I suspect the problem is more with the dictionary checking
phase than the character recognition. Since most of the
C code wouldn't show up as valid entries in the default
dictionary, it would end up being 'corrected' by
tesseract. I'm not sure if you can disable that phase,
but I think it would be worth looking into.
Since the font is proportionally spaced, perhaps you could
automatically subdivide each image into character cells
and try to OCR each character separately. I don't know
if it would work, but it might be worth a try.
Rob Komar
--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
---
You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.