Re: OCR of C code

Robert Komar Thu, 12 Sep 2013 11:36:23 -0700

On Wed, 11 Sep 2013, Stuart wrote:

I'm trying to convert some old C code I only have
printouts of back to source. I expected to have to do a
little editing, but Tesseract is having serious problems.


I scanned the images in at 800 DPI, it looks clean and I
tried some of the imagemagic scripts to cleanup, it looks
a bit cleaner on the screen but did not help the OCR
accuracy.

Searches on this topic yield loads of refernces on how ot
link tesseract libraries into your own C but nothing about
actually processing C code.

I have tried adding user words for things like fprintf
etc... and common variable names in the code, but it does
not help (although I'm not entirely convinced I did it
right).

Does anyone have any advice ?

Should it work ok, maybe its the proportional spaced times
roman font its in thats causing problems.

Thanks,

Stuart


I suspect the problem is more with the dictionary checking
phase than the character recognition.  Since most of the
C code wouldn't show up as valid entries in the default
dictionary, it would end up being 'corrected' by
tesseract.  I'm not sure if you can disable that phase,
but I think it would be worth looking into.

Since the font is proportionally spaced, perhaps you could
automatically subdivide each image into character cells
and try to OCR each character separately.  I don't know
if it would work, but it might be worth a try.

Rob Komar

--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

---You received this message because you are subscribed to the Google Groups "tesseract-ocr" group.

To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: OCR of C code

Reply via email to