Re: OCR of C code

Stuart Thu, 12 Sep 2013 18:54:30 -0700

I experimented this evening and now am prety sure my user word list is 
being used and I reduced the scanning resolution and did tests at 300 and 
400 dpi. There is a little improvement. Looking properly at the page I'm 
woring on, the double spaced text OCR's correctly, Tesseract works 
fine.however a lot of the letters in the proportional font are touching 
each other, I think thats whats causing the problems.


Automatically subdividing each image into character cells and OCR'ing each 
character separately sems like the only way out of this. I am experimenting 
with makebox to define the boxes first. 

Any better ideas ?

Thanks,,

Stuart

On Thursday, September 12, 2013 2:23:26 PM UTC-4, rkomar wrote:
>
> On Wed, 11 Sep 2013, Stuart wrote: 
>
> > I'm trying to convert some old C code I only have 
> > printouts of back to source. I expected to have to do a 
> > little editing, but Tesseract is having serious problems. 
> > 
> > I scanned the images in at 800 DPI, it looks clean and I 
> > tried some of the imagemagic scripts to cleanup, it looks 
> > a bit cleaner on the screen but did not help the OCR 
> > accuracy. 
> > 
> > Searches on this topic yield loads of refernces on how ot 
> > link tesseract libraries into your own C but nothing about 
> > actually processing C code. 
> > 
> > I have tried adding user words for things like fprintf 
> > etc... and common variable names in the code, but it does 
> > not help (although I'm not entirely convinced I did it 
> > right). 
> > 
> > Does anyone have any advice ? 
> > 
> > Should it work ok, maybe its the proportional spaced times 
> > roman font its in thats causing problems. 
> > 
> > Thanks, 
> > 
> > Stuart 
>
> I suspect the problem is more with the dictionary checking 
> phase than the character recognition.  Since most of the 
> C code wouldn't show up as valid entries in the default 
> dictionary, it would end up being 'corrected' by 
> tesseract.  I'm not sure if you can disable that phase, 
> but I think it would be worth looking into. 
>
> Since the font is proportionally spaced, perhaps you could 
> automatically subdivide each image into character cells 
> and try to OCR each character separately.  I don't know 
> if it would work, but it might be worth a try. 
>
> Rob Komar 
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: OCR of C code

Reply via email to