I experimented this evening and now am prety sure my user word list is being used and I reduced the scanning resolution and did tests at 300 and 400 dpi. There is a little improvement. Looking properly at the page I'm woring on, the double spaced text OCR's correctly, Tesseract works fine.however a lot of the letters in the proportional font are touching each other, I think thats whats causing the problems.
Automatically subdividing each image into character cells and OCR'ing each character separately sems like the only way out of this. I am experimenting with makebox to define the boxes first. Any better ideas ? Thanks,, Stuart On Thursday, September 12, 2013 2:23:26 PM UTC-4, rkomar wrote: > > On Wed, 11 Sep 2013, Stuart wrote: > > > I'm trying to convert some old C code I only have > > printouts of back to source. I expected to have to do a > > little editing, but Tesseract is having serious problems. > > > > I scanned the images in at 800 DPI, it looks clean and I > > tried some of the imagemagic scripts to cleanup, it looks > > a bit cleaner on the screen but did not help the OCR > > accuracy. > > > > Searches on this topic yield loads of refernces on how ot > > link tesseract libraries into your own C but nothing about > > actually processing C code. > > > > I have tried adding user words for things like fprintf > > etc... and common variable names in the code, but it does > > not help (although I'm not entirely convinced I did it > > right). > > > > Does anyone have any advice ? > > > > Should it work ok, maybe its the proportional spaced times > > roman font its in thats causing problems. > > > > Thanks, > > > > Stuart > > I suspect the problem is more with the dictionary checking > phase than the character recognition. Since most of the > C code wouldn't show up as valid entries in the default > dictionary, it would end up being 'corrected' by > tesseract. I'm not sure if you can disable that phase, > but I think it would be worth looking into. > > Since the font is proportionally spaced, perhaps you could > automatically subdivide each image into character cells > and try to OCR each character separately. I don't know > if it would work, but it might be worth a try. > > Rob Komar > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

