Hello Patrick, Have you considered selling your OCR post-processing program(s), that perform the spacing, character substitution, and other post-OCR enhancements?
Which language and OS are these written for? Jim On Jul 6, 5:53 am, Patrick Questembert <[email protected]> wrote: > It's really a long list of approaches, including: > - spacing: we don't trust any spacing determination by Tesseract and > reevaluate every space indicated by Tesseract for possible elimination or > consider every two letters for a possible space insertion > - obvious mistakes: this is by far the largest category of corrections we > make. For example VV is usually corrected back to W - but there are hundreds > more cases > - ambiguous letters such as i versus l: surprisingly, Tesseract makes a ton > of incongruous mistakes that lead me to believe there is no feature analysis > whatsoever - for example a 'y' may get mapped to 'g', even though there is > 0% chance of that based on a wide open gap on top. For these types of > mistakes we go back to the source image to apply our own OCR of sorts. > - dictionaries: another big disappointment - from our testing we found that > Tesseract applies the dictionary in less than 5% of the cases where it > should (i.e. where the letter mistake is one listed in the ambigs files, > with the correct spelling in the user dictionary) so we implemented our own > dictionaries > - pattern matching: the regular expressions we use include wide tolerance > for mistakes. Under the "protection" of a regular expression for a specific > pattern we have the flexibility to include hundreds of ambiguities (because > these trigger only when they help complete a match which makes it more > likely to be a valid substitution > > Patrick > > > > On Mon, Jul 4, 2011 at 12:56 AM, Andres <[email protected]> wrote: > > Hello Patrick, > > > Could you extend a little about what do you mean with Tesseract heuristics > > ? > > > Thanks, > > > Andres > > > 2011/7/3 patrickq <[email protected]> > > >> The answer is (of course) "it depends": > >> 1. If you compare Tesseract and ABBY on a same image, without applying > >> preprocessing to it, ABBY wins (because Tesseract's image processing > >> is very rudimentary - at best). Of course if your test images are > >> produced (for example) by a flatbed scanner, the lack of image > >> processing is not an issue and refer to case 2 below. > >> 2. If you compare Tesseract and ABBY on a clean (processed) image, > >> without applying any post-Tesseract heuristic, ABBY may have an > >> advantage > >> 3. However, if you compare Tesseract + image processing + heuristics & > >> corrections, Tesseract actually beats ABBY hands down. > > >> ScanBizCards is case #3 around Tesseract 3.01. If you want to test > >> this combo please do this: > >> - go tohttp://www.scanbizcards.com/webdemo > > >> - upload an image (under Batch Actions). Warning: ScanBizCards is > >> geared towards recognizing text on business cards so it would be best > >> if you tested on something *like* a business card (sparse text), not a > >> full page with lots of text > >> - click that image then "Image Editor" on top and OCR it > >> - when done testing please delete the test images from this demo > >> account (or get your own online account) ... > > >> You can also test instead on your Android or iPhone mobile device by > >> installing the free version of ScanBizCards. ABBY powers two iPhone > >> apps made by German company - Business Card Reader (by Shape Services) > >> and Card Reader (by xRoot Software) - and of course ABBY's own > >> iPhone / Android business card reader app. > > >> Patrick > > >> On Jul 3, 10:10 am, mw18888 <[email protected]> wrote: > >> > Can anyone comment on the accuracy of Tesseract vs Abbyy? > > >> > Regards, > > >> > mw18888 > > >> -- > >> You received this message because you are subscribed to the Google > >> Groups "tesseract-ocr" group. > >> To post to this group, send email to [email protected] > >> To unsubscribe from this group, send email to > >> [email protected] > >> For more options, visit this group at > >>http://groups.google.com/group/tesseract-ocr?hl=en > > > -- > > You received this message because you are subscribed to the Google > > Groups "tesseract-ocr" group. > > To post to this group, send email to [email protected] > > To unsubscribe from this group, send email to > > [email protected] > > For more options, visit this group at > >http://groups.google.com/group/tesseract-ocr?hl=en- Hide quoted text - > > - Show quoted text - -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

