Hello Patrick,

Have you considered selling your OCR post-processing program(s), that
perform the spacing, character substitution, and other post-OCR
enhancements?

Which language and OS are these written for?

Jim




On Jul 6, 5:53 am, Patrick Questembert <[email protected]>
wrote:
> It's really a long list of approaches, including:
> - spacing: we don't trust any spacing determination by Tesseract and
> reevaluate every space indicated by Tesseract for possible elimination or
> consider every two letters for a possible space insertion
> - obvious mistakes: this is by far the largest category of corrections we
> make. For example VV is usually corrected back to W - but there are hundreds
> more cases
> - ambiguous letters such as i versus l: surprisingly, Tesseract makes a ton
> of incongruous mistakes that lead me to believe there is no feature analysis
> whatsoever - for example a 'y' may get mapped to 'g', even though there is
> 0% chance of that based on a wide open gap on top. For these types of
> mistakes we go back to the source image to apply our own OCR of sorts.
> - dictionaries: another big disappointment - from our testing we found that
> Tesseract applies the dictionary in less than 5% of the cases where it
> should (i.e. where the letter mistake is one listed in the ambigs files,
> with the correct spelling in the user dictionary) so we implemented our own
> dictionaries
> - pattern matching: the regular expressions we use include wide tolerance
> for mistakes. Under the "protection" of a regular expression for a specific
> pattern we have the flexibility to include hundreds of ambiguities (because
> these trigger only when they help complete a match which makes it more
> likely to be a valid substitution
>
> Patrick
>
>
>
> On Mon, Jul 4, 2011 at 12:56 AM, Andres <[email protected]> wrote:
> > Hello Patrick,
>
> > Could you extend a little about what do you mean with Tesseract heuristics
> > ?
>
> > Thanks,
>
> > Andres
>
> > 2011/7/3 patrickq <[email protected]>
>
> >> The answer is (of course) "it depends":
> >> 1. If you compare Tesseract and ABBY on a same image, without applying
> >> preprocessing to it, ABBY wins (because Tesseract's image processing
> >> is very rudimentary - at best). Of course if your test images are
> >> produced (for example) by a flatbed scanner, the lack of image
> >> processing is not an issue and refer to case 2 below.
> >> 2. If you compare Tesseract and ABBY on a clean (processed) image,
> >> without applying any post-Tesseract heuristic, ABBY may have an
> >> advantage
> >> 3. However, if you compare Tesseract + image processing + heuristics &
> >> corrections, Tesseract actually beats ABBY hands down.
>
> >> ScanBizCards is case #3 around Tesseract 3.01. If you want to test
> >> this combo please do this:
> >> - go tohttp://www.scanbizcards.com/webdemo
>
> >> - upload an image (under Batch Actions). Warning: ScanBizCards is
> >> geared towards recognizing text on business cards so it would be best
> >> if you tested on something *like* a business card (sparse text), not a
> >> full page with lots of text
> >> - click that image then "Image Editor" on top and OCR it
> >> - when done testing please delete the test images from this demo
> >> account (or get your own online account) ...
>
> >> You can also test instead on your Android or iPhone mobile device by
> >> installing the free version of ScanBizCards. ABBY powers two iPhone
> >> apps made by German company - Business Card Reader (by Shape Services)
> >> and Card Reader (by xRoot Software) - and of course ABBY's own
> >> iPhone / Android business card reader app.
>
> >> Patrick
>
> >> On Jul 3, 10:10 am, mw18888 <[email protected]> wrote:
> >> > Can anyone comment on the accuracy of Tesseract vs Abbyy?
>
> >> > Regards,
>
> >> > mw18888
>
> >> --
> >> You received this message because you are subscribed to the Google
> >> Groups "tesseract-ocr" group.
> >> To post to this group, send email to [email protected]
> >> To unsubscribe from this group, send email to
> >> [email protected]
> >> For more options, visit this group at
> >>http://groups.google.com/group/tesseract-ocr?hl=en
>
> >  --
> > You received this message because you are subscribed to the Google
> > Groups "tesseract-ocr" group.
> > To post to this group, send email to [email protected]
> > To unsubscribe from this group, send email to
> > [email protected]
> > For more options, visit this group at
> >http://groups.google.com/group/tesseract-ocr?hl=en- Hide quoted text -
>
> - Show quoted text -

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to