Keep in mind that accuracy depends heavily on the right fonts being included in the training set. I have no reason to believe that the 2.04 and 3.0 training sets are identical - perhaps someone could enlighten us. In any case, I routinely come accross certain pages where recognition is terrible and where there is no doubt that the cause is a missing font.
On Jul 26, 1:55 pm, Philip Pemberton <[email protected]> wrote: > Hi, > I'm working on cataloguing about 20 years of journals and magazines, > down to article level where possible. My plan is to scan the Table of > Contents pages from each issue, OCR with Tesseract, then use text > processing software (a fancy way of saying "a Python script") to analyse > the text, find the article titles, and add the data to a MySQL database. > > Tesseract 2.04 does pretty well for accuracy -- at worst, I get the > occasional full-stop turning into a hyphen/dash. All pretty simple to > fix. Problem is, Tess2.04 can't handle double-quotes -- instead it dies > with this error: > > phil...@cheetah:~/$ tesseract elek0002.tif elek0002_tess2 > Tesseract Open Source OCR Engine > tesseract: unicharset.cpp:76: const UNICHAR_ID > UNICHARSET::unichar_to_id(const char*, int) const: Assertion > `ids.contains(unichar_repr, length)' failed. > Aborted > > If I use Tesseract 3 (the current SVN release), then I can OCR the page: > > phil...@cheetah:~/$ LD_LIBRARY_PATH=/tmp/tess/lib > /tmp/tess/bin/tesseract elek0002.tif elek0002_tess3 > Tesseract Open Source OCR Engine with LibTiff > > But the error rate is FAR worse. The page numbers on the right-hand side > of the page are completely gone, the first line is mush (random letters) > and upper-case "M" gets OCR'd as "l\/l" (usually when the page contains > a frequency, e.g. "89 MHz"). > > The assertion failure seems to be a manifestation of Issue #265 > (http://code.google.com/p/tesseract-ocr/issues/detail?id=265), which is > apparently "fixed in Tesseract 3". What I'd like is the recognition > accuracy of 2.04, with the stability of 3.0 (or at least the bugfix for > #265)... > > Is there any way to get the accuracy back where it was with 2.04 (or at > least get the page numbers back)? > > I've uploaded my test images here: > http://www.philpem.me.uk/temp/tesseract/ > > Both are greyscale TIFFs. > > ELEK0001.TIF is a "works fine" example that OCRs almost perfectly in > Tess2.04 but has significant errors in Tess3.0-svn. > > ELEK0002.TIF crashes Tess2.04, works in Tess3.0-svn, but has a lot of > errors (especially on the first line). > > When processed with Tess3.0, the page numbers (right-hand column) are > omitted from the output .TXT file. > > Thanks, > Phil. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

