Accuracy worse on 3.0-svn than 2.04?

Philip Pemberton Mon, 26 Jul 2010 19:49:16 -0700

Hi,

I'm working on cataloguing about 20 years of journals and magazines,down to article level where possible. My plan is to scan the Table ofContents pages from each issue, OCR with Tesseract, then use textprocessing software (a fancy way of saying "a Python script") to analysethe text, find the article titles, and add the data to a MySQL database.

Tesseract 2.04 does pretty well for accuracy -- at worst, I get theoccasional full-stop turning into a hyphen/dash. All pretty simple tofix. Problem is, Tess2.04 can't handle double-quotes -- instead it dieswith this error:


phil...@cheetah:~/$ tesseract elek0002.tif elek0002_tess2
Tesseract Open Source OCR Engine

tesseract: unicharset.cpp:76: const UNICHAR_IDUNICHARSET::unichar_to_id(const char*, int) const: Assertion`ids.contains(unichar_repr, length)' failed.

Aborted

If I use Tesseract 3 (the current SVN release), then I can OCR the page:

phil...@cheetah:~/$ LD_LIBRARY_PATH=/tmp/tess/lib/tmp/tess/bin/tesseract elek0002.tif elek0002_tess3

Tesseract Open Source OCR Engine with LibTiff

But the error rate is FAR worse. The page numbers on the right-hand sideof the page are completely gone, the first line is mush (random letters)and upper-case "M" gets OCR'd as "l\/l" (usually when the page containsa frequency, e.g. "89 MHz").


The assertion failure seems to be a manifestation of Issue #265

(http://code.google.com/p/tesseract-ocr/issues/detail?id=265), which isapparently "fixed in Tesseract 3". What I'd like is the recognitionaccuracy of 2.04, with the stability of 3.0 (or at least the bugfix for#265)...

Is there any way to get the accuracy back where it was with 2.04 (or atleast get the page numbers back)?


I've uploaded my test images here:
  http://www.philpem.me.uk/temp/tesseract/

Both are greyscale TIFFs.

ELEK0001.TIF is a "works fine" example that OCRs almost perfectly inTess2.04 but has significant errors in Tess3.0-svn.

ELEK0002.TIF crashes Tess2.04, works in Tess3.0-svn, but has a lot oferrors (especially on the first line).

When processed with Tess3.0, the page numbers (right-hand column) areomitted from the output .TXT file.


Thanks,
Phil.

--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Accuracy worse on 3.0-svn than 2.04?

Reply via email to