Hi,
I'm working on cataloguing about 20 years of journals and magazines, down to article level where possible. My plan is to scan the Table of Contents pages from each issue, OCR with Tesseract, then use text processing software (a fancy way of saying "a Python script") to analyse the text, find the article titles, and add the data to a MySQL database.

Tesseract 2.04 does pretty well for accuracy -- at worst, I get the occasional full-stop turning into a hyphen/dash. All pretty simple to fix. Problem is, Tess2.04 can't handle double-quotes -- instead it dies with this error:

phil...@cheetah:~/$ tesseract elek0002.tif elek0002_tess2
Tesseract Open Source OCR Engine
tesseract: unicharset.cpp:76: const UNICHAR_ID UNICHARSET::unichar_to_id(const char*, int) const: Assertion `ids.contains(unichar_repr, length)' failed.
Aborted

If I use Tesseract 3 (the current SVN release), then I can OCR the page:

phil...@cheetah:~/$ LD_LIBRARY_PATH=/tmp/tess/lib /tmp/tess/bin/tesseract elek0002.tif elek0002_tess3
Tesseract Open Source OCR Engine with LibTiff

But the error rate is FAR worse. The page numbers on the right-hand side of the page are completely gone, the first line is mush (random letters) and upper-case "M" gets OCR'd as "l\/l" (usually when the page contains a frequency, e.g. "89 MHz").

The assertion failure seems to be a manifestation of Issue #265
(http://code.google.com/p/tesseract-ocr/issues/detail?id=265), which is apparently "fixed in Tesseract 3". What I'd like is the recognition accuracy of 2.04, with the stability of 3.0 (or at least the bugfix for #265)...

Is there any way to get the accuracy back where it was with 2.04 (or at least get the page numbers back)?

I've uploaded my test images here:
  http://www.philpem.me.uk/temp/tesseract/

Both are greyscale TIFFs.

ELEK0001.TIF is a "works fine" example that OCRs almost perfectly in Tess2.04 but has significant errors in Tess3.0-svn.

ELEK0002.TIF crashes Tess2.04, works in Tess3.0-svn, but has a lot of errors (especially on the first line).

When processed with Tess3.0, the page numbers (right-hand column) are omitted from the output .TXT file.

Thanks,
Phil.

--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to