Hi,
I'm working on cataloguing about 20 years of journals and magazines,
down to article level where possible. My plan is to scan the Table of
Contents pages from each issue, OCR with Tesseract, then use text
processing software (a fancy way of saying "a Python script") to analyse
the text, find the article titles, and add the data to a MySQL database.
Tesseract 2.04 does pretty well for accuracy -- at worst, I get the
occasional full-stop turning into a hyphen/dash. All pretty simple to
fix. Problem is, Tess2.04 can't handle double-quotes -- instead it dies
with this error:
phil...@cheetah:~/$ tesseract elek0002.tif elek0002_tess2
Tesseract Open Source OCR Engine
tesseract: unicharset.cpp:76: const UNICHAR_ID
UNICHARSET::unichar_to_id(const char*, int) const: Assertion
`ids.contains(unichar_repr, length)' failed.
Aborted
If I use Tesseract 3 (the current SVN release), then I can OCR the page:
phil...@cheetah:~/$ LD_LIBRARY_PATH=/tmp/tess/lib
/tmp/tess/bin/tesseract elek0002.tif elek0002_tess3
Tesseract Open Source OCR Engine with LibTiff
But the error rate is FAR worse. The page numbers on the right-hand side
of the page are completely gone, the first line is mush (random letters)
and upper-case "M" gets OCR'd as "l\/l" (usually when the page contains
a frequency, e.g. "89 MHz").
The assertion failure seems to be a manifestation of Issue #265
(http://code.google.com/p/tesseract-ocr/issues/detail?id=265), which is
apparently "fixed in Tesseract 3". What I'd like is the recognition
accuracy of 2.04, with the stability of 3.0 (or at least the bugfix for
#265)...
Is there any way to get the accuracy back where it was with 2.04 (or at
least get the page numbers back)?
I've uploaded my test images here:
http://www.philpem.me.uk/temp/tesseract/
Both are greyscale TIFFs.
ELEK0001.TIF is a "works fine" example that OCRs almost perfectly in
Tess2.04 but has significant errors in Tess3.0-svn.
ELEK0002.TIF crashes Tess2.04, works in Tess3.0-svn, but has a lot of
errors (especially on the first line).
When processed with Tess3.0, the page numbers (right-hand column) are
omitted from the output .TXT file.
Thanks,
Phil.
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en.