Hi,
I'm currently working on cataloguing about 20 years worth of electronics
magazines, books and journals, down to article level. Obviously, typing
in the article names, page numbers and synopses isn't an option -- for a
start it'd make my hands hurt (a lot!) and take a very long time...
we're talking on the order of 200 issues per journal, and four separate
journals, plus about two dozen books, most of which aren't on Amazon (so
I can't just copy-paste the TOC from there).
Even the publisher of the journals doesn't have a full catalogue; from
what I've been told theirs only goes back to 1990.
I figure a better way to handle this is to scan the table of contents
from each issue, then OCR it with Tesseract, and use a Python script to
extract the page numbers and insert them into a MySQL database.
Problem is, Tesseract 2.04 doesn't like quoted text:
phil...@cheetah:~/elektor$ tesseract elek0002.tif elek0002_tess2
Tesseract Open Source OCR Engine
tesseract: unicharset.cpp:76: const UNICHAR_ID
UNICHARSET::unichar_to_id(const char*, int) const: Assertion
`ids.contains(unichar_repr, length)' failed.
Aborted
This happens on any page which contains a double-quoted string (e.g. the
string "typing is easy" will crash Tesseract like this). If I OCR the
pages with Tesseract 3 (the current SVN HEAD) then Tesseract doesn't
crash, but the error rate is much worse. For example:
- Page numbers are missing from the table entirely. It seems any line
that ends with a number will have those trailing numbers stripped.
- Upper-case "M" characters tend to get decoded incorrectly; this
happens most often with frequencies -- e.g. "146-76 MHz" is decoded
correctly by Tess2.04, but Tess3.0 decodes it as "146-76 l\/lHz".
Fixable with a regexp, but still a bit of a pain. This happens in
ELEK0001.TIF.
- The top line of text sometimes gets garbled (as in, read as random
characters). This only seems to happen on ELEK0002.TIF.
It seems the crash is related to Issue #265:
http://code.google.com/p/tesseract-ocr/issues/detail?id=265
Unfortunately this Issue entry doesn't list the changeset/commit which
fixes the bug, so backporting will be somewhat difficult (I need to find
the fix, then work on backporting it).
Is there some way I might be able to improve Tesseract3's recognition
accuracy, or does someone have a patch to fix the unicharset.cpp
assertion failure bug in Tesseract 2.04?
At the very least, I need the page numbers to be retained intact in the
.TXT file, like they are in Tesseract 2.04.
I've uploaded my sample images to:
http://www.philpem.me.uk/temp/tesseract
ELEK0001 works in both Tesseract 2.04 and Tesseract 3.0 (but seems to
have more OCR errors in the latter)
ELEK0002 only works in Tesseract 3.0
Thanks,
Phil.
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en.