Hi,

I'm currently working on cataloguing about 20 years worth of electronics magazines, books and journals, down to article level. Obviously, typing in the article names, page numbers and synopses isn't an option -- for a start it'd make my hands hurt (a lot!) and take a very long time... we're talking on the order of 200 issues per journal, and four separate journals, plus about two dozen books, most of which aren't on Amazon (so I can't just copy-paste the TOC from there).

Even the publisher of the journals doesn't have a full catalogue; from what I've been told theirs only goes back to 1990.


I figure a better way to handle this is to scan the table of contents from each issue, then OCR it with Tesseract, and use a Python script to extract the page numbers and insert them into a MySQL database.


Problem is, Tesseract 2.04 doesn't like quoted text:

phil...@cheetah:~/elektor$ tesseract elek0002.tif elek0002_tess2
Tesseract Open Source OCR Engine
tesseract: unicharset.cpp:76: const UNICHAR_ID UNICHARSET::unichar_to_id(const char*, int) const: Assertion `ids.contains(unichar_repr, length)' failed.
Aborted

This happens on any page which contains a double-quoted string (e.g. the string "typing is easy" will crash Tesseract like this). If I OCR the pages with Tesseract 3 (the current SVN HEAD) then Tesseract doesn't crash, but the error rate is much worse. For example:

- Page numbers are missing from the table entirely. It seems any line that ends with a number will have those trailing numbers stripped.

- Upper-case "M" characters tend to get decoded incorrectly; this happens most often with frequencies -- e.g. "146-76 MHz" is decoded correctly by Tess2.04, but Tess3.0 decodes it as "146-76 l\/lHz". Fixable with a regexp, but still a bit of a pain. This happens in ELEK0001.TIF.

- The top line of text sometimes gets garbled (as in, read as random characters). This only seems to happen on ELEK0002.TIF.


It seems the crash is related to Issue #265:
  http://code.google.com/p/tesseract-ocr/issues/detail?id=265
Unfortunately this Issue entry doesn't list the changeset/commit which fixes the bug, so backporting will be somewhat difficult (I need to find the fix, then work on backporting it).

Is there some way I might be able to improve Tesseract3's recognition accuracy, or does someone have a patch to fix the unicharset.cpp assertion failure bug in Tesseract 2.04? At the very least, I need the page numbers to be retained intact in the .TXT file, like they are in Tesseract 2.04.

I've uploaded my sample images to:
  http://www.philpem.me.uk/temp/tesseract

ELEK0001 works in both Tesseract 2.04 and Tesseract 3.0 (but seems to have more OCR errors in the latter)
ELEK0002 only works in Tesseract 3.0

Thanks,
Phil.

--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to