Improving accuracy on Tesseract 3.0 (also Issue 265)

Philip Pemberton Mon, 26 Jul 2010 19:49:17 -0700

Hi,

I'm currently working on cataloguing about 20 years worth of electronicsmagazines, books and journals, down to article level. Obviously, typingin the article names, page numbers and synopses isn't an option -- for astart it'd make my hands hurt (a lot!) and take a very long time...we're talking on the order of 200 issues per journal, and four separatejournals, plus about two dozen books, most of which aren't on Amazon (soI can't just copy-paste the TOC from there).

Even the publisher of the journals doesn't have a full catalogue; fromwhat I've been told theirs only goes back to 1990.

I figure a better way to handle this is to scan the table of contentsfrom each issue, then OCR it with Tesseract, and use a Python script toextract the page numbers and insert them into a MySQL database.



Problem is, Tesseract 2.04 doesn't like quoted text:

phil...@cheetah:~/elektor$ tesseract elek0002.tif elek0002_tess2
Tesseract Open Source OCR Engine

tesseract: unicharset.cpp:76: const UNICHAR_IDUNICHARSET::unichar_to_id(const char*, int) const: Assertion`ids.contains(unichar_repr, length)' failed.

Aborted

This happens on any page which contains a double-quoted string (e.g. thestring "typing is easy" will crash Tesseract like this). If I OCR thepages with Tesseract 3 (the current SVN HEAD) then Tesseract doesn'tcrash, but the error rate is much worse. For example:

- Page numbers are missing from the table entirely. It seems any linethat ends with a number will have those trailing numbers stripped.

- Upper-case "M" characters tend to get decoded incorrectly; thishappens most often with frequencies -- e.g. "146-76 MHz" is decodedcorrectly by Tess2.04, but Tess3.0 decodes it as "146-76 l\/lHz".Fixable with a regexp, but still a bit of a pain. This happens inELEK0001.TIF.

- The top line of text sometimes gets garbled (as in, read as randomcharacters). This only seems to happen on ELEK0002.TIF.



It seems the crash is related to Issue #265:
  http://code.google.com/p/tesseract-ocr/issues/detail?id=265

Unfortunately this Issue entry doesn't list the changeset/commit whichfixes the bug, so backporting will be somewhat difficult (I need to findthe fix, then work on backporting it).

Is there some way I might be able to improve Tesseract3's recognitionaccuracy, or does someone have a patch to fix the unicharset.cppassertion failure bug in Tesseract 2.04?At the very least, I need the page numbers to be retained intact in the.TXT file, like they are in Tesseract 2.04.


I've uploaded my sample images to:
  http://www.philpem.me.uk/temp/tesseract

ELEK0001 works in both Tesseract 2.04 and Tesseract 3.0 (but seems tohave more OCR errors in the latter)

ELEK0002 only works in Tesseract 3.0

Thanks,
Phil.

--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Improving accuracy on Tesseract 3.0 (also Issue 265)

Reply via email to