On 26 July 2010 19:21, Philip Pemberton <[email protected]> wrote:
> Problem is, Tesseract 2.04 doesn't like quoted text:
>
> phil...@cheetah:~/elektor$ tesseract elek0002.tif elek0002_tess2
> Tesseract Open Source OCR Engine
> tesseract: unicharset.cpp:76: const UNICHAR_ID
> UNICHARSET::unichar_to_id(const char*, int) const: Assertion
> `ids.contains(unichar_repr, length)' failed.
> Aborted
>
> This happens on any page which contains a double-quoted string (e.g. the
> string "typing is easy" will crash Tesseract like this). If I OCR the pages
> with Tesseract 3 (the current SVN HEAD) then Tesseract doesn't crash, but
> the error rate is much worse. For example:
>
>  - Page numbers are missing from the table entirely. It seems any line that
> ends with a number will have those trailing numbers stripped.
>
>  - Upper-case "M" characters tend to get decoded incorrectly; this happens
> most often with frequencies -- e.g. "146-76 MHz" is decoded correctly by
> Tess2.04, but Tess3.0 decodes it as "146-76 l\/lHz". Fixable with a regexp,
> but still a bit of a pain. This happens in ELEK0001.TIF.
>

Have you tried adding 'MHz' to the user dictionary?

>  - The top line of text sometimes gets garbled (as in, read as random
> characters). This only seems to happen on ELEK0002.TIF.
>

The link you pointed to seems to be unavailable at the moment. Is the
text in the top line a different size to the rest of the text?

>
> It seems the crash is related to Issue #265:
>  http://code.google.com/p/tesseract-ocr/issues/detail?id=265
> Unfortunately this Issue entry doesn't list the changeset/commit which fixes
> the bug, so backporting will be somewhat difficult (I need to find the fix,
> then work on backporting it).
>

Issue 265? Are you sure? That refers to reading rotated images, which
is only possible in Tesseract 3 because of the addition of code to
read top-to-bottom languages. It's not a simple change that easily
lends itself to being backported.

> Is there some way I might be able to improve Tesseract3's recognition
> accuracy, or does someone have a patch to fix the unicharset.cpp assertion
> failure bug in Tesseract 2.04?
> At the very least, I need the page numbers to be retained intact in the .TXT
> file, like they are in Tesseract 2.04.
>
> I've uploaded my sample images to:
>  http://www.philpem.me.uk/temp/tesseract
>
> ELEK0001 works in both Tesseract 2.04 and Tesseract 3.0 (but seems to have
> more OCR errors in the latter)
> ELEK0002 only works in Tesseract 3.0

-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to