On 26 July 2010 19:21, Philip Pemberton <[email protected]> wrote: > Problem is, Tesseract 2.04 doesn't like quoted text: > > phil...@cheetah:~/elektor$ tesseract elek0002.tif elek0002_tess2 > Tesseract Open Source OCR Engine > tesseract: unicharset.cpp:76: const UNICHAR_ID > UNICHARSET::unichar_to_id(const char*, int) const: Assertion > `ids.contains(unichar_repr, length)' failed. > Aborted > > This happens on any page which contains a double-quoted string (e.g. the > string "typing is easy" will crash Tesseract like this). If I OCR the pages > with Tesseract 3 (the current SVN HEAD) then Tesseract doesn't crash, but > the error rate is much worse. For example: > > - Page numbers are missing from the table entirely. It seems any line that > ends with a number will have those trailing numbers stripped. > > - Upper-case "M" characters tend to get decoded incorrectly; this happens > most often with frequencies -- e.g. "146-76 MHz" is decoded correctly by > Tess2.04, but Tess3.0 decodes it as "146-76 l\/lHz". Fixable with a regexp, > but still a bit of a pain. This happens in ELEK0001.TIF. >
Have you tried adding 'MHz' to the user dictionary? > - The top line of text sometimes gets garbled (as in, read as random > characters). This only seems to happen on ELEK0002.TIF. > The link you pointed to seems to be unavailable at the moment. Is the text in the top line a different size to the rest of the text? > > It seems the crash is related to Issue #265: > http://code.google.com/p/tesseract-ocr/issues/detail?id=265 > Unfortunately this Issue entry doesn't list the changeset/commit which fixes > the bug, so backporting will be somewhat difficult (I need to find the fix, > then work on backporting it). > Issue 265? Are you sure? That refers to reading rotated images, which is only possible in Tesseract 3 because of the addition of code to read top-to-bottom languages. It's not a simple change that easily lends itself to being backported. > Is there some way I might be able to improve Tesseract3's recognition > accuracy, or does someone have a patch to fix the unicharset.cpp assertion > failure bug in Tesseract 2.04? > At the very least, I need the page numbers to be retained intact in the .TXT > file, like they are in Tesseract 2.04. > > I've uploaded my sample images to: > http://www.philpem.me.uk/temp/tesseract > > ELEK0001 works in both Tesseract 2.04 and Tesseract 3.0 (but seems to have > more OCR errors in the latter) > ELEK0002 only works in Tesseract 3.0 -- <Leftmost> jimregan, that's because deep inside you, you are evil. <Leftmost> Also not-so-deep inside you. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

