Re: Improving accuracy on Tesseract 3.0 (also Issue 265)

Jimmy O'Regan Tue, 27 Jul 2010 04:38:18 -0700

On 27 July 2010 11:28, Philip Pemberton <[email protected]> wrote:
> On 27/07/10 09:57, Jimmy O'Regan wrote:
>> Have you tried adding 'MHz' to the user dictionary?
>
> At the risk of sounding like an idiot... how do you do that?
> I didn't see anything about a user dictionary in the documentation...
>


It's a plain text file, one word per line, eng.user-words

(To be honest, I haven't needed to use it with tesseract 3, so I'm not
actually sure where it looks for it now - if putting the file in the
same directory as eng.traineddata doesn't work, I'll dig through the
code for it.)

>>>   - The top line of text sometimes gets garbled (as in, read as random
>>> characters). This only seems to happen on ELEK0002.TIF.
>>
>> The link you pointed to seems to be unavailable at the moment. Is the
>> text in the top line a different size to the rest of the text?
>
> Yes and no... the TOC is laid out as a large-font heading, with a slightly
> smaller synopsis (which is sometimes omitted) below. I guess the font sizes
> are around 14pt and 10-12pt respectively.
>
> Try this link, I've tested it working here:
>  http://www.philpem.me.uk/temp/tesseract/
>

Ok, that link works now, and that's basically what I expected to see.

The basic issue - that Tesseract has trouble reading mixed text sizes
- is a known one, but your images add a new dimension to the problem,
as it seems it's also 'trimming' the block to boundary to the extent
of the smaller text - if I'm right, that's why the numbers are being
dropped. (Actually, I wish I'd seen your images two weeks ago, because
I've gone down a few dead ends on this problem).

This is more a missing feature than a bug; actually splitting the
blocks into smaller blocks based on difference in text size is not
difficult, but determining *when* and by what threshold is; if you can
provide more of the same sort of image, it would help immensely.

I won't have time to look at it until next week, but if you absolutely
can't wait, what you could do is split the image into separate lines
and OCR them separately.

> The two files I mentioned are in that directory; they're greyscale TIFFs,
> and fairly large (elek0001.tif is 4.7MB, elek0002.tif is 14MB).
>
>> Issue 265? Are you sure? That refers to reading rotated images, which
>> is only possible in Tesseract 3 because of the addition of code to
>> read top-to-bottom languages. It's not a simple change that easily
>> lends itself to being backported.
>
> I was thinking in terms of the error message. If flipping the text causes
> Tesseract to pick up a double-quote or two, then the same problem may well
> occur...
>

I think it's more likely that Tesseract 2 is just crapping out because
of some feature of the second file; Tesseract 3 uses Leptonica for
image handling, so more TIFF oddities are handled better.

> I've got a Mercurial version of the SVN repository; I'm going to see about
> running a "bisection test" (basically, divide-and-conquer testing) to try
> and find out which Tesseract commit fixed the quoting bug. This of course
> assumes that all (or at least most) of the commits are compilable...
>

I don't know Mercurial, so I'm just thinking of it as 'git-lite for
Python fans', but (thinking in terms of git's bisect, which Mercurial
most likely copied) that won't work. The commit in question was
basically a code dump from Google, which makes *a lot* of changes in a
lot of places.

> Thanks,
> --
> Phil.
> [email protected]
> http://www.philpem.me.uk/
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>
>



-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Improving accuracy on Tesseract 3.0 (also Issue 265)

Reply via email to