On 27/07/10 09:57, Jimmy O'Regan wrote:
> Have you tried adding 'MHz' to the user dictionary?

At the risk of sounding like an idiot... how do you do that?
I didn't see anything about a user dictionary in the documentation...

>>   - The top line of text sometimes gets garbled (as in, read as random
>> characters). This only seems to happen on ELEK0002.TIF.
>
> The link you pointed to seems to be unavailable at the moment. Is the
> text in the top line a different size to the rest of the text?

Yes and no... the TOC is laid out as a large-font heading, with a slightly smaller synopsis (which is sometimes omitted) below. I guess the font sizes are around 14pt and 10-12pt respectively.

Try this link, I've tested it working here:
  http://www.philpem.me.uk/temp/tesseract/

The two files I mentioned are in that directory; they're greyscale TIFFs, and fairly large (elek0001.tif is 4.7MB, elek0002.tif is 14MB).

> Issue 265? Are you sure? That refers to reading rotated images, which
> is only possible in Tesseract 3 because of the addition of code to
> read top-to-bottom languages. It's not a simple change that easily
> lends itself to being backported.

I was thinking in terms of the error message. If flipping the text causes Tesseract to pick up a double-quote or two, then the same problem may well occur...

I've got a Mercurial version of the SVN repository; I'm going to see about running a "bisection test" (basically, divide-and-conquer testing) to try and find out which Tesseract commit fixed the quoting bug. This of course assumes that all (or at least most) of the commits are compilable...

Thanks,
--
Phil.
[email protected]
http://www.philpem.me.uk/

--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to