On 27/07/10 09:57, Jimmy O'Regan wrote:
> Have you tried adding 'MHz' to the user dictionary?
At the risk of sounding like an idiot... how do you do that?
I didn't see anything about a user dictionary in the documentation...
>> - The top line of text sometimes gets garbled (as in, read as random
>> characters). This only seems to happen on ELEK0002.TIF.
>
> The link you pointed to seems to be unavailable at the moment. Is the
> text in the top line a different size to the rest of the text?
Yes and no... the TOC is laid out as a large-font heading, with a
slightly smaller synopsis (which is sometimes omitted) below. I guess
the font sizes are around 14pt and 10-12pt respectively.
Try this link, I've tested it working here:
http://www.philpem.me.uk/temp/tesseract/
The two files I mentioned are in that directory; they're greyscale
TIFFs, and fairly large (elek0001.tif is 4.7MB, elek0002.tif is 14MB).
> Issue 265? Are you sure? That refers to reading rotated images, which
> is only possible in Tesseract 3 because of the addition of code to
> read top-to-bottom languages. It's not a simple change that easily
> lends itself to being backported.
I was thinking in terms of the error message. If flipping the text
causes Tesseract to pick up a double-quote or two, then the same problem
may well occur...
I've got a Mercurial version of the SVN repository; I'm going to see
about running a "bisection test" (basically, divide-and-conquer testing)
to try and find out which Tesseract commit fixed the quoting bug. This
of course assumes that all (or at least most) of the commits are
compilable...
Thanks,
--
Phil.
[email protected]
http://www.philpem.me.uk/
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en.