On 27 July 2010 11:28, Philip Pemberton <[email protected]> wrote: > On 27/07/10 09:57, Jimmy O'Regan wrote: >> Have you tried adding 'MHz' to the user dictionary? > > At the risk of sounding like an idiot... how do you do that? > I didn't see anything about a user dictionary in the documentation... >
It's a plain text file, one word per line, eng.user-words (To be honest, I haven't needed to use it with tesseract 3, so I'm not actually sure where it looks for it now - if putting the file in the same directory as eng.traineddata doesn't work, I'll dig through the code for it.) >>> - The top line of text sometimes gets garbled (as in, read as random >>> characters). This only seems to happen on ELEK0002.TIF. >> >> The link you pointed to seems to be unavailable at the moment. Is the >> text in the top line a different size to the rest of the text? > > Yes and no... the TOC is laid out as a large-font heading, with a slightly > smaller synopsis (which is sometimes omitted) below. I guess the font sizes > are around 14pt and 10-12pt respectively. > > Try this link, I've tested it working here: > http://www.philpem.me.uk/temp/tesseract/ > Ok, that link works now, and that's basically what I expected to see. The basic issue - that Tesseract has trouble reading mixed text sizes - is a known one, but your images add a new dimension to the problem, as it seems it's also 'trimming' the block to boundary to the extent of the smaller text - if I'm right, that's why the numbers are being dropped. (Actually, I wish I'd seen your images two weeks ago, because I've gone down a few dead ends on this problem). This is more a missing feature than a bug; actually splitting the blocks into smaller blocks based on difference in text size is not difficult, but determining *when* and by what threshold is; if you can provide more of the same sort of image, it would help immensely. I won't have time to look at it until next week, but if you absolutely can't wait, what you could do is split the image into separate lines and OCR them separately. > The two files I mentioned are in that directory; they're greyscale TIFFs, > and fairly large (elek0001.tif is 4.7MB, elek0002.tif is 14MB). > >> Issue 265? Are you sure? That refers to reading rotated images, which >> is only possible in Tesseract 3 because of the addition of code to >> read top-to-bottom languages. It's not a simple change that easily >> lends itself to being backported. > > I was thinking in terms of the error message. If flipping the text causes > Tesseract to pick up a double-quote or two, then the same problem may well > occur... > I think it's more likely that Tesseract 2 is just crapping out because of some feature of the second file; Tesseract 3 uses Leptonica for image handling, so more TIFF oddities are handled better. > I've got a Mercurial version of the SVN repository; I'm going to see about > running a "bisection test" (basically, divide-and-conquer testing) to try > and find out which Tesseract commit fixed the quoting bug. This of course > assumes that all (or at least most) of the commits are compilable... > I don't know Mercurial, so I'm just thinking of it as 'git-lite for Python fans', but (thinking in terms of git's bisect, which Mercurial most likely copied) that won't work. The commit in question was basically a code dump from Google, which makes *a lot* of changes in a lot of places. > Thanks, > -- > Phil. > [email protected] > http://www.philpem.me.uk/ > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en. > > -- <Leftmost> jimregan, that's because deep inside you, you are evil. <Leftmost> Also not-so-deep inside you. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

