On 27/07/10 12:38, Jimmy O'Regan wrote:
>> At the risk of sounding like an idiot... how do you do that?
>> I didn't see anything about a user dictionary in the documentation...
>>
> It's a plain text file, one word per line, eng.user-words
Ah, there it is. I can see it in the Ubuntu 10.04 package for Tesseract
2.04 (in /usr/share/tesseract-ocr/tessdata), but there isn't one for Tess 3.
The Ubuntu wordlist is pretty big... 921 user-added words...
> (To be honest, I haven't needed to use it with tesseract 3, so I'm not
> actually sure where it looks for it now - if putting the file in the
> same directory as eng.traineddata doesn't work, I'll dig through the
> code for it.)
I grepped the code and it seems to be looking for something called
LANG.user-words, but that didn't seem to do anything -- I got the same
garbled text when I ran Tesseract 3 the second time.
I even tried to unpack the traineddata file to see if it was hidden in
there, and combine_tessdata barfed:
phil...@cheetah:~/tesseract/tesseract-ocr-hg-trunk/tessdata$
LD_LIBRARY_PATH=/tmp/tess/lib /tmp/tess/bin/combine_tessdata -u
eng.traineddata eng
Extracting tessdata components from eng.traineddata
tesseract::TessdataManager::TessdataTypeFromFileName( filename, &type,
&text_file):Error:Assert failed:in file tessdatamanager.cpp, line 241
Segmentation fault
> The basic issue - that Tesseract has trouble reading mixed text sizes
> - is a known one, but your images add a new dimension to the problem,
> as it seems it's also 'trimming' the block to boundary to the extent
> of the smaller text - if I'm right, that's why the numbers are being
> dropped. (Actually, I wish I'd seen your images two weeks ago, because
> I've gone down a few dead ends on this problem).
It's interesting that 2.04 doesn't exhibit the same issue... It looks to
me like the same font (a Helvetica variant?), size and weight is used
for the entire "article title" line.
> This is more a missing feature than a bug; actually splitting the
> blocks into smaller blocks based on difference in text size is not
> difficult, but determining *when* and by what threshold is; if you can
> provide more of the same sort of image, it would help immensely.
I can scan a few more issues of the journal in question -- as I said
previously, I've got the full run from 1974 through present (with 1990
onwards on DVD), and every issue up to about 1976 uses a table of
contents with a similar format.
> I won't have time to look at it until next week, but if you absolutely
> can't wait, what you could do is split the image into separate lines
> and OCR them separately.
I'll have a look at that -- thanks. With a bit of luck I'll be able to
figure out the API...
> I think it's more likely that Tesseract 2 is just crapping out because
> of some feature of the second file; Tesseract 3 uses Leptonica for
> image handling, so more TIFF oddities are handled better.
What gets me is that both images were created with the same software. If
I load ELEK0002 and save it with GIMP, I see the same effect. If I use
GIMP to white-out the double-quotes, the OCR goes perfectly.
> I don't know Mercurial, so I'm just thinking of it as 'git-lite for
> Python fans', but (thinking in terms of git's bisect, which Mercurial
> most likely copied) that won't work. The commit in question was
> basically a code dump from Google, which makes *a lot* of changes in a
> lot of places.
Cue scream track....
"AAARGH!"
Guess I'd better fire up Kdbg.
Thanks,
--
Phil.
[email protected]
http://www.philpem.me.uk/
--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to
[email protected].
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en.