Re: Improving accuracy on Tesseract 3.0 (also Issue 265)

Philip Pemberton Tue, 27 Jul 2010 08:28:43 -0700

On 27/07/10 12:38, Jimmy O'Regan wrote:
>> At the risk of sounding like an idiot... how do you do that?
>> I didn't see anything about a user dictionary in the documentation...
>>
> It's a plain text file, one word per line, eng.user-words

Ah, there it is. I can see it in the Ubuntu 10.04 package for Tesseract2.04 (in /usr/share/tesseract-ocr/tessdata), but there isn't one for Tess 3.


The Ubuntu wordlist is pretty big... 921 user-added words...

> (To be honest, I haven't needed to use it with tesseract 3, so I'm not
> actually sure where it looks for it now - if putting the file in the
> same directory as eng.traineddata doesn't work, I'll dig through the
> code for it.)

I grepped the code and it seems to be looking for something calledLANG.user-words, but that didn't seem to do anything -- I got the samegarbled text when I ran Tesseract 3 the second time.

I even tried to unpack the traineddata file to see if it was hidden inthere, and combine_tessdata barfed:

phil...@cheetah:~/tesseract/tesseract-ocr-hg-trunk/tessdata$LD_LIBRARY_PATH=/tmp/tess/lib /tmp/tess/bin/combine_tessdata -ueng.traineddata eng


Extracting tessdata components from eng.traineddata

tesseract::TessdataManager::TessdataTypeFromFileName( filename, &type,&text_file):Error:Assert failed:in file tessdatamanager.cpp, line 241

Segmentation fault


> The basic issue - that Tesseract has trouble reading mixed text sizes
> - is a known one, but your images add a new dimension to the problem,
> as it seems it's also 'trimming' the block to boundary to the extent
> of the smaller text - if I'm right, that's why the numbers are being
> dropped. (Actually, I wish I'd seen your images two weeks ago, because
> I've gone down a few dead ends on this problem).

It's interesting that 2.04 doesn't exhibit the same issue... It looks tome like the same font (a Helvetica variant?), size and weight is usedfor the entire "article title" line.


> This is more a missing feature than a bug; actually splitting the
> blocks into smaller blocks based on difference in text size is not
> difficult, but determining *when* and by what threshold is; if you can
> provide more of the same sort of image, it would help immensely.

I can scan a few more issues of the journal in question -- as I saidpreviously, I've got the full run from 1974 through present (with 1990onwards on DVD), and every issue up to about 1976 uses a table ofcontents with a similar format.


> I won't have time to look at it until next week, but if you absolutely
> can't wait, what you could do is split the image into separate lines
> and OCR them separately.

I'll have a look at that -- thanks. With a bit of luck I'll be able tofigure out the API...


> I think it's more likely that Tesseract 2 is just crapping out because
> of some feature of the second file; Tesseract 3 uses Leptonica for
> image handling, so more TIFF oddities are handled better.

What gets me is that both images were created with the same software. IfI load ELEK0002 and save it with GIMP, I see the same effect. If I useGIMP to white-out the double-quotes, the OCR goes perfectly.


> I don't know Mercurial, so I'm just thinking of it as 'git-lite for
> Python fans', but (thinking in terms of git's bisect, which Mercurial
> most likely copied) that won't work. The commit in question was
> basically a code dump from Google, which makes *a lot* of changes in a
> lot of places.

Cue scream track....
"AAARGH!"

Guess I'd better fire up Kdbg.

Thanks,
--
Phil.
[email protected]
http://www.philpem.me.uk/

--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Improving accuracy on Tesseract 3.0 (also Issue 265)

Reply via email to