Re: Improving accuracy on Tesseract 3.0 (also Issue 265)

Jimmy O'Regan Tue, 27 Jul 2010 09:31:00 -0700

On 27 July 2010 13:35, Philip Pemberton <[email protected]> wrote:
> On 27/07/10 12:38, Jimmy O'Regan wrote:
>>> At the risk of sounding like an idiot... how do you do that?
>>> I didn't see anything about a user dictionary in the documentation...
>>>
>> It's a plain text file, one word per line, eng.user-words
>
> Ah, there it is. I can see it in the Ubuntu 10.04 package for Tesseract 2.04
> (in /usr/share/tesseract-ocr/tessdata), but there isn't one for Tess 3.
>
> The Ubuntu wordlist is pretty big... 921 user-added words...
>


As wordlists go, that's tiny :)

>> (To be honest, I haven't needed to use it with tesseract 3, so I'm not
>> actually sure where it looks for it now - if putting the file in the
>> same directory as eng.traineddata doesn't work, I'll dig through the
>> code for it.)
>
> I grepped the code and it seems to be looking for something called
> LANG.user-words, but that didn't seem to do anything -- I got the same
> garbled text when I ran Tesseract 3 the second time.
>
> I even tried to unpack the traineddata file to see if it was hidden in
> there, and combine_tessdata barfed:
>
> phil...@cheetah:~/tesseract/tesseract-ocr-hg-trunk/tessdata$
> LD_LIBRARY_PATH=/tmp/tess/lib /tmp/tess/bin/combine_tessdata -u
> eng.traineddata eng
>
> Extracting tessdata components from eng.traineddata
> tesseract::TessdataManager::TessdataTypeFromFileName( filename, &type,
> &text_file):Error:Assert failed:in file tessdatamanager.cpp, line 241
> Segmentation fault
>

I never got around to playing with that. I'll have a look at it,
either later, or tomorrow.

>
>> The basic issue - that Tesseract has trouble reading mixed text sizes
>> - is a known one, but your images add a new dimension to the problem,
>> as it seems it's also 'trimming' the block to boundary to the extent
>> of the smaller text - if I'm right, that's why the numbers are being
>> dropped. (Actually, I wish I'd seen your images two weeks ago, because
>> I've gone down a few dead ends on this problem).
>
> It's interesting that 2.04 doesn't exhibit the same issue... It looks to me
> like the same font (a Helvetica variant?), size and weight is used for the
> entire "article title" line.
>

Lots of new features, lots of new bugs.

>> This is more a missing feature than a bug; actually splitting the
>> blocks into smaller blocks based on difference in text size is not
>> difficult, but determining *when* and by what threshold is; if you can
>> provide more of the same sort of image, it would help immensely.
>
> I can scan a few more issues of the journal in question -- as I said
> previously, I've got the full run from 1974 through present (with 1990
> onwards on DVD), and every issue up to about 1976 uses a table of contents
> with a similar format.
>

Cool, thanks.

>> I won't have time to look at it until next week, but if you absolutely
>> can't wait, what you could do is split the image into separate lines
>> and OCR them separately.
>
> I'll have a look at that -- thanks. With a bit of luck I'll be able to
> figure out the API...
>



>> I think it's more likely that Tesseract 2 is just crapping out because
>> of some feature of the second file; Tesseract 3 uses Leptonica for
>> image handling, so more TIFF oddities are handled better.
>
> What gets me is that both images were created with the same software. If I
> load ELEK0002 and save it with GIMP, I see the same effect. If I use GIMP to
> white-out the double-quotes, the OCR goes perfectly.
>

Actually... I just found this comment:

    /*
       The adaption step used to be here. It has been moved to after
       make_reject_map so that we know whether the word will be accepted in the
       first pass or not.   This move will PREVENT adaption to words containing
       double quotes because the word will not be identical to what tess thinks
       its best choice is. (See CurrentBestChoiceIs in
       danj/microfeatures/stopper.c which is used by AdaptableWord in
       danj/microfeatures/adaptmatch.c)
     */


>> I don't know Mercurial, so I'm just thinking of it as 'git-lite for
>> Python fans', but (thinking in terms of git's bisect, which Mercurial
>> most likely copied) that won't work. The commit in question was
>> basically a code dump from Google, which makes *a lot* of changes in a
>> lot of places.
>
> Cue scream track....
> "AAARGH!"
>
> Guess I'd better fire up Kdbg.

-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Improving accuracy on Tesseract 3.0 (also Issue 265)

Reply via email to