Re: Improving accuracy on Tesseract 3.0 (also Issue 265)

Jimmy O'Regan Wed, 28 Jul 2010 08:03:05 -0700

On 27 July 2010 20:49, Philip Pemberton <[email protected]> wrote:
> On 27/07/10 17:30, Jimmy O'Regan wrote:
>>>
>>> The Ubuntu wordlist is pretty big... 921 user-added words...
>>
>> As wordlists go, that's tiny :)
>
> Aye, but it's an exceptions list :)
> Seems to contain a lot of fairly technical words and abbreviations which I
> assume aren't in the Tesseract base wordlist.
>


Yeah, that's a reasonable assumption.

>>> I grepped the code and it seems to be looking for something called
>>> LANG.user-words, but that didn't seem to do anything -- I got the same
>>> garbled text when I ran Tesseract 3 the second time.
>
> Turns out T3 doesn't even access $LANG.user-words. I suspect it's looking
> for it in the traineddata file...
>

Hmm... probably... which is quite a stupid thing to do, really, but I
presume nobody in Google actually uses this, so it's probably quite
neglected.

I'm toying with the idea of adding support for an actual *user* list -
i.e., that tesseract would check $HOME/.tesseract/lang.user-words -
because assuming a single user system that the user has full control
over is still a braindamaged assumption.

>>> phil...@cheetah:~/tesseract/tesseract-ocr-hg-trunk/tessdata$
>>> LD_LIBRARY_PATH=/tmp/tess/lib /tmp/tess/bin/combine_tessdata -u
>>> eng.traineddata eng
>
> [...]
>>
>> I never got around to playing with that. I'll have a look at it,
>> either later, or tomorrow.
>
> Turns out the issue is that combine_tessdata wants the prefix to end with a
> period. So 'eng' crashes it, but 'eng.' works fine (and produces a bunch of
> files in the CSD).
>

I should fix that, so it doesn't become the "tesseract only accepts
'.tif'" thing all over again.

>> Lots of new features, lots of new bugs.
>
> Ain't it always the way...
>
>>> I can scan a few more issues of the journal in question -- as I said
>>> previously, I've got the full run from 1974 through present (with 1990
>>> onwards on DVD), and every issue up to about 1976 uses a table of
>>> contents
>>> with a similar format.
>>
>> Cool, thanks.
>
> No problem. I just need to clear some space on the table and set the scanner
> up first...
>
>>     /*
>>        The adaption step used to be here. It has been moved to after
>>        make_reject_map so that we know whether the word will be accepted
>> in the
>>        first pass or not.   This move will PREVENT adaption to words
>> containing
>>        double quotes because the word will not be identical to what tess
>> thinks
>>        its best choice is. (See CurrentBestChoiceIs in
>>        danj/microfeatures/stopper.c which is used by AdaptableWord in
>>        danj/microfeatures/adaptmatch.c)
>>      */
>
> I must confess I'm not 100% sure what that means...
>

I means that whoever did this knew it was going to screw up text with quotes.

> Thanks,
> --
> Phil.
> [email protected]
> http://www.philpem.me.uk/
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>
>



-- 
<Leftmost> jimregan, that's because deep inside you, you are evil.
<Leftmost> Also not-so-deep inside you.

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Improving accuracy on Tesseract 3.0 (also Issue 265)

Reply via email to