On 27/07/10 17:30, Jimmy O'Regan wrote:
The Ubuntu wordlist is pretty big... 921 user-added words...

As wordlists go, that's tiny :)

Aye, but it's an exceptions list :)
Seems to contain a lot of fairly technical words and abbreviations which I assume aren't in the Tesseract base wordlist.

I grepped the code and it seems to be looking for something called
LANG.user-words, but that didn't seem to do anything -- I got the same
garbled text when I ran Tesseract 3 the second time.

Turns out T3 doesn't even access $LANG.user-words. I suspect it's looking for it in the traineddata file...

phil...@cheetah:~/tesseract/tesseract-ocr-hg-trunk/tessdata$
LD_LIBRARY_PATH=/tmp/tess/lib /tmp/tess/bin/combine_tessdata -u
eng.traineddata eng
[...]
I never got around to playing with that. I'll have a look at it,
either later, or tomorrow.

Turns out the issue is that combine_tessdata wants the prefix to end with a period. So 'eng' crashes it, but 'eng.' works fine (and produces a bunch of files in the CSD).

Lots of new features, lots of new bugs.

Ain't it always the way...

I can scan a few more issues of the journal in question -- as I said
previously, I've got the full run from 1974 through present (with 1990
onwards on DVD), and every issue up to about 1976 uses a table of contents
with a similar format.

Cool, thanks.

No problem. I just need to clear some space on the table and set the scanner up first...

     /*
        The adaption step used to be here. It has been moved to after
        make_reject_map so that we know whether the word will be accepted in the
        first pass or not.   This move will PREVENT adaption to words containing
        double quotes because the word will not be identical to what tess thinks
        its best choice is. (See CurrentBestChoiceIs in
        danj/microfeatures/stopper.c which is used by AdaptableWord in
        danj/microfeatures/adaptmatch.c)
      */

I must confess I'm not 100% sure what that means...

Thanks,
--
Phil.
[email protected]
http://www.philpem.me.uk/

--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to