On 27 July 2010 20:49, Philip Pemberton <[email protected]> wrote: > On 27/07/10 17:30, Jimmy O'Regan wrote: >>> >>> The Ubuntu wordlist is pretty big... 921 user-added words... >> >> As wordlists go, that's tiny :) > > Aye, but it's an exceptions list :) > Seems to contain a lot of fairly technical words and abbreviations which I > assume aren't in the Tesseract base wordlist. >
Yeah, that's a reasonable assumption. >>> I grepped the code and it seems to be looking for something called >>> LANG.user-words, but that didn't seem to do anything -- I got the same >>> garbled text when I ran Tesseract 3 the second time. > > Turns out T3 doesn't even access $LANG.user-words. I suspect it's looking > for it in the traineddata file... > Hmm... probably... which is quite a stupid thing to do, really, but I presume nobody in Google actually uses this, so it's probably quite neglected. I'm toying with the idea of adding support for an actual *user* list - i.e., that tesseract would check $HOME/.tesseract/lang.user-words - because assuming a single user system that the user has full control over is still a braindamaged assumption. >>> phil...@cheetah:~/tesseract/tesseract-ocr-hg-trunk/tessdata$ >>> LD_LIBRARY_PATH=/tmp/tess/lib /tmp/tess/bin/combine_tessdata -u >>> eng.traineddata eng > > [...] >> >> I never got around to playing with that. I'll have a look at it, >> either later, or tomorrow. > > Turns out the issue is that combine_tessdata wants the prefix to end with a > period. So 'eng' crashes it, but 'eng.' works fine (and produces a bunch of > files in the CSD). > I should fix that, so it doesn't become the "tesseract only accepts '.tif'" thing all over again. >> Lots of new features, lots of new bugs. > > Ain't it always the way... > >>> I can scan a few more issues of the journal in question -- as I said >>> previously, I've got the full run from 1974 through present (with 1990 >>> onwards on DVD), and every issue up to about 1976 uses a table of >>> contents >>> with a similar format. >> >> Cool, thanks. > > No problem. I just need to clear some space on the table and set the scanner > up first... > >> /* >> The adaption step used to be here. It has been moved to after >> make_reject_map so that we know whether the word will be accepted >> in the >> first pass or not. This move will PREVENT adaption to words >> containing >> double quotes because the word will not be identical to what tess >> thinks >> its best choice is. (See CurrentBestChoiceIs in >> danj/microfeatures/stopper.c which is used by AdaptableWord in >> danj/microfeatures/adaptmatch.c) >> */ > > I must confess I'm not 100% sure what that means... > I means that whoever did this knew it was going to screw up text with quotes. > Thanks, > -- > Phil. > [email protected] > http://www.philpem.me.uk/ > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en. > > -- <Leftmost> jimregan, that's because deep inside you, you are evil. <Leftmost> Also not-so-deep inside you. -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

