Hello all,

I was able to train some new fonts thanks to the help I've got here.

The Wiki is somewhat vague when it comes to dictionaries.

On the Wiki there are few dictionaries mentioned as well as the concern 
with the licenses.

Looking at both aspell and ispell there are different list of words as far 
as their size. Ispell is simpler and there is an extra large list and 
a medium list.

Inside the file american.med+ I see that some words have a slash / and then 
something else like MS. For instance:

abc
abdicate/DGNS
abdomen/MS
abdominal/Y
abduct/DGS
abduction/MS
abductor/MS

I wonder if that is going to generated bad results with the WORDLIST2DAWG 
application. 
http://tesseract-ocr.googlecode.com/svn-history/trunk/doc/wordlist2dawg.1.html 
has 
some pointers saying that you need to use as:

wordlist2dawg frequent_words_list lang.freq-dawg lang.unicharset.

In this case, if I am using ispell then the command should 
be: wordlist2dawg american.med+ eng.freq-dawg eng.unicharset

But I wonder if I should use the med+ or if I use the xlg file instead.

There is also some instructions that also take into account the length of 
each word:

wordlist2dawg -l <short> <long> WORDLIST DAWG lang.unicharset

Anyone with experience on this that could give me some pointers?

Thanks in advance.

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to