On Thu, 29 Aug 2002, Eric Muller wrote:

> For my personal use, I would like to acquire electronic dictionaries, 
> principally for the major European languages, with the following 
> characteristics:
> 
> - reputable source
> 
> - "raw" datafiles accessible - I appreciate the interfaces that 
> dictionary vendors may provide, but I want to be able to write my own 
> code to find the data I am looking for
> 
> - the wordlist is the principal aspect; I can live without definitions.
> 
> - "markup" about the structure of words, for things like hyphenation, 
> etc. (or from which hyphenation can be derived)
> 
> - some form of frequency count would be nice
> 
> For example, I'd like to compute something like: "the average French 
> character occupies x bytes in UTF-8", with average defined in sync with 
> the frequency count. And I'd like to compute things like spelling 
> changes introduced by hyphenation in Dutch.
> 
> Any pointers?
> 
> Thanks,
> Eric.
                                             Friday, August 30, 2002
Eric,
    I have no sources to suggest, just a comment.  The average UTF-8
length of a French word will depend to some extent on whether separate
codes are used for combining characters/diacritics or a single code for a 
precomposed letter + diacritic combination. It will matter more if you
want the average length of Czech or Polish words. Fortunately Vietnamese
isn't European.

     Regards,
          Jim Agenbroad ( [EMAIL PROTECTED] )
     "It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams." Adapted
from a letter by Gabriel Garcia Marquez.
     The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
     Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  


Reply via email to