On Thu, 29 Aug 2002, Eric Muller wrote: > For my personal use, I would like to acquire electronic dictionaries, > principally for the major European languages, with the following > characteristics: > > - reputable source > > - "raw" datafiles accessible - I appreciate the interfaces that > dictionary vendors may provide, but I want to be able to write my own > code to find the data I am looking for > > - the wordlist is the principal aspect; I can live without definitions. > > - "markup" about the structure of words, for things like hyphenation, > etc. (or from which hyphenation can be derived) > > - some form of frequency count would be nice > > For example, I'd like to compute something like: "the average French > character occupies x bytes in UTF-8", with average defined in sync with > the frequency count. And I'd like to compute things like spelling > changes introduced by hyphenation in Dutch. > > Any pointers? > > Thanks, > Eric. Friday, August 30, 2002 Eric, I have no sources to suggest, just a comment. The average UTF-8 length of a French word will depend to some extent on whether separate codes are used for combining characters/diacritics or a single code for a precomposed letter + diacritic combination. It will matter more if you want the average length of Czech or Polish words. Fortunately Vietnamese isn't European.
Regards, Jim Agenbroad ( [EMAIL PROTECTED] ) "It is not true that people stop pursuing their dreams because they grow old, they grow old because they stop pursuing their dreams." Adapted from a letter by Gabriel Garcia Marquez. The above are purely personal opinions, not necessarily the official views of any government or any agency of any. Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A. Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.