Hi, I haven't seen mentioned it here before, so ...
In <URL:http://permalink.gmane.org/gmane.science.linguistics.corpora/13159>, Google has announce public availability of Google Books corpora for several languages (English, Chinese, French, German, Hebrew, Russian, Spanish). The corpora are two years old (2009-07-15). License is Creative Commons Attribution 3.0 Unported. Corpora contain only words that have been observed in at least 40 different books. For each word, frequencies are given per observed year. But years have to be taken with a grain of salt: Searching for 'computer' in the German corpus with the Books Ngram Viewer <URL:http://ngrams.googlelabs.com/> and clicking at range '1800-1966' at the bottom reveals a computer lexicon from 1902, which was obviously printed in 1992. Additionally, the German corpus contains lots of typical OCR errors like incorrect correct ßrot Brot AVahrscheinlichkeit Wahrscheinlichkeit that I would have expected to be handled better by Google. (Well, there are many of such typical errors, but with low frequencies each so that in total they shouldn't generate significant skew to the data.) A few numbers for the German corpus (the only one I have looked at so far): * The size of the list of 1-grams is 1 GB compressed, 5 GB uncompressed. * Most frequent word is 'der' with a frequency of 1,167,791,242. * The list contains 24 frequency classes, class 24 being incomplete (the 40 books limit). * After consolidating the list (cumulating frequencies of the same words over all years), there are 3.6 million words. The final list has a size of ca. 60 MB. * The oldest books in the German corpus are from 1564. Best regards, Stephan Hennig
