[tex-hyphen] Google Books corpus

Stephan Hennig Thu, 30 Jun 2011 16:26:15 -0700

Hi,

I haven't seen mentioned it here before, so ...


In
<URL:http://permalink.gmane.org/gmane.science.linguistics.corpora/13159>, Google
has announce public availability of Google Books corpora for several
languages (English, Chinese, French, German, Hebrew, Russian, Spanish).
 The corpora are two years old (2009-07-15).  License is Creative
Commons Attribution 3.0 Unported.

Corpora contain only words that have been observed in at least 40
different books.  For each word, frequencies are given per observed
year.  But years have to be taken with a grain of salt: Searching for
'computer' in the German corpus with the Books Ngram Viewer
<URL:http://ngrams.googlelabs.com/> and clicking at range '1800-1966' at
the bottom reveals a computer lexicon from 1902, which was obviously
printed in 1992.  Additionally, the German corpus contains lots of
typical OCR errors like

    incorrect                correct

  ßrot                     Brot
  AVahrscheinlichkeit      Wahrscheinlichkeit

that I would have expected to be handled better by Google.  (Well, there
are many of such typical errors, but with low frequencies each so that
in total they shouldn't generate significant skew to the data.)

A few numbers for the German corpus (the only one I have looked at so far):

  * The size of the list of 1-grams is 1 GB compressed, 5 GB
    uncompressed.

  * Most frequent word is 'der' with a frequency of 1,167,791,242.

  * The list contains 24 frequency classes, class 24 being
    incomplete (the 40 books limit).

  * After consolidating the list (cumulating frequencies of the
    same words over all years), there are 3.6 million words.
    The final list has a size of ca. 60 MB.

  * The oldest books in the German corpus are from 1564.

Best regards,
Stephan Hennig

[tex-hyphen] Google Books corpus

Reply via email to