DEAR SENDERS OF WIKIMEDIA ETC.
PLAEASE STOP SENDING ME REQUEST OR QUESTION ABOUT PARTICIPATING IN WIKIPEDIA
DELETE MY E-MAIL-ADRESS
I AM NOT INTERESTED!

Op 30 jul. 2013, om 17:15 heeft Mathieu Stumpf <[email protected]> 
het volgende geschreven:

> Le 2013-07-26 20:26, Amgine a écrit :
>> The request is to create a web-based text corpus[1] from which to derive
>> frequencies and then compare with existing wiktionaries. Not a light
>> undertaking, but one which has been proposed and implemented previously
>> (e.g. Connel's Gutenberg project[2])
>> 
>> Generically speaking, someone would need to determine the appropriate
>> size of the corpus sample, it's temporal currency, and the method of
>> creating and maintaining it. This isn't easy to do, and having no
>> strictures results in unwieldy and mostly irrelevant products like
>> Google's n-grams[3] (on the other hand, if someone can figure out how to
>> filter n-grams usefully it would mean we don't have to build our own.)
> 
> Actually, I think it would be interesting to have a trend history of words 
> usage over centuries (current trend would also be interesting but probably 
> harder to implement). Wikisource may be used in order to achieve that.
> 
>> 
>> Amgine
>> 
>> [1] https://en.wikipedia.org/wiki/Linguistic_corpus
>> [2] https://en.wiktionary.org/wiki/User:Connel_MacKenzie/Gutenberg
>> [3] http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
>> 
>> 
>> On 26/07/13 09:18, Lars Aronsson wrote:
>>> On 07/23/2013 11:23 AM, Mathieu Stumpf wrote:
>>>> Here is what I would like to do : generating reports which give, for
>>>> a given language, a list of words which are used on the web with a
>>>> number evaluating its occurencies, but which are not in a given
>>>> wiktionary.
>>>> 
>>>> How would you recommand to implemente that within the wikimedia
>>>> infrastructure?
>>> 
>>> Some years back, I undertook to add entries for
>>> Swedish words in the English Wiktionary. You can
>>> follow my diary at http://en.wiktionary.org/wiki/User:LA2
>>> 
>>> Among the things I did was to extract a list of all
>>> Swedish words that already had entries. The best
>>> way was to use CatScan to list entries in categories
>>> for Swedish words. Even if there is a page called
>>> "men", this doesn't mean the Swedish word "men"
>>> has an entry, because it could be the English word
>>> "men" that is in that page.
>>> 
>>> Then I extracted all words from some known texts,
>>> e.g. novels, the Bible, government reports, and the
>>> Swedish Wikipedia, counting the number of
>>> occurrencies of each word. Case significance is
>>> a bit tricky. There should not be an entry for
>>> lower-case stockholm, so you can't just convert
>>> everything to lower case. But if a sentence begins
>>> with a capital letter, that word should not have
>>> a capitalized entry. Another tricky issue is
>>> abbreviations, which should keep the period,
>>> for example "i.e." rather than "i" and "e". But
>>> the period that ends a sentence should be removed.
>>> When splitting a text into words, I decided to keep
>>> all periods and initial capital letters, even if this
>>> leads to some false words.
>>> 
>>> When you have word frequency statistics for a text,
>>> and a list of existing entries from Wiktionary, you
>>> can compute the coverage, and I wrote a little
>>> script for this. I found that English Wiktionary already
>>> had Swedish entries covering 72% of the words in the
>>> Bible, and when I started to add entries for the most
>>> common of the missing words, I was able to increase
>>> this to 87% in just a single month (September 2010).
>>> 
>>> Many of the common words that were missing when
>>> I started were adverbs such as "thereof", "herein",
>>> which occur frequently in any text but are not very
>>> exciting to write entries about. This statistics-based
>>> approach gave me a reason to add those entries.
>>> 
>>> It is interesting to contrast a given text to a given
>>> dictionary in this way. The Swedish entries in the
>>> English Wiktionary is a different dictionary than the
>>> Swedish entries in the German or Danish Wiktionary.
>>> The kinds of words found in the Bible are different
>>> from those found in Wikipedia or in legal texts.
>>> There is not a single, universal text corpus that we
>>> can aim to cover. Google has released its ngram
>>> dataset. I'm not sure if it covers Swedish, but even
>>> if it does, it must differ from the corpus frequencies
>>> published by the Swedish Academy.
>>> 
>>> It is relatively easy to extract a list of existing entries
>>> from Wiktionary. But to prepare a given text corpus
>>> for frequency and coverage analysis needs more
>>> preparation.
>> 
>> 
>> _______________________________________________
>> Wiktionary-l mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
> 
> -- 
> Association Culture-Libre
> http://www.culture-libre.org/
> 
> _______________________________________________
> Wiktionary-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wiktionary-l


_______________________________________________
Wiktionary-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wiktionary-l

Reply via email to