DEAR SENDERS OF WIKIMEDIA ETC. PLAEASE STOP SENDING ME REQUEST OR QUESTION ABOUT PARTICIPATING IN WIKIPEDIA DELETE MY E-MAIL-ADRESS I AM NOT INTERESTED!
Op 30 jul. 2013, om 17:15 heeft Mathieu Stumpf <[email protected]> het volgende geschreven: > Le 2013-07-26 20:26, Amgine a écrit : >> The request is to create a web-based text corpus[1] from which to derive >> frequencies and then compare with existing wiktionaries. Not a light >> undertaking, but one which has been proposed and implemented previously >> (e.g. Connel's Gutenberg project[2]) >> >> Generically speaking, someone would need to determine the appropriate >> size of the corpus sample, it's temporal currency, and the method of >> creating and maintaining it. This isn't easy to do, and having no >> strictures results in unwieldy and mostly irrelevant products like >> Google's n-grams[3] (on the other hand, if someone can figure out how to >> filter n-grams usefully it would mean we don't have to build our own.) > > Actually, I think it would be interesting to have a trend history of words > usage over centuries (current trend would also be interesting but probably > harder to implement). Wikisource may be used in order to achieve that. > >> >> Amgine >> >> [1] https://en.wikipedia.org/wiki/Linguistic_corpus >> [2] https://en.wiktionary.org/wiki/User:Connel_MacKenzie/Gutenberg >> [3] http://storage.googleapis.com/books/ngrams/books/datasetsv2.html >> >> >> On 26/07/13 09:18, Lars Aronsson wrote: >>> On 07/23/2013 11:23 AM, Mathieu Stumpf wrote: >>>> Here is what I would like to do : generating reports which give, for >>>> a given language, a list of words which are used on the web with a >>>> number evaluating its occurencies, but which are not in a given >>>> wiktionary. >>>> >>>> How would you recommand to implemente that within the wikimedia >>>> infrastructure? >>> >>> Some years back, I undertook to add entries for >>> Swedish words in the English Wiktionary. You can >>> follow my diary at http://en.wiktionary.org/wiki/User:LA2 >>> >>> Among the things I did was to extract a list of all >>> Swedish words that already had entries. The best >>> way was to use CatScan to list entries in categories >>> for Swedish words. Even if there is a page called >>> "men", this doesn't mean the Swedish word "men" >>> has an entry, because it could be the English word >>> "men" that is in that page. >>> >>> Then I extracted all words from some known texts, >>> e.g. novels, the Bible, government reports, and the >>> Swedish Wikipedia, counting the number of >>> occurrencies of each word. Case significance is >>> a bit tricky. There should not be an entry for >>> lower-case stockholm, so you can't just convert >>> everything to lower case. But if a sentence begins >>> with a capital letter, that word should not have >>> a capitalized entry. Another tricky issue is >>> abbreviations, which should keep the period, >>> for example "i.e." rather than "i" and "e". But >>> the period that ends a sentence should be removed. >>> When splitting a text into words, I decided to keep >>> all periods and initial capital letters, even if this >>> leads to some false words. >>> >>> When you have word frequency statistics for a text, >>> and a list of existing entries from Wiktionary, you >>> can compute the coverage, and I wrote a little >>> script for this. I found that English Wiktionary already >>> had Swedish entries covering 72% of the words in the >>> Bible, and when I started to add entries for the most >>> common of the missing words, I was able to increase >>> this to 87% in just a single month (September 2010). >>> >>> Many of the common words that were missing when >>> I started were adverbs such as "thereof", "herein", >>> which occur frequently in any text but are not very >>> exciting to write entries about. This statistics-based >>> approach gave me a reason to add those entries. >>> >>> It is interesting to contrast a given text to a given >>> dictionary in this way. The Swedish entries in the >>> English Wiktionary is a different dictionary than the >>> Swedish entries in the German or Danish Wiktionary. >>> The kinds of words found in the Bible are different >>> from those found in Wikipedia or in legal texts. >>> There is not a single, universal text corpus that we >>> can aim to cover. Google has released its ngram >>> dataset. I'm not sure if it covers Swedish, but even >>> if it does, it must differ from the corpus frequencies >>> published by the Swedish Academy. >>> >>> It is relatively easy to extract a list of existing entries >>> from Wiktionary. But to prepare a given text corpus >>> for frequency and coverage analysis needs more >>> preparation. >> >> >> _______________________________________________ >> Wiktionary-l mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/wiktionary-l > > -- > Association Culture-Libre > http://www.culture-libre.org/ > > _______________________________________________ > Wiktionary-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wiktionary-l _______________________________________________ Wiktionary-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wiktionary-l
