Re: [Wikimedia-l] [Languages] Wikipedia article per speaker

2015-06-14 Thread Amir E. Aharoni
Dry article creation with little actual community interaction like
discussions, arguments and reverts, is problematic, but it does have one
overlooked advantage, which I myself didn't quite realize just a few months
ago: Creating a lot of texts that are known to be corresponding (a.k.a.
parallel) can be used by machine translation developers to create
statistical MT engines. When an engine exists, it may make translation of
more articles easier and faster.

Creating enough articles to bootstrap MT can be a goal for a content
creation project. I'm not sure how many is enough - 10,000?..

And either I missed it, or nobody mentioned it yet, but ahem ahem ahem
ContentTranslation. It is already helping Wikipedias in minorized languages
to create a lot of meaningful articles more easily, and with future
features like task lists and suggestions, it will be possible to use it for
tracking success conveniently.
בתאריך 8 ביוני 2015 01:23,‏ Milos Rancic mill...@gmail.com כתב:

 When you get data, at some point of time you start thinking about
 quite fringe comparisons. But that could actually give some useful
 conclusions, like this time it did [1].

 We did the next:
 * Used the number of primary speakers from Ethnologue. (Erik Zachte is
 using approximate number of primary + secondary speakers; that could
 be good for correction of this data.)
 * Categorized languages according to the logarithmic number of
 speakers: =10k, =100k, =1M, =10M, =100M.
 * Took the number of articles of Wikipedia in particular language and
 created ration (number of articles / number of speakers).
 * This list is consisted just of languages with Ethnologue status 1
 (national), 2 (provincial) or 3 (wider communication). In fact, we
 have a lot of projects (more than 100) with worse language status; a
 number of them are actually threatened or even on the edge of
 extinction.

 Those are the preliminary results and I will definitely have to pass
 through all the numbers. I fixed manually some serious errors, like
 not having English Wikipedia itself inside of data :D

 Putting the languages into the logarithmic categories proved to be
 useful, as we are now able to compare the Wikipedias according to
 their gross capacity (numbers of speakers). I suppose somebody well
 introduced into statistics could even create the function which could
 be used to check how good one project stays, no matter of those strict
 categories.

 It's obvious that as more speakers one language has, it's harder to
 the community to follow the ratio.

 So, the winners per category are:
 1) = 1k: Hawaiian, ratio 0.96900
 2) = 10k: Mirandese, ratio 0.18073
 3) = 100k: Basque, ratio 0.38061
 4) = 1M: Swedish, ratio 0.21381
 5) = 10M: Dutch, ratio 0.08305
 6) = 100M: English, ratio 0.01447

 However, keep in mind that we removed languages not inside categories
 1, 2 or 3. That affected =10k languages, as, for example, Upper
 Sorbian stays much better than Mirandese (0.67). (Will fix it while
 creating the full report. Obviously, in this case logarithmic
 categories of numbers of speakers are much more important than what's
 the state of the language.)

 It's obvious that we could draw the line between 1:1 for 1-10k
 speakers to 10:1 for =100M speakers. But, again, I would like to get
 input of somebody more competent.

 One very important category is missing here and it's about the level
 of development of the speakers. That could be added: GDP/PPP per
 capita for spoken country or countries would be useful as measurement.
 And I suppose somebody with statistical knowledge would be able to
 give us the number which would have meaning ability to create
 Wikipedia article.

 Completed in such way, we'd be able to measure the success of
 particular Wikimedia groups and organizations. OK. Articles per
 speaker are not the only way to do so, but we could use other
 parameters, as well: number of new/active/very active editors etc. And
 we could put it into time scale.

 I'll make some other results. And to remind: I'd like to have the
 formula to count ability to create Wikipedia article and then to
 produce level of particular community success in creating Wikipedia
 articles. And, of course, to implement it for editors.

 [1]
 https://docs.google.com/spreadsheets/d/1TYyhETevEJ5MhfRheRn-aGc4cs_6k45Gwk_ic14TXY4/edit?usp=sharing

 ___
 Languages mailing list
 langua...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/languages

___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] [Languages] Wikipedia article per speaker

2015-06-14 Thread Milos Rancic
On Sun, Jun 14, 2015 at 5:17 PM, Amir E. Aharoni
amir.ahar...@mail.huji.ac.il wrote:
 And either I missed it, or nobody mentioned it yet, but ahem ahem ahem
 ContentTranslation. It is already helping Wikipedias in minorized languages
 to create a lot of meaningful articles more easily, and with future features
 like task lists and suggestions, it will be possible to use it for tracking
 success conveniently.

Just a short note here... The complexity of the task, which I think I
comprehend, is so significant, that I made the lamest mistake from my
own perspective. Please note that the page Names of Wikimedia
languages [1] assumes that there is only one variant of Serbian
(although some languages have full four written varieties in Serbian:
Немачка / Nemačka / Њемачка / Njemačka).

So, yes, ContentTranslation. (To be honest, one of my priorities
should be to actually see how it works...) Besides the tools (and I
think there are some other tools, as well), there is a lot of
documentation, which should be gathered inside of one user friendly
howto.

Creating correlations between Wikimedia projects data and data about
languages is not a simple task. In relation to the languages, we know
which information we need, but we often don't have enough of data; in
relation to Wikimedia, we have data, but we often don't know what to
do with it. And the most important danger of dealing with such sets is
not to have enough data and don't know what to do with it.

While the lack of reliable data about languages could be fixed through
necessary approximations, while searching for more relevant data, the
part which says that we should know what we should do with data could
be easily fixed by sharing the ideas here. That's the main reason why
I am sharing here work in progress.

(Now back to linking languages: 208th Wikipedia edition by size,
Karachay-Balkar...)

[1] https://meta.wikimedia.org/wiki/Names_of_Wikimedia_languages

___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] [Languages] Wikipedia article per speaker

2015-06-14 Thread Milos Rancic
On Sun, Jun 14, 2015 at 5:38 PM, Milos Rancic mill...@gmail.com wrote:
 One more lame mistake: It's not about countries, but about languages.
 Thus: немачки, njemački, њемачки, njemački,

Khm... немачки, nemački, њемачки, njemački,
___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] [Languages] Wikipedia article per speaker

2015-06-14 Thread Milos Rancic
On Sun, Jun 14, 2015 at 5:35 PM, Milos Rancic mill...@gmail.com wrote:
 Just a short note here... The complexity of the task, which I think I
 comprehend, is so significant, that I made the lamest mistake from my
 own perspective. Please note that the page Names of Wikimedia
 languages [1] assumes that there is only one variant of Serbian
 (although some languages have full four written varieties in Serbian:
 Немачка / Nemačka / Њемачка / Njemačka).

One more lame mistake: It's not about countries, but about languages.
Thus: немачки, njemački, њемачки, njemački,

___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] [Languages] Wikipedia article per speaker

2015-06-12 Thread Milos Rancic
Illario, Latin doesn't have L1 speakers. And data about languages are such
a mess, that I would stick with Ethnologue's data for L1 speakers, although
they are not reliable. Ethnologue counts there are 100,000 speakers of
language X in country A and 34 in country B, thus there are 100,034
speakers in total (although likely error margin for the first number is
150 times larger than the second number), as well as it has numerous other
flaws, like fringe macrolanguage category is. However, besides counting
the same way, English Wikipedia has much worse failures when we leave ~50
major languages safety, if not based on Ethnologue's data. (It's mostly
about wishful thinking of ethnic nationalists and chronic lack of manpower
to fix that bullshit promptly.)

Nemo, yes I was thinking about various data instead of article count and
GDP/PPP per capita, so here are the thoughts, including those two
parameters:

* Article count per speaker gives one one nice pseudo-hyperbolic curve.
Basically, you can see a hyperbolic curve by drawing the line over the
highest points: Hawaiian-Upper Sorbian-Basque-Swedish-Dutch-English. By
normalizing the numbers, we could get targets per language.

* However, edit count seems like better idea. I think, but it has to be
proved, that such numbers won't have to be adjusted for the number of
speakers themselves.

* We could count various numbers related to users. For example, it seems
that as smaller ratio between the number of active and very active users
is, as healthier community is. Also, number of editors per million of
speaker per GDP or HDI could be useful parameter.

* I was thinking yesterday about HDI. But then I've realized that it would
be good to create all of possibly relevant charts and see what they bring
as information. I am interested in comparison of Wikipedia stats with Gini
coefficient, for example.

And I will do that. After I finish with the most frustrating part of the
job: draw the line between Wikipedia editions, Ethnologue data and actual
languages. Good news is that I am on ~150th of ~280 Wikipedia editions and
it's likely I will finish it during the next week. (After almost eight
years of dealing with this matter, whenever someone says that there are two
hundred eighty something Wikipedia languages or that there are 7000
languages in the world, I reach for my revolver.)
 On Jun 12, 2015 20:51, Federico Leva (Nemo) nemow...@gmail.com wrote:

 Milos Rancic, 08/06/2015 00:23:

 And I suppose somebody with statistical knowledge would be able to
 give us the number which would have meaning ability to create
 Wikipedia article.


 Why not use the human development index (HDI) as factor? Also, instead of
 the number of articles I'd rather use database size or number of words.

 Nemo

 ___
 Languages mailing list
 langua...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/languages

___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe