Re: [Wikitech-l] Your questions and votes for our CTO and VP of Product Q

2016-12-30 Thread Gergo Tisza
On Tue, Dec 20, 2016 at 12:45 AM, Quim Gil  wrote:

> The questions for this session are being crowdsourced at
> http://www.allourideas.org/wikidev17-product-technology-questions. Anyone
> can propose questions and vote, anonymously, as many times as you want. At
> the moment, we have 25 questions and 451 votes.
>
> An important technical detail: questions posted later have also good
> chances to make it to the top of the list as long as new voters select
> them. The ranking is made out of comparisons between questions, not
> accumulation of votes. For instance, the current top question is in fact
> one of the last that has been submitted so far.
>

Right now the top question has a score of 70 based on 88 votes; the second
question has a score of 67 based on 1 vote. (This is not some super-rare
accident, either: number 8 and 9 on the popularity list both have 4 votes.)
I argued that All Our Ideas is too experimental to be relied on back when
it was considered as the voting tool for an early iteration of what ended
up being the Community Tech Wishlist, and I still think that's the case.

The way their voting system works is that they assume each idea has some
appeal (an arbitrary real number) for each voter, the appeals for a given
idea are normally distributed, and when a voter is shown a question pair,
their probability of voting a given way is a certain function of the
difference in appeals. They then use various statistical methods to come up
with random values for the appeals which match the observed votes, and
using those values they can calculate the probability for each question
that a randomly selected voter would prefer that question to a randomly
selected alternative; those probabilities are used to score the questions.

That means that the scores can be heavily underspecified (ie. mostly result
from the random numbers generated by their algorithm and not actual votes)
for some questions; this is especially true for recently submitted
questions, which have a very small number of votes, so they will basically
get a random position in the ranking. As far as I can see, the journal
article [1] where they present their method doesn't discuss this problem at
all. This is not terribly useful as a real-world ranking model IMO, so I
hope that 1) there will be some human oversight when evaluating the
results, and 2) that we don't intend to use this system for any voting that
actually matters (getting weirdly prioritized results for a Q session is,
of course, not a huge deal).


[1] http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0123483
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Now live: Shared structured data

2016-12-30 Thread mathieu stumpf guntz


As to my mind it's a very interesting topic, I searched a bit more.

https://www.w3.org/International/articles/article-text-size.en
which quotes 
http://www-01.ibm.com/software/globalization/guidelines/a3.html


According to which, for strings in English source that are over 70 
characters, you might expect an 130% average expansion. So, with an 
admittedly very loose inference,  the 400 character limit for all is 
equivalent to a 307 character limit for English. Would you say that it 
would seems ok to have a 307 character limit there?



Le 29/12/2016 à 12:11, mathieu stumpf guntz a écrit :



Le 28/12/2016 à 23:08, Yuri Astrakhan a écrit :

The 400 chat limit is to be in sync with Wikidata, which has the same
limitation. The origins of this limit is to encourage storage of 
"values"

rather than full strings (sentences).
Well, that's probably not the best constraints for a glossary then. To 
my mind, 400 char limit regardless of the language is rather 
suprising. Surely you can tell much more with a set of 400 ideograms 
than with, well, whatever the language happen to have the longest 
average sentence length (any idea?). Also, at least for some 
translation pairs, there is a tendancy to have translations longer 
than the original[1].


[1] http://www.sid.ir/en/VEWSSID/J_pdf/53001320130303.pdf

  Also, it discourages storage of wiki
markup.
What about disallowing it explicitly? You might even enforce that with 
a quick parsing that prevent recording, or simply put a reminder when 
detecting such a string to avoid blocking users in legitimate corner 
cases.




On Wed, Dec 28, 2016, 16:45 mathieu stumpf guntz <
psychosl...@culture-libre.org> wrote:

Thank you Yuri. Is there some rational explanation behind this 
limits? I

understand the limit over performance concern, and 2Mb seems already
very large for intented glossaries. But 400 chars might be problematic
for some definition I guess, especially since translations can lead to
varying lenght needs.


Le 25/12/2016 à 17:03, Yuri Astrakhan a écrit :

Hi Mathieu, yes, I think you can totally build up this glossary in a
dataset. Just remember that each string can be no longer then 400 
chars,

and total size under 2mb.

On Sun, Dec 25, 2016, 10:45 mathieu stumpf guntz <
psychosl...@culture-libre.org> wrote:


Hi Yuri,

Seems very interesting. Am I wrong thinking this could helpto create
multi-lingual glossary as drafted in
https://phabricator.wikimedia.org/T150263#2860014 ?


Le 22/12/2016 à 20:30, Yuri Astrakhan a écrit :

Gift season! We have launched structured data on Commons, available

from

all wikis.

TLDR; One data store. Use everywhere. Upload table data to Commons,

with
localization, and use it to create wiki tables, lists, or use 
directly

in
graphs. Works for GeoJSON maps too. Must be licensed as CC0. Try 
this

per-state GDP map demo, and select multiple years. More demos at the

bottom.

US Map state highlight
 



Data can now be stored as *.tab and *.map pages in the data 
namespace

on
Commons. That data may contain localization, so a table cell 
could be

in
multiple languages. And that data is accessible from any wikis, 
by Lua

scripts, Graphs, and Maps.

Lua lets you generate wiki tables from the data by filtering,

converting,
mixing, and formatting the raw data. Lua also lets you generate 
lists.

Or

any wiki markup.

Graphs can use both .tab and .map directly to visualize the data and

let
users interact with it. The GDP demo above uses a map from 
Commons, and

colors each segment with the data based on a data table.

Kartographer (/) can use the .map data as an 
extra

layer

on top of the base map. This way we can show endangered species'

habitat.

== Demo ==
* Raw data example

* Interactive Weather data
 


* Same data in Weather template

* Interactive GDP map
 


* Endangered Jemez Mountains salamander - habitat
 


* Population history

* Line chart 

== Getting started ==
* Try creating a page at data:Sandbox/.tab on Commons. Don't

forget

the .tab extension, or it won't work.
* Try using some data with the Line chart graph template
A thorough guide is needed, help is welcome!

== Documentation links ==
* Tabular help 
* Map help 
If you find a bug, create Phabricator ticket with #tabular-data 
tag, or

comment on the documentation talk pages.

== FAQ ==
* Relation to Wikidata:  Wikidata is about "facts"