Hi All,
 
Brent Hecht here :-) This has been a really interesting discussion, and I 
wanted to chime in with a few notes.
 
The 99.2% is based on a quick script I wrote that looked at reciprocity among a 
sample of interlanguage links (ILLs) in 25 languages to address some questions 
that Denny and I brainstormed. It did not consider commented links, redirects, 
etc. However, in my lab’s published work that takes much more involved 
approaches to this problem, we also find that complex interlanguage link 
situations are the obvious minority, with simple 1:1 relationships being the 
norm. For instance, only 1% of connected components of the ILL graph (groups of 
articles linked together by ILLs) had more than one article per language 
edition in our 25-language dataset.
 
There are some important details to note here. For instance, most connected 
components contain only one article (most concepts are covered by only a single 
language edition) and non-1:1 cases by definition involve more articles than 
1:1 cases (all other things being equal). Also, general concepts of global 
interest are likely disproportionately represented in the non-1:1 situations 
(e.g. river, canal, high school, diplomacy). 
 
That said, given our data, I also think Denny is spot on with the “let’s start 
with the 1:1s” approach to building Wikidata, with solutions for more complex 
situations coming later. These solutions could be fascinating and important, 
but make sense as a second step, IMHO. Given that each language edition will be 
able to pick and choose from statements (last I checked, at least), this might 
provide additional flexibility as well, allowing greater variation to be 
included in the 1:1 model.
 
If interested, I'd encourage folks to check out our CHI 2012 paper [1], as well 
as some excellent work done by Gerard de Melo and Gerhard Weikum that preceded 
us [3]. De Melo and Weikum establish an interesting taxonomy for causes of 
non-1:1 links: conceptual drift, different granularities, and mistakes made by 
editors.
 
In my view, perhaps a greater problem is the one of missing interlanguage 
links, which I hope Wikidata’s popularity will help to solve. We’ve done some 
work to show that missing links can be somewhat substantial between certain 
language editions [2], although that was based on data from 2009.
 
It's important to note, too, that some of the differences in coverage of a 
given concept across articles in different languages is addressed not with 
ILLs, but simply by describing concepts differently in each language edition. 
We call this "sub-concept diversity", and it can be substantial [2]. Our CHI 
2012 paper describes a system we built, Omnipedia, that allows folks to browse 
the content about a single concept in 25 language editions. We’re hoping to 
launch the system sometime soon, but we have some practical considerations to 
deal with first (funding, finishing my thesis, etc. :-)).
 
Lastly, I've been digging into the social science of this stuff a bit lately 
and many folks believe that, as Ziko said, different languages "divide 
knowledge in different ways" (even apart from any effects introduced in the 
Wikipedia context specifically). For instance, the linguist Anna Wierzbicka 
talks about the granularity differences as "cultural elaboration" and has all 
sorts of fun examples in her book "Understanding Cultures Through Their 
Keywords". You can also make arguments about this from a geographic perspective 
(my social science roots), psycholinguistics, and I'm sure other fields as 
well. This stuff perhaps explains some of the 1% of non-1:1 concepts, as well 
as some of the sub-concept diversity, although I am still brainstorming.
 
In any case, hopefully this helps some! Happy to answer questions. Thanks again 
for a great discussion.
 
-       Brent

p.s. Don't forget to support Denny and crew in the Knight News Challenge 
proposal :-) : 
http://newschallenge.tumblr.com/post/25575917516/wikidata-as-a-central-free-repository-of-identifiers

Brent Hecht
Ph.D. Candidate in Computer Science @ Northwestern University
Asst. Prof. of Comp. Sci @ Univ. Minnesota beginning 2013
w: http://www.brenthecht.com
e: br...@u.northwestern.edu
t: @bhecht
 
[1]       Bao, P., Hecht, B., Carton, S., Quaderi, M., Horn, M. and Gergle, D. 
2012. Omnipedia: Bridging the Wikipedia Language Gap. CHI  ’12: 30th 
International Conference on Human Factors in Computing Systems (2012).
[2]       Hecht, B. and Gergle, D. 2010. The Tower of Babel Meets Web 2.0: 
User-Generated Content and Its Applications in a Multilingual Context. CHI  
’10: 28th International Conference on Human Factors in Computing Systems 
(Atlanta, GA, 2010), 291–300.
[3]       de Melo, G. and Weikum, G. 2010. Untangling the Cross-Lingual Link 
Structure of Wikipedia. ACL  ’10: 48th Annual Meeting of the Association for 
Computational Linguistics (Uppsala, Sweden, 2010).









On Jun 26, 2012, at 7:56 AM, Denny Vrandečić wrote:

> I got the number from Brent Hecht, a researcher at Northwestern, who
> has a number of great papers published on Wikipedia-related topics.
> 
> CC-ing him, so he knows I am blam.., er, referencing him :)
> 
> Cheers,
> Denny
> 
> 
> 
> 2012/6/26 Martijn Hoekstra <martijnhoeks...@gmail.com>:
>> This number, 99.2% was also mentioned on the Berlin Hackathon. It
>> sounds much higher than what my (very scientifically relevant,
>> obviously) gut feeling tells me. Could you indicate where this number
>> is coming from?
>> 
>> On Tue, Jun 26, 2012 at 2:45 PM, Denny Vrandečić
>> <denny.vrande...@wikimedia.de> wrote:
>>> Ziko,
>>> 
>>> it does not jeopardize the Wikidata goal -- the current language link
>>> system won't be switched off, but can be further used. Everything that
>>> is working currently will still be possible afterwards. Wikidata can
>>> still be used to represent the 99.2% of language links that are simple
>>> -- this would still be a huge improvement over the current state.
>>> 
>>> As soon as these are out of the way, we can think about if and how to
>>> extend the system in order to deal with the rest.
>>> 
>>> Cheers,
>>> Denny
>>> 
>>> 2012/6/25 Ziko van Dijk <vand...@wmnederland.nl>:
>>>> Hello,
>>>> 
>>>> So may I guess that "double links" are usually the result of a
>>>> Wikipedian who was not sure which language link to set, so in doubt,
>>>> he simply put in the language links for two different articles?
>>>> 
>>>> And in general, is it imagineable that different languages divide the
>>>> knowledge in different ways, which could jeopardize the whole goal of
>>>> Wikidata unifiying the language links?
>>>> 
>>>> Kind regards
>>>> Ziko
>>>> 
>>>> 
>>>> 2012/6/25 Delirium <delir...@hackish.org>:
>>>>> Thanks for this list. For the languages I know, I've started going through
>>>>> and fixing ones that are clearly wrong. If a number of people do that, 
>>>>> that
>>>>> should improve the general quality/consistency of interwiki links. I 
>>>>> second
>>>>> the other comment that it'd be nice if the parsing could be re-run to
>>>>> exclude commented-out links, but the list is still useful as is.
>>>>> 
>>>>> There are some difficult cases, though, when languages make different
>>>>> choices on how to group subjects, so the articles aren't actually in 
>>>>> 1-to-1
>>>>> correspondence. For example, the English article [[en: Móði and Magni]]
>>>>> unsurprisingly has two outgoing interwiki links, when linking to languages
>>>>> that split them, such as [[da:Magni]] and [[da:Modi]]. It's not clear what
>>>>> to do about these cases.
>>>>> 
>>>>> Best,
>>>>> Mark
>>>>> 
>>>>> 
>>>>> On 6/25/12 12:29 PM, Denny Vrandečić wrote:
>>>>>> 
>>>>>> Hi all,
>>>>>> 
>>>>>> I ran some analysis last week, to get some numbers out of the
>>>>>> Wikipedia language links. One type of reports that were generated was
>>>>>> the list of all articles in the main namespaces of the Wikipedias that
>>>>>> link to more than one article in another language edition of Wikipedia
>>>>>> (so called double language links). There are not that many of them
>>>>>> (about 19,000 in total), split by language, all available here:
>>>>>> 
>>>>>> <http://simia.net/languagelinks/>
>>>>>> 
>>>>>> Double language links are not errors per se, but they contain a few
>>>>>> nuisances
>>>>>> * they lead to two links in the language links list that just look the
>>>>>> same (you have to hover over them to see that they link to different
>>>>>> languages), which is not really optimal from the user experience side
>>>>>> * they are not saved in the langlinks table and thus are ignored in
>>>>>> certain reports and also in the respective export
>>>>>> 
>>>>>> I am not sure how to reach out to the respective Wikipedia
>>>>>> communities, or if I should at all. Should I post to their respective
>>>>>> version of the village pump? Remembering from the time I was active on
>>>>>> the Croatian Wikipedia, I would have appreciated that list to check
>>>>>> the entries. I reckoned the wikipedia-l list would be the right place,
>>>>>> but that list looks rather dead.
>>>>>> 
>>>>>> Cheers,
>>>>>> Denny
>>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> Wikimedia-l mailing list
>>>>> Wikimedia-l@lists.wikimedia.org
>>>>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
>>>> 
>>>> 
>>>> 
>>>> --
>>>> 
>>>> -----------------------------------------------------------
>>>> Vereniging Wikimedia Nederland
>>>> dr. Ziko van Dijk, voorzitter
>>>> http://wmnederland.nl/
>>>> 
>>>> Wikimedia Nederland
>>>> Postbus 167
>>>> 3500 AD Utrecht
>>>> -----------------------------------------------------------
>>>> 
>>>> _______________________________________________
>>>> Wikimedia-l mailing list
>>>> Wikimedia-l@lists.wikimedia.org
>>>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
>>> 
>>> 
>>> 
>>> --
>>> Project director Wikidata
>>> Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
>>> Tel. +49-30-219 158 26-0 | http://wikimedia.de
>>> 
>>> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
>>> Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
>>> unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das
>>> Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
>>> 
>>> _______________________________________________
>>> Wikimedia-l mailing list
>>> Wikimedia-l@lists.wikimedia.org
>>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
>> 
>> _______________________________________________
>> Wikimedia-l mailing list
>> Wikimedia-l@lists.wikimedia.org
>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
> 
> 
> 
> -- 
> Project director Wikidata
> Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
> Tel. +49-30-219 158 26-0 | http://wikimedia.de
> 
> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
> Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
> unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das
> Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.


_______________________________________________
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l

Reply via email to