The next Research Showcase will be live-streamed this Wednesday, June 21,
2017 at 11:30 AM (PST) 18:30 UTC.

YouTube stream: https://www.youtube.com/watch?v=i2jpKRwPT-Q

you can join the conversation on IRC at #wikimedia-research.
you can watch our past research showcases here

This month's presentations:

Title: Problematizing and Addressing the Article-as-Concept Assumption in

By *Allen Yilun Lin*

Abstract: Wikipedia-based studies and systems frequently assume that each
article describes a separate concept. However, in this paper, we show that
this article-as-concept assumption is problematic due to editors’ tendency
to split articles into parent articles and sub-articles when articles get
too long for readers (e.g. “United States” and “American literature” in the
English Wikipedia). In this paper, we present evidence that this issue can
have significant impacts on Wikipedia-based studies and systems and
introduce the subarticle matching problem. The goal of the sub-article
matching problem is to automatically connect sub-articles to parent
articles to help Wikipedia-based studies and systems retrieve complete
information about a concept. We then describe the first system to address
the sub-article matching problem. We show that, using a diverse feature set
and standard machine learning techniques, our system can achieve good
performance on most of our ground truth datasets, significantly
outperforming baseline approaches.

Title: Understanding Wikidata Queries

By *Markus Kroetzsch*

Abstract: Wikimedia provides a public service that lets anyone answer
complex questions over the sum of all knowledge stored in Wikidata. These
questions are expressed in the query language SPARQL and range from the
most simple fact retrievals ("What is the birthday of Douglas Adams?") to
complex analytical queries ("Average lifespan of people by occupation").
The talk presents ongoing efforts to analyse the server logs of the
millions of queries that are answered each month. It is an important but
difficult challenge to draw meaningful conclusions from this dataset. One
might hope to learn relevant information about the usage of the service and
Wikidata in general, but at the same time one has to be careful not to be
misled by the data. Indeed, the dataset turned out to be highly
heterogeneous and unpredictable, with strongly varying usage patterns that
make it difficult to draw conclusions about "normal" usage. The talk will
give a status report, present preliminary results, and discuss possible
next steps.

