[Wikitech-l] [GlobalFactSync] User Script, Data Browser, Reference web service - WMF Grant project

Sebastian Hellmann Thu, 15 Aug 2019 08:40:08 -0700

Dear all,

we would like to share consolidated updates for the GlobalFactSync (GFS)project with you (copied fromhttps://meta.wikimedia.org/wiki/Grants:Project/DBpedia/GlobalFactSyncRE/News)

We polished everything for our presentation at Wikimania tomorrow:https://wikimania.wikimedia.org/wiki/2019:Technology_outreach_%26_innovation/GlobalFactSync


All feedback welcome!

-- Sebastian (with the team: Tina, Włodzimierz, Krzysztof, Johannes andMarvin)



     User Script, Data Browser, Reference web service (15. August 2019)

After the Kick-Off note end of July<https://meta.wikimedia.org/wiki/Grants:Project/DBpedia/GlobalFactSyncRE/News#Kick-off_note_(25._Juli_2019)>,which described our first edit and the concept better, we shaped thetechnical microservices and data into more concise tools that are easierto use and demo during our Wikimania presentation<https://wikimania.wikimedia.org/wiki/2019:Technology_outreach_%26_innovation/GlobalFactSync>:


1. User Script <https://en.wikipedia.org/wiki/User_scripts> available
   at User:JohannesFre/global.js
   <https://meta.wikimedia.org/wiki/User:JohannesFre/global.js> shows
   links from each article and Wikidata to the Data Browser and
   Reference Web Service
   <https://meta.wikimedia.org/wiki/User:JohannesFre/global.js>

1.
   User Script Linking to the GFS Data Browser
2. GFS Data Browser <https://global.dbpedia.org/> Github
   <https://github.com/dbpedia/gfs> now accepts any URI in subject from
   Wikipedia, DBpedia or Wikidata, see the Boys Don't Cry example from
   Kick-Off Note
   
<https://global.dbpedia.org/?s=https%3A%2F%2Fglobal.dbpedia.org%2Fid%2F2nrbo&p=http%3A%2F%2Fdbpedia.org%2Fontology%2FreleaseDate&src=general>,
   Berlin/Geo-coords lat
   
<https://global.dbpedia.org/?s=https%3A%2F%2Fglobal.dbpedia.org%2Fid%2F4pafr&p=http%3A%2F%2Fwww.w3.org%2F2003%2F01%2Fgeo%2Fwgs84_pos%23lat&src=general>
   long
   
<https://global.dbpedia.org/?s=https%3A%2F%2Fglobal.dbpedia.org%2Fid%2F4pafr&p=http%3A%2F%2Fwww.w3.org%2F2003%2F01%2Fgeo%2Fwgs84_pos%23long&src=general>,
   Albert Einstein's Religion
   
<https://global.dbpedia.org/?s=https%3A%2F%2Fglobal.dbpedia.org%2Fid%2F55LmB&p=http%3A%2F%2Fdbpedia.org%2Fontology%2Freligion&src=general>.
   *Not Live yet, edits/fixes are not reflected*
3. Reference Web Service (Albert Einstein:
   
http://dbpedia.informatik.uni-leipzig.de:8111/infobox/references?article=https://en.wikipedia.org/wiki/Albert_Einstein&format=json&dbpedia)
   extracts (1) all references from a Wikipedia page, (2) matched to
   the infobox parameter and (3) also extracts the fact from it. The
   service will remain stable, so you can use it.

Furthermore, we are designing a friendly fork of HarvestTemplates<https://github.com/Pascalco/harvesttemplates> to effectively import allthat data into Wikidata.



     Kick-off note (25. Juli 2019)


*GlobalFactSync - Synchronizing Wikidata and Wikipedia's infoboxes*

How is data edited in Wikipedia/Wikidata? Where does it come from? Andhow can we synchronize it globally?

The GlobalFactSync<https://meta.wikimedia.org/wiki/Grants:Project/DBpedia/GlobalFactSyncRE>(GFS) Project — funded by the Wikimedia Foundation — started in June2019 and has two goals:


 * Answer the above-mentioned three questions.
 * Build an information system to synchronize facts between all
   Wikipedia language-editions and Wikidata.

Now we are seven weeks into the project (10+ more months to go) and weare releasing our first prototypes to gather feedback.



/How – Synchronization vs Consensus/

We follow an absolute *Human(s)-in-the-loop* approach when we talk aboutsynchronization. The final decision whether to synchronize a value ornot should rest with a human editor who understands consensus and theimplications. There will be no automatic imports. Our focus is todrastically reduce the time to research all references for individualfacts.

A trivial example is the release date of the single “Boys Don’t Cry”(March 16th, 1989) in the English<https://en.wikipedia.org/wiki/Boys_Don%27t_Cry_(Moulin_Rouge_song)>,Japanese<https://ja.wikipedia.org/wiki/%E6%B6%99%E3%82%92%E3%81%BF%E3%81%9B%E3%81%AA%E3%81%84%E3%81%A7_%E3%80%9CBoys_Don't_Cry%E3%80%9C>,and French<https://fr.wikipedia.org/wiki/Namida_wo_Misenaide_(Boys_Don%27t_Cry)>Wikipedia, Wikidata <https://www.wikidata.org/wiki/Q3020026#P577> andfinally in the external open database MusicBrainz<https://musicbrainz.org/artist/e57182dc-2693-46fc-a739-a81c734a4326>. Ahuman editor might need 15-30 minutes finding and opening all differentsources, while our current prototype can spot differences and displaythem in 5 seconds.

We already had our first successful edit where a Wikipedia editor fixedthe discrepancy with our prototype: “I’ve updated Wikidata so that allfive sources are in agreement.” We are now working on the following tasks:


 * Scaling the system to all infoboxes, Wikidata and selected external
   databases (see below on the difficulties there)
 * Making the system:
     o “live” without stale information
     o “reliable” with less technical errors when extracting and
       indexing data
     o “better referenced” by not only synchronizing facts but also
       references


/Contributions and Feedback/

To ensure that GlobalFactSync will serve and help the Wikiverse weencourage everyone to try our data and micro-services and leave us somefeedback, either on our Meta-Wiki page<https://meta.wikimedia.org/wiki/Grants_talk:Project/DBpedia/GlobalFactSyncRE>or via [email protected] <mailto:[email protected]>. In the following 10+months, we intend to improve and build upon these initial results. Atthe same time, these microservices are available to every developer toexploit it and hack useful applications. The most promisingcontributions will be rewarded and receive the book “Engineering AgileBig-Data Systems”. Please post feedback or any tool or GUI here. In caseyou need changes to be made to the API, please let us know, too. For theambitious future developers among you, we have some budget left that wewill dedicate to an internship. In order to apply, just mention it inyour feedback post.

Finally, to talk to us and other GlobalfactSync-Users you may want tovisit WikidataCon and Wikimania, where we will present the latestdevelopments and the progress of our project.



/Data, APIs & Microservices (Technical prototypes)/

Data Processing and Infobox Extraction:

For GlobalFactSync we use data from Wikipedia infoboxes of differentlanguages, as well as Wikidata, and DBpedia and fuse them to receive onebig, consolidated dataset – a PreFusion dataset<https://databus.dbpedia.org/dbpedia/prefusion> (in JSON-LD). Moreinformation on the fusion process, which is the engine behind GFS, canbe found in the FlexiFusion paper<https://svn.aksw.org/papers/2019/ISWC_FlexiFusion/public.pdf>. One ofour next steps is to integrate MusicBrainz into this process as anexternal dataset. We hope to implement even more such external datasetsto increase the amount of available information and references.



*First microservices:*

We deployed a set of microservices to show the current state of ourtoolchain.


 * [Initial User Interface] The GFS Data Browser is our GlobalFactSync
   UI prototype (available at http://global.dbpedia.org) which shows
   all extracted information available for one entity for different
   sources. It can be used to analyze the factual consensus between
   different Wikipedia articles for the same thing. Example: Look at
   the variety of population counts for Grimma
   
<https://global.dbpedia.org/?s=https%3A%2F%2Fglobal.dbpedia.org%2Fid%2F9QwA&p=http%3A%2F%2Fdbpedia.org%2Fontology%2FpopulationTotal&src=general>.

 * [PreFusion JSON API] While the UI allows simple, fast and easy
   browsing for one entity at a time, we also provide raw access to the
   underlying data (PreFusion dump). The query UI
   (http://global.dbpedia.org:8990 (user: read, pw: gfs) can be
   utilized to run simple analytical queries. Thus, we can determine
   the number of locations having at least one population value
   
<http://global.dbpedia.org:8990/db/prefusion/provenance?query=%7B%0D%0A++++%22predicate.%40id%22%3A+%22http%3A%2F%2Fdbpedia.org%2Fontology%2FpopulationTotal%22%2C%0D%0A%7D&projection=%7B%0D%0A++%22subject.%40id%22+%3A+1%0D%0A++%22objects.object.%40value%22%3A+1%0D%0A%7D>
   (1,194,007) but can also focus on examples with data quality
   problems (e.g. one of the 4,268 locations with more than 10
   population values
   
<http://global.dbpedia.org:8990/db/prefusion/provenance?query=%7B%0D%0A++++%22predicate.%40id%22%3A+%22http%3A%2F%2Fdbpedia.org%2Fontology%2FpopulationTotal%22%2C%0D%0A++++%24where%3A+%22this.objects.length+%3E++10%22%0D%0A%7D&projection=%7B%0D%0A++%22subject.%40id%22+%3A+1%0D%0A++%22objects.object.%40value%22%3A+1%0D%0A%7D>).
   Moreover, documentation about the PreFusion dataset and the download
   link for the data are available on the Databus website
   <https://databus.dbpedia.org/dbpedia/prefusion>.

 * [Reference Data Download] We ran the Reference Extraction Service
   over 10 Wikipedia languages. Download dumps here
   
<http://dbpedia.informatik.uni-leipzig.de/repo/lewoniewski/gfs/infobox-refs/2019.07.01/>.

 * [Reference Extraction Service] Good references are crucial for an
   import of facts from Wikipedia to Wikidata. We are currently working
   with colleagues from Poznań University of Economics and Business on
   reference extraction for facts from Wikipedia. A current development
   reference extraction microservice
   
<http://dbpedia.informatik.uni-leipzig.de:8111/infobox/references?article=https://en.wikipedia.org/wiki/Facebook&format=json>
   shows all references and the location where they were spotted in the
   Infobox – ad hoc – for a given article:
   
http://dbpedia.informatik.uni-leipzig.de:8111/infobox/references?article=https://en.wikipedia.org/wiki/Facebook&format=json
   ( ‘&format=tsv’ also available)

 * [Infobox Extraction Service] A similar ad hoc extraction of factual
   information from infoboxes and other Wikipedia article information
   is available here. This microservice displays information which can
   be extracted with the help of DBpedia mappings from an infobox e.g.
   from the German Facebook Wikipedia article:
   
http://dbpedia.informatik.uni-leipzig.de:9998/server/extraction/en/extract?title=Facebook&revid=&format=trix&extractors=mappings.
   See here for more options:
   http://dbpedia.informatik.uni-leipzig.de:9999/server/extraction/.

 * [ID service] Last but not least, we offer the Global ID Resolution
   Service
   
<https://global.dbpedia.org/same-thing/lookup/?uri=http://dbpedia.org/resource/Facebook>.
   It ties together all available identifiers for one thing (i.e. at
   the moment all DBpedia/Wikipedia and Wikidata identifiers –
   MusicBrainz coming soon…) and shows their stable DBpedia Global ID.


/Finding sync targets/

In order to test out our algorithms, we started by looking at variousgroups of subjects, our so-called sync targets. Based on the differentsubjects a set of problems were identified with varying layers ofcomplexity:


 * identity check/check for ambiguity — Are we talking about the same
   entity?
 * fixed vs. varying property — Some properties vary depending on
   nationality (e.g., release dates), or point in time (e.g.,
   population count).
 * reference — Depending on the entity’s identity check and the
   property’s fixed or varying state the reference might vary. Also,
   for some targets, no query-able online reference might be available.
 * normalization/conversion of values — Depending on
   language/nationality of the article properties can have varying
   units (e.g., currency, metric vs imperial system).

The check for ambiguity is the most crucial step to ensure that theinfoboxes that are being compared do refer to the same entity. We found,instances where the Wikipedia page and the infobox shown on that pagewere presenting information about different subjects (e.g., see here<https://en.wikipedia.org/wiki/Boys_Don%27t_Cry_(Moulin_Rouge_song)>).



/Examples/

As a good sync target to start with the group ‘NBA players’ wasidentified. There are no ambiguity issues, it is a clearly defined groupof persons, and the amount of varying properties is very limited.Information seems to be derived from mainly two web sites (nba.com andbasketball-reference.com) and normalization is only a minor issue.‘Video games’ also proved to be an easy sync target, with the mainproblem being varying properties such as different release dates fordifferent platforms (Microsoft Windows, Linux, MacOS X, XBox) anddifferent regions (NA vs EU).

More difficult topics, such as ‘cars’, ’music albums’, and ‘musicsingles’ showed more potential for ambiguity as well as propertyvariability. A major concern we found was Wikipedia pages that containmultiple infoboxes (often seen for pages referring to a certain type ofcar, such as this one <https://en.wikipedia.org/wiki/Volkswagen_Polo>).Reference and fact extraction can be done for each infobox, butcurrently, we run into trouble once we fuse this data.

Further information about sync targets and their challenges can be foundon our Meta-Wiki discussion page<https://meta.wikimedia.org/wiki/Grants_talk:Project/DBpedia/GlobalFactSyncRE/Timeline/Tasks#Preliminary_study_-_sync_targets>,where Wikipedians that deal with infoboxes on a regular basis can alsoshare their insights on the matter. Some issues were also foundregarding the mapping of properties. In order to make GlobalFactSync asapplicable as possible, we rely on the DBpedia community to help usimprove the mappings. If you are interested in participating, we willconnect with you at http://mappings.dbpedia.org and in the DBpedia forum<https://forum.dbpedia.org/>.



Bottomline – We value your feedback!


_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] [GlobalFactSync] User Script, Data Browser, Reference web service - WMF Grant project

Reply via email to