Re: [Wikidata] Concise/Notable Wikidata Dump

Sebastian Hellmann Sat, 21 Dec 2019 11:58:16 -0800

Hi Aidan,

since DBpedia has been around for twelve years now, we spent the last 3years intensively re-engineering to solve problems like this.

Last week, we finished the Virtuoso DBpedia Docker[1] to work onDatabus Collections[2],[3]. Databus contains different repartitions ofall the datasets, i.e. Wikipedia/Wikidata extractions and external data.The idea here is that datasets or graphs are stored in a granular mannerand then you make your own collection (DCAT Catalog) or re-usecollections by others.

This will go in the direction to build 1 Billion derived KnowledgeGraphs until 2025:https://databus.dbpedia.org/dbpedia/publication/strategy/2019.09.09/strategy_databus_initiative.pdf

We analysed a lot of problems in the GlobalFactSync project [5] andstudied Wikidata intensively. Our conclusion here is that we will invertDBpedia by 180 degrees in the future. So instead of taking the main datafrom Wikipedia and Wikidata, we will take it from the sources directly,since all data in Wikipedia and Wikidata comes from somewhere else. Sothe new direction is LOD -> DBpedia -> Wikipedia/Wikidata viasameAs,equivalentClass/Property mappings.

This is not a solution for the dump size problem per se, because we arecreating even bigger and more varied and domain-specific knowledgegraphs and dumps via FlexiFusion. Besides the flexible sourcepartitions, we offer a partition by property, where you can simply pickthe properties you want for your knowledge graph and then docker-load itinto the SPARQL store of choice. There is no manual yet, but this querygives you all 3.8 million birthdates of the new big fused graph:https://databus.dbpedia.org/yasgui/


PREFIX dataid: <http://dataid.dbpedia.org/ns/core#>
PREFIX dataid-cv: <http://dataid.dbpedia.org/ns/cv#>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX dcat:  <http://www.w3.org/ns/dcat#>
SELECT DISTINCT ?file WHERE {

?dataset dataid:version<https://databus.dbpedia.org/vehnem/flexifusion/fusion/2019.11.15> .

    ?dataset dcat:distribution ?distribution .

?distribution <http://dataid.dbpedia.org/ns/cv#tag>'birthDate'^^<http://www.w3.org/2001/XMLSchema#string> .

    ?distribution dcat:downloadURL ?file .
}

Where you can filter more here:https://databus.dbpedia.org/vehnem/flexifusion/fusion/2019.11.15

During the next year, we will include all European library data into thesyncing process, several national statistical datasets and other dataand refine the way to extract exactly the partition you need. It is anopportunistic extension to Linked Open Data, where you can select thepartition you need independent of the IDs or vocab used.


-- Sebastian


[1] https://github.com/dbpedia/Dockerized-DBpedia

[2] https://forum.dbpedia.org/t/dbpedia-dataset-2019-08-30-pre-release/219

[3] https://github.com/dbpedia/minimal-download-client

[4] https://svn.aksw.org/papers/2019/ISWC_FlexiFusion/public.pdf

[5] https://meta.wikimedia.org/wiki/Grants:Project/DBpedia/GlobalFactSyncRE

On 19.12.19 23:15, Aidan Hogan wrote:

Hey all,

Just a general response to all the comments thus far.
- @Marco et al., regarding the WDumper by Benno, this is a very coolinitiative! In fact I spotted it just *after* posting so I think thisgoes quite some ways towards addressing the general issue raised.
- @Markus, I partially disagree regarding the importance ofrubber-stamping a "notable dump" on the Wikidata side. I would seeit's value as being something like the "truthy dump", which I believehas been widely used in research for working with a concise sub-set ofWikidata. Perhaps a middle ground is for a sporadic "notable dump" tobe generated by WDumper and published on Zenodo. This may besufficient in terms of making the dump available and reusable forresearch purposes (or even better than the current dumps, given thepermanence you mention). Also it would reduce costs on the Wikidataside (I don't think a notable dump would be necessary to generate on aweekly basis, for example).
- @Lydia, good point! I was thinking that filtering by wikilinks willjust drop some more obscure nodes (like Q51366847 for example), buthad not considered that there are some more general "concepts" that donot have a corresponding Wikipedia article. All the same, in a lot ofthe research we use Wikidata for, we are not particularly interestedin one thing or another, but more interested in facilitating whatother people are interested in. Examples would be query performance,finding paths, versioning, finding references, etc. But point taken!Maybe there is a way to identify "general entities" that do not havewikilinks, but do have a high degree or centrality, for example? Woulda degree-based or centrality-based filter be possible in somethinglike WDumper (perhaps it goes beyond the original purpose; certainlyit does not seem trivial in terms of resources used)? Would it be agood idea?
In summary, I like the idea of using WDumper to sporadically generate-- and publish on Zenodo -- a "notable version" of Wikidata filteredby sitelinks (perhaps also allowing other high-degree or high-PageRanknodes to pass the filter). At least I know I would use such a dump.
Best,
Aidan

On 2019-12-19 6:46, Lydia Pintscher wrote:
On Tue, Dec 17, 2019 at 7:16 PM Aidan Hogan <aid...@gmail.com> wrote:
Hey all,

As someone who likes to use Wikidata in their research, and likes to
give students projects relating to Wikidata, I am finding it more and
more difficult to (recommend to) work with recent versions of Wikidata
due to the increasing dump sizes, where even the truthy version now
costs considerable time and machine resources to process and handle. In
some cases we just grin and bear the costs, while in other cases we
apply an ad hoc sampling to be able to play around with the data andtry
things quickly.

More generally, I think the growing data volumes might inadvertently
scare people off taking the dumps and using them in their research.

One idea we had recently to reduce the data size for a student project
while keeping the most notable parts of Wikidata was to only keepclaims
that involve an item linked to Wikipedia; in other words, if the
statement involves a Q item (in the "subject" or "object") notlinked to
Wikipedia, the statement is removed.

I wonder would it be possible for Wikidata to provide such a dump to
download (e.g., in RDF) for people who prefer to work with a more
concise sub-graph that still maintains the most "notable" parts? While
of course one could compute this from the full-dump locally, makingsuch
a version available as a dump directly would save clients some
resources, potentially encourage more research using/on Wikidata, and
having such a version "rubber-stamped" by Wikidata would also help to
justify the use of such a dataset for research purposes.

... just an idea I thought I would float out there. Perhaps there is
another (better) way to define a concise dump.

Best,
Aidan
Hi Aiden,

That the dumps are becoming too big is an issue I've heard a number of
times now. It's something we need to tackle. My biggest issue is
deciding how to slice and dice it though in a way that works for many
use cases. We have https://phabricator.wikimedia.org/T46581 to
brainstorm about that and figure it out. Input from several people
very welcome. I also added a link to Benno's tool there.
As for the specific suggestion: I fear relying on the existence of
sitelinks will kick out a lot of important things you would care about
like professions so I'm not sure that's a good thing to offer
officially for a larger audience.


Cheers
Lydia
_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

--
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT)Competence Center

at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association

Projects: http://dbpedia.org, http://nlp2rdf.org,http://linguistics.okfn.org, https://www.w3.org/community/ld4lt<http://www.w3.org/community/ld4lt>

Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Concise/Notable Wikidata Dump

Reply via email to