[Wikidata] Concise/Notable Wikidata Dump

Aidan Hogan Tue, 17 Dec 2019 10:15:26 -0800

Hey all,

As someone who likes to use Wikidata in their research, and likes togive students projects relating to Wikidata, I am finding it more andmore difficult to (recommend to) work with recent versions of Wikidatadue to the increasing dump sizes, where even the truthy version nowcosts considerable time and machine resources to process and handle. Insome cases we just grin and bear the costs, while in other cases weapply an ad hoc sampling to be able to play around with the data and trythings quickly.

More generally, I think the growing data volumes might inadvertentlyscare people off taking the dumps and using them in their research.

One idea we had recently to reduce the data size for a student projectwhile keeping the most notable parts of Wikidata was to only keep claimsthat involve an item linked to Wikipedia; in other words, if thestatement involves a Q item (in the "subject" or "object") not linked toWikipedia, the statement is removed.

I wonder would it be possible for Wikidata to provide such a dump todownload (e.g., in RDF) for people who prefer to work with a moreconcise sub-graph that still maintains the most "notable" parts? Whileof course one could compute this from the full-dump locally, making sucha version available as a dump directly would save clients someresources, potentially encourage more research using/on Wikidata, andhaving such a version "rubber-stamped" by Wikidata would also help tojustify the use of such a dataset for research purposes.

... just an idea I thought I would float out there. Perhaps there isanother (better) way to define a concise dump.


Best,
Aidan

_______________________________________________
Wikidata mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata

[Wikidata] Concise/Notable Wikidata Dump

Reply via email to