See also this recent discussion/brainstorm on "Wikidata subsetting"

https://docs.google.com/document/d/1MmrpEQ9O7xA6frNk6gceu_IbQrUiEYGI9vcQjDvTL9c/edit#heading=h.7xg3cywpkgfq

In a geographical context, whether or not an item has a Wikipedia entry has been contemplated as a criterion for filtering Wikidata for a gazetteer by eg the @gbhgis team behind the Vision of Britiain site (Humphrey Southall)

But I would caution against using the criterion blindly -- Wikidata notability includes "structural need" as an inclusion criterion for good reason. You definitely wouldn't want items becoming disconnected from their P31/P279* subclass tree because one of the intermediate items had been omitted.

Other such items might be very valuable (or not valuable at all), depending on the end-user's application. So it's definitely non-trivial to think what the WD team should leave in for any generic dump intended to have wide usefulness. Even for a specific end-use, one might want to think quite carefully about things not to cut out.

  -- James.



On 18/12/2019 12:37, Edgard Marx wrote:
It certainly helps, however, I think Aidan's suggestion goes into the
direction of having an official dump distribution.

Imagine how many CO2 can be spared just by avoiding the computational
resource to recreate this dump every time ones need it.

Besides, it standardise the dataset used for research purposes.

On Wed, Dec 18, 2019, 11:26 Marco Fossati <foss...@spaziodati.eu> wrote:

Hi everyone,

Benno (in CC) has recently announced this tool:
https://tools.wmflabs.org/wdumps/

I haven't checked it out yet, but it sounds related to Aidan's inquiry.
Hope this helps.

Cheers,

Marco

On 12/18/19 8:01 AM, Edgard Marx wrote:
+1

On Tue, Dec 17, 2019, 19:14 Aidan Hogan <aid...@gmail.com
<mailto:aid...@gmail.com>> wrote:

     Hey all,

     As someone who likes to use Wikidata in their research, and likes to
     give students projects relating to Wikidata, I am finding it more and
     more difficult to (recommend to) work with recent versions of
Wikidata
     due to the increasing dump sizes, where even the truthy version now
     costs considerable time and machine resources to process and handle.
In
     some cases we just grin and bear the costs, while in other cases we
     apply an ad hoc sampling to be able to play around with the data and
     try
     things quickly.

     More generally, I think the growing data volumes might inadvertently
     scare people off taking the dumps and using them in their research.

     One idea we had recently to reduce the data size for a student
project
     while keeping the most notable parts of Wikidata was to only keep
     claims
     that involve an item linked to Wikipedia; in other words, if the
     statement involves a Q item (in the "subject" or "object") not
     linked to
     Wikipedia, the statement is removed.

     I wonder would it be possible for Wikidata to provide such a dump to
     download (e.g., in RDF) for people who prefer to work with a more
     concise sub-graph that still maintains the most "notable" parts?
While
     of course one could compute this from the full-dump locally, making
     such
     a version available as a dump directly would save clients some
     resources, potentially encourage more research using/on Wikidata, and
     having such a version "rubber-stamped" by Wikidata would also help to
     justify the use of such a dataset for research purposes.

     ... just an idea I thought I would float out there. Perhaps there is
     another (better) way to define a concise dump.

     Best,
     Aidan

     _______________________________________________
     Wikidata mailing list
     Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
     https://lists.wikimedia.org/mailman/listinfo/wikidata


_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Reply via email to