Re: [Wikidata] Concise/Notable Wikidata Dump

Markus Kroetzsch Thu, 19 Dec 2019 01:03:20 -0800

Hi all,

Yes, Benno's WDumper could be used for this purpose. The motivation for the whole project was very similar to what Aidan describes. We realised thought that there won't be a single good way to build smaller dump that would serve every conceivable use in research, which is why the UI let's users make custom dumps.

In general, we are happy to hear more ideas on how to build useful smaller dumps that would be interesting. We are also accepting pull requests.

Benno, could you add a feature to include only items with Wikipedia page (in some language, or in any language)?

Edgard, I don't think making this more "official" will be very important for most researchers. Benno spent quite some time on aligning the RDF export with the official dumps, so in practice, WDumper mostly produces a subset of the triples of the official dump (which one could also have extracted manually). If there are differences left between the formats, we will be happy to hear about them (a github issue would be the best way to report it).

As Benno already wrote, WDumper connects to Zenodo to ensure that exported datasets are archived in a permanent and citable fashion. This is very important for research. As far as I know, none of the existing dumps (official or not) guarantee long-term availability at the moment.


Cheers,

Markus



On 18/12/2019 13:37, Edgard Marx wrote:

It certainly helps, however, I think Aidan's suggestion goes into the direction of having an official dump distribution.

Imagine how many CO2 can be spared just by avoiding the computational resource to recreate this dump every time ones need it.


Besides, it standardise the dataset used for research purposes.

On Wed, Dec 18, 2019, 11:26 Marco Fossati <foss...@spaziodati.eu <mailto:foss...@spaziodati.eu>> wrote:


    Hi everyone,

    Benno (in CC) has recently announced this tool:
    https://tools.wmflabs.org/wdumps/

    I haven't checked it out yet, but it sounds related to Aidan's inquiry.
    Hope this helps.

    Cheers,

    Marco

    On 12/18/19 8:01 AM, Edgard Marx wrote:
     > +1
     >
     > On Tue, Dec 17, 2019, 19:14 Aidan Hogan <aid...@gmail.com
    <mailto:aid...@gmail.com>
     > <mailto:aid...@gmail.com <mailto:aid...@gmail.com>>> wrote:
     >
     >     Hey all,
     >
     >     As someone who likes to use Wikidata in their research, and
    likes to
     >     give students projects relating to Wikidata, I am finding it
    more and
     >     more difficult to (recommend to) work with recent versions of
    Wikidata
     >     due to the increasing dump sizes, where even the truthy
    version now
     >     costs considerable time and machine resources to process and
    handle. In
     >     some cases we just grin and bear the costs, while in other
    cases we
     >     apply an ad hoc sampling to be able to play around with the
    data and
     >     try
     >     things quickly.
     >
     >     More generally, I think the growing data volumes might
    inadvertently
     >     scare people off taking the dumps and using them in their
    research.
     >
     >     One idea we had recently to reduce the data size for a
    student project
     >     while keeping the most notable parts of Wikidata was to only keep
     >     claims
     >     that involve an item linked to Wikipedia; in other words, if the
     >     statement involves a Q item (in the "subject" or "object") not
     >     linked to
     >     Wikipedia, the statement is removed.
     >
     >     I wonder would it be possible for Wikidata to provide such a
    dump to
     >     download (e.g., in RDF) for people who prefer to work with a more
     >     concise sub-graph that still maintains the most "notable"
    parts? While
     >     of course one could compute this from the full-dump locally,
    making
     >     such
     >     a version available as a dump directly would save clients some
     >     resources, potentially encourage more research using/on
    Wikidata, and
     >     having such a version "rubber-stamped" by Wikidata would also
    help to
     >     justify the use of such a dataset for research purposes.
     >
     >     ... just an idea I thought I would float out there. Perhaps
    there is
     >     another (better) way to define a concise dump.
     >
     >     Best,
     >     Aidan
     >
     >     _______________________________________________
     >     Wikidata mailing list
     > Wikidata@lists.wikimedia.org
    <mailto:Wikidata@lists.wikimedia.org>
    <mailto:Wikidata@lists.wikimedia.org
    <mailto:Wikidata@lists.wikimedia.org>>
     > https://lists.wikimedia.org/mailman/listinfo/wikidata
     >
     >
     > _______________________________________________
     > Wikidata mailing list
     > Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
     > https://lists.wikimedia.org/mailman/listinfo/wikidata
     >

    _______________________________________________
    Wikidata mailing list
    Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
    https://lists.wikimedia.org/mailman/listinfo/wikidata


_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Concise/Notable Wikidata Dump

Reply via email to