Re: [Wikidata] Concise/Notable Wikidata Dump Wikidata Digest, Vol 97, Issue 13

PWN Sat, 21 Dec 2019 19:31:41 -0800

Hello all,
Regarding the limiting of dumps, I fear it nullifies one of the huge advantages 
of wikidata, which is to expand structured, referenced data beyond the often 
too narrow confines of Wikipedia. Women and marginalized communities who are 
frequently eliminated for lack of “notability” by overzealous or misguided 
Wikipedia editors risk being accidentally re-eliminated by confining dumps to 
items with wikilinks. (Remember the female researcher whose Wikipedia page was 
rejected for “lack of notability” - just before she won a Noble prize?)


I think Wikidata dumps should be complete, with a possibility of 
user-controlled selection by topic or period or other query, but not by what 
amounts to a kind of a “hidden” filter of approval by a Wikipedia editor 
somewhere outside of Wikidata in a widely disseminated dump marked, 
misleadingly, as “notable”. 

Selection is very powerful in the digital world, where people assume (wrongly) 
that what they see is what exists



Sent from my iPad

> On Dec 20, 2019, at 13:00, [email protected] wrote:
> 
> Send Wikidata mailing list submissions to
>    [email protected]
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>    https://lists.wikimedia.org/mailman/listinfo/wikidata
> or, via email, send a message with subject or body 'help' to
>    [email protected]
> 
> You can reach the person managing the list at
>    [email protected]
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Wikidata digest..."
> 
> 
> Today's Topics:
> 
>   1. Re: Concise/Notable Wikidata Dump (Aidan Hogan)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Thu, 19 Dec 2019 19:15:09 -0300
> From: Aidan Hogan <[email protected]>
> To: [email protected]
> Subject: Re: [Wikidata] Concise/Notable Wikidata Dump
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset=utf-8; format=flowed
> 
> Hey all,
> 
> Just a general response to all the comments thus far.
> 
> - @Marco et al., regarding the WDumper by Benno, this is a very cool 
> initiative! In fact I spotted it just *after* posting so I think this 
> goes quite some ways towards addressing the general issue raised.
> 
> - @Markus, I partially disagree regarding the importance of 
> rubber-stamping a "notable dump" on the Wikidata side. I would see it's 
> value as being something like the "truthy dump", which I believe has 
> been widely used in research for working with a concise sub-set of 
> Wikidata. Perhaps a middle ground is for a sporadic "notable dump" to be 
> generated by WDumper and published on Zenodo. This may be sufficient in 
> terms of making the dump available and reusable for research purposes 
> (or even better than the current dumps, given the permanence you 
> mention). Also it would reduce costs on the Wikidata side (I don't think 
> a notable dump would be necessary to generate on a weekly basis, for 
> example).
> 
> - @Lydia, good point! I was thinking that filtering by wikilinks will 
> just drop some more obscure nodes (like Q51366847 for example), but had 
> not considered that there are some more general "concepts" that do not 
> have a corresponding Wikipedia article. All the same, in a lot of the 
> research we use Wikidata for, we are not particularly interested in one 
> thing or another, but more interested in facilitating what other people 
> are interested in. Examples would be query performance, finding paths, 
> versioning, finding references, etc. But point taken! Maybe there is a 
> way to identify "general entities" that do not have wikilinks, but do 
> have a high degree or centrality, for example? Would a degree-based or 
> centrality-based filter be possible in something like WDumper (perhaps 
> it goes beyond the original purpose; certainly it does not seem trivial 
> in terms of resources used)? Would it be a good idea?
> 
> In summary, I like the idea of using WDumper to sporadically generate -- 
> and publish on Zenodo -- a "notable version" of Wikidata filtered by 
> sitelinks (perhaps also allowing other high-degree or high-PageRank 
> nodes to pass the filter). At least I know I would use such a dump.
> 
> Best,
> Aidan
> 
>> On 2019-12-19 6:46, Lydia Pintscher wrote:
>>> On Tue, Dec 17, 2019 at 7:16 PM Aidan Hogan <[email protected]> wrote:
>>> 
>>> Hey all,
>>> 
>>> As someone who likes to use Wikidata in their research, and likes to
>>> give students projects relating to Wikidata, I am finding it more and
>>> more difficult to (recommend to) work with recent versions of Wikidata
>>> due to the increasing dump sizes, where even the truthy version now
>>> costs considerable time and machine resources to process and handle. In
>>> some cases we just grin and bear the costs, while in other cases we
>>> apply an ad hoc sampling to be able to play around with the data and try
>>> things quickly.
>>> 
>>> More generally, I think the growing data volumes might inadvertently
>>> scare people off taking the dumps and using them in their research.
>>> 
>>> One idea we had recently to reduce the data size for a student project
>>> while keeping the most notable parts of Wikidata was to only keep claims
>>> that involve an item linked to Wikipedia; in other words, if the
>>> statement involves a Q item (in the "subject" or "object") not linked to
>>> Wikipedia, the statement is removed.
>>> 
>>> I wonder would it be possible for Wikidata to provide such a dump to
>>> download (e.g., in RDF) for people who prefer to work with a more
>>> concise sub-graph that still maintains the most "notable" parts? While
>>> of course one could compute this from the full-dump locally, making such
>>> a version available as a dump directly would save clients some
>>> resources, potentially encourage more research using/on Wikidata, and
>>> having such a version "rubber-stamped" by Wikidata would also help to
>>> justify the use of such a dataset for research purposes.
>>> 
>>> ... just an idea I thought I would float out there. Perhaps there is
>>> another (better) way to define a concise dump.
>>> 
>>> Best,
>>> Aidan
>> 
>> Hi Aiden,
>> 
>> That the dumps are becoming too big is an issue I've heard a number of
>> times now. It's something we need to tackle. My biggest issue is
>> deciding how to slice and dice it though in a way that works for many
>> use cases. We have https://phabricator.wikimedia.org/T46581 to
>> brainstorm about that and figure it out. Input from several people
>> very welcome. I also added a link to Benno's tool there.
>> As for the specific suggestion: I fear relying on the existence of
>> sitelinks will kick out a lot of important things you would care about
>> like professions so I'm not sure that's a good thing to offer
>> officially for a larger audience.
>> 
>> 
>> Cheers
>> Lydia
>> 
> 
> 
> 
> ------------------------------
> 
> Subject: Digest Footer
> 
> _______________________________________________
> Wikidata mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikidata
> 
> 
> ------------------------------
> 
> End of Wikidata Digest, Vol 97, Issue 13
> ****************************************

_______________________________________________
Wikidata mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Concise/Notable Wikidata Dump Wikidata Digest, Vol 97, Issue 13

Reply via email to