On Sat, 21 Dec 2019 at 17:25, Lydia Pintscher <[email protected]>
wrote:

> On Thu, Dec 19, 2019 at 11:16 PM Aidan Hogan <[email protected]> wrote:
> > - @Lydia, good point! I was thinking that filtering by wikilinks will
> > just drop some more obscure nodes (like Q51366847 for example), but had
> > not considered that there are some more general "concepts" that do not
> > have a corresponding Wikipedia article. All the same, in a lot of the
> > research we use Wikidata for, we are not particularly interested in one
> > thing or another, but more interested in facilitating what other people
> > are interested in. Examples would be query performance, finding paths,
> > versioning, finding references, etc. But point taken! Maybe there is a
> > way to identify "general entities" that do not have wikilinks, but do
> > have a high degree or centrality, for example? Would a degree-based or
> > centrality-based filter be possible in something like WDumper (perhaps
> > it goes beyond the original purpose; certainly it does not seem trivial
> > in terms of resources used)? Would it be a good idea?
>
> I think it's definitely worth exploring but I fear it needs someone to
> actually sit down and collect the different dumps use-cases and talk
> to people to figure out which part of the data they need. Based on
> that we could identify common patterns.


Yeah, there are a bunch of quite varied motivations for subsets.  I have
found the topic of Wikidata subsetting and data dumps coming up again and
again. Most recently in a lifescience/bioinformations setting which is how
we ended up collecting raw materials in the doc already shared here,
https://docs.google.com/document/d/1MmrpEQ9O7xA6frNk6gceu_IbQrUiEYGI9vcQjDvTL9c
but also in other domains. If people here care to drop use cases, thoughts
and notes (*however scrappy*) into that doc I will make a pass over it to
try to pull together a more readable summary of the various motivations for
subsetting.

The work Adam wrote up at
https://addshore.com/2019/10/your-own-wikidata-query-service-with-no-limits-part-1/
is also very relevant...

(I think this is something
> that needs to be done but unfortunately can't dedicate time to it in
> the foreseeable future. https://phabricator.wikimedia.org/T46581 is a
> good place for people who want to help think it through.


That is also a fine place to record things! I don’t mean to fork the
discussion. Maybe we could have a call for interested parties in the new
year?

Dan



>
>
> Cheers
> Lydia
>
> --
> Lydia Pintscher - http://about.me/lydia.pintscher
> Product Manager for Wikidata
>
> Wikimedia Deutschland e.V.
> Tempelhofer Ufer 23-24
> 10963 Berlin
> www.wikimedia.de
>
> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
>
> Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
> unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
> Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.
>
> _______________________________________________
> Wikidata mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
_______________________________________________
Wikidata mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata

Reply via email to