I hope that splitting the wikidata dump into smaller, more functional
chunks is something the wikidata project considers.

It's probably less about splitting the dumps up and more about starting to
split the main wikidata namespace into more discrete areas, because without
that the full wikidata graph is hard to partition/dumps to be functionally
split up into something. For example, the latest wikidata news was "The
sixty-three millionth item, about a protein, is created." (yay!) - but
there are lots and lots of proteins. If someone is mirroring wikidata
locally to speed up their queries for say an astronomy use case, having to
download, store, and process a bunch of triples about a huge collection of
proteins is only making their life harder. Maybe some of these specialized
collections should go into their own namespace, like "wikidata-proteins" or
"wikidata-biology". The project can have some guidelines about how
"notable" an item has to be before it gets moved into "wikidata-core".
Hemoglobin, yeah, that probably belongs in "wikidata-core".
"MGG_03181-t26_1" aka Q63000000 (which is some protein that's been found in
rice blast fungus) - well, maybe that's not quite notable enough just yet,
but is certainly still valuable to some subset of the community.

Federated queries mean that this isn't too much harder to manage from a
usability standpoint. If my local graph query processor/database knows that
it has large chunks of wikidata mirrored into it, it doesn't need to use
federated SPARQL to make remote network calls to wikidata.org's WDQS to
resolve my query - but if it stumbles across a graph item that it needs to
follow back across the network to wikidata.org, it can.

And wikidata.org could and still should strive to manage as many entities
in its knowledge base as possible, and load as many of these different
datasets into its local graph database to feed the WDQS, potentially even
knowledgebases that aren't from wikidata.org. That way, federated queries
that previously would have had to have made network calls can instead be
just integrated into the local query plan and hopefully go much faster.

-Erik


On Fri, May 3, 2019 at 9:50 AM Darren Cook <dar...@dcook.org> wrote:

> > Wikidata grows like mad. This is something we all experience in the
> really bad
> > response times we are suffering. It is so bad that people are asked what
> kind of
> > updates they are running because it makes a difference in the lag times
> there are.
> >
> > Given that Wikidata is growing like a weed, ...
>
> As I've delved deeper into Wikidata I get the feeling it is being
> developed with the assumptions of infinite resources, and no strong
> guidelines of exactly what the scope is (i.e. where you draw the line
> between what belongs in Wikidata and what does not).
>
> This (and concerns of it being open to data vandalism) has personally
> made me back-off a bit. I'd originally planned to have Wikidata be the
> primary data source, but I'm now leaning towards keeping data tables and
> graphs outside, with scheduled scripts to import into Wikidata, and
> export from Wikidata.
>
> > For the technical guys, consider our growth and plan for at least one
> year.
>
> The 37GB (json, bz2) data dump file (it was already 33GB, twice the size
> of the English wikipedia dump, when I grabbed it last November) is
> unwieldy. And, as there is no incremental changes being published, it is
> hard to create a mirror.
>
> Can that dump file be split up in some functional way, I wonder?
>
> Darren
>
>
> --
> Darren Cook, Software Researcher/Developer
>
> _______________________________________________
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Reply via email to