Some datatypes already kinda support federation. The globe used with
globecoordinate can be set to any URL. While it's usually
http://www.wikidata.org/entity/Q2 (Earth) on Wikidata, it could refer to a
foreign entity on a different Wikibase instance, allowing for some federation.
The same goes for the unit in a quantity, and the calendarmodel in a date/time.
---- On Sat, 22 Feb 2020 20:05:57 -0500 Erik Paulson <[email protected]>
wrote ----
[Apologies for length]
I hope to build a knowledge graph of current and past public officials in
Wisconsin, both elected and appointed. This entire dataset is not appropriate
for Wikidata, as much of it will not meet the notability guidelines - very few
people care about who was on the planning commission of the Town of Westport
here in Dane County, Wisconsin back in 2006.
I would like to use Wikibase as the software to manage this hyper-specific
knowledge graph. Wikibase is easy to use for displaying and editing data.
Wikibase is type-aware, e.g. if a type is richer than a string or an integer,
Wikibase can give a richer display, like showing a coordinate on a map.
Wikibase can export data to a SPARQL query service. With upcoming projects like
WBStack, Wikibase could be available as a service and might be a great place
for me to host my hyper-specific knowledge graph, provided there is a very
clear path for how I could export my data and seamlessly import it into a new
Wikibase install with minimal disruption, should WBStack ever shut down.
I would like to use data from Wikidata (or federate back to Wikidata) as much
as possible, for two main reasons.
First, while much of my data is not notable, there are items that will overlap
with Wikidata. The Town of Westport is Q3505274 in Wikidata, and I would prefer
not to have to duplicate the work that has already been done collecting
statements about Westport, such as population or geographical footprint. Those
statements, however, would be useful in queries in my hyper-specific knowledge
graph - a typical query might be to find members of planning commissions of
towns with populations between 2500 and 5000 people.
Second, and possibly more important, there is a very useful ontology that has
developed through Wikidata, defining a set of classes for the world, and the
properties and datatypes to describe those classes, and the relationships that
connect those classes together. For example, ‘member of’, P463 on Wikidata, is
a property that my hyper-specific knowledge graph would also use quite heavily.
Equally important is that Wikidata has talk pages, mailing lists, and wiki
pages that have hashed out over the years exactly how to use this ontology and
what are best practices. Does P463 apply only to people or can other things be
‘members of’? Five years ago Wikidata editors thought about it and agreed that
it’s probably OK for things other than humans to be ‘members of’ other things.
The knowledge and best practices from Wikidata can be reused in other knowledge
graphs even if they’re not stored in Wikidata, and people who know how to model
and query data in Wikidata will feel at home in non-Wikidata knowledge graphs.
I’ve been thinking about what my options are to mix data between wikidata and
my hyper-specific knowledge graph. I am not a Wikidata expert so parts of this
might be incorrect.
The simplest possible thing I can do is to federate through SPARQL engines. I
can dump my hyper-specific graph into a local triple store, and use SERVICE
statements in SPARQL to reach out into Wikidata and fetch data in my queries to
combine the two datasets.
As an improvement, I can load the Wikidata triples into my triplestore and
query the unified dataset directly with fully namedspaced-IRIs triples. As an
added bonus, any query that worked on http://query.wikidata.org would work
unchanged in my local Wikibase. Queries that used both datasets should be
faster than federating through SERVICE, but the challenge is the Wikidata dumps
are huge and take days to load. Worse, it’s difficult to get incremental
changes, so to stay current I’d have to reload many gigabytes every few days.
However, what I really want is to have as much relevant Wikidata data as
possible in my local Wikibase, so all the great Wikibase features work, like
fulltext autocomplete to edit entities, or ensuring that property constraints
can be applied, or that qualifiers are cleanly handled. One redflag is that I
don’t think I’ve ever seen anyone with a good import of Wikidata into a local
Wikibase, or even a dump of a smaller Wikibase instance into a new Wikibase
instance. It’s probably even harder than importing into the query service. I am
also pretty sure that I don’t want to try to import an entire Wikidata dataset
into a local Wikibase, because Wikidata is huge and there’s no good way to get
an incremental set of changes.
One idea that probably isn’t possible today is to store a statement in a local
Wikibase that references a remote Wikibase. I wish that I could create a
statement in my local Wikibase: Q2 P1
<http://www.wikidata.org/entity/Q3505274>, where Q2 and P1 are entities created
in my local Wikibase. As a workaround, I could do something like put a
statement of Q2 P1 Q1 in my local Wikibase, and then put a property on my local
Q1 to say that that the local Q1 has an external identifier of Q3505274 on
Wikidata. A fancier feature might be to have a feature in Wikibase that knows
if an entity references a remote Wikibase, the local wikibase mirrors the
entity from the remote Wikibase into the local Wikibase.
I don’t think that I would care that my local Q and P numbers don’t match up
with any data federated into my local Wikibase, though to reduce cognitive load
I would probably try to load common properties to matching P numbers at least
(I think I will always want P31 and P279 to mean the same thing as Wikidata’s
P31 and P279, even if I have to create a bunch of empty properties to ensure
the numbering is right.)
Being able to customize the prefixes of the IRI in my RDF exports is a
must-have, though. It looks like that is possible today, though a bit clunky.
Editing items through the Wikibase UI already doesn’t matter much for P and Q
numbers, because autocomplete hides all of that. If I’m adding a statement
about ‘educated at’, I don’t need to know the P number for ‘educated at’ nor
the Q number for the school involved. It would be nice if that was easier to do
in the query service, so I could use labels in place of P and Q numbers.
What is probably most realistic today for my use case is just manually
“federating” to Wikidata by copying Wikidata entities into my Wikibase as
needed, with new local Q and P numbers. For my hyper-specific knowledge graph,
it’s probably not that many entities that I need to pull from Wikidata so as I
discover I need to create entities, I can check to see if Wikidata has one
already and import it first.
My subset is probably much smaller than any of the subsets envisioned in the
recent discussions about “concise” or “notable” dumps of Wikidata. I assume
that my hyper-specific knowledge graph would want a few hundred properties from
Wikidata and only a few thousand items from Wikidata at most.
I will treat the entities that I pull from Wikidata as “read-only” copies -
most of them would be anyway. If there is a new statement I need to make about
one of these, like updating something about the Town of Westport, it is
probably of interest to other Wikidata consumers and the edit should be made on
Wikidata and federated back. I can track the (much smaller) set that I mirrored
from Wikidata and periodically refresh them, so I don’t need to try to process
an entire dump.
A nice potential feature of Wikibase would be to be able to explicitly flag an
entity as a mirrored entity so Wikibase prevented it from being locally edited
and it included provenance information to point back to the wikibase(Wikidata)
where it came from. Another feature that might be nice (but I might be wrong
and maybe this is a bad feature) is that if Wikibase knew that this was an
import, that when it exported to RDF it could also dump the triples using the
same IRIs as the source system.
So, perhaps in summary, I would like to be able to reuse some data from
Wikidata in my local wikibase, but I am concerned about:
- I want a subset of Wikidata, but I want only a very small subset of Wikidata
- How do I track the entities copied from Wikidata into my local Wikibase so I
can update them as new statements are added or updated on Wikidata?
- How can I make it easy for people who know the ontology and data already in
Wikidata to be able to edit the hyper-specific knowledge graph in my local
Wikibase?
- How can I make it easy to query my hyper-specific knowledge graph using
SPARQL while maximizing query reuse between the Wikidata query service and my
local query service, potentially to the point of having the same query work on
both if the query only involves the Wikidata subset?
Thanks,
-Erik
_______________________________________________
Wikidata mailing list
mailto:[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata
_______________________________________________
Wikidata mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata