Re: [Wikidata] Use case and thoughts on a local Wikibase with some simple "federation" with Wikidata

Smitty van Bodegom Sun, 23 Feb 2020 03:46:10 -0800

Some datatypes already kinda support federation. The globe used with 
globecoordinate can be set to any URL. While it's usually 
http://www.wikidata.org/entity/Q2 (Earth) on Wikidata, it could refer to a 
foreign entity on a different Wikibase instance, allowing for some federation. 
The same goes for the unit in a quantity, and the calendarmodel in a date/time.






---- On Sat, 22 Feb 2020 20:05:57 -0500 Erik Paulson <[email protected]> 
wrote ----



[Apologies for length]



I hope to build a knowledge graph of current and past public officials in 
Wisconsin, both elected and appointed. This entire dataset is not appropriate 
for Wikidata, as much of it will not meet the notability guidelines - very few 
people care about who was on the planning commission of the Town of Westport 
here in Dane County, Wisconsin back in 2006. 



I would like to use Wikibase as the software to manage this hyper-specific 
knowledge graph. Wikibase is easy to use for displaying and editing data. 
Wikibase is type-aware, e.g. if a type is richer than a string or an integer, 
Wikibase can give a richer display, like showing a coordinate on a map. 
Wikibase can export data to a SPARQL query service. With upcoming projects like 
WBStack, Wikibase could be available as a service and might be a great place 
for me to host my hyper-specific knowledge graph, provided there is a very 
clear path for how I could export my data and seamlessly import it into a new 
Wikibase install with minimal disruption, should WBStack ever shut down.



I would like to use data from Wikidata (or federate back to Wikidata) as much 
as possible, for two main reasons.



First, while much of my data is not notable, there are items that will overlap 
with Wikidata. The Town of Westport is Q3505274 in Wikidata, and I would prefer 
not to have to duplicate the work that has already been done collecting 
statements about Westport, such as population or geographical footprint. Those 
statements, however, would be useful in queries in my hyper-specific knowledge 
graph - a typical query might be to find members of planning commissions of 
towns with populations between 2500 and 5000 people.



Second, and possibly more important, there is a very useful ontology that has 
developed through Wikidata, defining a set of classes for the world, and the 
properties and datatypes to describe those classes, and the relationships that 
connect those classes together. For example, ‘member of’, P463 on Wikidata, is 
a property that my hyper-specific knowledge graph would also use quite heavily. 
Equally important is that Wikidata has talk pages, mailing lists, and wiki 
pages that have hashed out over the years exactly how to use this ontology and 
what are best practices. Does P463 apply only to people or can other things be 
‘members of’? Five years ago Wikidata editors thought about it and agreed that 
it’s probably OK for things other than humans to be ‘members of’ other things. 
The knowledge and best practices from Wikidata can be reused in other knowledge 
graphs even if they’re not stored in Wikidata, and people who know how to model 
and query data in Wikidata will feel at home in non-Wikidata knowledge graphs. 



I’ve been thinking about what my options are to mix data between wikidata and 
my hyper-specific knowledge graph. I am not a Wikidata expert so parts of this 
might be incorrect. 



The simplest possible thing I can do is to federate through SPARQL engines. I 
can dump my hyper-specific graph into a local triple store, and use SERVICE 
statements in SPARQL to reach out into Wikidata and fetch data in my queries to 
combine the two datasets. 



As an improvement, I can load the Wikidata triples into my triplestore and 
query the unified dataset directly with fully namedspaced-IRIs triples. As an 
added bonus, any query that worked on http://query.wikidata.org would work 
unchanged in my local Wikibase. Queries that used both datasets should be 
faster than federating through SERVICE, but the challenge is the Wikidata dumps 
are huge and take days to load. Worse, it’s difficult to get incremental 
changes, so to stay current I’d have to reload many gigabytes every few days. 



However, what I really want is to have as much relevant Wikidata data as 
possible in my local Wikibase, so all the great Wikibase features work, like 
fulltext autocomplete to edit entities, or ensuring that property constraints 
can be applied, or that qualifiers are cleanly handled. One redflag is that I 
don’t think I’ve ever seen anyone with a good import of Wikidata into a local 
Wikibase, or even a dump of a smaller Wikibase instance into a new Wikibase 
instance. It’s probably even harder than importing into the query service. I am 
also pretty sure that I don’t want to try to import an entire Wikidata dataset 
into a local Wikibase, because Wikidata is huge and there’s no good way to get 
an incremental set of changes. 



One idea that probably isn’t possible today is to store a statement in a local 
Wikibase that references a remote Wikibase. I wish that I could create a 
statement in my local Wikibase: Q2 P1 
<http://www.wikidata.org/entity/Q3505274>, where Q2 and P1 are entities created 
in my local Wikibase. As a workaround, I could do something like put a 
statement of Q2 P1 Q1 in my local Wikibase, and then put a property on my local 
Q1 to say that that the local Q1 has an external identifier of Q3505274 on 
Wikidata. A fancier feature might be to have a feature in Wikibase that knows 
if an entity references a remote Wikibase, the local wikibase mirrors the 
entity from the remote Wikibase into the local Wikibase.   



I don’t think that I would care that my local Q and P numbers don’t match up 
with any data federated into my local Wikibase, though to reduce cognitive load 
I would probably try to load common properties to matching P numbers at least 
(I think I will always want P31 and P279 to mean the same thing as Wikidata’s 
P31 and P279, even if I have to create a bunch of empty properties to ensure 
the numbering is right.) 



Being able to customize the prefixes of the IRI in my RDF exports is a 
must-have, though. It looks like that is possible today, though a bit clunky. 



Editing items through the Wikibase UI already doesn’t matter much for P and Q 
numbers, because autocomplete hides all of that. If I’m adding a statement 
about ‘educated at’, I don’t need to know the P number for ‘educated at’ nor 
the Q number for the school involved. It would be nice if that was easier to do 
in the query service, so I could use labels in place of P and Q numbers. 



What is probably most realistic today for my use case is just manually 
“federating” to Wikidata by copying Wikidata entities into my Wikibase as 
needed, with new local Q and P numbers. For my hyper-specific knowledge graph, 
it’s probably not that many entities that I need to pull from Wikidata so as I 
discover I need to create entities, I can check to see if Wikidata has one 
already and import it first.  



My subset is probably much smaller than any of the subsets envisioned in the 
recent discussions about “concise” or “notable” dumps of Wikidata. I assume 
that my hyper-specific knowledge graph would want a few hundred properties from 
Wikidata and only a few thousand items from Wikidata at most.  



I will treat the entities that I pull from Wikidata as “read-only” copies - 
most of them would be anyway. If there is a new statement I need to make about 
one of these, like updating something about the Town of Westport, it is 
probably of interest to other Wikidata consumers and the edit should be made on 
Wikidata and federated back. I can track the (much smaller) set that I mirrored 
from Wikidata and periodically refresh them, so I don’t need to try to process 
an entire dump.



A nice potential feature of Wikibase would be to be able to explicitly flag an 
entity as a mirrored entity so Wikibase prevented it from being locally edited 
and it included provenance information to point back to the wikibase(Wikidata) 
where it came from. Another feature that might be nice (but I might be wrong 
and maybe this is a bad feature) is that if Wikibase knew that this was an 
import, that when it exported to RDF it could also dump the triples using the 
same IRIs as the source system. 



So, perhaps in summary, I would like to be able to reuse some data from 
Wikidata in my local wikibase, but I am concerned about:

- I want a subset of Wikidata, but I want only a very small subset of Wikidata

- How do I track the entities copied from Wikidata into my local Wikibase so I 
can update them as new statements are added or updated on Wikidata?

- How can I make it easy for people who know the ontology and data already in 
Wikidata to be able to edit the hyper-specific knowledge graph in my local 
Wikibase?

- How can I make it easy to query my hyper-specific knowledge graph using 
SPARQL while maximizing query reuse between the Wikidata query service and my 
local query service, potentially to the point of having the same query work on 
both if the query only involves the Wikidata subset?  



Thanks,



-Erik




_______________________________________________

Wikidata mailing list

mailto:[email protected]

https://lists.wikimedia.org/mailman/listinfo/wikidata

_______________________________________________
Wikidata mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Use case and thoughts on a local Wikibase with some simple "federation" with Wikidata

Reply via email to