Re: [Wikidata] Use case and thoughts on a local Wikibase with some simple "federation" with Wikidata

Samantha Alipio Mon, 24 Feb 2020 03:45:50 -0800

Hi Erik,
Thanks for taking the time to so thoroughly document your envisioned use
case for Wikibase. Feel free to reach out to me (
[email protected]) for further discussion -- I'd really like to
learn more about how you'll be using Wikibase for your project.


In case you haven't yet seen the Wikidata/Wikibase development plans for
2020, there are some federation-related enhancements to Wikibase planned
that could be relevant here -- for example, <
https://www.wikidata.org/wiki/Wikidata:Development_plan#Access_Wikidata_Properties_in_custom_Wikibase_instance_(aka_Federation)
>.

Best,
Sam

On Sun, 23 Feb 2020 at 02:08, Erik Paulson <[email protected]> wrote:

> [Apologies for length]
>
> I hope to build a knowledge graph of current and past public officials in
> Wisconsin, both elected and appointed. This entire dataset is not
> appropriate for Wikidata, as much of it will not meet the notability
> guidelines - very few people care about who was on the planning commission
> of the Town of Westport here in Dane County, Wisconsin back in 2006.
>
> I would like to use Wikibase as the software to manage this hyper-specific
> knowledge graph. Wikibase is easy to use for displaying and editing data.
> Wikibase is type-aware, e.g. if a type is richer than a string or an
> integer, Wikibase can give a richer display, like showing a coordinate on a
> map. Wikibase can export data to a SPARQL query service. With upcoming
> projects like WBStack, Wikibase could be available as a service and might
> be a great place for me to host my hyper-specific knowledge graph, provided
> there is a very clear path for how I could export my data and seamlessly
> import it into a new Wikibase install with minimal disruption, should
> WBStack ever shut down.
>
> I would like to use data from Wikidata (or federate back to Wikidata) as
> much as possible, for two main reasons.
>
> First, while much of my data is not notable, there are items that will
> overlap with Wikidata. The Town of Westport is Q3505274 in Wikidata, and I
> would prefer not to have to duplicate the work that has already been done
> collecting statements about Westport, such as population or geographical
> footprint. Those statements, however, would be useful in queries in my
> hyper-specific knowledge graph - a typical query might be to find members
> of planning commissions of towns with populations between 2500 and 5000
> people.
>
> Second, and possibly more important, there is a very useful ontology that
> has developed through Wikidata, defining a set of classes for the world,
> and the properties and datatypes to describe those classes, and the
> relationships that connect those classes together. For example, ‘member
> of’, P463 on Wikidata, is a property that my hyper-specific knowledge graph
> would also use quite heavily. Equally important is that Wikidata has talk
> pages, mailing lists, and wiki pages that have hashed out over the years
> exactly how to use this ontology and what are best practices. Does P463
> apply only to people or can other things be ‘members of’? Five years ago
> Wikidata editors thought about it and agreed that it’s probably OK for
> things other than humans to be ‘members of’ other things. The knowledge and
> best practices from Wikidata can be reused in other knowledge graphs even
> if they’re not stored in Wikidata, and people who know how to model and
> query data in Wikidata will feel at home in non-Wikidata knowledge graphs.
>
> I’ve been thinking about what my options are to mix data between wikidata
> and my hyper-specific knowledge graph. I am not a Wikidata expert so parts
> of this might be incorrect.
>
> The simplest possible thing I can do is to federate through SPARQL
> engines. I can dump my hyper-specific graph into a local triple store, and
> use SERVICE statements in SPARQL to reach out into Wikidata and fetch data
> in my queries to combine the two datasets.
>
> As an improvement, I can load the Wikidata triples into my triplestore and
> query the unified dataset directly with fully namedspaced-IRIs triples. As
> an added bonus, any query that worked on query.wikidata.org would work
> unchanged in my local Wikibase. Queries that used both datasets should be
> faster than federating through SERVICE, but the challenge is the Wikidata
> dumps are huge and take days to load. Worse, it’s difficult to get
> incremental changes, so to stay current I’d have to reload many gigabytes
> every few days.
>
> However, what I really want is to have as much relevant Wikidata data as
> possible in my local Wikibase, so all the great Wikibase features work,
> like fulltext autocomplete to edit entities, or ensuring that property
> constraints can be applied, or that qualifiers are cleanly handled. One
> redflag is that I don’t think I’ve ever seen anyone with a good import of
> Wikidata into a local Wikibase, or even a dump of a smaller Wikibase
> instance into a new Wikibase instance. It’s probably even harder than
> importing into the query service. I am also pretty sure that I don’t want
> to try to import an entire Wikidata dataset into a local Wikibase, because
> Wikidata is huge and there’s no good way to get an incremental set of
> changes.
>
> One idea that probably isn’t possible today is to store a statement in a
> local Wikibase that references a remote Wikibase. I wish that I could
> create a statement in my local Wikibase: Q2 P1 <
> http://www.wikidata.org/entity/Q3505274>, where Q2 and P1 are entities
> created in my local Wikibase. As a workaround, I could do something like
> put a statement of Q2 P1 Q1 in my local Wikibase, and then put a property
> on my local Q1 to say that that the local Q1 has an external identifier of
> Q3505274 on Wikidata. A fancier feature might be to have a feature in
> Wikibase that knows if an entity references a remote Wikibase, the local
> wikibase mirrors the entity from the remote Wikibase into the local
> Wikibase.
>
> I don’t think that I would care that my local Q and P numbers don’t match
> up with any data federated into my local Wikibase, though to reduce
> cognitive load I would probably try to load common properties to matching P
> numbers at least (I think I will always want P31 and P279 to mean the same
> thing as Wikidata’s P31 and P279, even if I have to create a bunch of empty
> properties to ensure the numbering is right.)
>
> Being able to customize the prefixes of the IRI in my RDF exports is a
> must-have, though. It looks like that is possible today, though a bit
> clunky.
>
> Editing items through the Wikibase UI already doesn’t matter much for P
> and Q numbers, because autocomplete hides all of that. If I’m adding a
> statement about ‘educated at’, I don’t need to know the P number for
> ‘educated at’ nor the Q number for the school involved. It would be nice if
> that was easier to do in the query service, so I could use labels in place
> of P and Q numbers.
>
> What is probably most realistic today for my use case is just manually
> “federating” to Wikidata by copying Wikidata entities into my Wikibase as
> needed, with new local Q and P numbers. For my hyper-specific knowledge
> graph, it’s probably not that many entities that I need to pull from
> Wikidata so as I discover I need to create entities, I can check to see if
> Wikidata has one already and import it first.
>
> My subset is probably much smaller than any of the subsets envisioned in
> the recent discussions about “concise” or “notable” dumps of Wikidata. I
> assume that my hyper-specific knowledge graph would want a few hundred
> properties from Wikidata and only a few thousand items from Wikidata at
> most.
>
> I will treat the entities that I pull from Wikidata as “read-only” copies
> - most of them would be anyway. If there is a new statement I need to make
> about one of these, like updating something about the Town of Westport, it
> is probably of interest to other Wikidata consumers and the edit should be
> made on Wikidata and federated back. I can track the (much smaller) set
> that I mirrored from Wikidata and periodically refresh them, so I don’t
> need to try to process an entire dump.
>
> A nice potential feature of Wikibase would be to be able to explicitly
> flag an entity as a mirrored entity so Wikibase prevented it from being
> locally edited and it included provenance information to point back to the
> wikibase(Wikidata) where it came from. Another feature that might be nice
> (but I might be wrong and maybe this is a bad feature) is that if Wikibase
> knew that this was an import, that when it exported to RDF it could also
> dump the triples using the same IRIs as the source system.
>
> So, perhaps in summary, I would like to be able to reuse some data from
> Wikidata in my local wikibase, but I am concerned about:
> - I want a subset of Wikidata, but I want only a very small subset of
> Wikidata
> - How do I track the entities copied from Wikidata into my local Wikibase
> so I can update them as new statements are added or updated on Wikidata?
> - How can I make it easy for people who know the ontology and data already
> in Wikidata to be able to edit the hyper-specific knowledge graph in my
> local Wikibase?
> - How can I make it easy to query my hyper-specific knowledge graph using
> SPARQL while maximizing query reuse between the Wikidata query service and
> my local query service, potentially to the point of having the same query
> work on both if the query only involves the Wikidata subset?
>
> Thanks,
>
> -Erik
>
> _______________________________________________
> Wikidata mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>


-- 
Samantha Alipio
Product Manager, Wikibase

Wikimedia Deutschland e. V. | Tempelhofer Ufer 23-24 | 10963 Berlin
Phone: +49 (0)30 219 158 26-0
https://wikimedia.de

Imagine a world in which every single human being can freely share in the
sum of all knowledge. Help us to achieve our vision!
https://spenden.wikimedia.de

Wikimedia Deutschland – Gesellschaft zur Förderung Freien Wissens e. V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/029/42207.

_______________________________________________
Wikidata mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Use case and thoughts on a local Wikibase with some simple "federation" with Wikidata

Reply via email to