Hi Erik, Thanks for taking the time to so thoroughly document your envisioned use case for Wikibase. Feel free to reach out to me ( [email protected]) for further discussion -- I'd really like to learn more about how you'll be using Wikibase for your project.
In case you haven't yet seen the Wikidata/Wikibase development plans for 2020, there are some federation-related enhancements to Wikibase planned that could be relevant here -- for example, < https://www.wikidata.org/wiki/Wikidata:Development_plan#Access_Wikidata_Properties_in_custom_Wikibase_instance_(aka_Federation) >. Best, Sam On Sun, 23 Feb 2020 at 02:08, Erik Paulson <[email protected]> wrote: > [Apologies for length] > > I hope to build a knowledge graph of current and past public officials in > Wisconsin, both elected and appointed. This entire dataset is not > appropriate for Wikidata, as much of it will not meet the notability > guidelines - very few people care about who was on the planning commission > of the Town of Westport here in Dane County, Wisconsin back in 2006. > > I would like to use Wikibase as the software to manage this hyper-specific > knowledge graph. Wikibase is easy to use for displaying and editing data. > Wikibase is type-aware, e.g. if a type is richer than a string or an > integer, Wikibase can give a richer display, like showing a coordinate on a > map. Wikibase can export data to a SPARQL query service. With upcoming > projects like WBStack, Wikibase could be available as a service and might > be a great place for me to host my hyper-specific knowledge graph, provided > there is a very clear path for how I could export my data and seamlessly > import it into a new Wikibase install with minimal disruption, should > WBStack ever shut down. > > I would like to use data from Wikidata (or federate back to Wikidata) as > much as possible, for two main reasons. > > First, while much of my data is not notable, there are items that will > overlap with Wikidata. The Town of Westport is Q3505274 in Wikidata, and I > would prefer not to have to duplicate the work that has already been done > collecting statements about Westport, such as population or geographical > footprint. Those statements, however, would be useful in queries in my > hyper-specific knowledge graph - a typical query might be to find members > of planning commissions of towns with populations between 2500 and 5000 > people. > > Second, and possibly more important, there is a very useful ontology that > has developed through Wikidata, defining a set of classes for the world, > and the properties and datatypes to describe those classes, and the > relationships that connect those classes together. For example, ‘member > of’, P463 on Wikidata, is a property that my hyper-specific knowledge graph > would also use quite heavily. Equally important is that Wikidata has talk > pages, mailing lists, and wiki pages that have hashed out over the years > exactly how to use this ontology and what are best practices. Does P463 > apply only to people or can other things be ‘members of’? Five years ago > Wikidata editors thought about it and agreed that it’s probably OK for > things other than humans to be ‘members of’ other things. The knowledge and > best practices from Wikidata can be reused in other knowledge graphs even > if they’re not stored in Wikidata, and people who know how to model and > query data in Wikidata will feel at home in non-Wikidata knowledge graphs. > > I’ve been thinking about what my options are to mix data between wikidata > and my hyper-specific knowledge graph. I am not a Wikidata expert so parts > of this might be incorrect. > > The simplest possible thing I can do is to federate through SPARQL > engines. I can dump my hyper-specific graph into a local triple store, and > use SERVICE statements in SPARQL to reach out into Wikidata and fetch data > in my queries to combine the two datasets. > > As an improvement, I can load the Wikidata triples into my triplestore and > query the unified dataset directly with fully namedspaced-IRIs triples. As > an added bonus, any query that worked on query.wikidata.org would work > unchanged in my local Wikibase. Queries that used both datasets should be > faster than federating through SERVICE, but the challenge is the Wikidata > dumps are huge and take days to load. Worse, it’s difficult to get > incremental changes, so to stay current I’d have to reload many gigabytes > every few days. > > However, what I really want is to have as much relevant Wikidata data as > possible in my local Wikibase, so all the great Wikibase features work, > like fulltext autocomplete to edit entities, or ensuring that property > constraints can be applied, or that qualifiers are cleanly handled. One > redflag is that I don’t think I’ve ever seen anyone with a good import of > Wikidata into a local Wikibase, or even a dump of a smaller Wikibase > instance into a new Wikibase instance. It’s probably even harder than > importing into the query service. I am also pretty sure that I don’t want > to try to import an entire Wikidata dataset into a local Wikibase, because > Wikidata is huge and there’s no good way to get an incremental set of > changes. > > One idea that probably isn’t possible today is to store a statement in a > local Wikibase that references a remote Wikibase. I wish that I could > create a statement in my local Wikibase: Q2 P1 < > http://www.wikidata.org/entity/Q3505274>, where Q2 and P1 are entities > created in my local Wikibase. As a workaround, I could do something like > put a statement of Q2 P1 Q1 in my local Wikibase, and then put a property > on my local Q1 to say that that the local Q1 has an external identifier of > Q3505274 on Wikidata. A fancier feature might be to have a feature in > Wikibase that knows if an entity references a remote Wikibase, the local > wikibase mirrors the entity from the remote Wikibase into the local > Wikibase. > > I don’t think that I would care that my local Q and P numbers don’t match > up with any data federated into my local Wikibase, though to reduce > cognitive load I would probably try to load common properties to matching P > numbers at least (I think I will always want P31 and P279 to mean the same > thing as Wikidata’s P31 and P279, even if I have to create a bunch of empty > properties to ensure the numbering is right.) > > Being able to customize the prefixes of the IRI in my RDF exports is a > must-have, though. It looks like that is possible today, though a bit > clunky. > > Editing items through the Wikibase UI already doesn’t matter much for P > and Q numbers, because autocomplete hides all of that. If I’m adding a > statement about ‘educated at’, I don’t need to know the P number for > ‘educated at’ nor the Q number for the school involved. It would be nice if > that was easier to do in the query service, so I could use labels in place > of P and Q numbers. > > What is probably most realistic today for my use case is just manually > “federating” to Wikidata by copying Wikidata entities into my Wikibase as > needed, with new local Q and P numbers. For my hyper-specific knowledge > graph, it’s probably not that many entities that I need to pull from > Wikidata so as I discover I need to create entities, I can check to see if > Wikidata has one already and import it first. > > My subset is probably much smaller than any of the subsets envisioned in > the recent discussions about “concise” or “notable” dumps of Wikidata. I > assume that my hyper-specific knowledge graph would want a few hundred > properties from Wikidata and only a few thousand items from Wikidata at > most. > > I will treat the entities that I pull from Wikidata as “read-only” copies > - most of them would be anyway. If there is a new statement I need to make > about one of these, like updating something about the Town of Westport, it > is probably of interest to other Wikidata consumers and the edit should be > made on Wikidata and federated back. I can track the (much smaller) set > that I mirrored from Wikidata and periodically refresh them, so I don’t > need to try to process an entire dump. > > A nice potential feature of Wikibase would be to be able to explicitly > flag an entity as a mirrored entity so Wikibase prevented it from being > locally edited and it included provenance information to point back to the > wikibase(Wikidata) where it came from. Another feature that might be nice > (but I might be wrong and maybe this is a bad feature) is that if Wikibase > knew that this was an import, that when it exported to RDF it could also > dump the triples using the same IRIs as the source system. > > So, perhaps in summary, I would like to be able to reuse some data from > Wikidata in my local wikibase, but I am concerned about: > - I want a subset of Wikidata, but I want only a very small subset of > Wikidata > - How do I track the entities copied from Wikidata into my local Wikibase > so I can update them as new statements are added or updated on Wikidata? > - How can I make it easy for people who know the ontology and data already > in Wikidata to be able to edit the hyper-specific knowledge graph in my > local Wikibase? > - How can I make it easy to query my hyper-specific knowledge graph using > SPARQL while maximizing query reuse between the Wikidata query service and > my local query service, potentially to the point of having the same query > work on both if the query only involves the Wikidata subset? > > Thanks, > > -Erik > > _______________________________________________ > Wikidata mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikidata > -- Samantha Alipio Product Manager, Wikibase Wikimedia Deutschland e. V. | Tempelhofer Ufer 23-24 | 10963 Berlin Phone: +49 (0)30 219 158 26-0 https://wikimedia.de Imagine a world in which every single human being can freely share in the sum of all knowledge. Help us to achieve our vision! https://spenden.wikimedia.de Wikimedia Deutschland – Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.
_______________________________________________ Wikidata mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata
