[Wikidata] Re: Inconsistencies on WDQS data - data reload on WDQS

Kingsley Idehen via Wikidata Fri, 24 Feb 2023 06:27:27 -0800


On 2/23/23 4:17 PM, James Heald wrote:

On 23/02/2023 20:08, Kingsley Idehen via Wikidata wrote:
On 2/23/23 12:19 PM, James Heald wrote:
I have to say I am a bit concerned by this talk, since some of Blazegraph's "features and quirks" can be exceedingly useful.
That isn't justification for tightly-coupling a Query Tool to a Query Service Endpoint, especially when an open standard (in the form of SPARQL) exists.
Of course it's a good thing to be able to swap out the back-end and to be able to run essentially the same queries against other realisations of the database.
It's also a good thing to be able to clone the user interface and use essentially the same UI with a different back-end. (As I understand it, this should be very possible).

Good to hear, since that's my fundamental point re loosely-coupled architecture enabled by open standards.

But. There are features which have been listed in the desiderata for WDQS from the very start, that go beyond what the out-of-the-box SPARQL 1.1 standard offers.

Therein lies the problem. A standards based client can include extensions for a specific back-end in configurable form based on loose-coupling principles. Doing it otherwise is what's generally known as leaky abstraction that ultimately racks up technical debt.

An example of technical debt that's manifesting right now is an inability to diffuse the costs of the Wikidata Knowledge Graph across a federation of SPARQL query service providers. This doesn't have to be the case at all, bearing in mind the nature of SPARQL and structured data represented using RDF.

Most notable among these is the ability to retrieve items with coordinates close to a particular point on the earth's surface. (Something which, as the Blazegraph developers discovered, can be implemented fairly easily if you add a "Z-order curve" index on coordinate values https://en.wikipedia.org/wiki/Z-order_curve ).

None of that would be lost in a WDQS instance configured to discover the SPARQL query endpoint and associated capabilities.

Not all users will have an interest in geographical objects. Those who don't will lose little if they hook up a back-end that doesn't provide this, because presumably they won't be running queries which require it. But those who do need this functionality need this indexing.



See my comment above.

Given that this was something the Blazegraph developers (all 3 of them) found they could add relatively easily; and given that it seems to me that any database back-end would gain considerable cachet by being able to run wikidata queries, it seems to me not unreasonable to approach potential alternative back-ends and see how easily they too might be able to add a Z-order curve index for coordinate values, plus basic functionality to make use of it. (Where wikibase:box and wikibase:around are about as basic as it gets).
Andrea suggested a more GeoSPARQL-orientated solution ( https://wikitech.wikimedia.org/wiki/User:AndreaWest/Blazegraph_Features_and_Capabilities#SPARQL_Functional_Extensions ), but that seems to me a much much bigger ask; I do suspect that (for almost all contending projects) the simple wikibase:box and wikibase:around services would be a lot more easily implemented, to free us from our tight-coupling to Blazegraph, yet still provide this functionality, which I do believe is a needed requirement.
As for named subqueries, as well as making queries much more readable, IMO they may be particularly valuable as a way to specify particular optimisations (ie sequencing of query execution, that may be absolutely *crucial* if a query is to run) in a particularly readable and **portable** way -- certainly when compared to optimiser "hint" syntaxes, that may be tied *very* specifically to a particular back-end.
Why do I think named subqueries are so portable, if they are not part of the SPARQL 1.1 standard, and most providers don't support them ?
The answer is because if necessary it would require only a fairly simple pre-processor script to turn them into inline sub-queries, which *are* supported by the standard.
Named sub-queries having the advantage though of making the query a lot more readable; and can be useful to indicate to the back-end that the sub-query need only be retrieved once, rather than repeatedly each time it is referenced (which may be helpful for some back-ends).

These implementation details aren't really relevant to the fundamental point I am trying to make about the virtues of loosely-coupled architecture facilitated by existing open standards (e.g., SPARQL).

So: I don't disagree that it would be useful if WDQS was less tightly dependent on Blazegraph.
But: rather than going straight to removing good features, I think there is a lot of scope for seeing whether the dev teams for other back-ends could be persuaded to match the features on those back-ends without too much difficulty; and that this would be a better path to at least investigate, in preference to breaking swathes of queries that are in active use.

Nothing I've said amounts for feature removal. Everything I've said is simply about loosely-coupled architecture as a guiding principle for making WDQS usable against other SPARQL endpoints :)



Kingsley

   --  James.







_______________________________________________
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/G4YAKE5S5JU22SPKHFIMWAG43MKKPSIL/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org



--
Regards,

Kingsley Idehen 
Founder & CEO
OpenLink Software
Home Page: http://www.openlinksw.com
Community Support: https://community.openlinksw.com
Weblogs (Blogs):
Company Blog: https://medium.com/openlink-software-blog
Virtuoso Blog: https://medium.com/virtuoso-blog
Data Access Drivers Blog: 
https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers

Personal Weblogs (Blogs):
Medium Blog: https://medium.com/@kidehen
Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/
              http://kidehen.blogspot.com

Profile Pages:
Pinterest: https://www.pinterest.com/kidehen/
Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen
Twitter: https://twitter.com/kidehen
Google+: https://plus.google.com/+KingsleyIdehen/about
LinkedIn: http://www.linkedin.com/in/kidehen

Web Identities (WebID):
Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i
        : 
http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this

smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/ZF6JVCYLSCUDALV7S3MZILUFLWRGXA6T/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org

[Wikidata] Re: Inconsistencies on WDQS data - data reload on WDQS

Reply via email to