Wikidata has a huge number of labels in a high number of languages. Is it be possible that indexing strategies based on the language of the string literal a good thing ? It’s an RDF choice to encode the language in the literal, it might not be the better choice for performance indeed. But a query planner/rewriter should be able to detect a pattern like « filter lang() = "en" » to take advantage of such an index ?
Retrieving label is important in general and do this efficiently might be a something that makes a difference … Le ven. 5 nov. 2021 à 11:55, David Causse <[email protected]> a écrit : > Hi Thad, > > I looked at this query and I have nothing to add to what was suggested > already to make it run faster. > I think the main issue is the size of the intermediate results that have > to have the language filter applied, sadly almost every time that a FILTER > is being used on a string literal blazegraph might have to fetch its > representation from its lexicon which incur a huge slowdown. > Regarding indices and ordering I believe the right indices are being used > otherwize the query would certainly time out, I doubt it can filter all > english labels before joining them to the property labels. > > The criterion ?prop wdt:P31/wdt:P279* wd:Q18616576 does indeed seem > useless to me and is pulling a couple false positives[1] into the join > (totally harmless regarding query perf but should perhaps be cleaned up > from wikidata?). > > So filtering & fetching the textual data is indeed what makes this query > slow. I tried various combinations but could not come up with reasonable & > stable sub-second response times. Fetching the textual data (possibly > lazily) from another service might help but this certainly is a consequent > rewrite of the client relying on this query. > > Caching is definitely going to help especially if this data is not subject > to rapid/frequent changes, the WDQS infrastructure has a caching layer but > retention might not be long enough to be useful for this particular tool. > The json output seems indeed quite big (almost 5Mb), while not > enormous it's still consequent and if this data is relatively stable there > might be value in refreshing it on purpose (daily as you suggest) and > making it available on a static storage. > > Another note about response times, you may see varying response times from > the query service and the reasons might be one of the following: > - it's cached on the query service caching layer (generally sub 100ms > response time) > - the server the query hits is heavily loaded > - the server the query hits is an old generation (we have 2 different > kinds of hardware setup in the cluster at the moment and might explain some > of the variance you see). > > Hope it helps a bit, > > Regards, > > David. > > > 1: https://w.wiki/4Lae > > On Wed, Nov 3, 2021 at 11:39 PM Thad Guidry <[email protected]> wrote: > >> Thanks Kingsley, Thomas, Jeff, >> >> From what I see the live query never is sub second and that's likely >> because of 2 things: >> 1. indexing not prioritizing this kind of query and aligning it (which >> David Causse might know if that could be changed), essentially its metadata >> about Wikidata (it's available properties). >> 2. it's 2.2 MB of data >> >> I think that Yi Liu's Wikidata Property Explorer service then might want >> to instead cache the results for 24 hours for the best of both worlds. >> >> To be fair, the raw amount of data requested seems to be approximately >> 2.2 MB and so probably should be locally cached by his tool for some >> determined time (like 24 hours). >> >> Thad >> https://www.linkedin.com/in/thadguidry/ >> https://calendly.com/thadguidry/ >> >> _______________________________________________ >> Wikidata mailing list -- [email protected] >> To unsubscribe send an email to [email protected] >> > _______________________________________________ > Wikidata mailing list -- [email protected] > To unsubscribe send an email to [email protected] >
_______________________________________________ Wikidata mailing list -- [email protected] To unsubscribe send an email to [email protected]
