Hi!

With Wikidata Query Service usage raising and more use cases being
found, it is time to consider caching infrastructure for results, since
queries are expensive. One of the questions I would like to solicit
feedback on is the following:

Should we have default SPARQL endpoint cached or uncached? If cached,
which default cache duration would be good for most users? The cache, of
course, applies to the results of the same (identical) query only.
Please also note the following is not an implementation plan, but rather
an opinion poll, whatever we end up deciding we will have an
announcement with actual plan before we do it.

Also, whichever default we choose, there should be a possibility to get
both cached and uncached results. The question is when you access the
endpoint with no options, which one would it be. So possible variants are:

1. query.wikidata.org/sparql is uncached, to get cached result you use
something like query.wikidata.org/sparql?cached=120 to get result no
older than 120 seconds ago.
PRO: least surprise for default users.
CON: relies on goodwill of tool writers, if somebody doesn't know about
cache option and uses the same query heavily, we would have to ask them
to use the parameter.

2. query.wikidata.org/sparql is cached for short duration (e.g. 1
minute) by default, if you'd like fresh result, you do something like
query.wikidata.org/sparql?cached=0. If you're fine with older result,
you can use query.wikidata.org/sparql?cached=3600 and get cached result
if it's still in cache but by default you never get result older than 1
minute. This of course assuming Varnish magic can do this, if not, the
scheme has to be amended.
PRO: performance improvement while keeping default results reasonably fresh
CON: it is not obvious that result is not the freshest data but can be
stale, so if you update something in wikidata and query again within
minute, you can be surprised

3. query.wikidata.org/sparql is cached for long duration (e.g. hours) by
default, if you'd like fresher result you do something like
query.wikidata.org/sparql?cache=120 to get result no older than 2
minutes, or cache=0 if you want uncached one.
PRO: best performance improvement for most queries, works well with
queries that display data that rarely changes, such as lists, etc.
CON: for people not knowing about cache option, in may be rather
confusing to not be able to get up-to-date results.

So we'd like to hear - especially from current SPARQL endpoint users -
what do you think about these and which would work for you?

Also, for the users of the WDQS GUI - provided we have cached and
uncached options, which one the GUI should return by default? Should it
be always uncached? Performance there is not a major question - the
traffic to the GUI is pretty low - but rather convenience. Of course, if
you run cached query from GUI and the data in cache, you can get results
much faster for some queries. OTOH, it may be important in many cases to
be able to access actual content up-to-date, not the cached version.

I also created a poll: https://phabricator.wikimedia.org/V8
so please feel free to vote for your favorite option.

OK, this letter is long enough already so I'll stop here and wait to
hear what everybody's thinking.

Thanks in advance,
-- 
Stas Malyshev
smalys...@wikimedia.org

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Reply via email to