Re: [Wikidata] Status of Wikidata Query Service

Samuel Klein Mon, 10 Feb 2020 23:05:19 -0800

The earth is not flat :)

I appreciate all of your thoughts in this thread, Amirouche.
A linked data fragments approach w/ a [thick client] seems to me to be one
useful option to explore and benchmark.


Other possible thoughts:
~ have some core highly-used subgraphs within which queries are lightning
fast.
~ give queriers the option to search on fast subgraphs; and the option to
set a quick query timeout.
~ give queriers quick estimates of how much load a query will impose
~ set the default query timeout to be quite quick (while letting any user
raise their default to some cap, just like we can set how many results we
want to see on RC / history pages)

//S

On Mon, Feb 10, 2020 at 1:53 PM Amirouche Boubekki <
amirouche.boube...@gmail.com> wrote:

> Le lun. 10 févr. 2020 à 18:23, Marco Neumann <marco.neum...@gmail.com> a
> écrit :
> >
> > why all the sad faces?
>
> > the Semantic Web will be distributed after all
>
> The semantic Web is already distributed.
>
> > and there is no need to stuff everything into one graph.
>
> Everything into one graph, or if you prefer in one place, is the gist
> of the idea of a library or encyclopedia.
>
> > it just requires us as an RDF community to spend more time developing
> ideas around efficient query distribution
>
> Maybe. But does not preclude the aggregation or sum of knowledge to happen.
>
> > and focus on relationships and links in wikidata
>
> Like I wrote above, a distributed knowledge base is already the state
> of the things. I am not sure how to understand that part of the
> sentence.
>
> > rather than building a monolithic database
>
> That is the gist of my proposal.  Without the ability to run wikidata
> at a small scale, WMF will fail at knowledge equity.
>
> > for humongous arbitrary joins and table scans
>
> I proposed something along the lines of
> https://linkeddatafragments.org as known as "thin server, thick
> client" I had no feedback :(
>
> > as a free for all.
>
> With that, I heartily agree.  With the ability to downscale wikidata
> infrastructure, and make companies and institutions pay for the stream
> of changes to apply to their local instance, it will make things much
> easier.
>
> > The slogan "sum of all human knowledge" in one place should not be taken
> too literally.
>
> I disagree.
>
> >
> > it's I believe what wikidata as a project already does in any event,
> actually, the SPARQL endpoint as an extension to the wikidata architecture
> around wikibase should be used more pro-actively to connect multiple RDF
> data providers for search.
>
> Read my proposal at
> https://meta.wikimedia.org/wiki/Grants:Project/Future-proof_WDQS
>
> The title is misleading, I intended to change it to Future-proof
> WikiData.  WDQS or querying is an integral part of wikidata and must
> not be merely an addon.
>
> > I would think that this is already a common use case for wikidata users
> who enrich their remote queries with wikidata data.
>
> I do not understand.  Yes, people enrich wikidata queries with their data.
> And?
>
> > All that said it's quite an achievement to scale the wikidata SPARQL
> endpoint to where it is now.
> > Congratulations to the team and I look forward to seeing more of it in
> the future.
>
> Yes, I agree with that.  Congratulations!  I am very proud to be part
> of the Wikimedia community.
>
> The current WMF proposal that is called "sharding", see details at:
>
>   https://en.wikipedia.org/wiki/Shard_(database_architecture)
>
> It is, not future proof.  I have not done any analysis, but I bet that
> most of the 2TB of wikidata is English, so even if you shard by
> language, you will still end up with a gigantic graph.  Also, most of
> the data is not specific to a natural language, so one can not
> possibly split the data by language.
>
> If WMF comes up with another sharding strategy, how will edits that
> span multiple regions will happen?
>
> How will it make entering the wikidata party easier?
>
> I dare to write in the open: it seems to me like we are witnessing
> "Earth is flat vs. Earth is not flat" kind of event.
>
>
> Thanks for the reply!
>
>
> > On Mon, Feb 10, 2020 at 4:11 PM Amirouche Boubekki <
> amirouche.boube...@gmail.com> wrote:
> >>
> >> Hello Guillaume,
> >>
> >> Le ven. 7 févr. 2020 à 14:33, Guillaume Lederrey
> >> <gleder...@wikimedia.org> a écrit :
> >> >
> >> > Hello all!
> >> >
> >> > First of all, my apologies for the long silence. We need to do better
> in terms of communication. I'll try my best to send a monthly update from
> now on. Keep me honest, remind me if I fail.
> >> >
> >>
> >> It will be nice to have some feedback on my grant request at:
> >>
> >>   https://meta.wikimedia.org/wiki/Grants:Project/Future-proof_WDQS
> >>
> >> Or one of the other threads on the very same mailing list.
> >>
> >> > Another attempt to get update lag under control is to apply back
> pressure on edits, by adding the WDQS update lag to the Wikdiata maxlag
> [6]. This is obviously less than ideal (at least as long as WDQS updates
> are lagging as often as they are), but does allow the service to recover
> from time to time. We probably need to iterate on this, provide better
> granularity, differentiate better between operations that have an impact on
> update lag and those which don't.
> >> >
> >> > On the slightly better news side, we now have a much better
> understanding of the update process and of its shortcomings. The current
> process does a full diff between each updated entity and what we have in
> blazegraph. Even if a single triple needs to change, we still read tons of
> data from Blazegraph. While this approach is simple and robust, it is
> obviously not efficient. We need to rewrite the updater to take a more
> event streaming / reactive approach, and only work on the actual changes.
> >>
> >> When it will be done, it will be still a short term solution
> >>
> >> > This is a big chunk of work, almost a complete rewrite of the updater,
> >>
> >> > and we need a new solution to stream changes with guaranteed ordering
> (something that our kafka queues don't offer). This is where we are
> focusing our energy at the moment, this looks like the best option to
> improve the situation in the medium term. This change will probably have
> some functional impacts [3].
> >>
> >> Guaranteed ordering in a multi-party distributed setting has no easy
> >> solution, and apparently it is not provided by Kafka.  For a
> >> non-technical person, you can read
> >> https://en.wikipedia.org/wiki/Two_Generals%27_Problem
> >>
> >> > Some longer term thoughts:
> >> >
> >> > Keeping all of Wikidata in a single graph is most probably not going
> to work long term.
> >>
> >> :(
> >>
> >> > We have not found examples of public SPARQL endpoints with > 10 B
> triples and there is probably a good reason for that.
> >>
> >> Because Wikimedia is the only non-profit in the field?
> >>
> >> > We will probably need to split the graphs at some point.
> >>
> >> :(
> >>
> >> > We don't know how yet
> >>
> >> :(
> >>
> >> > (that's why we loaded the dumps into Hadoop, that might give us some
> more insight).
> >>
> >> :(
> >>
> >> > We might expose a subgraph with only truthy statements. Or have
> language-specific graphs, with only language-specific labels.
> >>
> >> :(
> >>
> >> > Or something completely different.
> >>
> >> :)
> >>
> >> > Keeping WDQS / Wikidata as open as they are at the moment might not
> be possible in the long term. We need to think if / how we want to
> implement some form of authentication and quotas.
> >>
> >> With blacklists and whitelists, but this is huge anyway.
> >>
> >> > Potentially increasing quotas for some use cases, but keeping them
> strict for others. Again, we don't know how this will look like, but we're
> thinking about it.
> >>
> >> > What you can do to help:
> >> >
> >> > Again, we're not sure. Of course, reducing the load (both in terms of
> edits on Wikidata and of reads on WDQS) will help. But not using those
> services makes them useless.
> >>
> >> What about making the lag part of the service.  I mean, you could
> >> reload WDQS periodically, for instance daily, and drop the updater
> >> altogether. Who needs to see the updates live in WDQS as soon as edits
> >> are done in wikidata?
> >>
> >> > We suspect that some use cases are more expensive than others (a
> single property change to a large entity will require a comparatively
> insane amount of work to update it on the WDQS side). We'd like to have
> real data on the cost of various operations, but we only have guesses at
> this point.
> >> >
> >> > If you've read this far, thanks a lot for your engagement!
> >> >
> >> >   Have fun!
> >> >
> >>
> >> Will do.
> >>
> >> _______________________________________________
> >> Wikidata mailing list
> >> Wikidata@lists.wikimedia.org
> >> https://lists.wikimedia.org/mailman/listinfo/wikidata
> >
> > --
> >
> >
> > ---
> > Marco Neumann
> > KONA
> >
> > _______________________________________________
> > Wikidata mailing list
> > Wikidata@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
>
> --
> Amirouche ~ https://hyper.dev
>
> _______________________________________________
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>


-- 
Samuel Klein          @metasj           w:user:sj          +1 617 529 4266

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Status of Wikidata Query Service

Reply via email to