Re: [Wikidata] Wikidata Query Service update lag

Guillaume Lederrey Thu, 14 Nov 2019 11:47:06 -0800

Hello!

Thanks for the suggestions!


On Thu, Nov 14, 2019 at 5:02 PM Thad Guidry <[email protected]> wrote:

> Is the Write Retention Queue adequate?
> Is the branching factor for the lexicon indices too large, resulting in a
> non-linear slowdown in the write rate over tim?
> Did you look into Small Slot Optimization?
> Are the Write Cache Buffers adequate?
> Is there a lot of Heap pressure?
> Is the MemoryManager have the maximum amount of RAM it can handle?  4TB?
> Is the RWStore handling the recycling well?
> Is the SAIL Buffer Capacity adequate?
> Are you not using exact range counts where you could be using fast range
> counts?
>
>
Start at the Hardware side first however.
> Is the disk activity for writes really low...and CPU is very high?  You
> have identified a bottleneck in that case, discover WHY that would be the
> case looking into any of the above.
>

Sounds like good questions, but outside of my area of expertise. I've
created https://phabricator.wikimedia.org/T238362 to track it, and I'll see
if someone can have a look. I know that we did multiple passes at tuning
Blazegraph properties, with limited success so far.


> and a 100+ other things that should be looked at that all affect WRITE
> performance during UPDATES.
>
> https://wiki.blazegraph.com/wiki/index.php/IOOptimization
> https://wiki.blazegraph.com/wiki/index.php/PerformanceOptimization
>
> I would also suggest you start monitoring some of the internals of
> Blazegraph (JAVA) while in production with tools such as XRebel or
> AppDynamics.
>

Both XRebel and AppDynamics are proprietary, so no way that we'll deploy
them in our environment. We are tracking a few JMX based metrics, but so
far, we don't really know what to look for.

Thanks!

  Guillaume

Thad
> https://www.linkedin.com/in/thadguidry/
>
>
> On Thu, Nov 14, 2019 at 7:31 AM Guillaume Lederrey <
> [email protected]> wrote:
>
>> Thanks for the feedback!
>>
>> On Thu, Nov 14, 2019 at 11:11 AM <[email protected]> wrote:
>>
>>>
>>> Besides waiting for the new updater, it may be useful to tell us, what
>>> we as users can do too. It is unclear to me what the problem is. For
>>> instance, at one point I was worried that the many parallel requests to
>>> the SPARQL endpoint that we make in Scholia is a problem. As far as I
>>> understand it is not a problem at all. Another issue could be the way
>>> that we use Magnus Manske's Quickstatements and approve bots for high
>>> frequency editing. Perhaps a better overview and constraints on
>>> large-scale editing could be discussed?
>>>
>>
>> To be (again) completely honest, we don't entirely understand the issue
>> either. There are clearly multiple related issues. In high level terms, we
>> have at least:
>>
>> * Some part of the update process on Blazegraph is CPU bound and single
>> threaded. Even with low query load, if we have a high edit rate, Blazegraph
>> can't keep up, and saturates a single CPU (with plenty of available
>> resources on other CPUs). This is a hard issue to fix, requiring either
>> splitting the processing over multiple CPU or sharding the data over
>> multiple servers. Neither of which Blazegraph supports (at least not in our
>> current configuration).
>> * There is a race for resources between edits and queries: a high query
>> load will impact the update rate. This could to some extent be mitigated by
>> reducing the query load: if no one is using the service, it works great!
>> Obviously that's not much of a solution.
>>
>> What you can do (short term):
>>
>> * Keep bots usage well behaved (don't do parallel queries, provide a
>> meaningful user agent, smooth the load over time if possible, ...). As far
>> as I can see, most usage are already well behaved.
>> * Optimize your queries: better queries will use less resources, which
>> should help. Time to completion is a good approximation of the resources
>> used. I don't really have any more specific advice, SPARQL is not my area
>> of expertise.
>>
>> What you can do (longer term):
>>
>> * Help us think out of the box. Can we identify higher level use cases?
>> Could we implement some of our workflows on a higher level API than SPARQL,
>> which might allow for more internal optimizations?
>> * Help us better understand the constraints. Document use cases on [1].
>>
>> Sadly, we don't have the bandwidth right now to engage meaningfully in
>> this conversation. Feel free to send thoughts already, but don't expect any
>> timely response.
>>
>> Yet another thought is the large discrepancy between Virginia and Texas
>>> data centers as I could see on Grafana [1]. As far as I understand the
>>> hardware (and software) are the same. So why is there this large
>>> difference? Rather than editing or BlazeGraph, could the issue be some
>>> form of network issue?
>>>
>>
>> As pointed out by Lucas, this is expected. Due to how our GeoDNS works,
>> we see more traffic on eqiad than on codfw.
>>
>> Thanks for the help!
>>
>>    Guillaume
>>
>> [1] https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Usage
>>
>>
>>
>>>
>>>
>>> [1]
>>>
>>> https://grafana.wikimedia.org/d/000000489/wikidata-query-service?panelId=8&fullscreen&orgId=1&from=now-7d&to=now
>>>
>>> /Finn
>>>
>>>
>>>
>>> On 14/11/2019 10:50, Guillaume Lederrey wrote:
>>> > Hello all!
>>> >
>>> > As you've probably noticed, the update lag on the public WDQS endpoint
>>> > [1] is not doing well [2], with lag climbing to > 12h for some
>>> servers.
>>> > We are tracking this on phabricator [3], subscribe to that task if you
>>> > want to stay informed.
>>> >
>>> > To be perfectly honest, we don't have a good short term solution. The
>>> > graph database that we are using at the moment (Blazegraph [4]) does
>>> not
>>> > easily support sharding, so even throwing hardware at the problem
>>> isn't
>>> > really an option.
>>> >
>>> > We are working on a few medium term improvements:
>>> >
>>> > * A dedicated updater service in Blazegraph, which should help
>>> increase
>>> > the update throughput [5]. Finger crossed, this should be ready for
>>> > initial deployment and testing by next week (no promise, we're doing
>>> the
>>> > best we can).
>>> > * Some improvement in the parallelism of the updater [6]. This has
>>> just
>>> > been identified. While it will probably also provide some improvement
>>> in
>>> > throughput, we haven't actually started working on that and we don't
>>> > have any numbers at this point.
>>> >
>>> > Longer term:
>>> >
>>> > We are hiring a new team member to work on WDQS. It will take some
>>> time
>>> > to get this person up to speed, but we should have more capacity to
>>> > address the deeper issues of WDQS by January.
>>> >
>>> > The 2 main points we want to address are:
>>> >
>>> > * Finding a triple store that scales better than our current solution.
>>> > * Better understand what are the use cases on WDQS and see if we can
>>> > provide a technical solution that is better suited. Our intuition is
>>> > that some of the use cases that require synchronous (or quasi
>>> > synchronous) updates would be better implemented outside of a triple
>>> > store. Honestly, we have no idea yet if this makes sense and what
>>> those
>>> > alternate solutions might be.
>>> >
>>> > Thanks a lot for your patience during this tough time!
>>> >
>>> >     Guillaume
>>> >
>>> >
>>> > [1] https://query.wikidata.org/
>>> > [2]
>>> >
>>> https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&from=1571131796906&to=1573723796906&var-cluster_name=wdqs&panelId=8&fullscreen
>>> > [3] https://phabricator.wikimedia.org/T238229
>>> > [4] https://blazegraph.com/
>>> > [5] https://phabricator.wikimedia.org/T212826
>>> > [6] https://phabricator.wikimedia.org/T238045
>>> >
>>> > --
>>> > Guillaume Lederrey
>>> > Engineering Manager, Search Platform
>>> > Wikimedia Foundation
>>> > UTC+1 / CET
>>> >
>>> > _______________________________________________
>>> > Wikidata mailing list
>>> > [email protected]
>>> > https://lists.wikimedia.org/mailman/listinfo/wikidata
>>> >
>>>
>>> _______________________________________________
>>> Wikidata mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>>
>>
>>
>> --
>> Guillaume Lederrey
>> Engineering Manager, Search Platform
>> Wikimedia Foundation
>> UTC+1 / CET
>> _______________________________________________
>> Wikidata mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
> _______________________________________________
> Wikidata mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>


-- 
Guillaume Lederrey
Engineering Manager, Search Platform
Wikimedia Foundation
UTC+1 / CET

_______________________________________________
Wikidata mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata Query Service update lag

Reply via email to