Re: [Wikidata] Wikidata Query Service update lag

Denny Vrandečić Mon, 18 Nov 2019 09:55:09 -0800

I don't know if there is actually someone who would be capable and have the
time to do so, I just would hope there are such people - but it probably
makes sense to check if there are actually volunteers before doing work to
enable them :)


On Fri, Nov 15, 2019 at 5:17 AM Guillaume Lederrey <gleder...@wikimedia.org>
wrote:

> On Fri, Nov 15, 2019 at 12:49 AM Denny Vrandečić <vrande...@google.com>
> wrote:
>
>> Just wondering, is there a way to let volunteers look into the issue? (I
>> guess no because it would give potentially access to the query stream, but
>> maybe the answer is more optimistic)
>>
>
> There are ways, none of them easy. There are precedents for volunteers
> having access to our production environment. I'm not really sure what the
> process looks like. There is at least some NDA to sign and some vetting
> process. As you pointed out, this would give access to sensitive
> information, and to the ability to do great damage (power, responsibility
> and those kind of things).
>
> More realistically, we could provide more information for analysis. Heap
> dumps do contain private information, but thread dumps are pretty safe, so
> we could publish those. We would need to automate this on our side, but
> that might be an option. Of course, having access to limited information
> and no way to experiment on changes seriously limits the ability to
> investigate.
>
> I'll check with the team if that's something we are ready to invest in.
>
>
>> On Thu, Nov 14, 2019 at 2:39 PM Thad Guidry <thadgui...@gmail.com> wrote:
>>
>>> In the enterprise, most folks use either Java Mission Control, or just
>>> Java VisualVM profiler.  Seeing sleeping Threads is often good to start
>>> with, and just taking a snapshot or even Heap Dump when things are really
>>> grinding slow would be useful, you can later share those snapshots/heap
>>> dump with the community or Java profiling experts to analyze later.
>>>
>>> https://visualvm.github.io/index.html
>>>
>>> Thad
>>> https://www.linkedin.com/in/thadguidry/
>>>
>>>
>>> On Thu, Nov 14, 2019 at 1:46 PM Guillaume Lederrey <
>>> gleder...@wikimedia.org> wrote:
>>>
>>>> Hello!
>>>>
>>>> Thanks for the suggestions!
>>>>
>>>> On Thu, Nov 14, 2019 at 5:02 PM Thad Guidry <thadgui...@gmail.com>
>>>> wrote:
>>>>
>>>>> Is the Write Retention Queue adequate?
>>>>> Is the branching factor for the lexicon indices too large, resulting
>>>>> in a non-linear slowdown in the write rate over tim?
>>>>> Did you look into Small Slot Optimization?
>>>>> Are the Write Cache Buffers adequate?
>>>>> Is there a lot of Heap pressure?
>>>>> Is the MemoryManager have the maximum amount of RAM it can handle?
>>>>> 4TB?
>>>>> Is the RWStore handling the recycling well?
>>>>> Is the SAIL Buffer Capacity adequate?
>>>>> Are you not using exact range counts where you could be using fast
>>>>> range counts?
>>>>>
>>>>>
>>>> Start at the Hardware side first however.
>>>>> Is the disk activity for writes really low...and CPU is very high?
>>>>> You have identified a bottleneck in that case, discover WHY that would be
>>>>> the case looking into any of the above.
>>>>>
>>>>
>>>> Sounds like good questions, but outside of my area of expertise. I've
>>>> created https://phabricator.wikimedia.org/T238362 to track it, and
>>>> I'll see if someone can have a look. I know that we did multiple passes at
>>>> tuning Blazegraph properties, with limited success so far.
>>>>
>>>>
>>>>> and a 100+ other things that should be looked at that all affect WRITE
>>>>> performance during UPDATES.
>>>>>
>>>>> https://wiki.blazegraph.com/wiki/index.php/IOOptimization
>>>>> https://wiki.blazegraph.com/wiki/index.php/PerformanceOptimization
>>>>>
>>>>> I would also suggest you start monitoring some of the internals of
>>>>> Blazegraph (JAVA) while in production with tools such as XRebel or
>>>>> AppDynamics.
>>>>>
>>>>
>>>> Both XRebel and AppDynamics are proprietary, so no way that we'll
>>>> deploy them in our environment. We are tracking a few JMX based metrics,
>>>> but so far, we don't really know what to look for.
>>>>
>>>> Thanks!
>>>>
>>>>   Guillaume
>>>>
>>>> Thad
>>>>> https://www.linkedin.com/in/thadguidry/
>>>>>
>>>>>
>>>>> On Thu, Nov 14, 2019 at 7:31 AM Guillaume Lederrey <
>>>>> gleder...@wikimedia.org> wrote:
>>>>>
>>>>>> Thanks for the feedback!
>>>>>>
>>>>>> On Thu, Nov 14, 2019 at 11:11 AM <f...@imm.dtu.dk> wrote:
>>>>>>
>>>>>>>
>>>>>>> Besides waiting for the new updater, it may be useful to tell us,
>>>>>>> what
>>>>>>> we as users can do too. It is unclear to me what the problem is. For
>>>>>>> instance, at one point I was worried that the many parallel requests
>>>>>>> to
>>>>>>> the SPARQL endpoint that we make in Scholia is a problem. As far as
>>>>>>> I
>>>>>>> understand it is not a problem at all. Another issue could be the
>>>>>>> way
>>>>>>> that we use Magnus Manske's Quickstatements and approve bots for
>>>>>>> high
>>>>>>> frequency editing. Perhaps a better overview and constraints on
>>>>>>> large-scale editing could be discussed?
>>>>>>>
>>>>>>
>>>>>> To be (again) completely honest, we don't entirely understand the
>>>>>> issue either. There are clearly multiple related issues. In high level
>>>>>> terms, we have at least:
>>>>>>
>>>>>> * Some part of the update process on Blazegraph is CPU bound and
>>>>>> single threaded. Even with low query load, if we have a high edit rate,
>>>>>> Blazegraph can't keep up, and saturates a single CPU (with plenty of
>>>>>> available resources on other CPUs). This is a hard issue to fix, 
>>>>>> requiring
>>>>>> either splitting the processing over multiple CPU or sharding the data 
>>>>>> over
>>>>>> multiple servers. Neither of which Blazegraph supports (at least not in 
>>>>>> our
>>>>>> current configuration).
>>>>>> * There is a race for resources between edits and queries: a high
>>>>>> query load will impact the update rate. This could to some extent be
>>>>>> mitigated by reducing the query load: if no one is using the service, it
>>>>>> works great! Obviously that's not much of a solution.
>>>>>>
>>>>>> What you can do (short term):
>>>>>>
>>>>>> * Keep bots usage well behaved (don't do parallel queries, provide a
>>>>>> meaningful user agent, smooth the load over time if possible, ...). As 
>>>>>> far
>>>>>> as I can see, most usage are already well behaved.
>>>>>> * Optimize your queries: better queries will use less resources,
>>>>>> which should help. Time to completion is a good approximation of the
>>>>>> resources used. I don't really have any more specific advice, SPARQL is 
>>>>>> not
>>>>>> my area of expertise.
>>>>>>
>>>>>> What you can do (longer term):
>>>>>>
>>>>>> * Help us think out of the box. Can we identify higher level use
>>>>>> cases? Could we implement some of our workflows on a higher level API 
>>>>>> than
>>>>>> SPARQL, which might allow for more internal optimizations?
>>>>>> * Help us better understand the constraints. Document use cases on
>>>>>> [1].
>>>>>>
>>>>>> Sadly, we don't have the bandwidth right now to engage meaningfully
>>>>>> in this conversation. Feel free to send thoughts already, but don't 
>>>>>> expect
>>>>>> any timely response.
>>>>>>
>>>>>> Yet another thought is the large discrepancy between Virginia and
>>>>>>> Texas
>>>>>>> data centers as I could see on Grafana [1]. As far as I understand
>>>>>>> the
>>>>>>> hardware (and software) are the same. So why is there this large
>>>>>>> difference? Rather than editing or BlazeGraph, could the issue be
>>>>>>> some
>>>>>>> form of network issue?
>>>>>>>
>>>>>>
>>>>>> As pointed out by Lucas, this is expected. Due to how our GeoDNS
>>>>>> works, we see more traffic on eqiad than on codfw.
>>>>>>
>>>>>> Thanks for the help!
>>>>>>
>>>>>>    Guillaume
>>>>>>
>>>>>> [1] https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Usage
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [1]
>>>>>>>
>>>>>>> https://grafana.wikimedia.org/d/000000489/wikidata-query-service?panelId=8&fullscreen&orgId=1&from=now-7d&to=now
>>>>>>>
>>>>>>> /Finn
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 14/11/2019 10:50, Guillaume Lederrey wrote:
>>>>>>> > Hello all!
>>>>>>> >
>>>>>>> > As you've probably noticed, the update lag on the public WDQS
>>>>>>> endpoint
>>>>>>> > [1] is not doing well [2], with lag climbing to > 12h for some
>>>>>>> servers.
>>>>>>> > We are tracking this on phabricator [3], subscribe to that task if
>>>>>>> you
>>>>>>> > want to stay informed.
>>>>>>> >
>>>>>>> > To be perfectly honest, we don't have a good short term solution.
>>>>>>> The
>>>>>>> > graph database that we are using at the moment (Blazegraph [4])
>>>>>>> does not
>>>>>>> > easily support sharding, so even throwing hardware at the problem
>>>>>>> isn't
>>>>>>> > really an option.
>>>>>>> >
>>>>>>> > We are working on a few medium term improvements:
>>>>>>> >
>>>>>>> > * A dedicated updater service in Blazegraph, which should help
>>>>>>> increase
>>>>>>> > the update throughput [5]. Finger crossed, this should be ready
>>>>>>> for
>>>>>>> > initial deployment and testing by next week (no promise, we're
>>>>>>> doing the
>>>>>>> > best we can).
>>>>>>> > * Some improvement in the parallelism of the updater [6]. This has
>>>>>>> just
>>>>>>> > been identified. While it will probably also provide some
>>>>>>> improvement in
>>>>>>> > throughput, we haven't actually started working on that and we
>>>>>>> don't
>>>>>>> > have any numbers at this point.
>>>>>>> >
>>>>>>> > Longer term:
>>>>>>> >
>>>>>>> > We are hiring a new team member to work on WDQS. It will take some
>>>>>>> time
>>>>>>> > to get this person up to speed, but we should have more capacity
>>>>>>> to
>>>>>>> > address the deeper issues of WDQS by January.
>>>>>>> >
>>>>>>> > The 2 main points we want to address are:
>>>>>>> >
>>>>>>> > * Finding a triple store that scales better than our current
>>>>>>> solution.
>>>>>>> > * Better understand what are the use cases on WDQS and see if we
>>>>>>> can
>>>>>>> > provide a technical solution that is better suited. Our intuition
>>>>>>> is
>>>>>>> > that some of the use cases that require synchronous (or quasi
>>>>>>> > synchronous) updates would be better implemented outside of a
>>>>>>> triple
>>>>>>> > store. Honestly, we have no idea yet if this makes sense and what
>>>>>>> those
>>>>>>> > alternate solutions might be.
>>>>>>> >
>>>>>>> > Thanks a lot for your patience during this tough time!
>>>>>>> >
>>>>>>> >     Guillaume
>>>>>>> >
>>>>>>> >
>>>>>>> > [1] https://query.wikidata.org/
>>>>>>> > [2]
>>>>>>> >
>>>>>>> https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&from=1571131796906&to=1573723796906&var-cluster_name=wdqs&panelId=8&fullscreen
>>>>>>> > [3] https://phabricator.wikimedia.org/T238229
>>>>>>> > [4] https://blazegraph.com/
>>>>>>> > [5] https://phabricator.wikimedia.org/T212826
>>>>>>> > [6] https://phabricator.wikimedia.org/T238045
>>>>>>> >
>>>>>>> > --
>>>>>>> > Guillaume Lederrey
>>>>>>> > Engineering Manager, Search Platform
>>>>>>> > Wikimedia Foundation
>>>>>>> > UTC+1 / CET
>>>>>>> >
>>>>>>> > _______________________________________________
>>>>>>> > Wikidata mailing list
>>>>>>> > Wikidata@lists.wikimedia.org
>>>>>>> > https://lists.wikimedia.org/mailman/listinfo/wikidata
>>>>>>> >
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Wikidata mailing list
>>>>>>> Wikidata@lists.wikimedia.org
>>>>>>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Guillaume Lederrey
>>>>>> Engineering Manager, Search Platform
>>>>>> Wikimedia Foundation
>>>>>> UTC+1 / CET
>>>>>> _______________________________________________
>>>>>> Wikidata mailing list
>>>>>> Wikidata@lists.wikimedia.org
>>>>>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>>>>>
>>>>> _______________________________________________
>>>>> Wikidata mailing list
>>>>> Wikidata@lists.wikimedia.org
>>>>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>>>>
>>>>
>>>>
>>>> --
>>>> Guillaume Lederrey
>>>> Engineering Manager, Search Platform
>>>> Wikimedia Foundation
>>>> UTC+1 / CET
>>>> _______________________________________________
>>>> Wikidata mailing list
>>>> Wikidata@lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>>>
>>> _______________________________________________
>>> Wikidata mailing list
>>> Wikidata@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>>
>> _______________________________________________
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>
>
> --
> Guillaume Lederrey
> Engineering Manager, Search Platform
> Wikimedia Foundation
> UTC+1 / CET
> _______________________________________________
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Wikidata Query Service update lag

Reply via email to