I don't know if there is actually someone who would be capable and have the time to do so, I just would hope there are such people - but it probably makes sense to check if there are actually volunteers before doing work to enable them :)
On Fri, Nov 15, 2019 at 5:17 AM Guillaume Lederrey <gleder...@wikimedia.org> wrote: > On Fri, Nov 15, 2019 at 12:49 AM Denny Vrandečić <vrande...@google.com> > wrote: > >> Just wondering, is there a way to let volunteers look into the issue? (I >> guess no because it would give potentially access to the query stream, but >> maybe the answer is more optimistic) >> > > There are ways, none of them easy. There are precedents for volunteers > having access to our production environment. I'm not really sure what the > process looks like. There is at least some NDA to sign and some vetting > process. As you pointed out, this would give access to sensitive > information, and to the ability to do great damage (power, responsibility > and those kind of things). > > More realistically, we could provide more information for analysis. Heap > dumps do contain private information, but thread dumps are pretty safe, so > we could publish those. We would need to automate this on our side, but > that might be an option. Of course, having access to limited information > and no way to experiment on changes seriously limits the ability to > investigate. > > I'll check with the team if that's something we are ready to invest in. > > >> On Thu, Nov 14, 2019 at 2:39 PM Thad Guidry <thadgui...@gmail.com> wrote: >> >>> In the enterprise, most folks use either Java Mission Control, or just >>> Java VisualVM profiler. Seeing sleeping Threads is often good to start >>> with, and just taking a snapshot or even Heap Dump when things are really >>> grinding slow would be useful, you can later share those snapshots/heap >>> dump with the community or Java profiling experts to analyze later. >>> >>> https://visualvm.github.io/index.html >>> >>> Thad >>> https://www.linkedin.com/in/thadguidry/ >>> >>> >>> On Thu, Nov 14, 2019 at 1:46 PM Guillaume Lederrey < >>> gleder...@wikimedia.org> wrote: >>> >>>> Hello! >>>> >>>> Thanks for the suggestions! >>>> >>>> On Thu, Nov 14, 2019 at 5:02 PM Thad Guidry <thadgui...@gmail.com> >>>> wrote: >>>> >>>>> Is the Write Retention Queue adequate? >>>>> Is the branching factor for the lexicon indices too large, resulting >>>>> in a non-linear slowdown in the write rate over tim? >>>>> Did you look into Small Slot Optimization? >>>>> Are the Write Cache Buffers adequate? >>>>> Is there a lot of Heap pressure? >>>>> Is the MemoryManager have the maximum amount of RAM it can handle? >>>>> 4TB? >>>>> Is the RWStore handling the recycling well? >>>>> Is the SAIL Buffer Capacity adequate? >>>>> Are you not using exact range counts where you could be using fast >>>>> range counts? >>>>> >>>>> >>>> Start at the Hardware side first however. >>>>> Is the disk activity for writes really low...and CPU is very high? >>>>> You have identified a bottleneck in that case, discover WHY that would be >>>>> the case looking into any of the above. >>>>> >>>> >>>> Sounds like good questions, but outside of my area of expertise. I've >>>> created https://phabricator.wikimedia.org/T238362 to track it, and >>>> I'll see if someone can have a look. I know that we did multiple passes at >>>> tuning Blazegraph properties, with limited success so far. >>>> >>>> >>>>> and a 100+ other things that should be looked at that all affect WRITE >>>>> performance during UPDATES. >>>>> >>>>> https://wiki.blazegraph.com/wiki/index.php/IOOptimization >>>>> https://wiki.blazegraph.com/wiki/index.php/PerformanceOptimization >>>>> >>>>> I would also suggest you start monitoring some of the internals of >>>>> Blazegraph (JAVA) while in production with tools such as XRebel or >>>>> AppDynamics. >>>>> >>>> >>>> Both XRebel and AppDynamics are proprietary, so no way that we'll >>>> deploy them in our environment. We are tracking a few JMX based metrics, >>>> but so far, we don't really know what to look for. >>>> >>>> Thanks! >>>> >>>> Guillaume >>>> >>>> Thad >>>>> https://www.linkedin.com/in/thadguidry/ >>>>> >>>>> >>>>> On Thu, Nov 14, 2019 at 7:31 AM Guillaume Lederrey < >>>>> gleder...@wikimedia.org> wrote: >>>>> >>>>>> Thanks for the feedback! >>>>>> >>>>>> On Thu, Nov 14, 2019 at 11:11 AM <f...@imm.dtu.dk> wrote: >>>>>> >>>>>>> >>>>>>> Besides waiting for the new updater, it may be useful to tell us, >>>>>>> what >>>>>>> we as users can do too. It is unclear to me what the problem is. For >>>>>>> instance, at one point I was worried that the many parallel requests >>>>>>> to >>>>>>> the SPARQL endpoint that we make in Scholia is a problem. As far as >>>>>>> I >>>>>>> understand it is not a problem at all. Another issue could be the >>>>>>> way >>>>>>> that we use Magnus Manske's Quickstatements and approve bots for >>>>>>> high >>>>>>> frequency editing. Perhaps a better overview and constraints on >>>>>>> large-scale editing could be discussed? >>>>>>> >>>>>> >>>>>> To be (again) completely honest, we don't entirely understand the >>>>>> issue either. There are clearly multiple related issues. In high level >>>>>> terms, we have at least: >>>>>> >>>>>> * Some part of the update process on Blazegraph is CPU bound and >>>>>> single threaded. Even with low query load, if we have a high edit rate, >>>>>> Blazegraph can't keep up, and saturates a single CPU (with plenty of >>>>>> available resources on other CPUs). This is a hard issue to fix, >>>>>> requiring >>>>>> either splitting the processing over multiple CPU or sharding the data >>>>>> over >>>>>> multiple servers. Neither of which Blazegraph supports (at least not in >>>>>> our >>>>>> current configuration). >>>>>> * There is a race for resources between edits and queries: a high >>>>>> query load will impact the update rate. This could to some extent be >>>>>> mitigated by reducing the query load: if no one is using the service, it >>>>>> works great! Obviously that's not much of a solution. >>>>>> >>>>>> What you can do (short term): >>>>>> >>>>>> * Keep bots usage well behaved (don't do parallel queries, provide a >>>>>> meaningful user agent, smooth the load over time if possible, ...). As >>>>>> far >>>>>> as I can see, most usage are already well behaved. >>>>>> * Optimize your queries: better queries will use less resources, >>>>>> which should help. Time to completion is a good approximation of the >>>>>> resources used. I don't really have any more specific advice, SPARQL is >>>>>> not >>>>>> my area of expertise. >>>>>> >>>>>> What you can do (longer term): >>>>>> >>>>>> * Help us think out of the box. Can we identify higher level use >>>>>> cases? Could we implement some of our workflows on a higher level API >>>>>> than >>>>>> SPARQL, which might allow for more internal optimizations? >>>>>> * Help us better understand the constraints. Document use cases on >>>>>> [1]. >>>>>> >>>>>> Sadly, we don't have the bandwidth right now to engage meaningfully >>>>>> in this conversation. Feel free to send thoughts already, but don't >>>>>> expect >>>>>> any timely response. >>>>>> >>>>>> Yet another thought is the large discrepancy between Virginia and >>>>>>> Texas >>>>>>> data centers as I could see on Grafana [1]. As far as I understand >>>>>>> the >>>>>>> hardware (and software) are the same. So why is there this large >>>>>>> difference? Rather than editing or BlazeGraph, could the issue be >>>>>>> some >>>>>>> form of network issue? >>>>>>> >>>>>> >>>>>> As pointed out by Lucas, this is expected. Due to how our GeoDNS >>>>>> works, we see more traffic on eqiad than on codfw. >>>>>> >>>>>> Thanks for the help! >>>>>> >>>>>> Guillaume >>>>>> >>>>>> [1] https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Usage >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>> [1] >>>>>>> >>>>>>> https://grafana.wikimedia.org/d/000000489/wikidata-query-service?panelId=8&fullscreen&orgId=1&from=now-7d&to=now >>>>>>> >>>>>>> /Finn >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 14/11/2019 10:50, Guillaume Lederrey wrote: >>>>>>> > Hello all! >>>>>>> > >>>>>>> > As you've probably noticed, the update lag on the public WDQS >>>>>>> endpoint >>>>>>> > [1] is not doing well [2], with lag climbing to > 12h for some >>>>>>> servers. >>>>>>> > We are tracking this on phabricator [3], subscribe to that task if >>>>>>> you >>>>>>> > want to stay informed. >>>>>>> > >>>>>>> > To be perfectly honest, we don't have a good short term solution. >>>>>>> The >>>>>>> > graph database that we are using at the moment (Blazegraph [4]) >>>>>>> does not >>>>>>> > easily support sharding, so even throwing hardware at the problem >>>>>>> isn't >>>>>>> > really an option. >>>>>>> > >>>>>>> > We are working on a few medium term improvements: >>>>>>> > >>>>>>> > * A dedicated updater service in Blazegraph, which should help >>>>>>> increase >>>>>>> > the update throughput [5]. Finger crossed, this should be ready >>>>>>> for >>>>>>> > initial deployment and testing by next week (no promise, we're >>>>>>> doing the >>>>>>> > best we can). >>>>>>> > * Some improvement in the parallelism of the updater [6]. This has >>>>>>> just >>>>>>> > been identified. While it will probably also provide some >>>>>>> improvement in >>>>>>> > throughput, we haven't actually started working on that and we >>>>>>> don't >>>>>>> > have any numbers at this point. >>>>>>> > >>>>>>> > Longer term: >>>>>>> > >>>>>>> > We are hiring a new team member to work on WDQS. It will take some >>>>>>> time >>>>>>> > to get this person up to speed, but we should have more capacity >>>>>>> to >>>>>>> > address the deeper issues of WDQS by January. >>>>>>> > >>>>>>> > The 2 main points we want to address are: >>>>>>> > >>>>>>> > * Finding a triple store that scales better than our current >>>>>>> solution. >>>>>>> > * Better understand what are the use cases on WDQS and see if we >>>>>>> can >>>>>>> > provide a technical solution that is better suited. Our intuition >>>>>>> is >>>>>>> > that some of the use cases that require synchronous (or quasi >>>>>>> > synchronous) updates would be better implemented outside of a >>>>>>> triple >>>>>>> > store. Honestly, we have no idea yet if this makes sense and what >>>>>>> those >>>>>>> > alternate solutions might be. >>>>>>> > >>>>>>> > Thanks a lot for your patience during this tough time! >>>>>>> > >>>>>>> > Guillaume >>>>>>> > >>>>>>> > >>>>>>> > [1] https://query.wikidata.org/ >>>>>>> > [2] >>>>>>> > >>>>>>> https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&from=1571131796906&to=1573723796906&var-cluster_name=wdqs&panelId=8&fullscreen >>>>>>> > [3] https://phabricator.wikimedia.org/T238229 >>>>>>> > [4] https://blazegraph.com/ >>>>>>> > [5] https://phabricator.wikimedia.org/T212826 >>>>>>> > [6] https://phabricator.wikimedia.org/T238045 >>>>>>> > >>>>>>> > -- >>>>>>> > Guillaume Lederrey >>>>>>> > Engineering Manager, Search Platform >>>>>>> > Wikimedia Foundation >>>>>>> > UTC+1 / CET >>>>>>> > >>>>>>> > _______________________________________________ >>>>>>> > Wikidata mailing list >>>>>>> > Wikidata@lists.wikimedia.org >>>>>>> > https://lists.wikimedia.org/mailman/listinfo/wikidata >>>>>>> > >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Wikidata mailing list >>>>>>> Wikidata@lists.wikimedia.org >>>>>>> https://lists.wikimedia.org/mailman/listinfo/wikidata >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Guillaume Lederrey >>>>>> Engineering Manager, Search Platform >>>>>> Wikimedia Foundation >>>>>> UTC+1 / CET >>>>>> _______________________________________________ >>>>>> Wikidata mailing list >>>>>> Wikidata@lists.wikimedia.org >>>>>> https://lists.wikimedia.org/mailman/listinfo/wikidata >>>>>> >>>>> _______________________________________________ >>>>> Wikidata mailing list >>>>> Wikidata@lists.wikimedia.org >>>>> https://lists.wikimedia.org/mailman/listinfo/wikidata >>>>> >>>> >>>> >>>> -- >>>> Guillaume Lederrey >>>> Engineering Manager, Search Platform >>>> Wikimedia Foundation >>>> UTC+1 / CET >>>> _______________________________________________ >>>> Wikidata mailing list >>>> Wikidata@lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/wikidata >>>> >>> _______________________________________________ >>> Wikidata mailing list >>> Wikidata@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/wikidata >>> >> _______________________________________________ >> Wikidata mailing list >> Wikidata@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wikidata >> > > > -- > Guillaume Lederrey > Engineering Manager, Search Platform > Wikimedia Foundation > UTC+1 / CET > _______________________________________________ > Wikidata mailing list > Wikidata@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikidata >
_______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata