Hello! Thanks for the suggestions!
On Thu, Nov 14, 2019 at 5:02 PM Thad Guidry <[email protected]> wrote: > Is the Write Retention Queue adequate? > Is the branching factor for the lexicon indices too large, resulting in a > non-linear slowdown in the write rate over tim? > Did you look into Small Slot Optimization? > Are the Write Cache Buffers adequate? > Is there a lot of Heap pressure? > Is the MemoryManager have the maximum amount of RAM it can handle? 4TB? > Is the RWStore handling the recycling well? > Is the SAIL Buffer Capacity adequate? > Are you not using exact range counts where you could be using fast range > counts? > > Start at the Hardware side first however. > Is the disk activity for writes really low...and CPU is very high? You > have identified a bottleneck in that case, discover WHY that would be the > case looking into any of the above. > Sounds like good questions, but outside of my area of expertise. I've created https://phabricator.wikimedia.org/T238362 to track it, and I'll see if someone can have a look. I know that we did multiple passes at tuning Blazegraph properties, with limited success so far. > and a 100+ other things that should be looked at that all affect WRITE > performance during UPDATES. > > https://wiki.blazegraph.com/wiki/index.php/IOOptimization > https://wiki.blazegraph.com/wiki/index.php/PerformanceOptimization > > I would also suggest you start monitoring some of the internals of > Blazegraph (JAVA) while in production with tools such as XRebel or > AppDynamics. > Both XRebel and AppDynamics are proprietary, so no way that we'll deploy them in our environment. We are tracking a few JMX based metrics, but so far, we don't really know what to look for. Thanks! Guillaume Thad > https://www.linkedin.com/in/thadguidry/ > > > On Thu, Nov 14, 2019 at 7:31 AM Guillaume Lederrey < > [email protected]> wrote: > >> Thanks for the feedback! >> >> On Thu, Nov 14, 2019 at 11:11 AM <[email protected]> wrote: >> >>> >>> Besides waiting for the new updater, it may be useful to tell us, what >>> we as users can do too. It is unclear to me what the problem is. For >>> instance, at one point I was worried that the many parallel requests to >>> the SPARQL endpoint that we make in Scholia is a problem. As far as I >>> understand it is not a problem at all. Another issue could be the way >>> that we use Magnus Manske's Quickstatements and approve bots for high >>> frequency editing. Perhaps a better overview and constraints on >>> large-scale editing could be discussed? >>> >> >> To be (again) completely honest, we don't entirely understand the issue >> either. There are clearly multiple related issues. In high level terms, we >> have at least: >> >> * Some part of the update process on Blazegraph is CPU bound and single >> threaded. Even with low query load, if we have a high edit rate, Blazegraph >> can't keep up, and saturates a single CPU (with plenty of available >> resources on other CPUs). This is a hard issue to fix, requiring either >> splitting the processing over multiple CPU or sharding the data over >> multiple servers. Neither of which Blazegraph supports (at least not in our >> current configuration). >> * There is a race for resources between edits and queries: a high query >> load will impact the update rate. This could to some extent be mitigated by >> reducing the query load: if no one is using the service, it works great! >> Obviously that's not much of a solution. >> >> What you can do (short term): >> >> * Keep bots usage well behaved (don't do parallel queries, provide a >> meaningful user agent, smooth the load over time if possible, ...). As far >> as I can see, most usage are already well behaved. >> * Optimize your queries: better queries will use less resources, which >> should help. Time to completion is a good approximation of the resources >> used. I don't really have any more specific advice, SPARQL is not my area >> of expertise. >> >> What you can do (longer term): >> >> * Help us think out of the box. Can we identify higher level use cases? >> Could we implement some of our workflows on a higher level API than SPARQL, >> which might allow for more internal optimizations? >> * Help us better understand the constraints. Document use cases on [1]. >> >> Sadly, we don't have the bandwidth right now to engage meaningfully in >> this conversation. Feel free to send thoughts already, but don't expect any >> timely response. >> >> Yet another thought is the large discrepancy between Virginia and Texas >>> data centers as I could see on Grafana [1]. As far as I understand the >>> hardware (and software) are the same. So why is there this large >>> difference? Rather than editing or BlazeGraph, could the issue be some >>> form of network issue? >>> >> >> As pointed out by Lucas, this is expected. Due to how our GeoDNS works, >> we see more traffic on eqiad than on codfw. >> >> Thanks for the help! >> >> Guillaume >> >> [1] https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Usage >> >> >> >>> >>> >>> [1] >>> >>> https://grafana.wikimedia.org/d/000000489/wikidata-query-service?panelId=8&fullscreen&orgId=1&from=now-7d&to=now >>> >>> /Finn >>> >>> >>> >>> On 14/11/2019 10:50, Guillaume Lederrey wrote: >>> > Hello all! >>> > >>> > As you've probably noticed, the update lag on the public WDQS endpoint >>> > [1] is not doing well [2], with lag climbing to > 12h for some >>> servers. >>> > We are tracking this on phabricator [3], subscribe to that task if you >>> > want to stay informed. >>> > >>> > To be perfectly honest, we don't have a good short term solution. The >>> > graph database that we are using at the moment (Blazegraph [4]) does >>> not >>> > easily support sharding, so even throwing hardware at the problem >>> isn't >>> > really an option. >>> > >>> > We are working on a few medium term improvements: >>> > >>> > * A dedicated updater service in Blazegraph, which should help >>> increase >>> > the update throughput [5]. Finger crossed, this should be ready for >>> > initial deployment and testing by next week (no promise, we're doing >>> the >>> > best we can). >>> > * Some improvement in the parallelism of the updater [6]. This has >>> just >>> > been identified. While it will probably also provide some improvement >>> in >>> > throughput, we haven't actually started working on that and we don't >>> > have any numbers at this point. >>> > >>> > Longer term: >>> > >>> > We are hiring a new team member to work on WDQS. It will take some >>> time >>> > to get this person up to speed, but we should have more capacity to >>> > address the deeper issues of WDQS by January. >>> > >>> > The 2 main points we want to address are: >>> > >>> > * Finding a triple store that scales better than our current solution. >>> > * Better understand what are the use cases on WDQS and see if we can >>> > provide a technical solution that is better suited. Our intuition is >>> > that some of the use cases that require synchronous (or quasi >>> > synchronous) updates would be better implemented outside of a triple >>> > store. Honestly, we have no idea yet if this makes sense and what >>> those >>> > alternate solutions might be. >>> > >>> > Thanks a lot for your patience during this tough time! >>> > >>> > Guillaume >>> > >>> > >>> > [1] https://query.wikidata.org/ >>> > [2] >>> > >>> https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&from=1571131796906&to=1573723796906&var-cluster_name=wdqs&panelId=8&fullscreen >>> > [3] https://phabricator.wikimedia.org/T238229 >>> > [4] https://blazegraph.com/ >>> > [5] https://phabricator.wikimedia.org/T212826 >>> > [6] https://phabricator.wikimedia.org/T238045 >>> > >>> > -- >>> > Guillaume Lederrey >>> > Engineering Manager, Search Platform >>> > Wikimedia Foundation >>> > UTC+1 / CET >>> > >>> > _______________________________________________ >>> > Wikidata mailing list >>> > [email protected] >>> > https://lists.wikimedia.org/mailman/listinfo/wikidata >>> > >>> >>> _______________________________________________ >>> Wikidata mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/wikidata >>> >> >> >> -- >> Guillaume Lederrey >> Engineering Manager, Search Platform >> Wikimedia Foundation >> UTC+1 / CET >> _______________________________________________ >> Wikidata mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/wikidata >> > _______________________________________________ > Wikidata mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikidata > -- Guillaume Lederrey Engineering Manager, Search Platform Wikimedia Foundation UTC+1 / CET
_______________________________________________ Wikidata mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata
