Hi, Looking into diff between 2.7 and 2.10, pretty much the only difference in the existing data and DB APIs are related to changes. So I think it's very likely there's something not obvious we didn't consider.
I will build a test case and see if I can reproduce the problem. Long term I think we need to have tests to prevent this kind of issues. I'll think about how this can be done. Yngve, thanks a lot for your work investigating this issue. Cheers, Povilas On 1/13/21 12:26 PM, Pierre Tardy wrote: > Hello, > > getPrevSuccessfulBuild is called by getChanges for build which in turn > is called by /builds/NN/changes Rest API. > the bug Vlad was referring to was a perf issue on the /changes API, > which has been fixed a while back. > > Indeed, this algorithm is far from optimized,but I don't see why this > would lead to main thread blocking. Looking at the code, I see that > there are no big loops that do not yield to the main reactor loop. > > I insist on the buildbot profiler. What I was saying before is that you > need to hit the record button before the problem appears, and put a > large enough record time to be sure to catch a spike. > Then, you will be able to zoom to the cpu spike and catch the issue > precisely. > > If the spike is in the order of minutes like you said, you can configure > it like this and get enough samples to get enough evidence to where the > code is actually spending time: > > ProfilerService(frequency=500, gatherperiod=60 * 60, mode='virtual', > basepath=None, wantBuilds=100 > > This will record for one hour, and mitigate the memory used if you worry > about it. > > Pierre > > > Le mer. 13 janv. 2021 à 11:01, Yngve N. Pettersen <[email protected] > <mailto:[email protected]>> a écrit : > > > Hi again, > > I was just able to get a partial database log from a freeze incident > when > refreshing the Builds->Builders page. > > It looks like Vlad is on the right track. > > There are a *lot* of individual source stamp requests, but also requests > to the builds and change tables. > > An interesting part of the builds request is this request: > > SELECT builds.id <http://builds.id>, builds.number, > builds.builderid, builds.buildrequestid, > builds.workerid, builds.masterid, builds.started_at, builds.complete_at, > builds.state_string, builds.results > FROM builds > WHERE builds.builderid = 51 AND builds.number < 46 AND > builds.results = 0 ORDER BY builds.complete_at DESC > LIMIT 1000 OFFSET 0 > > which appears to then be followed by a lot of changes and source stamp > requests. > > The log contains a lot of these requests per second; according to the DB > graph 200 to 400 per second. > > The 1000 limit appears to come from > db.builds.BuildsConnectorComponent.getPrevSuccessfulBuild(), but that > value seems to have been that way for a while, so the problem is likely > caused by something else. This function did show up at the beginning > of my > traces related to these freezes. > > > One possibility that I can think of, is that several of these pages, > or > the functions they are using, are no longer restricting how far back > in > the build history they are fetching build information for. E.g the > Builders page is only supposed to show a couple of days of builds > for each > builder, so there should be no need to fetch data for a 1000 builds > (making sure you have the build ids is one thing, fetching all the > associated data even for builds that are not to be displayed is > something > else). > > BTW, I have noticed that another page, the waterfall, is not > displaying > anything, even after waiting for a very long time. > > > On Wed, 13 Jan 2021 01:34:57 +0100, Yngve N. Pettersen > <[email protected] <mailto:[email protected]>> > wrote: > > > Hi, > > > > Thanks for that info. > > > > In my case the problem is apparently something that happens now > and then. > > > > As mentioned, I have seen it on the Builds->Builders and > Builds->Workers > > pages, neither of which includes any changelog access AFAIK. > > > > I have also seen it occasionally on individual build pages, which > has a > > log of steps with logs, and a changelog panel. > > > > Just a few minutes ago I saw this freeze/spike happen while the > buildbot > > manager was completely idle, since all active tasks had completed, > and I > > had paused all workers since I needed to restart the manager (due > to the > > hanging build). > > > > I have also had reports about the Grid view and Console pages > displaying > > this issue, but have not seen it myself. > > > > At present I have enabled logging in the postgresql server, so > maybe I > > can figure out what requests are handled during the spike. > > > > > > > > On Wed, 13 Jan 2021 00:38:34 +0100, Vlad Bogolin <[email protected] > <mailto:[email protected]>> > > wrote: > > > >> Hi, > >> > >> I have experienced some similar interface freezes while trying to > >> configure > >> our version of buildbot. I now remember two cases: > >> > >> 1) A "changes" API problem where it seemed that the "limit" > argument was > >> ignored in some cases which translated into a full changes table > scan. > >> This > >> was reproducible when hitting the "Builds > Last Changes" > dashboard and > >> then all the other pages were frozen. There are other requests to > >> changes, > >> so this may be related to the Builds page too. Also, this only > >> happened when the number of changes from the db was high. I was > >> planning on > >> submitting a proper fix, but we are running a custom version of > 2.7.1 > >> where > >> I implemented a fast workaround and did not managed to submit a > proper > >> fix > >> (hope to be able to do it next week). > >> > >> 2) We experienced the same issue as you describe when a lot of logs > >> where > >> coming (which seems to be your case too) and the master process was > >> overwhelmed when multiple builds were running in the same time > (constant > >> CPU usage around ~120%). We solved the issue by switching to multi > >> master > >> and limiting the amount of logs, but if you say that this was not > an > >> issue > >> in 2.7 I would really be interested in finding out what is the root > >> cause > >> (I thought it was the high amount of logs). You can test this > >> hypothesis by > >> limiting the amount of running builds and see if the issue keeps > >> reproducing. > >> > >> What worked for me in order to find out the "changes" API problem was > >> visiting each dashboard and see if the freeze occurs or not. > >> > >> Hope this helps! > >> > >> Cheers, > >> Vlad > >> > >> On Wed, Jan 13, 2021 at 12:40 AM Yngve N. Pettersen > <[email protected] <mailto:[email protected]>> > >> wrote: > >> > >>> On Tue, 12 Jan 2021 22:13:50 +0100, Pierre Tardy > <[email protected] <mailto:[email protected]>> > >>> wrote: > >>> > >>> > Thanks for the update. > >>> > > >>> > Some random thoughts... > >>> > > >>> > You should probably leave the profiler open until you get the > >>> performance > >>> > spike. > >>> > If you are inside the spike when starting, indeed, you won't > be able > >>> to > >>> > start profiler, but if it is started before the spike it for > sure > >>> will > >>> > detect exactly where the code is. > >>> > >>> I did have the profiler open in this latest case; as far as I > could > >>> tell > >>> it still didn't start recording until after the spike ended > (there was > >>> no > >>> progress information in the recorder line). > >>> > >>> The two major items showing up were > >>> > >>> /buildbot/db/builds.py+91:getPrevSuccessfulBuild > >>> /buildbot/db/pool.py+190:__thd > >>> > >>> but I think they were recorded after the spike. > >>> > >>> I am planning to activate more detailed logging in the postgresql > >>> server, > >>> but have not done that yet (probably need to shut down and restart > >>> buildbot when I do). > >>> > >>> > >>> BTW, I suspect that this issue can also cause trouble for builds > whose > >>> steps ends at the time the problem is occurring; I just noticed > a task > >>> that is still running more than 4 hours after it started a step that > >>> should have been killed after 20 minutes if it was hanging. It > should > >>> have > >>> ended at about the time one of the hangs was occuring. And it is > >>> impossible to stop the task for some reason, even shutting down the > >>> worker > >>> process did not work. AFAIK the only way to fix the issue is to > shut > >>> the > >>> buildbot manager down. > >>> > >>> > statistic profiling will use timer interrupts which will preempt > >>> anything > >>> > that is running, and make a call stack trace. > >>> > > >>> > Waiting for repro, if, from the db log, you manage to get the > info of > >>> > what > >>> > kind of db data that is, maybe we can narrow down the usual > >>> suspects.. > >>> > > >>> > If there are lots of short selects like you said, usually, you > would > >>> > have a > >>> > back and forth from reactor thread to db thread, so it sounds > weird. > >>> > What can be leading to your behavior is that whatever is > halting the > >>> > processing, everything is queued up in between, and unqueued > when it > >>> is > >>> > finished, which could lead to spike of db actions in the end > of the > >>> > event. > >>> > >>> The DB actions were going on for the entire 3 minutes that spike > >>> lasted; > >>> it is not a burst at either end, but a ~180 second long continuous > >>> sequence (or barrage) of approximately 70-90000 transactions, if > I am > >>> interpreting the graph data correctly. > >>> > >>> > Regards > >>> > Pierre > >>> > > >>> > > >>> > Le mar. 12 janv. 2021 à 21:49, Yngve N. Pettersen > >>> <[email protected] <mailto:[email protected]>> a > >>> > écrit : > >>> > > >>> >> Hi again, > >>> >> > >>> >> A bit of an update. > >>> >> > >>> >> I have not been able to locate the issue using the profiler. > >>> >> > >>> >> It seems that when Buildbot gets into the problematic mode, > then the > >>> >> profiler is not able to work at all. It only starts collecting > >>> after the > >>> >> locked mode is resolved. > >>> >> > >>> >> It does seem like the locked mode occurs when Buildbot is > fetching > >>> a lot > >>> >> of data from the DB and then spends a lot of time processing > that > >>> data, > >>> >> without yielding to other processing needs. > >>> >> > >>> >> Looking at the monitoring of the server, it also appears that > >>> buildbot > >>> >> is > >>> >> fetching a lot of data. During the most recent instance, the > >>> returned > >>> >> tuples count in the graph for the server indicates three > minutes > >>> of, on > >>> >> average 25000 tuples returned, with spikes to 80K and 100K, per > >>> second. > >>> >> > >>> >> The number of open connections rose to 6 or 7, and the > transaction > >>> count > >>> >> was 400-500 per second during the whole time (rolled back > >>> transactions, > >>> >> which I assume is just one or more selects). > >>> >> > >>> >> IMO this makes it look like, while requesting these data, > Buildbot > >>> is > >>> >> *synchronously* querying the DB and processing the returned > data, > >>> not > >>> >> yielding. It might also be that it is requesting data more data > >>> than it > >>> >> needs, and also requesting other data earlier than it is actually > >>> >> needed. > >>> >> > >>> >> > >>> >> > >>> >> On Tue, 12 Jan 2021 12:48:40 +0100, Yngve N. Pettersen > >>> >> <[email protected] <mailto:[email protected]>> > >>> >> > >>> >> wrote: > >>> >> > >>> >> > Hi, > >>> >> > > >>> >> > IIRC the only real processing in our system that might be > heavy is > >>> >> done > >>> >> > via logobserver.LineConsumerLogObserver in a class (now) > derived > >>> from > >>> >> > ShellCommandNewStyle, so if that is the issue, and > deferToThread > >>> is > >>> >> the > >>> >> > solution, then if it isn't already done, my suggestion > would be to > >>> >> > implement that inside the code handling the log observers. > >>> >> > > >>> >> > I've tested the profiler a little, but haven't seen any > samples > >>> within > >>> >> > our code so far, just inside buildbot, quite a lot of log DB > >>> actions, > >>> >> > also some TLS activity. > >>> >> > > >>> >> > The performance issue for those pages seems to be a bit > flaky; at > >>> >> > present its not happening AFAICT > >>> >> > > >>> >> > On Tue, 12 Jan 2021 10:59:42 +0100, Pierre Tardy > >>> <[email protected] <mailto:[email protected]>> > >>> >> > wrote: > >>> >> > > >>> >> >> Hello, > >>> >> >> > >>> >> >> A lot of things happen between 2.7 and 2.10, although I > don't see > >>> >> >> anything > >>> >> >> which could impact the performance that much. (maybe new > reporter > >>> >> >> framework, but really not convinced) > >>> >> >> If you see that the db is underutilized this must be a > classical > >>> >> reactor > >>> >> >> starvation. > >>> >> >> With asynchronous systems like buildbot, you shouldn't do > any > >>> heavy > >>> >> >> computation in the main event loop thread, those must be > done in > >>> a > >>> >> >> thread > >>> >> >> via deferToThread and co. > >>> >> >> > >>> >> >> Those are the common issues you can have with performance > >>> >> >> independantly from upgrade regressions: > >>> >> >> > >>> >> >> 1) Custom steps: > >>> >> >> A lot of time, we see people struggling with performance > when > >>> they > >>> >> just > >>> >> >> have some custom step doing heavy computation that block > the main > >>> >> thread > >>> >> >> constantly, preventing all the very quick tasks to run in //. > >>> >> >> > >>> >> >> 2) too much logs > >>> >> >> In this case, there is not much to do beside reducing the log > >>> >> amount. > >>> >> >> This > >>> >> >> would be the time to switch to a multi-master setup, where > you > >>> put 2 > >>> >> >> masters for builds, and one master for web UI. > >>> >> >> You can put those in the same machine/VM, no problem, the > only > >>> work > >>> >> is > >>> >> >> to > >>> >> >> have separate processes that each have several event > queues. You > >>> can > >>> >> use > >>> >> >> docker-compose or kubernetes in order to more easily > create such > >>> >> >> deployment. We don't have readily useable for that, but > several > >>> >> people > >>> >> >> have > >>> >> >> done and documented it, for example > >>> >> >> https://github.com/pop/buildbot-on-kubernetes > >>> >> >> > >>> >> >> > >>> >> >> I have developed the buildbot profiler in order to quickly > find > >>> >> those. > >>> >> >> You > >>> >> >> just have to install it as a plugin and start a profile > whenever > >>> the > >>> >> >> buildbot feels slow. > >>> >> >> It is a statistical profiler, so it will not significantly > >>> change the > >>> >> >> actual performance so it is safe to run in production. > >>> >> >> > >>> >> >> https://pypi.org/project/buildbot-profiler/ > >>> >> >> > >>> >> >> > >>> >> >> Regards, > >>> >> >> Pierre > >>> >> >> > >>> >> >> > >>> >> >> Le mar. 12 janv. 2021 à 01:29, Yngve N. Pettersen > >>> >> <[email protected] <mailto:[email protected]>> a > >>> >> >> écrit : > >>> >> >> > >>> >> >>> Hello all, > >>> >> >>> > >>> >> >>> We have just upgraded our buildbot system from 2.7 to 2.10. > >>> >> >>> > >>> >> >>> However, I am noticing performance issues when loading > these > >>> pages: > >>> >> >>> > >>> >> >>> Builds->Builders > >>> >> >>> Builds->Workers > >>> >> >>> individual builds > >>> >> >>> > >>> >> >>> Loading these can take several minutes, although there are > >>> periods > >>> >> of > >>> >> >>> immediate responses. > >>> >> >>> > >>> >> >>> What I am seeing on the buildbot manager machine is that > the > >>> Python3 > >>> >> >>> process hits 90-100% for the entire period. > >>> >> >>> > >>> >> >>> The Python version is 3.6.9 running on Ubuntu 18.04 > >>> >> >>> > >>> >> >>> As far as I can tell, the Postgresql database is mostly > idle > >>> during > >>> >> >>> this > >>> >> >>> period. I did do a full vacuum a few hours ago, in case > that > >>> was the > >>> >> >>> issue. > >>> >> >>> > >>> >> >>> There are about 40 builders, and 30 workers in the > system, only > >>> >> about > >>> >> >>> 10-15 of these have a 10-20 builds for the past few days, > >>> although > >>> >> most > >>> >> >>> of > >>> >> >>> these have active histories of 3000 builds (which do make me > >>> wonder > >>> >> if > >>> >> >>> the > >>> >> >>> problem could be a lack of limiting the DB queries, at > present I > >>> >> have > >>> >> >>> not > >>> >> >>> inspected the DB queries). > >>> >> >>> > >>> >> >>> The individual builds can have very large log files in > the build > >>> >> steps, > >>> >> >>> in > >>> >> >>> many cases tens of thousands of lines (we _are_ talking > about a > >>> >> >>> Chromium > >>> >> >>> based project). > >>> >> >>> > >>> >> >>> Our changes in the builders and workers JS code are > minimal (we > >>> are > >>> >> >>> using > >>> >> >>> a custom build of www-base), just using different > information > >>> for > >>> >> the > >>> >> >>> build labels (build version number), and grouping the > builders, > >>> >> which > >>> >> >>> should not be causing any performance issues. (we have larger > >>> >> changes > >>> >> >>> in > >>> >> >>> the individual builder view, where we include Git commit > >>> messages, > >>> >> and > >>> >> >>> I > >>> >> >>> have so far not seen any performance issues there) > >>> >> >>> > >>> >> >>> BTW: The line plots for build time and successes on > builders > >>> seems > >>> >> to > >>> >> >>> be > >>> >> >>> MIA. Not sure if that is an upstream issue, or due to > something > >>> in > >>> >> our > >>> >> >>> www-base build. > >>> >> >>> > >>> >> >>> Do you have any suggestions for where to look for the > cause of > >>> the > >>> >> >>> problem? > >>> >> >>> > >>> >> >>> > >>> >> >>> -- > >>> >> >>> Sincerely, > >>> >> >>> Yngve N. Pettersen > >>> >> >>> Vivaldi Technologies AS > >>> >> >>> _______________________________________________ > >>> >> >>> users mailing list > >>> >> >>> [email protected] <mailto:[email protected]> > >>> >> >>> https://lists.buildbot.net/mailman/listinfo/users > >>> >> >>> > >>> >> > > >>> >> > > >>> >> > >>> >> > >>> >> -- > >>> >> Sincerely, > >>> >> Yngve N. Pettersen > >>> >> Vivaldi Technologies AS > >>> >> > >>> > >>> > >>> -- > >>> Sincerely, > >>> Yngve N. Pettersen > >>> Vivaldi Technologies AS > >>> _______________________________________________ > >>> users mailing list > >>> [email protected] <mailto:[email protected]> > >>> https://lists.buildbot.net/mailman/listinfo/users > >>> > > > > > > > -- > Sincerely, > Yngve N. Pettersen > Vivaldi Technologies AS > > > _______________________________________________ > users mailing list > [email protected] > https://lists.buildbot.net/mailman/listinfo/users > _______________________________________________ users mailing list [email protected] https://lists.buildbot.net/mailman/listinfo/users
