Hi again,

A bit of an update.

I have not been able to locate the issue using the profiler.

It seems that when Buildbot gets into the problematic mode, then the profiler is not able to work at all. It only starts collecting after the locked mode is resolved.

It does seem like the locked mode occurs when Buildbot is fetching a lot of data from the DB and then spends a lot of time processing that data, without yielding to other processing needs.

Looking at the monitoring of the server, it also appears that buildbot is fetching a lot of data. During the most recent instance, the returned tuples count in the graph for the server indicates three minutes of, on average 25000 tuples returned, with spikes to 80K and 100K, per second.

The number of open connections rose to 6 or 7, and the transaction count was 400-500 per second during the whole time (rolled back transactions, which I assume is just one or more selects).

IMO this makes it look like, while requesting these data, Buildbot is *synchronously* querying the DB and processing the returned data, not yielding. It might also be that it is requesting data more data than it needs, and also requesting other data earlier than it is actually needed.



On Tue, 12 Jan 2021 12:48:40 +0100, Yngve N. Pettersen <[email protected]> wrote:

Hi,

IIRC the only real processing in our system that might be heavy is done via logobserver.LineConsumerLogObserver in a class (now) derived from ShellCommandNewStyle, so if that is the issue, and deferToThread is the solution, then if it isn't already done, my suggestion would be to implement that inside the code handling the log observers.

I've tested the profiler a little, but haven't seen any samples within our code so far, just inside buildbot, quite a lot of log DB actions, also some TLS activity.

The performance issue for those pages seems to be a bit flaky; at present its not happening AFAICT

On Tue, 12 Jan 2021 10:59:42 +0100, Pierre Tardy <[email protected]> wrote:

Hello,

A lot of things happen between 2.7 and 2.10, although I don't see anything
which could impact the performance that much. (maybe new reporter
framework, but really not convinced)
If you see that the db is underutilized this must be a classical reactor
starvation.
With asynchronous systems like buildbot, you shouldn't do any heavy
computation in the main event loop thread, those must be done in a thread
via deferToThread and co.

Those are the common issues you can have with performance
independantly from upgrade regressions:

1) Custom steps:
A lot of time, we see people struggling with performance when they just
have some custom step doing heavy computation that block the main thread
constantly, preventing all the very quick tasks to run in //.

2) too much logs
In this case, there is not much to do beside reducing the log amount. This
would be the time to switch to a multi-master setup, where you put 2
masters for builds, and one master for web UI.
You can put those in the same machine/VM, no problem, the only work is to
have separate processes that each have several event queues. You can use
docker-compose or kubernetes in order to more easily create such
deployment. We don't have readily useable for that, but several people have
done and documented it, for example
https://github.com/pop/buildbot-on-kubernetes


I have developed the buildbot profiler in order to quickly find those. You
just have to install it as a plugin and start a profile whenever the
buildbot feels slow.
It is a statistical profiler, so it will not significantly change the
actual performance so it is safe to run in production.

https://pypi.org/project/buildbot-profiler/


Regards,
Pierre


Le mar. 12 janv. 2021 à 01:29, Yngve N. Pettersen <[email protected]> a
écrit :

Hello all,

We have just upgraded our buildbot system from 2.7 to 2.10.

However, I am noticing performance issues when loading these pages:

  Builds->Builders
  Builds->Workers
  individual builds

Loading these can take several minutes, although there are periods of
immediate responses.

What I am seeing on the buildbot manager machine is that the Python3
process hits 90-100% for the entire period.

The Python version is 3.6.9 running on Ubuntu 18.04

As far as I can tell, the Postgresql database is mostly idle during this period. I did do a full vacuum a few hours ago, in case that was the issue.

There are about 40 builders, and 30 workers in the system, only about
10-15 of these have a 10-20 builds for the past few days, although most
of
these have active histories of 3000 builds (which do make me wonder if
the
problem could be a lack of limiting the DB queries, at present I have not
inspected the DB queries).

The individual builds can have very large log files in the build steps,
in
many cases tens of thousands of lines (we _are_ talking about a Chromium
based project).

Our changes in the builders and workers JS code are minimal (we are using
a custom build of www-base), just using different information for the
build labels (build version number), and grouping the builders, which
should not be causing any performance issues. (we have larger changes in the individual builder view, where we include Git commit messages, and I
have so far not seen any performance issues there)

BTW: The line plots for build time and successes on builders seems to be
MIA. Not sure if that is an upstream issue, or due to something in our
www-base build.

Do you have any suggestions for where to look for the cause of the problem?


--
Sincerely,
Yngve N. Pettersen
Vivaldi Technologies AS
_______________________________________________
users mailing list
[email protected]
https://lists.buildbot.net/mailman/listinfo/users





--
Sincerely,
Yngve N. Pettersen
Vivaldi Technologies AS
_______________________________________________
users mailing list
[email protected]
https://lists.buildbot.net/mailman/listinfo/users

Reply via email to