Hello,
getPrevSuccessfulBuild is called by getChanges for build which in turn
is called by /builds/NN/changes Rest API.
the bug Vlad was referring to was a perf issue on the /changes API,
which has been fixed a while back.
Indeed, this algorithm is far from optimized,but I don't see why this
would lead to main thread blocking. Looking at the code, I see that
there are no big loops that do not yield to the main reactor loop.
I insist on the buildbot profiler. What I was saying before is that you
need to hit the record button before the problem appears, and put a
large enough record time to be sure to catch a spike.
Then, you will be able to zoom to the cpu spike and catch the issue
precisely.
If the spike is in the order of minutes like you said, you can configure
it like this and get enough samples to get enough evidence to where the
code is actually spending time:
ProfilerService(frequency=500, gatherperiod=60 * 60, mode='virtual',
basepath=None, wantBuilds=100
This will record for one hour, and mitigate the memory used if you worry
about it.
Pierre
Le mer. 13 janv. 2021 à 11:01, Yngve N. Pettersen <[email protected]
<mailto:[email protected]>> a écrit :
Hi again,
I was just able to get a partial database log from a freeze incident
when
refreshing the Builds->Builders page.
It looks like Vlad is on the right track.
There are a *lot* of individual source stamp requests, but also
requests
to the builds and change tables.
An interesting part of the builds request is this request:
SELECT builds.id <http://builds.id>, builds.number,
builds.builderid, builds.buildrequestid,
builds.workerid, builds.masterid, builds.started_at,
builds.complete_at,
builds.state_string, builds.results
FROM builds
WHERE builds.builderid = 51 AND builds.number < 46 AND
builds.results = 0 ORDER BY builds.complete_at DESC
LIMIT 1000 OFFSET 0
which appears to then be followed by a lot of changes and source
stamp
requests.
The log contains a lot of these requests per second; according to
the DB
graph 200 to 400 per second.
The 1000 limit appears to come from
db.builds.BuildsConnectorComponent.getPrevSuccessfulBuild(), but
that
value seems to have been that way for a while, so the problem is
likely
caused by something else. This function did show up at the beginning
of my
traces related to these freezes.
One possibility that I can think of, is that several of these pages,
or the functions they are using, are no longer restricting how
far back
in the build history they are fetching build information for.
E.g the Builders page is only supposed to show a couple of days of
builds
for each builder, so there should be no need to fetch data for a
1000 builds (making sure you have the build ids is one thing,
fetching all the associated data even for builds that are not to be
displayed is
something else).
BTW, I have noticed that another page, the waterfall, is not
displaying anything, even after waiting for a very long time.
On Wed, 13 Jan 2021 01:34:57 +0100, Yngve N. Pettersen
<[email protected] <mailto:[email protected]>>
wrote:
> Hi,
>
> Thanks for that info.
>
> In my case the problem is apparently something that happens now
and then.
>
> As mentioned, I have seen it on the Builds->Builders and
Builds->Workers > pages, neither of which includes any changelog
access AFAIK.
>
> I have also seen it occasionally on individual build pages, which
has a > log of steps with logs, and a changelog panel.
>
> Just a few minutes ago I saw this freeze/spike happen while the
buildbot > manager was completely idle, since all active tasks
had completed,
and I > had paused all workers since I needed to restart the
manager (due
to the > hanging build).
>
> I have also had reports about the Grid view and Console pages
displaying > this issue, but have not seen it myself.
>
> At present I have enabled logging in the postgresql server, so
maybe I > can figure out what requests are handled during the
spike.
>
>
>
> On Wed, 13 Jan 2021 00:38:34 +0100, Vlad Bogolin <[email protected]
<mailto:[email protected]>> > wrote:
>
>> Hi,
>>
>> I have experienced some similar interface freezes while trying to
>> configure
>> our version of buildbot. I now remember two cases:
>>
>> 1) A "changes" API problem where it seemed that the "limit"
argument was
>> ignored in some cases which translated into a full changes table
scan. >> This
>> was reproducible when hitting the "Builds > Last Changes"
dashboard and
>> then all the other pages were frozen. There are other requests to
>> changes,
>> so this may be related to the Builds page too. Also, this only
>> happened when the number of changes from the db was high. I was
>> planning on
>> submitting a proper fix, but we are running a custom version of
2.7.1 >> where
>> I implemented a fast workaround and did not managed to submit a
proper >> fix
>> (hope to be able to do it next week).
>>
>> 2) We experienced the same issue as you describe when a lot of
logs >> where
>> coming (which seems to be your case too) and the master process
was
>> overwhelmed when multiple builds were running in the same time
(constant
>> CPU usage around ~120%). We solved the issue by switching to
multi >> master
>> and limiting the amount of logs, but if you say that this was not
an >> issue
>> in 2.7 I would really be interested in finding out what is the
root >> cause
>> (I thought it was the high amount of logs). You can test this
>> hypothesis by
>> limiting the amount of running builds and see if the issue keeps
>> reproducing.
>>
>> What worked for me in order to find out the "changes" API
problem was
>> visiting each dashboard and see if the freeze occurs or not.
>>
>> Hope this helps!
>>
>> Cheers,
>> Vlad
>>
>> On Wed, Jan 13, 2021 at 12:40 AM Yngve N. Pettersen
<[email protected] <mailto:[email protected]>>
>> wrote:
>>
>>> On Tue, 12 Jan 2021 22:13:50 +0100, Pierre Tardy
<[email protected] <mailto:[email protected]>> >>> wrote:
>>>
>>> > Thanks for the update.
>>> >
>>> > Some random thoughts...
>>> >
>>> > You should probably leave the profiler open until you get the
>>> performance
>>> > spike.
>>> > If you are inside the spike when starting, indeed, you won't
be able >>> to
>>> > start profiler, but if it is started before the spike it for
sure >>> will
>>> > detect exactly where the code is.
>>>
>>> I did have the profiler open in this latest case; as far as I
could >>> tell
>>> it still didn't start recording until after the spike ended
(there was >>> no
>>> progress information in the recorder line).
>>>
>>> The two major items showing up were
>>>
>>> /buildbot/db/builds.py+91:getPrevSuccessfulBuild
>>> /buildbot/db/pool.py+190:__thd
>>>
>>> but I think they were recorded after the spike.
>>>
>>> I am planning to activate more detailed logging in the
postgresql >>> server,
>>> but have not done that yet (probably need to shut down and
restart
>>> buildbot when I do).
>>>
>>>
>>> BTW, I suspect that this issue can also cause trouble for builds
whose
>>> steps ends at the time the problem is occurring; I just noticed
a task
>>> that is still running more than 4 hours after it started a step
that
>>> should have been killed after 20 minutes if it was hanging. It
should
>>> have
>>> ended at about the time one of the hangs was occuring. And it is
>>> impossible to stop the task for some reason, even shutting down
the
>>> worker
>>> process did not work. AFAIK the only way to fix the issue is to
shut >>> the
>>> buildbot manager down.
>>>
>>> > statistic profiling will use timer interrupts which will
preempt >>> anything
>>> > that is running, and make a call stack trace.
>>> >
>>> > Waiting for repro, if, from the db log, you manage to get the
info of
>>> > what
>>> > kind of db data that is, maybe we can narrow down the usual
>>> suspects..
>>> >
>>> > If there are lots of short selects like you said, usually, you
would
>>> > have a
>>> > back and forth from reactor thread to db thread, so it sounds
weird.
>>> > What can be leading to your behavior is that whatever is
halting the
>>> > processing, everything is queued up in between, and unqueued
when it >>> is
>>> > finished, which could lead to spike of db actions in the end
of the
>>> > event.
>>>
>>> The DB actions were going on for the entire 3 minutes that spike
>>> lasted;
>>> it is not a burst at either end, but a ~180 second long
continuous
>>> sequence (or barrage) of approximately 70-90000 transactions, if
I am
>>> interpreting the graph data correctly.
>>>
>>> > Regards
>>> > Pierre
>>> >
>>> >
>>> > Le mar. 12 janv. 2021 à 21:49, Yngve N. Pettersen >>>
<[email protected] <mailto:[email protected]>> a
>>> > écrit :
>>> >
>>> >> Hi again,
>>> >>
>>> >> A bit of an update.
>>> >>
>>> >> I have not been able to locate the issue using the profiler.
>>> >>
>>> >> It seems that when Buildbot gets into the problematic mode,
then the
>>> >> profiler is not able to work at all. It only starts
collecting >>> after the
>>> >> locked mode is resolved.
>>> >>
>>> >> It does seem like the locked mode occurs when Buildbot is
fetching >>> a lot
>>> >> of data from the DB and then spends a lot of time processing
that >>> data,
>>> >> without yielding to other processing needs.
>>> >>
>>> >> Looking at the monitoring of the server, it also appears that
>>> buildbot
>>> >> is
>>> >> fetching a lot of data. During the most recent instance, the
>>> returned
>>> >> tuples count in the graph for the server indicates three
minutes >>> of, on
>>> >> average 25000 tuples returned, with spikes to 80K and 100K,
per >>> second.
>>> >>
>>> >> The number of open connections rose to 6 or 7, and the
transaction >>> count
>>> >> was 400-500 per second during the whole time (rolled back
>>> transactions,
>>> >> which I assume is just one or more selects).
>>> >>
>>> >> IMO this makes it look like, while requesting these data,
Buildbot >>> is
>>> >> *synchronously* querying the DB and processing the returned
data, >>> not
>>> >> yielding. It might also be that it is requesting data more
data >>> than it
>>> >> needs, and also requesting other data earlier than it is
actually
>>> >> needed.
>>> >>
>>> >>
>>> >>
>>> >> On Tue, 12 Jan 2021 12:48:40 +0100, Yngve N. Pettersen
>>> >> <[email protected] <mailto:[email protected]>>
>>> >>
>>> >> wrote:
>>> >>
>>> >> > Hi,
>>> >> >
>>> >> > IIRC the only real processing in our system that might be
heavy is
>>> >> done
>>> >> > via logobserver.LineConsumerLogObserver in a class (now)
derived >>> from
>>> >> > ShellCommandNewStyle, so if that is the issue, and
deferToThread >>> is
>>> >> the
>>> >> > solution, then if it isn't already done, my suggestion
would be to
>>> >> > implement that inside the code handling the log observers.
>>> >> >
>>> >> > I've tested the profiler a little, but haven't seen any
samples >>> within
>>> >> > our code so far, just inside buildbot, quite a lot of log
DB >>> actions,
>>> >> > also some TLS activity.
>>> >> >
>>> >> > The performance issue for those pages seems to be a bit
flaky; at
>>> >> > present its not happening AFAICT
>>> >> >
>>> >> > On Tue, 12 Jan 2021 10:59:42 +0100, Pierre Tardy >>>
<[email protected] <mailto:[email protected]>>
>>> >> > wrote:
>>> >> >
>>> >> >> Hello,
>>> >> >>
>>> >> >> A lot of things happen between 2.7 and 2.10, although I
don't see
>>> >> >> anything
>>> >> >> which could impact the performance that much. (maybe new
reporter
>>> >> >> framework, but really not convinced)
>>> >> >> If you see that the db is underutilized this must be a
classical
>>> >> reactor
>>> >> >> starvation.
>>> >> >> With asynchronous systems like buildbot, you shouldn't do
any >>> heavy
>>> >> >> computation in the main event loop thread, those must be
done in >>> a
>>> >> >> thread
>>> >> >> via deferToThread and co.
>>> >> >>
>>> >> >> Those are the common issues you can have with performance
>>> >> >> independantly from upgrade regressions:
>>> >> >>
>>> >> >> 1) Custom steps:
>>> >> >> A lot of time, we see people struggling with performance
when >>> they
>>> >> just
>>> >> >> have some custom step doing heavy computation that block
the main
>>> >> thread
>>> >> >> constantly, preventing all the very quick tasks to run in
//.
>>> >> >>
>>> >> >> 2) too much logs
>>> >> >> In this case, there is not much to do beside reducing
the log
>>> >> amount.
>>> >> >> This
>>> >> >> would be the time to switch to a multi-master setup, where
you >>> put 2
>>> >> >> masters for builds, and one master for web UI.
>>> >> >> You can put those in the same machine/VM, no problem, the
only >>> work
>>> >> is
>>> >> >> to
>>> >> >> have separate processes that each have several event
queues. You
>>> can
>>> >> use
>>> >> >> docker-compose or kubernetes in order to more easily
create such
>>> >> >> deployment. We don't have readily useable for that, but
several
>>> >> people
>>> >> >> have
>>> >> >> done and documented it, for example
>>> >> >> https://github.com/pop/buildbot-on-kubernetes
>>> >> >>
>>> >> >>
>>> >> >> I have developed the buildbot profiler in order to quickly
find
>>> >> those.
>>> >> >> You
>>> >> >> just have to install it as a plugin and start a profile
whenever >>> the
>>> >> >> buildbot feels slow.
>>> >> >> It is a statistical profiler, so it will not significantly
>>> change the
>>> >> >> actual performance so it is safe to run in production.
>>> >> >>
>>> >> >> https://pypi.org/project/buildbot-profiler/
>>> >> >>
>>> >> >>
>>> >> >> Regards,
>>> >> >> Pierre
>>> >> >>
>>> >> >>
>>> >> >> Le mar. 12 janv. 2021 à 01:29, Yngve N. Pettersen
>>> >> <[email protected] <mailto:[email protected]>> a
>>> >> >> écrit :
>>> >> >>
>>> >> >>> Hello all,
>>> >> >>>
>>> >> >>> We have just upgraded our buildbot system from 2.7 to
2.10.
>>> >> >>>
>>> >> >>> However, I am noticing performance issues when loading
these >>> pages:
>>> >> >>>
>>> >> >>> Builds->Builders
>>> >> >>> Builds->Workers
>>> >> >>> individual builds
>>> >> >>>
>>> >> >>> Loading these can take several minutes, although there
are >>> periods
>>> >> of
>>> >> >>> immediate responses.
>>> >> >>>
>>> >> >>> What I am seeing on the buildbot manager machine is that
the >>> Python3
>>> >> >>> process hits 90-100% for the entire period.
>>> >> >>>
>>> >> >>> The Python version is 3.6.9 running on Ubuntu 18.04
>>> >> >>>
>>> >> >>> As far as I can tell, the Postgresql database is mostly
idle >>> during
>>> >> >>> this
>>> >> >>> period. I did do a full vacuum a few hours ago, in case
that >>> was the
>>> >> >>> issue.
>>> >> >>>
>>> >> >>> There are about 40 builders, and 30 workers in the
system, only
>>> >> about
>>> >> >>> 10-15 of these have a 10-20 builds for the past few days,
>>> although
>>> >> most
>>> >> >>> of
>>> >> >>> these have active histories of 3000 builds (which do
make me
>>> wonder
>>> >> if
>>> >> >>> the
>>> >> >>> problem could be a lack of limiting the DB queries, at
present I
>>> >> have
>>> >> >>> not
>>> >> >>> inspected the DB queries).
>>> >> >>>
>>> >> >>> The individual builds can have very large log files in
the build
>>> >> steps,
>>> >> >>> in
>>> >> >>> many cases tens of thousands of lines (we _are_ talking
about a
>>> >> >>> Chromium
>>> >> >>> based project).
>>> >> >>>
>>> >> >>> Our changes in the builders and workers JS code are
minimal (we >>> are
>>> >> >>> using
>>> >> >>> a custom build of www-base), just using different
information >>> for
>>> >> the
>>> >> >>> build labels (build version number), and grouping the
builders,
>>> >> which
>>> >> >>> should not be causing any performance issues. (we have
larger
>>> >> changes
>>> >> >>> in
>>> >> >>> the individual builder view, where we include Git commit
>>> messages,
>>> >> and
>>> >> >>> I
>>> >> >>> have so far not seen any performance issues there)
>>> >> >>>
>>> >> >>> BTW: The line plots for build time and successes on
builders >>> seems
>>> >> to
>>> >> >>> be
>>> >> >>> MIA. Not sure if that is an upstream issue, or due to
something >>> in
>>> >> our
>>> >> >>> www-base build.
>>> >> >>>
>>> >> >>> Do you have any suggestions for where to look for the
cause of >>> the
>>> >> >>> problem?
>>> >> >>>
>>> >> >>>
>>> >> >>> --
>>> >> >>> Sincerely,
>>> >> >>> Yngve N. Pettersen
>>> >> >>> Vivaldi Technologies AS
>>> >> >>> _______________________________________________
>>> >> >>> users mailing list
>>> >> >>> [email protected] <mailto:[email protected]>
>>> >> >>> https://lists.buildbot.net/mailman/listinfo/users
>>> >> >>>
>>> >> >
>>> >> >
>>> >>
>>> >>
>>> >> --
>>> >> Sincerely,
>>> >> Yngve N. Pettersen
>>> >> Vivaldi Technologies AS
>>> >>
>>>
>>>
>>> --
>>> Sincerely,
>>> Yngve N. Pettersen
>>> Vivaldi Technologies AS
>>> _______________________________________________
>>> users mailing list
>>> [email protected] <mailto:[email protected]>
>>> https://lists.buildbot.net/mailman/listinfo/users
>>>
>
>
--
Sincerely,
Yngve N. Pettersen
Vivaldi Technologies AS
_______________________________________________
users mailing list
[email protected]
https://lists.buildbot.net/mailman/listinfo/users