Re: [[email protected]] Performance issues in 2.10

Povilas Kanapickas Wed, 13 Jan 2021 02:32:54 -0800

Hi,

Looking into diff between 2.7 and 2.10, pretty much the only difference
in the existing data and DB APIs are related to changes. So I think it's
very likely there's something not obvious we didn't consider.


I will build a test case and see if I can reproduce the problem. Long
term I think we need to have tests to prevent this kind of issues. I'll
think about how this can be done.

Yngve, thanks a lot for your work investigating this issue.

Cheers,
Povilas

On 1/13/21 12:26 PM, Pierre Tardy wrote:
> Hello,
> 
> getPrevSuccessfulBuild is called by getChanges for build which in turn
> is called by /builds/NN/changes Rest API.
> the bug Vlad was referring to was a perf issue on the /changes API,
> which has been fixed a while back.
> 
> Indeed, this algorithm is far from optimized,but I don't see why this
> would lead to main thread blocking. Looking at the code, I see that
> there are no big loops that do not yield to the main reactor loop.
> 
> I insist on the buildbot profiler. What I was saying before is that you
> need to hit the record button before the problem appears, and put a
> large enough record time to be sure to catch a spike.
> Then, you will be able to zoom to the cpu spike and catch the issue
> precisely.
> 
> If the spike is in the order of minutes like you said, you can configure
> it like this and get enough samples to get enough evidence to where the
> code is actually spending time:
> 
> ProfilerService(frequency=500, gatherperiod=60 * 60, mode='virtual', 
> basepath=None, wantBuilds=100
> 
> This will record for one hour, and mitigate the memory used if you worry
> about it.
> 
> Pierre
> 
> 
> Le mer. 13 janv. 2021 à 11:01, Yngve N. Pettersen <[email protected]
> <mailto:[email protected]>> a écrit :
> 
> 
>     Hi again,
> 
>     I was just able to get a partial database log from a freeze incident
>     when
>     refreshing the Builds->Builders page.
> 
>     It looks like Vlad is on the right track.
> 
>     There are a *lot* of individual source stamp requests, but also requests
>     to the builds and change tables.
> 
>     An interesting part of the builds request is this request:
> 
>     SELECT builds.id <http://builds.id>, builds.number,
>     builds.builderid, builds.buildrequestid,
>     builds.workerid, builds.masterid, builds.started_at, builds.complete_at,
>     builds.state_string, builds.results
>                  FROM builds
>                  WHERE builds.builderid = 51 AND builds.number < 46 AND
>     builds.results = 0 ORDER BY builds.complete_at DESC
>                   LIMIT 1000 OFFSET 0
> 
>     which appears to then be followed by a lot of changes and source stamp
>     requests.
> 
>     The log contains a lot of these requests per second; according to the DB
>     graph 200 to 400 per second.
> 
>     The 1000 limit appears to come from
>     db.builds.BuildsConnectorComponent.getPrevSuccessfulBuild(), but that
>     value seems to have been that way for a while, so the problem is likely
>     caused by something else. This function did show up at the beginning
>     of my
>     traces related to these freezes.
> 
> 
>     One possibility that I can think of, is that several of these pages,
>     or 
>     the functions they are using, are no longer restricting how far back
>     in 
>     the build history they are fetching build information for. E.g the 
>     Builders page is only supposed to show a couple of days of builds
>     for each 
>     builder, so there should be no need to fetch data for a 1000 builds 
>     (making sure you have the build ids is one thing, fetching all the 
>     associated data even for builds that are not to be displayed is
>     something 
>     else).
> 
>     BTW, I have noticed that another page, the waterfall, is not
>     displaying 
>     anything, even after waiting for a very long time.
> 
> 
>     On Wed, 13 Jan 2021 01:34:57 +0100, Yngve N. Pettersen
>     <[email protected] <mailto:[email protected]>>
>     wrote:
> 
>     > Hi,
>     >
>     > Thanks for that info.
>     >
>     > In my case the problem is apparently something that happens now
>     and then.
>     >
>     > As mentioned, I have seen it on the Builds->Builders and
>     Builds->Workers 
>     > pages, neither of which includes any changelog access AFAIK.
>     >
>     > I have also seen it occasionally on individual build pages, which
>     has a 
>     > log of steps with logs, and a changelog panel.
>     >
>     > Just a few minutes ago I saw this freeze/spike happen while the
>     buildbot 
>     > manager was completely idle, since all active tasks had completed,
>     and I 
>     > had paused all workers since I needed to restart the manager (due
>     to the 
>     > hanging build).
>     >
>     > I have also had reports about the Grid view and Console pages
>     displaying 
>     > this issue, but have not seen it myself.
>     >
>     > At present I have enabled logging in the postgresql server, so
>     maybe I 
>     > can figure out what requests are handled during the spike.
>     >
>     >
>     >
>     > On Wed, 13 Jan 2021 00:38:34 +0100, Vlad Bogolin <[email protected]
>     <mailto:[email protected]>> 
>     > wrote:
>     >
>     >> Hi,
>     >>
>     >> I have experienced some similar interface freezes while trying to 
>     >> configure
>     >> our version of buildbot. I now remember two cases:
>     >>
>     >> 1) A "changes" API problem where it seemed that the "limit"
>     argument was
>     >> ignored in some cases which translated into a full changes table
>     scan. 
>     >> This
>     >> was reproducible when hitting the "Builds > Last Changes"
>     dashboard and
>     >> then all the other pages were frozen. There are other requests to 
>     >> changes,
>     >> so this may be related to the Builds page too. Also, this only
>     >> happened when the number of changes from the db was high. I was 
>     >> planning on
>     >> submitting a proper fix, but we are running a custom version of
>     2.7.1 
>     >> where
>     >> I implemented a fast workaround and did not managed to submit a
>     proper 
>     >> fix
>     >> (hope to be able to do it next week).
>     >>
>     >> 2) We experienced the same issue as you describe when a lot of logs 
>     >> where
>     >> coming (which seems to be your case too) and the master process was
>     >> overwhelmed when multiple builds were running in the same time
>     (constant
>     >> CPU usage around ~120%). We solved the issue by switching to multi 
>     >> master
>     >> and limiting the amount of logs, but if you say that this was not
>     an 
>     >> issue
>     >> in 2.7 I would really be interested in finding out what is the root 
>     >> cause
>     >> (I thought it was the high amount of logs). You can test this 
>     >> hypothesis by
>     >> limiting the amount of running builds and see if the issue keeps
>     >> reproducing.
>     >>
>     >> What worked for me in order to find out the "changes" API problem was
>     >> visiting each dashboard and see if the freeze occurs or not.
>     >>
>     >> Hope this helps!
>     >>
>     >> Cheers,
>     >> Vlad
>     >>
>     >> On Wed, Jan 13, 2021 at 12:40 AM Yngve N. Pettersen
>     <[email protected] <mailto:[email protected]>>
>     >> wrote:
>     >>
>     >>> On Tue, 12 Jan 2021 22:13:50 +0100, Pierre Tardy
>     <[email protected] <mailto:[email protected]>> 
>     >>> wrote:
>     >>>
>     >>> > Thanks for the update.
>     >>> >
>     >>> > Some random thoughts...
>     >>> >
>     >>> > You should probably leave the profiler open until you get the 
>     >>> performance
>     >>> > spike.
>     >>> > If you are inside the spike when starting, indeed, you won't
>     be able 
>     >>> to
>     >>> > start profiler, but if it is started before the spike it for
>     sure 
>     >>> will
>     >>> > detect exactly where the code is.
>     >>>
>     >>> I did have the profiler open in this latest case; as far as I
>     could 
>     >>> tell
>     >>> it still didn't start recording until after the spike ended
>     (there was 
>     >>> no
>     >>> progress information in the recorder line).
>     >>>
>     >>> The two major items showing up were
>     >>>
>     >>>    /buildbot/db/builds.py+91:getPrevSuccessfulBuild
>     >>>    /buildbot/db/pool.py+190:__thd
>     >>>
>     >>> but I think they were recorded after the spike.
>     >>>
>     >>> I am planning to activate more detailed logging in the postgresql 
>     >>> server,
>     >>> but have not done that yet (probably need to shut down and restart
>     >>> buildbot when I do).
>     >>>
>     >>>
>     >>> BTW, I suspect that this issue can also cause trouble for builds
>     whose
>     >>> steps ends at the time the problem is occurring; I just noticed
>     a task
>     >>> that is still running more than 4 hours after it started a step that
>     >>> should have been killed after 20 minutes if it was hanging. It
>     should
>     >>> have
>     >>> ended at about the time one of the hangs was occuring. And it is
>     >>> impossible to stop the task for some reason, even shutting down the
>     >>> worker
>     >>> process did not work. AFAIK the only way to fix the issue is to
>     shut 
>     >>> the
>     >>> buildbot manager down.
>     >>>
>     >>> > statistic profiling will use timer interrupts which will preempt 
>     >>> anything
>     >>> > that is running, and make a call stack trace.
>     >>> >
>     >>> > Waiting for repro, if, from the db log, you manage to get the
>     info of
>     >>> > what
>     >>> > kind of db data that is, maybe we can narrow down the usual 
>     >>> suspects..
>     >>> >
>     >>> > If there are lots of short selects like you said, usually, you
>     would
>     >>> > have a
>     >>> > back and forth from reactor thread to db thread, so it sounds
>     weird.
>     >>> > What can be leading to your behavior is that whatever is
>     halting the
>     >>> > processing, everything is queued up in between, and unqueued
>     when it 
>     >>> is
>     >>> > finished, which could lead to spike of db actions in the end
>     of the
>     >>> > event.
>     >>>
>     >>> The DB actions were going on for the entire 3 minutes that spike 
>     >>> lasted;
>     >>> it is not a burst at either end, but a ~180 second long continuous
>     >>> sequence (or barrage) of approximately 70-90000 transactions, if
>     I am
>     >>> interpreting the graph data correctly.
>     >>>
>     >>> > Regards
>     >>> > Pierre
>     >>> >
>     >>> >
>     >>> > Le mar. 12 janv. 2021 à 21:49, Yngve N. Pettersen 
>     >>> <[email protected] <mailto:[email protected]>> a
>     >>> > écrit :
>     >>> >
>     >>> >> Hi again,
>     >>> >>
>     >>> >> A bit of an update.
>     >>> >>
>     >>> >> I have not been able to locate the issue using the profiler.
>     >>> >>
>     >>> >> It seems that when Buildbot gets into the problematic mode,
>     then the
>     >>> >> profiler is not able to work at all. It only starts collecting 
>     >>> after the
>     >>> >> locked mode is resolved.
>     >>> >>
>     >>> >> It does seem like the locked mode occurs when Buildbot is
>     fetching 
>     >>> a lot
>     >>> >> of data from the DB and then spends a lot of time processing
>     that 
>     >>> data,
>     >>> >> without yielding to other processing needs.
>     >>> >>
>     >>> >> Looking at the monitoring of the server, it also appears that 
>     >>> buildbot
>     >>> >> is
>     >>> >> fetching a lot of data. During the most recent instance, the 
>     >>> returned
>     >>> >> tuples count in the graph for the server indicates three
>     minutes 
>     >>> of, on
>     >>> >> average 25000 tuples returned, with spikes to 80K and 100K, per 
>     >>> second.
>     >>> >>
>     >>> >> The number of open connections rose to 6 or 7, and the
>     transaction 
>     >>> count
>     >>> >> was 400-500 per second during the whole time (rolled back 
>     >>> transactions,
>     >>> >> which I assume is just one or more selects).
>     >>> >>
>     >>> >> IMO this makes it look like, while requesting these data,
>     Buildbot 
>     >>> is
>     >>> >> *synchronously* querying the DB and processing the returned
>     data, 
>     >>> not
>     >>> >> yielding. It might also be that it is requesting data more data 
>     >>> than it
>     >>> >> needs, and also requesting other data earlier than it is actually
>     >>> >> needed.
>     >>> >>
>     >>> >>
>     >>> >>
>     >>> >> On Tue, 12 Jan 2021 12:48:40 +0100, Yngve N. Pettersen
>     >>> >> <[email protected] <mailto:[email protected]>>
>     >>> >>
>     >>> >> wrote:
>     >>> >>
>     >>> >> > Hi,
>     >>> >> >
>     >>> >> > IIRC the only real processing in our system that might be
>     heavy is
>     >>> >> done
>     >>> >> > via logobserver.LineConsumerLogObserver in a class (now)
>     derived 
>     >>> from
>     >>> >> > ShellCommandNewStyle, so if that is the issue, and
>     deferToThread 
>     >>> is
>     >>> >> the
>     >>> >> > solution, then if it isn't already done, my suggestion
>     would be to
>     >>> >> > implement that inside the code handling the log observers.
>     >>> >> >
>     >>> >> > I've tested the profiler a little, but haven't seen any
>     samples 
>     >>> within
>     >>> >> > our code so far, just inside buildbot, quite a lot of log DB 
>     >>> actions,
>     >>> >> > also some TLS activity.
>     >>> >> >
>     >>> >> > The performance issue for those pages seems to be a bit
>     flaky; at
>     >>> >> > present its not happening AFAICT
>     >>> >> >
>     >>> >> > On Tue, 12 Jan 2021 10:59:42 +0100, Pierre Tardy 
>     >>> <[email protected] <mailto:[email protected]>>
>     >>> >> > wrote:
>     >>> >> >
>     >>> >> >> Hello,
>     >>> >> >>
>     >>> >> >> A lot of things happen between 2.7 and 2.10, although I
>     don't see
>     >>> >> >> anything
>     >>> >> >> which could impact the performance that much. (maybe new
>     reporter
>     >>> >> >> framework, but really not convinced)
>     >>> >> >> If you see that the db is underutilized this must be a
>     classical
>     >>> >> reactor
>     >>> >> >> starvation.
>     >>> >> >> With asynchronous systems like buildbot, you shouldn't do
>     any 
>     >>> heavy
>     >>> >> >> computation in the main event loop thread, those must be
>     done in 
>     >>> a
>     >>> >> >> thread
>     >>> >> >> via deferToThread and co.
>     >>> >> >>
>     >>> >> >> Those are the common issues you can have with performance
>     >>> >> >> independantly from upgrade regressions:
>     >>> >> >>
>     >>> >> >> 1) Custom steps:
>     >>> >> >> A lot of time, we see people struggling with performance
>     when 
>     >>> they
>     >>> >> just
>     >>> >> >> have some custom step doing heavy computation that block
>     the main
>     >>> >> thread
>     >>> >> >> constantly, preventing all the very quick tasks to run in //.
>     >>> >> >>
>     >>> >> >> 2) too much logs
>     >>> >> >>  In this case, there is not much to do beside reducing the log
>     >>> >> amount.
>     >>> >> >> This
>     >>> >> >> would be the time to switch to a multi-master setup, where
>     you 
>     >>> put 2
>     >>> >> >> masters for builds, and one master for web UI.
>     >>> >> >> You can put those in the same machine/VM, no problem, the
>     only 
>     >>> work
>     >>> >> is
>     >>> >> >> to
>     >>> >> >> have separate processes that each have several event
>     queues. You
>     >>> can
>     >>> >> use
>     >>> >> >> docker-compose or kubernetes in order to more easily
>     create such
>     >>> >> >> deployment. We don't have readily useable for that, but
>     several
>     >>> >> people
>     >>> >> >> have
>     >>> >> >> done and documented it, for example
>     >>> >> >> https://github.com/pop/buildbot-on-kubernetes
>     >>> >> >>
>     >>> >> >>
>     >>> >> >> I have developed the buildbot profiler in order to quickly
>     find
>     >>> >> those.
>     >>> >> >> You
>     >>> >> >> just have to install it as a plugin and start a profile
>     whenever 
>     >>> the
>     >>> >> >> buildbot feels slow.
>     >>> >> >> It is a statistical profiler, so it will not significantly 
>     >>> change the
>     >>> >> >> actual performance so it is safe to run in production.
>     >>> >> >>
>     >>> >> >> https://pypi.org/project/buildbot-profiler/
>     >>> >> >>
>     >>> >> >>
>     >>> >> >> Regards,
>     >>> >> >> Pierre
>     >>> >> >>
>     >>> >> >>
>     >>> >> >> Le mar. 12 janv. 2021 à 01:29, Yngve N. Pettersen
>     >>> >> <[email protected] <mailto:[email protected]>> a
>     >>> >> >> écrit :
>     >>> >> >>
>     >>> >> >>> Hello all,
>     >>> >> >>>
>     >>> >> >>> We have just upgraded our buildbot system from 2.7 to 2.10.
>     >>> >> >>>
>     >>> >> >>> However, I am noticing performance issues when loading
>     these 
>     >>> pages:
>     >>> >> >>>
>     >>> >> >>>   Builds->Builders
>     >>> >> >>>   Builds->Workers
>     >>> >> >>>   individual builds
>     >>> >> >>>
>     >>> >> >>> Loading these can take several minutes, although there are 
>     >>> periods
>     >>> >> of
>     >>> >> >>> immediate responses.
>     >>> >> >>>
>     >>> >> >>> What I am seeing on the buildbot manager machine is that
>     the 
>     >>> Python3
>     >>> >> >>> process hits 90-100% for the entire period.
>     >>> >> >>>
>     >>> >> >>> The Python version is 3.6.9 running on Ubuntu 18.04
>     >>> >> >>>
>     >>> >> >>> As far as I can tell, the Postgresql database is mostly
>     idle 
>     >>> during
>     >>> >> >>> this
>     >>> >> >>> period. I did do a full vacuum a few hours ago, in case
>     that 
>     >>> was the
>     >>> >> >>> issue.
>     >>> >> >>>
>     >>> >> >>> There are about 40 builders, and 30 workers in the
>     system, only
>     >>> >> about
>     >>> >> >>> 10-15 of these have a 10-20 builds for the past few days, 
>     >>> although
>     >>> >> most
>     >>> >> >>> of
>     >>> >> >>> these have active histories of 3000 builds (which do make me
>     >>> wonder
>     >>> >> if
>     >>> >> >>> the
>     >>> >> >>> problem could be a lack of limiting the DB queries, at
>     present I
>     >>> >> have
>     >>> >> >>> not
>     >>> >> >>> inspected the DB queries).
>     >>> >> >>>
>     >>> >> >>> The individual builds can have very large log files in
>     the build
>     >>> >> steps,
>     >>> >> >>> in
>     >>> >> >>> many cases tens of thousands of lines (we _are_ talking
>     about a
>     >>> >> >>> Chromium
>     >>> >> >>> based project).
>     >>> >> >>>
>     >>> >> >>> Our changes in the builders and workers JS code are
>     minimal (we 
>     >>> are
>     >>> >> >>> using
>     >>> >> >>> a custom build of www-base), just using different
>     information 
>     >>> for
>     >>> >> the
>     >>> >> >>> build labels (build version number), and grouping the
>     builders,
>     >>> >> which
>     >>> >> >>> should not be causing any performance issues. (we have larger
>     >>> >> changes
>     >>> >> >>> in
>     >>> >> >>> the individual builder view, where we include Git commit 
>     >>> messages,
>     >>> >> and
>     >>> >> >>> I
>     >>> >> >>> have so far not seen any performance issues there)
>     >>> >> >>>
>     >>> >> >>> BTW: The line plots for build time and successes on
>     builders 
>     >>> seems
>     >>> >> to
>     >>> >> >>> be
>     >>> >> >>> MIA. Not sure if that is an upstream issue, or due to
>     something 
>     >>> in
>     >>> >> our
>     >>> >> >>> www-base build.
>     >>> >> >>>
>     >>> >> >>> Do you have any suggestions for where to look for the
>     cause of 
>     >>> the
>     >>> >> >>> problem?
>     >>> >> >>>
>     >>> >> >>>
>     >>> >> >>> --
>     >>> >> >>> Sincerely,
>     >>> >> >>> Yngve N. Pettersen
>     >>> >> >>> Vivaldi Technologies AS
>     >>> >> >>> _______________________________________________
>     >>> >> >>> users mailing list
>     >>> >> >>> [email protected] <mailto:[email protected]>
>     >>> >> >>> https://lists.buildbot.net/mailman/listinfo/users
>     >>> >> >>>
>     >>> >> >
>     >>> >> >
>     >>> >>
>     >>> >>
>     >>> >> --
>     >>> >> Sincerely,
>     >>> >> Yngve N. Pettersen
>     >>> >> Vivaldi Technologies AS
>     >>> >>
>     >>>
>     >>>
>     >>> --
>     >>> Sincerely,
>     >>> Yngve N. Pettersen
>     >>> Vivaldi Technologies AS
>     >>> _______________________________________________
>     >>> users mailing list
>     >>> [email protected] <mailto:[email protected]>
>     >>> https://lists.buildbot.net/mailman/listinfo/users
>     >>>
>     >
>     >
> 
> 
>     -- 
>     Sincerely,
>     Yngve N. Pettersen
>     Vivaldi Technologies AS
> 
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://lists.buildbot.net/mailman/listinfo/users
> 

_______________________________________________
users mailing list
[email protected]
https://lists.buildbot.net/mailman/listinfo/users

Re: [[email protected]] Performance issues in 2.10

Reply via email to