Re: [[email protected]] Performance issues in 2.10

Yngve N. Pettersen Wed, 13 Jan 2021 06:42:35 -0800

Hi,

I've been thinking a bit.

I am not familiar with the defer system, so I don't know the deep detailsabout it, so please take the following speculation with some big spoons ofsalt.

What I started wondering about is, what happens with the deferring systemif the following happens?


 - The list of builders is requested
 - for each builder the 1000 most recent builds are requested
 - for each build the change entries are requests
 - for each build or change entry the source stamps are requested

Could this lead to such a long queue of deferred function calls that theserver gets bogged down handling them, not being able to even forward datathat are ready to the web front-end?

I currently suspect that there is a missing restriction (either numbers ofbuilds or time-range) on requesting information about the individualbuilds.

Requesting a list of 1000 builds for each builder is probably OK as thequery should be quick, but the question is how many of those builds oneshould requests deep details of?

Another, related aspect of what I saw in the DB log is that there weremany requests for single items, especially for source stamps and changes.Maybe they could be gathered together into a single requests for X numberof records? That does require keeping a separate list/queue with the itemsthat has been requested, and might be complicated to implement.

On Wed, 13 Jan 2021 11:32:37 +0100, Povilas Kanapickas <[email protected]>wrote:

Hi,

Looking into diff between 2.7 and 2.10, pretty much the only difference
in the existing data and DB APIs are related to changes. So I think it's
very likely there's something not obvious we didn't consider.

I will build a test case and see if I can reproduce the problem. Long
term I think we need to have tests to prevent this kind of issues. I'll
think about how this can be done.

Yngve, thanks a lot for your work investigating this issue.

Cheers,
Povilas

On 1/13/21 12:26 PM, Pierre Tardy wrote:

Hello,

getPrevSuccessfulBuild is called by getChanges for build which in turn
is called by /builds/NN/changes Rest API.
the bug Vlad was referring to was a perf issue on the /changes API,
which has been fixed a while back.

Indeed, this algorithm is far from optimized,but I don't see why this
would lead to main thread blocking. Looking at the code, I see that
there are no big loops that do not yield to the main reactor loop.

I insist on the buildbot profiler. What I was saying before is that you
need to hit the record button before the problem appears, and put a
large enough record time to be sure to catch a spike.
Then, you will be able to zoom to the cpu spike and catch the issue
precisely.

If the spike is in the order of minutes like you said, you can configure
it like this and get enough samples to get enough evidence to where the
code is actually spending time:

ProfilerService(frequency=500, gatherperiod=60 * 60, mode='virtual',basepath=None, wantBuilds=100


This will record for one hour, and mitigate the memory used if you worry
about it.

Pierre


Le mer. 13 janv. 2021 à 11:01, Yngve N. Pettersen <[email protected]
<mailto:[email protected]>> a écrit :


    Hi again,

    I was just able to get a partial database log from a freeze incident
    when
    refreshing the Builds->Builders page.

    It looks like Vlad is on the right track.

There are a *lot* of individual source stamp requests, but alsorequests

    to the builds and change tables.

    An interesting part of the builds request is this request:

    SELECT builds.id <http://builds.id>, builds.number,
    builds.builderid, builds.buildrequestid,

builds.workerid, builds.masterid, builds.started_at,builds.complete_at,

    builds.state_string, builds.results
                 FROM builds
                 WHERE builds.builderid = 51 AND builds.number < 46 AND
    builds.results = 0 ORDER BY builds.complete_at DESC
                  LIMIT 1000 OFFSET 0

which appears to then be followed by a lot of changes and sourcestamp

    requests.

The log contains a lot of these requests per second; according tothe DB

    graph 200 to 400 per second.

    The 1000 limit appears to come from

db.builds.BuildsConnectorComponent.getPrevSuccessfulBuild(), butthatvalue seems to have been that way for a while, so the problem islikely

    caused by something else. This function did show up at the beginning
    of my
    traces related to these freezes.


    One possibility that I can think of, is that several of these pages,

or the functions they are using, are no longer restricting howfar backin the build history they are fetching build information for.E.g the Builders page is only supposed to show a couple of days ofbuildsfor each builder, so there should be no need to fetch data for a1000 builds (making sure you have the build ids is one thing,fetching all the associated data even for builds that are not to bedisplayed is

    something    else).

    BTW, I have noticed that another page, the waterfall, is not
    displaying    anything, even after waiting for a very long time.


    On Wed, 13 Jan 2021 01:34:57 +0100, Yngve N. Pettersen
    <[email protected] <mailto:[email protected]>>
    wrote:

    > Hi,
    >
    > Thanks for that info.
    >
    > In my case the problem is apparently something that happens now
    and then.
    >
    > As mentioned, I have seen it on the Builds->Builders and

Builds->Workers > pages, neither of which includes any changelogaccess AFAIK.

    >
    > I have also seen it occasionally on individual build pages, which
    has a    > log of steps with logs, and a changelog panel.
    >
    > Just a few minutes ago I saw this freeze/spike happen while the

buildbot > manager was completely idle, since all active taskshad completed,and I > had paused all workers since I needed to restart themanager (due

    to the    > hanging build).
    >
    > I have also had reports about the Grid view and Console pages
    displaying    > this issue, but have not seen it myself.
    >
    > At present I have enabled logging in the postgresql server, so

maybe I > can figure out what requests are handled during thespike.

    >
    >
    >
    > On Wed, 13 Jan 2021 00:38:34 +0100, Vlad Bogolin <[email protected]
    <mailto:[email protected]>>    > wrote:
    >
    >> Hi,
    >>

>> I have experienced some similar interface freezes while trying to>> configure

    >> our version of buildbot. I now remember two cases:
    >>
    >> 1) A "changes" API problem where it seemed that the "limit"
    argument was
    >> ignored in some cases which translated into a full changes table
    scan.    >> This
    >> was reproducible when hitting the "Builds > Last Changes"
    dashboard and

>> then all the other pages were frozen. There are other requests to>> changes,

    >> so this may be related to the Builds page too. Also, this only

>> happened when the number of changes from the db was high. I was>> planning on

    >> submitting a proper fix, but we are running a custom version of
    2.7.1    >> where
    >> I implemented a fast workaround and did not managed to submit a
    proper    >> fix
    >> (hope to be able to do it next week).
    >>

>> 2) We experienced the same issue as you describe when a lot oflogs >> where>> coming (which seems to be your case too) and the master processwas

    >> overwhelmed when multiple builds were running in the same time
    (constant

>> CPU usage around ~120%). We solved the issue by switching tomulti >> master

    >> and limiting the amount of logs, but if you say that this was not
    an    >> issue

>> in 2.7 I would really be interested in finding out what is theroot >> cause>> (I thought it was the high amount of logs). You can test this>> hypothesis by

    >> limiting the amount of running builds and see if the issue keeps
    >> reproducing.
    >>

>> What worked for me in order to find out the "changes" APIproblem was

    >> visiting each dashboard and see if the freeze occurs or not.
    >>
    >> Hope this helps!
    >>
    >> Cheers,
    >> Vlad
    >>
    >> On Wed, Jan 13, 2021 at 12:40 AM Yngve N. Pettersen
    <[email protected] <mailto:[email protected]>>
    >> wrote:
    >>
    >>> On Tue, 12 Jan 2021 22:13:50 +0100, Pierre Tardy
    <[email protected] <mailto:[email protected]>>    >>> wrote:
    >>>
    >>> > Thanks for the update.
    >>> >
    >>> > Some random thoughts...
    >>> >

>>> > You should probably leave the profiler open until you get the>>> performance

    >>> > spike.
    >>> > If you are inside the spike when starting, indeed, you won't
    be able    >>> to
    >>> > start profiler, but if it is started before the spike it for
    sure    >>> will
    >>> > detect exactly where the code is.
    >>>
    >>> I did have the profiler open in this latest case; as far as I
    could    >>> tell
    >>> it still didn't start recording until after the spike ended
    (there was    >>> no
    >>> progress information in the recorder line).
    >>>
    >>> The two major items showing up were
    >>>
    >>>    /buildbot/db/builds.py+91:getPrevSuccessfulBuild
    >>>    /buildbot/db/pool.py+190:__thd
    >>>
    >>> but I think they were recorded after the spike.
    >>>

>>> I am planning to activate more detailed logging in thepostgresql >>> server,>>> but have not done that yet (probably need to shut down andrestart

    >>> buildbot when I do).
    >>>
    >>>
    >>> BTW, I suspect that this issue can also cause trouble for builds
    whose
    >>> steps ends at the time the problem is occurring; I just noticed
    a task

>>> that is still running more than 4 hours after it started a stepthat

    >>> should have been killed after 20 minutes if it was hanging. It
    should
    >>> have
    >>> ended at about the time one of the hangs was occuring. And it is

>>> impossible to stop the task for some reason, even shutting downthe

    >>> worker
    >>> process did not work. AFAIK the only way to fix the issue is to
    shut    >>> the
    >>> buildbot manager down.
    >>>

>>> > statistic profiling will use timer interrupts which willpreempt >>> anything

    >>> > that is running, and make a call stack trace.
    >>> >
    >>> > Waiting for repro, if, from the db log, you manage to get the
    info of
    >>> > what

>>> > kind of db data that is, maybe we can narrow down the usual>>> suspects..

    >>> >
    >>> > If there are lots of short selects like you said, usually, you
    would
    >>> > have a
    >>> > back and forth from reactor thread to db thread, so it sounds
    weird.
    >>> > What can be leading to your behavior is that whatever is
    halting the
    >>> > processing, everything is queued up in between, and unqueued
    when it    >>> is
    >>> > finished, which could lead to spike of db actions in the end
    of the
    >>> > event.
    >>>

>>> The DB actions were going on for the entire 3 minutes that spike>>> lasted;>>> it is not a burst at either end, but a ~180 second longcontinuous

    >>> sequence (or barrage) of approximately 70-90000 transactions, if
    I am
    >>> interpreting the graph data correctly.
    >>>
    >>> > Regards
    >>> > Pierre
    >>> >
    >>> >

>>> > Le mar. 12 janv. 2021 à 21:49, Yngve N. Pettersen >>><[email protected] <mailto:[email protected]>> a

    >>> > écrit :
    >>> >
    >>> >> Hi again,
    >>> >>
    >>> >> A bit of an update.
    >>> >>
    >>> >> I have not been able to locate the issue using the profiler.
    >>> >>
    >>> >> It seems that when Buildbot gets into the problematic mode,
    then the

>>> >> profiler is not able to work at all. It only startscollecting >>> after the

    >>> >> locked mode is resolved.
    >>> >>
    >>> >> It does seem like the locked mode occurs when Buildbot is
    fetching    >>> a lot
    >>> >> of data from the DB and then spends a lot of time processing
    that    >>> data,
    >>> >> without yielding to other processing needs.
    >>> >>

>>> >> Looking at the monitoring of the server, it also appears that>>> buildbot

    >>> >> is

>>> >> fetching a lot of data. During the most recent instance, the>>> returned

    >>> >> tuples count in the graph for the server indicates three
    minutes    >>> of, on

>>> >> average 25000 tuples returned, with spikes to 80K and 100K,per >>> second.

    >>> >>
    >>> >> The number of open connections rose to 6 or 7, and the
    transaction    >>> count

>>> >> was 400-500 per second during the whole time (rolled back>>> transactions,

    >>> >> which I assume is just one or more selects).
    >>> >>
    >>> >> IMO this makes it look like, while requesting these data,
    Buildbot    >>> is
    >>> >> *synchronously* querying the DB and processing the returned
    data,    >>> not

>>> >> yielding. It might also be that it is requesting data moredata >>> than it>>> >> needs, and also requesting other data earlier than it isactually

    >>> >> needed.
    >>> >>
    >>> >>
    >>> >>
    >>> >> On Tue, 12 Jan 2021 12:48:40 +0100, Yngve N. Pettersen
    >>> >> <[email protected] <mailto:[email protected]>>
    >>> >>
    >>> >> wrote:
    >>> >>
    >>> >> > Hi,
    >>> >> >
    >>> >> > IIRC the only real processing in our system that might be
    heavy is
    >>> >> done
    >>> >> > via logobserver.LineConsumerLogObserver in a class (now)
    derived    >>> from
    >>> >> > ShellCommandNewStyle, so if that is the issue, and
    deferToThread    >>> is
    >>> >> the
    >>> >> > solution, then if it isn't already done, my suggestion
    would be to
    >>> >> > implement that inside the code handling the log observers.
    >>> >> >
    >>> >> > I've tested the profiler a little, but haven't seen any
    samples    >>> within

>>> >> > our code so far, just inside buildbot, quite a lot of logDB >>> actions,

    >>> >> > also some TLS activity.
    >>> >> >
    >>> >> > The performance issue for those pages seems to be a bit
    flaky; at
    >>> >> > present its not happening AFAICT
    >>> >> >

>>> >> > On Tue, 12 Jan 2021 10:59:42 +0100, Pierre Tardy >>><[email protected] <mailto:[email protected]>>

    >>> >> > wrote:
    >>> >> >
    >>> >> >> Hello,
    >>> >> >>
    >>> >> >> A lot of things happen between 2.7 and 2.10, although I
    don't see
    >>> >> >> anything
    >>> >> >> which could impact the performance that much. (maybe new
    reporter
    >>> >> >> framework, but really not convinced)
    >>> >> >> If you see that the db is underutilized this must be a
    classical
    >>> >> reactor
    >>> >> >> starvation.
    >>> >> >> With asynchronous systems like buildbot, you shouldn't do
    any    >>> heavy
    >>> >> >> computation in the main event loop thread, those must be
    done in    >>> a
    >>> >> >> thread
    >>> >> >> via deferToThread and co.
    >>> >> >>
    >>> >> >> Those are the common issues you can have with performance
    >>> >> >> independantly from upgrade regressions:
    >>> >> >>
    >>> >> >> 1) Custom steps:
    >>> >> >> A lot of time, we see people struggling with performance
    when    >>> they
    >>> >> just
    >>> >> >> have some custom step doing heavy computation that block
    the main
    >>> >> thread

>>> >> >> constantly, preventing all the very quick tasks to run in//.

    >>> >> >>
    >>> >> >> 2) too much logs

>>> >> >> In this case, there is not much to do beside reducingthe log

    >>> >> amount.
    >>> >> >> This
    >>> >> >> would be the time to switch to a multi-master setup, where
    you    >>> put 2
    >>> >> >> masters for builds, and one master for web UI.
    >>> >> >> You can put those in the same machine/VM, no problem, the
    only    >>> work
    >>> >> is
    >>> >> >> to
    >>> >> >> have separate processes that each have several event
    queues. You
    >>> can
    >>> >> use
    >>> >> >> docker-compose or kubernetes in order to more easily
    create such
    >>> >> >> deployment. We don't have readily useable for that, but
    several
    >>> >> people
    >>> >> >> have
    >>> >> >> done and documented it, for example
    >>> >> >> https://github.com/pop/buildbot-on-kubernetes
    >>> >> >>
    >>> >> >>
    >>> >> >> I have developed the buildbot profiler in order to quickly
    find
    >>> >> those.
    >>> >> >> You
    >>> >> >> just have to install it as a plugin and start a profile
    whenever    >>> the
    >>> >> >> buildbot feels slow.

>>> >> >> It is a statistical profiler, so it will not significantly>>> change the

    >>> >> >> actual performance so it is safe to run in production.
    >>> >> >>
    >>> >> >> https://pypi.org/project/buildbot-profiler/
    >>> >> >>
    >>> >> >>
    >>> >> >> Regards,
    >>> >> >> Pierre
    >>> >> >>
    >>> >> >>
    >>> >> >> Le mar. 12 janv. 2021 à 01:29, Yngve N. Pettersen
    >>> >> <[email protected] <mailto:[email protected]>> a
    >>> >> >> écrit :
    >>> >> >>
    >>> >> >>> Hello all,
    >>> >> >>>

>>> >> >>> We have just upgraded our buildbot system from 2.7 to2.10.

    >>> >> >>>
    >>> >> >>> However, I am noticing performance issues when loading
    these    >>> pages:
    >>> >> >>>
    >>> >> >>>   Builds->Builders
    >>> >> >>>   Builds->Workers
    >>> >> >>>   individual builds
    >>> >> >>>

>>> >> >>> Loading these can take several minutes, although thereare >>> periods

    >>> >> of
    >>> >> >>> immediate responses.
    >>> >> >>>
    >>> >> >>> What I am seeing on the buildbot manager machine is that
    the    >>> Python3
    >>> >> >>> process hits 90-100% for the entire period.
    >>> >> >>>
    >>> >> >>> The Python version is 3.6.9 running on Ubuntu 18.04
    >>> >> >>>
    >>> >> >>> As far as I can tell, the Postgresql database is mostly
    idle    >>> during
    >>> >> >>> this
    >>> >> >>> period. I did do a full vacuum a few hours ago, in case
    that    >>> was the
    >>> >> >>> issue.
    >>> >> >>>
    >>> >> >>> There are about 40 builders, and 30 workers in the
    system, only
    >>> >> about

>>> >> >>> 10-15 of these have a 10-20 builds for the past few days,>>> although

    >>> >> most
    >>> >> >>> of

>>> >> >>> these have active histories of 3000 builds (which domake me

    >>> wonder
    >>> >> if
    >>> >> >>> the
    >>> >> >>> problem could be a lack of limiting the DB queries, at
    present I
    >>> >> have
    >>> >> >>> not
    >>> >> >>> inspected the DB queries).
    >>> >> >>>
    >>> >> >>> The individual builds can have very large log files in
    the build
    >>> >> steps,
    >>> >> >>> in
    >>> >> >>> many cases tens of thousands of lines (we _are_ talking
    about a
    >>> >> >>> Chromium
    >>> >> >>> based project).
    >>> >> >>>
    >>> >> >>> Our changes in the builders and workers JS code are
    minimal (we    >>> are
    >>> >> >>> using
    >>> >> >>> a custom build of www-base), just using different
    information    >>> for
    >>> >> the
    >>> >> >>> build labels (build version number), and grouping the
    builders,
    >>> >> which

>>> >> >>> should not be causing any performance issues. (we havelarger

    >>> >> changes
    >>> >> >>> in

>>> >> >>> the individual builder view, where we include Git commit>>> messages,

    >>> >> and
    >>> >> >>> I
    >>> >> >>> have so far not seen any performance issues there)
    >>> >> >>>
    >>> >> >>> BTW: The line plots for build time and successes on
    builders    >>> seems
    >>> >> to
    >>> >> >>> be
    >>> >> >>> MIA. Not sure if that is an upstream issue, or due to
    something    >>> in
    >>> >> our
    >>> >> >>> www-base build.
    >>> >> >>>
    >>> >> >>> Do you have any suggestions for where to look for the
    cause of    >>> the
    >>> >> >>> problem?
    >>> >> >>>
    >>> >> >>>
    >>> >> >>> --
    >>> >> >>> Sincerely,
    >>> >> >>> Yngve N. Pettersen
    >>> >> >>> Vivaldi Technologies AS
    >>> >> >>> _______________________________________________
    >>> >> >>> users mailing list
    >>> >> >>> [email protected] <mailto:[email protected]>
    >>> >> >>> https://lists.buildbot.net/mailman/listinfo/users
    >>> >> >>>
    >>> >> >
    >>> >> >
    >>> >>
    >>> >>
    >>> >> --
    >>> >> Sincerely,
    >>> >> Yngve N. Pettersen
    >>> >> Vivaldi Technologies AS
    >>> >>
    >>>
    >>>
    >>> --
    >>> Sincerely,
    >>> Yngve N. Pettersen
    >>> Vivaldi Technologies AS
    >>> _______________________________________________
    >>> users mailing list
    >>> [email protected] <mailto:[email protected]>
    >>> https://lists.buildbot.net/mailman/listinfo/users
    >>>
    >
    >


    --
    Sincerely,
    Yngve N. Pettersen
    Vivaldi Technologies AS


_______________________________________________
users mailing list
[email protected]
https://lists.buildbot.net/mailman/listinfo/users



--
Sincerely,
Yngve N. Pettersen
Vivaldi Technologies AS
_______________________________________________
users mailing list
[email protected]
https://lists.buildbot.net/mailman/listinfo/users

Re: [[email protected]] Performance issues in 2.10

Reply via email to