Hi Pierre,
As always, thanks for the reply.
Just to be perfectly clear, most of the recent troubles weren't with
buildbot itself. It's more my unfamiliarity with the Python ecosystem.
We are , for various reasons, building everything from source
distributions. so docker or ansible isn't really what we're doing. What
I have is a straight-up bash script. Not a great one, though.
Thanks for the ideas. Changing the name might work in the short run. Not
a good long-term solution for us, as we have a specific naming
convention to keep people from going crazy.
The problem in this case isn't that the worker won't start or complete
builds, which isn't quite the usual problem. It's that we have a builder
that claims 2 builds are in progress. The most recent is acquiring locks
because the other builder on that worker is building. It's the one
behind that that claims to still be building, even though it's obviously
stalled.
The same builder picked up the lock 2 builds in a row. I seem to recall
you saying that's possible. In this case, we do need the output of the
screwed-up builder, so I really can't wait for another 3-day build.
Looks like it'll get moved to the alternate for now.
It's good to know that the problem is probably in the python rather than
in the database.
Some figures from the impending multi-master move:
master 1:
workers: 164
builders: 355
schedulers: 80
master 2:
workers: 27
builders: 87
schedulers: 80
master 3:
workers: 2
builders: 6
schedulers: 80
master 4:
workers: 0
builders: 0
schedulers: 450
Masters 1 and 2 represent the two major divisions of labor that we have.
Master 3 contains those builders that people notice and complain about
most when they fail or are not present, even more than the ui. Master 4
contains only the ui and force schedulers. Since you said that the
schedulers behave regardless of how many masters they are on, I left
them on all the masters. At some point, I expect to split master 1 up
some. And I'll be providing guidance to those here who might want to set
up a semi-private master for a specific project.
People here expect to see the UI on a specific URL, so I'll end up
bringing up multi-master on its own, the reconfiguring to point the UI
master to where it ought to be.
Neil
On 9/29/2016 3:54 AM, Pierre Tardy wrote:
Hi Neil,
Good to know your multimaster setup is nearly done!
Indeed scripted multimaster would be great!
docker-compose setup or ansible playbook would be perfect I think
For the other part of the annecdote I think we are still chasing the
same bug right.
At some point somehow, you got a worker that is in a bad state and
won't accept anymore build.
The idea to move to master is a workaround indeed. Another one is to
change the name of the worker.
Obviously the best long term option is to debug the problem, but it
does n't look like easy to reproduce nor to debug :-(
hacking the database? naah.. dont. really.
It wont even help as most likely the corrupted state is in the python
objects
Pierre
Le mer. 28 sept. 2016 à 22:33, Neil Gilmore <[email protected]
<mailto:[email protected]>> a écrit :
Hi everyone,
Congrats on rc4.
More anecdotes from rc1. I got tangled up a bit trying to get
multi-master working. I'm still not sure why all the parts would build
one day, then not the next (in this case, it was setuptools). Nor why
crossbar requires libffi to be installed on one machine but not the
other. Nor why SQLAlchemy will be downloaded and installed
automatically
but not psycopg2. These troubles seem to have straightened themselves
out, and I have multi-master buildbots in sandboxes on 2 different
machines. There's light at the end of the tunnel, I hope.
As side note, Pierre, I ended up scripting the whole install/build/run
thing. That may have to do for a tutorial.
I got asked for help with a builder. Seems it was taking inordinately
long to do a build, and the user tried cancelling, forcing, etc.
There's
3 builders for this worker. 1 doesn't use locks, but the other 2 do.
It's pretty common for our workers to have a builder that doesn't
lock,
and the rest do.
The current situation is that the build of the builder in question
shows
not 1, but 2 builds building. Sort of, the current build is shown as
acquiring locks. The older building build is clearly stalled.
The other builder for the worker is proceeding well (but its
builds take
about 3 days). Obviously, it was able to get the lock. But it has
started another build after finishing the first one. So it appears
that
it got the lock again before the original builder (unless there's
something else going on).
I also had a different worker's build stall, so I moved that worker to
our alternate master. Unfortunately, it's a trick that only works
once.
If I move it back, it'll still be stalled. Is there any way to
remove a
no longer active worker from the database? I tried once, but I
messed it
up and had to start with an empty database. I didn't try again.
Neil Gilmore
grammatech.com <http://grammatech.com>
_______________________________________________
users mailing list
[email protected] <mailto:[email protected]>
https://lists.buildbot.net/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://lists.buildbot.net/mailman/listinfo/users