[Gluster-devel] Removing myself as maintainer

2019-03-10 Thread Nigel Babu
Hello folks,

This change has gone through, but I wanted to let folks here know as well. I'm 
removing myself as maintainer from everything to reflect that I will no longer 
be the primary point of contact for any of the components I used to own.

However, I will still be around and contributing as I get time and energy.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Jenkins switched over to new builders for regression

2019-02-08 Thread Nigel Babu
All the RAX builders are now gone. We're running off AWS entirely now.
Please file an infra bug if you notice something odd. For future reference,
logs and cores are going to be available on https://logs.aws.gluster.org
rather than individual build servers. This should, in the future, be
printed in the logs.

On Fri, Feb 8, 2019 at 7:49 AM Nigel Babu  wrote:

> Hello,
>
> We've reached the half way mark in the migration and half our builders
> today are now running on AWS. I've turned off the RAX builders and have
> them try to be online only if the AWS builders cannot handle the number of
> jobs running at any given point.
>
> The new builders are named builder2xx.aws.gluster.org. If you notice an
> infra issue with them, please file a bug. I will be working on adding more
> AWS builders during the day today.
>
> --
> nigelb
>


-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Jenkins switched over to new builders for regression

2019-02-07 Thread Nigel Babu
Hello,

We've reached the half way mark in the migration and half our builders
today are now running on AWS. I've turned off the RAX builders and have
them try to be online only if the AWS builders cannot handle the number of
jobs running at any given point.

The new builders are named builder2xx.aws.gluster.org. If you notice an
infra issue with them, please file a bug. I will be working on adding more
AWS builders during the day today.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Regression logs issue

2019-02-07 Thread Nigel Babu
Hello folks,

In the last week, if you have had a regression job that failed, you will
not find a log for it. This is due to a mistake I made while deleting code.
Rather than deleting the code for the push to an internal HTTP server, I
also deleted a line which handled the log creation. Apologies for the
mistake. This has now been corrected and the fix pushed to all regression
nodes. Any future failures should have logs attached as artifacts.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Tests for the GCS stack using the k8s framework

2019-01-03 Thread Nigel Babu
Hello,

Deepshikha and I have been working on understanding and using the k8s
framework for testing the GCS stack. With the help of the folks from
sig-storage, we've managed to write a sample test that needs to be run
against an already setup k8s gluster with GCS installed on top[1]. This is
a temporary location for the tests and we'll move these into
gluster-csi-driver repo[2] once some of the dependency issues[3] are sorted
out.

The upstream storage tests are being split out into a test suite[4] that
can be consumed out of tree by folks like us who are implementing a CSI
driver interface. When that happens, we should be able to continuously
validate against the standards set for the storage interface.

[1]: https://github.com/nigelbabu/gcs-test/
[2]: https://github.com/gluster/gluster-csi-driver/
[3]: https://github.com/gluster/gluster-csi-driver/issues/131
[4]:
https://github.com/kubernetes/kubernetes/tree/master/test/e2e/storage/testsuites
-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Infra Update for Nov and Dec

2018-12-19 Thread Nigel Babu
Hello folks,

The infra team has not been sending regular updates recently because we’ve
been caught up in several different pieces of work that were running into
longer than 2 week sprint cycles. This is a summary of what we’ve done so
far since the last update.

* The bugzilla updates are done with a python script now and there’s now a
patch to handle a patch being abandoned and restored. It’s pending a merge
and deploy after the holiday season.
* Smoke jobs for python linting and shell linting.
* Smoke jobs for 32-bit builds.

The big piece that the infra team has been spending time has been working
on is identifying the best way to write end to end testing for GCS (Gluster
for Container Storage). We started with the assumption that we want to use
a test framework that as far as possible sticks closely to the upstream
kubernetes and Openshift Origin tests. We have had a 3-pronged approach to
this over the last two months.

1. We want to use machines we have access to right now to verify that the
deployment scripts that we publish works as we intend for it to work. To
this end, we created a job on Centos CI that consumes the deployment
exactly like we recommend anyone run the scripts in the gcs repository[1].
We’re running into a couple of failures and Mrugesh is working on
identifying and fixing them. We hope to have this complete in the first
week of January.
2. We want to use the upstream end to end test framework that consumes
ginkgo and gomega. The framework already exists to consume the kubectl
client to talk to a kubernetes cluster. We’ve just had a conversation with
the upstream Storage-SIG developers yesterday that has pointed us in the
right direction. We’re very close to having a first test. When the first
test in the end to end framework comes about, we’ll hook it up to the test
run we have in (1). Deepshikha and I are actively working on making this
happen. We plan to have a proof of concept in the second week of January
and write documentation and demos for the GCS team.
3. We want to do some testing that actively tries to break a production
sized cluster and look for how our stack handles failures. There’s a longer
plan on how to do this, but this work is currently on hold until we get the
first two pieces running. This is also blocked on us having access to
infrastructure where we can make this happen. Mrugesh will lead this
activity once the other blockers are removed.

Once we have the first proof of concept test written, we will hand over
writing the tests to the GCS development team and the infra team will then
move to working on building out the infrastructure for running these new
tests. We will continue to work in close collaboration with the Kubernetes
Storage SIG and the OKD Infrastructure teams to prevent us from duplicating
work.

[1]: https://ci.centos.org/view/Gluster/job/gluster_anteater_gcs/


-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Short review.gluster.org outage in the next 15 mins

2018-11-05 Thread Nigel Babu
Hello folks,

Going to restart gerrit on review.gluster.org for a quick config change in
the next 15 mins. Estimate outage of 5 mins. I'll update this thread when
we're back online

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Centos CI automation Retrospective

2018-11-02 Thread Nigel Babu
Hello folks,

On Monday, I merged in the changes that allowed all the jobs in Centos CI
to be handled in an automated fashion. In the past, it depended on Infra
team members to review, merge, and apply the changes on Centos CI. I've now
changed that so that the individual job owners can do their own merges.

1. On sending a pull request, a travis-ci job will ensure the YAML is valid
JJB.
2. On merge, we'll apply the changes to ci.centos.org with travis-ci.

We had a few issues when we did this change. This was expected, but it took
more time than I anticipated to fix all of them up.

Notably, the GD2 CI issues did not get fixed up until today. This was
because the status context was not defined in the yaml file, but only on
the UI. Please avoid making  However, I can now confirm that all jobs are
working exactly off their source yaml. Thanks to Kaushal and Madhu for
working me on solving  this issue. Apologies for the inconvenience caused.
If you have a pull request that did not seem to get CI to work, please send
an update with a cosmetic change. That should retrigger CI correctly.

If you notice anything off, please file an infra bug and we'll by happy to
help.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Gluster Infra Update

2018-10-18 Thread Nigel Babu
Hello folks,

Here's the update from the last 2 weeks from the Infra team.

* Created an architecture document for Automated Upgrade Testing. This is
now done and is undergoing reviews. It is scheduled to be published on the
devel list as soon as we have a decent PoC.
* Finished part of the migration of the bugzilla handling scripts to
python[1]. Sanju discovered a bug[2], so it's been rolled back. We're going
to add the ability to handle an external tracker as well while we fix the
bug.
* Softserve's SSH key handling is better[3]. You no longer have to paste an
SSH key into softserve as long as you have that key on Github. Softserve
will pick up the key from Github and auto-populate that field for you.
* Thanks to Sheersha's work we have a CI job[4] for gluster-ansible-infra
now.
* We're decentralizing the responsibility for handling Centos CI jobs[5].

[1]:
https://github.com/gluster/glusterfs-patch-acceptance-tests/blob/master/github/handle_bugzilla.py
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1636455
[3]: https://github.com/gluster/softserve/pull/48
[4]: https://github.com/gluster/centosci/pull/23
[5]:
https://lists.gluster.org/pipermail/gluster-infra/2018-October/005155.html

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Infra Update for the last 2 weeks

2018-10-03 Thread Nigel Babu
Hello folks,

I meant to send this out on Monday, but it's been a busy few days.
* The infra pieces of distributed regression are now complete. A big shout
out to Deepshikha for driving this and Ramky for his help in get this to
completion.
* The GD2 containers and CSI container builds work now. We still don't know
why it broke or why it started working again. We're tracking this in a
bug[1].
* Gluster-Infra now has a Sentry.io account, so we discover issues with
softserve or fstat very quickly and are able to debug it very quickly.
* We're restarting our efforts to get a nightly Glusto job going and are
running into test failures. Currently debugging them for actual failures vs
infra issues.
* The infra team has been assisting gluster-ansible on and off to help them
build out a set of tests. This has been going steady and now waiting on
Infra team to setup CI with Centos-CI team.
* From this sprint on, we're going to be spending some time triaging out
the infra bugs so they're assigned and in the correct state.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1626453

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] gluster-ansible: status of the project

2018-09-30 Thread Nigel Babu
On Sun, Sep 30, 2018 at 6:45 PM Sachidananda URS  wrote:

>
>
> On Sun, Sep 30, 2018 at 12:56 PM, Yaniv Kaul  wrote:
>
>>
>>
>> On Fri, Sep 28, 2018 at 2:33 PM Sachidananda URS  wrote:
>>
>>> Hi,
>>>
>>> gluster-ansible project is aimed at automating the deployment and
>>> maintenance of GlusterFS cluster.
>>>
>>> The project can be found at:
>>>
>>> * https://github.com/gluster/gluster-ansible
>>> * https://github.com/gluster/gluster-ansible-infra
>>> * https://github.com/gluster/gluster-ansible-features
>>> * https://github.com/gluster/gluster-ansible-maintenance
>>> * https://github.com/gluster/gluster-ansible-cluster
>>> * https://github.com/gluster/gluster-ansible-repositories
>>>
>>> We have the python bindings for GlusterD2 API, and can be found at:
>>>
>>> https://github.com/gluster/python-gluster-mgmt-client
>>>
>>> The goal is to use the python bindings in gluster_ansible module to make
>>> it work with GlusterD2.
>>>
>>> Current status of the project:
>>>
>>> * We have the initial working roles, packages are available at:
>>>- https://copr.fedorainfracloud.org/coprs/sac/gluster-ansible/builds/
>>>
>>> * The initial version supports:
>>>- End-to-end deployment of Gluster Hyperconverged Infrastructure.
>>>- GlusterFS volume management
>>>- GlusterFS brick setup
>>>- Packages and repository management
>>>
>>> * Autogeneration of python bindings for GlusterD2 is being worked by
>>> Sidharth (https://github.com/sidharthanup) and available at:
>>>   -
>>> https://github.com/sidharthanup/GD2_API/blob/master/testgen/glusterapi_README.md
>>>
>>> The GD2 API python project will be merged into
>>> python-gluster-mgmt-client.
>>>
>>> * Ansible modules (WIP):
>>>- New module: Facts module for self-heal and rebalance. Devyani is
>>> working on these modules.
>>>  https://github.com/ansible/ansible/pull/45997 - self-heal
>>>- Remove brick feature for gluster_ansible module:
>>>  https://github.com/ansible/ansible/pull/38269
>>>
>>
>> Is there any work planned for dynamic inventory (
>> https://docs.ansible.com/ansible/2.5/dev_guide/developing_inventory.html)
>> ?
>> Peers, bricks and volumes are all good candidates for dynamic inventory.
>>
>>>
>>>
>>> * Sheersha and Nigel are working on continuous integration, and PR is at:
>>>- https://github.com/gluster/gluster-ansible-infra/pull/29
>>>- https://github.com/gluster/gluster-ansible-infra/pull/26
>>>
>>> The CI work is in progress and will be integrated soon. Which will help
>>> us to keep the repository
>>> in stable condition.
>>>
>>
>> I recommend running ansible-lint in CI. For example:
>> [ykaul@ykaul gluster-ansible-infra]$ find . -name "*.yml" |xargs
>> ansible-lint
>> Syntax Error while loading YAML.
>>   did not find expected ',' or '}'
>>
>>
> I remember Nigel and Sheersha running ansible-lint (or yaml-lint) on the
> roles. I think they run
> on the tasks directory. But it is good idea to run on all the yamls.
>
>
>
>> The error appears to have been in
>> '/home/ykaul/github/gluster-ansible-infra/examples/backend_with_vdo.yml':
>> line 22, column 8, but may
>> be elsewhere in the file depending on the exact syntax problem.
>>
>> The offending line appears to be:
>>
>>- {vgname: 'vg_sdb', thinpoolname: 'foo_thinpool', thinpoolsize:
>> '100G', poolmetadatasize: '16G'
>>- {vgname: 'vg_sdc', thinpoolname: 'bar_thinpool', thinpoolsize:
>> '500G', poolmetadatasize: '16G'
>>^ here
>> This one looks easy to fix.  It seems that there is a value started
>> with a quote, and the YAML parser is expecting to see the line ended
>> with the same kind of quote.  For instance:
>>
>>
> Yeah the closing brace `}' is missing. This is fixed in upstream by PR:
> https://github.com/gluster/gluster-ansible-infra/pull/28
>

The testing and CI work is being done role by role with molecule[1].
Molecule runs both yamllint and ansible-lint. It's been catching lint
errors in the two roles we've worked with so far.

[1]: https://molecule.readthedocs.io/
-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Unplanned Jenkins maintenance

2018-09-28 Thread Nigel Babu
Hello folks,

I did a quick unplanned Jenkins maintenance today to upgrade 3 plugins with
security issues in them. This is now complete. There was a brief period
where we did not start new jobs until Jenkins restarted. There should have
been no interruption of existing jobs or any jobs canceled. Please file a
bug if you notice something wrong post-upgrade.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-infra] Freebsd builder upgrade to 10.4, maybe 11

2018-09-11 Thread Nigel Babu
On Tue, Sep 11, 2018 at 7:06 PM Michael Scherer  wrote:

> And... rescue mode is not working. So the server is down until
> Rackspace fix it.
>
> Can someone disable the freebsd smoke test, as I think our 2nd builder
> is not yet building fine ?
>


Disabled. Please do not merge any JJB review requests until this is fixed.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Proposal to change Gerrit -> Bugzilla updates

2018-09-11 Thread Nigel Babu
On Mon, Sep 10, 2018 at 7:08 PM Shyam Ranganathan 
wrote:

> My assumption here is that for each patch that mentions a BZ, an
> additional tracker would be added to the tracker list, right?
>

Correct.


>
> Further assumption (as I have not used trackers before) is that this
> would reduce noise as comments in the bug itself, right?
>

There is two goals - One two see all the patches posted at the start of the
bug. The reduction of noise is a good bonus to have.


>
> In the past we have reduced noise by not commenting on the bug (or
> github issue) every time the patch changes, so we get 2 comments per
> patch currently, with the above change we would just get one and that
> too as a terse external reference (see [1], based on my
> test/understanding).
>
> What we would lose is the commit details when the patch is merged in the
> BZ, as far as I can tell based on the changes below. These are useful
> and would like these to be retained in case they are not.
>

I'm okay to do that.


>
> > 2. When a patch is merged, only change state of the bug if needed. If
> > there is no state change, do not add an additional message. The external
> > tracker state should change reflecting the state of the review.
>
> I added a tracker to this bug [1], but not seeing the tracker state
> correctly reflected in BZ, is this work that needs to be done?
>

Huh. That's odd. This works way better with other trackers. I'll follow up
with the Bugzilla folks to see how to chase this down. Ideally it should
show the state change (however, with no notification on the bug as far as I
can understand. Given that you're for keeping the notification, at this
point, all that's extra is we'll add a tracker for every new bug.


>
> > 3. Assign the bug to the committer. This has edge cases, but it's best
> > to at least handle the easy ones and then figure out edge cases later.
> > The experience is going to be better than what it is right now.
>
> Is the above a reference to just the "assigned to", or overall process?
> If overall can you elaborate a little more on why this would be better
> (I am not saying it is not, attempting to understand how you see it).
>

It's a reference just to the "assigned to". I don't yet know who has their
gerrit emails mapped to their bugzilla IDs. Automatic assigning will only
work if that matrix is perfectly matched. So, when we start doing this, it
will fail for a bunch of users. We'll have to tune this over time to
develop a matrix of Gerrit email -> Bugzilla email.


>
> >
> > Please provide feedback/comments by end of day Friday. I plan to add
> > this activity to the next Infra team sprint that starts on Monday (Sep
> 17).
>
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1619423
>


-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Moving Jenkins alerts to a new list

2018-09-10 Thread Nigel Babu
Hello,

In an effort to make the devel list and maintainer lists more noise free,
I'm going to move all the Jenkins related alerts to a new list. This does
not apply to the alert sent out for new releases. This is part of a
longer-term plan to monitor build failures in Centos CI and the nightly
regression pipeline better. I will start migration from next Monday (17
Sep).

If you're interested in watching these alerts, please subscribe to
ci-results[1].

[1]: https://lists.gluster.org/mailman/listinfo/ci-results

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Proposal to change Gerrit -> Bugzilla updates

2018-09-10 Thread Nigel Babu
Hello folks,

We now have review.gluster.org as an external tracker on Bugzilla. Our
current automation when there is a bugzilla attached to a patch is as
follows:

1. When a new patchset has "Fixes: bz#1234" or "Updates: bz#1234", we will
post a comment to the bug with a link to the patch and change the status to
POST. 2. When the patchset is merged, if the commit said "Fixes", we move
the status to MODIFIED.

I'd like to propose the following improvements:
1. Add the Gerrit URL as an external tracker to the bug.
2. When a patch is merged, only change state of the bug if needed. If there
is no state change, do not add an additional message. The external tracker
state should change reflecting the state of the review.
3. Assign the bug to the committer. This has edge cases, but it's best to
at least handle the easy ones and then figure out edge cases later. The
experience is going to be better than what it is right now.

Please provide feedback/comments by end of day Friday. I plan to add this
activity to the next Infra team sprint that starts on Monday (Sep 17).

--
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Unplanned Jenkins Restart

2018-08-24 Thread Nigel Babu
Oops, big note: Centos Regression jobs may have ended up canceled. Please
retry them.

On Fri, Aug 24, 2018 at 9:31 PM Nigel Babu  wrote:

> Hello,
>
> We've had to do an unplanned Jenkins restart. Jenkins was overloaded and
> not responding to any requests. There was a backlog of over 100 jobs as
> well. The restart seems to have fixed things up.
>
> More details in bug: https://bugzilla.redhat.com/show_bug.cgi?id=1622173
>
> --
> nigelb
>


-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Unplanned Jenkins Restart

2018-08-24 Thread Nigel Babu
Hello,

We've had to do an unplanned Jenkins restart. Jenkins was overloaded and
not responding to any requests. There was a backlog of over 100 jobs as
well. The restart seems to have fixed things up.

More details in bug: https://bugzilla.redhat.com/show_bug.cgi?id=1622173

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Urgent Gerrit reboot today

2018-08-23 Thread Nigel Babu
Hello folks,

We're going to do an urgent reboot of the Gerrit server in the next 1h or
so. For some reason, hot-adding RAM on this machine isn't working, so we're
going to do a reboot to get this working. This is needed to prevent the OOM
Kill problems we've been running into since last night.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Fwd: [Ci-users] Maintenance Window 22-Aug-2018 12:00PM UTC

2018-08-21 Thread Nigel Babu
Heads up: Centos CI will be undergoing maintenance tomorrow.

-- Forwarded message -
From: Brian Stinson 
Date: Tue, Aug 21, 2018 at 1:58 AM
Subject: [Ci-users] Maintenance Window 22-Aug-2018 12:00PM UTC
To: 



Hi All,

Due to some pending OS updates we will be rebooting machines in the
CentOS CI Infrastructure starting at 12:00 Noon UTC on Wednesday
22-Aug-2018

We expect this to take up to 2 hours as we reboot various machines in
the CI Infrastructure. During this period we'll hold the master Jenkins
queue for resubmission after we complete this window.

Let us know here or in #centos-devel on Freenode if you have any
questions or comments.

Cheers!

--
Brian Stinson
CentOS CI Infrastructure Team
___
Ci-users mailing list
ci-us...@centos.org
https://lists.centos.org/mailman/listinfo/ci-users


-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Access to Docker Hub Gluster organization

2018-08-14 Thread Nigel Babu
On Tue, Aug 14, 2018 at 5:52 PM Humble Chirammal 
wrote:

>
>
> On Tue, Aug 14, 2018 at 2:09 PM, Nigel Babu  wrote:
>
>> Hello folks,
>>
>> Do we know who's the admin of the Gluster organization on Docker hub? I'd
>> like to be added to the org so I can set up nightly builds for all the
>> GCS-related containers.
>>
>> I admin this repo and I can add you to the team.
>
>
I found Kaushal who added me to the team :)


-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Master branch is closed

2018-08-13 Thread Nigel Babu
Oops, I apparently forgot to send out a note. Master has been since ~7 am
IST.

On Mon, Aug 13, 2018 at 4:25 PM Atin Mukherjee  wrote:

> Nigel,
>
> Now that mater branch is reopened, can you please revoke the commit access
> restrictions?
>
> On Mon, 6 Aug 2018 at 09:12, Nigel Babu  wrote:
>
>> Hello folks,
>>
>> Master branch is now closed. Only a few people have commit access now and
>> it's to be exclusively used to merge fixes to make master stable again.
>>
>>
>> --
>> nigelb
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
> --
> - Atin (atinm)
>


-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] ASAN Builds!

2018-08-10 Thread Nigel Babu
Hello folks,

Thanks to Niels, we now have ASAN builds compiling and a flag for getting
it to work locally. The patch[1] is not merged yet, but I can trigger runs
off the patch for now. The first run is off[2]

[1]: https://review.gluster.org/c/glusterfs/+/20589/2
[2]: https://build.gluster.org/job/asan/66/console

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Python components and test coverage

2018-08-10 Thread Nigel Babu
Hello folks,

We're currently in a transition to python3. Right now, there's a bug in one
piece of this transition code. I saw Nithya run into this yesterday. The
challenge here is, none of our testing for python2/python3 transition
catches this bug. Both Pylint and the ast-based testing that Kaleb
recommended does not catch this bug. The bug is trivial and would take 2
mins to fix, the challenge is that until we exercise almost all of these
code paths from both Python3 and Python2, we're not going to find out that
there are subtle breakages like this.

As far as I know, the three pieces where we use Python are geo-rep,
glusterfind, and libgfapi-python. My question:
* Are there more places where we run python?
* What sort of automated test coverage do we have for these components
right now?
* What can the CI team do to help identify problems? We have both Centos7
and Fedora28 builders, so we can definitely help run tests specific to
python.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Gerrit Upgrade Retrospective

2018-08-10 Thread Nigel Babu
Hello folks,

This is a quick retrospective we (the Infra team) did for the Gerrit
upgrade from 2 days ago.

## Went Well
* We had a full back up to fall back to. We had to fall back on this.
* We had a good 4h window so we had time to make mistakes and recover from
them.
* We had a good number of tests that were part of our upgrade steps. This
helped us catch a problem with the serviceuser plugin. We deleted the
plugin to overcome this.

## Went Badly
* This document did not capture that the serviceuser plugin also needs to
be upgraded.
* We made a mistake where we started the upgrade in the backup rather than
the main folder. We need to change our backup workflow so that this doesn't
happen in the future. This is an incredibly easy mistake to make.
* Git clones did not work. This was not part of our testing.
* cgit shows no repos. This was also not part of our testing.

## Future Recommendations
* [DONE] Setup proper documentation for the Gerrit upgrade workflow.
* We need to ensure that the engineer doing the upgrade does a staging
upgrade at least once or perhaps even twice to ensure the steps are
absolutely accurate.
* Gerrit stage consumes our ansible playbooks, but the sooner we can switch
master to this, the better. It catches problems we've already solved in the
past and automated away.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Clang failures update

2018-08-10 Thread Nigel Babu
Hello folks,

Based on Yaniv's feedback, I've removed deadcode.DeadStores checker. We are
left with 161 failures. I'm going to move this to 140 as a target for now.
The job will continue to be yellow and we need to fix at least 21 failures
by 31 Aug. That's about 7 issues per week to fix.

If anyone wants me to change the goal posts for this one, please let me
know.


If you want to run this on your local Fedora 27 machine, it should work
fine. If you want to run this on a Fedora 28 machine, you'll need to do a
little bit of a hack. Search for PYTHONDEV_CPPFLAGS in configure.ac and add
this line right below the existing line:

PYTHONDEV_CPPFLAGS=`echo ${PYTHONDEV_CPPFLAGS} | sed -e
's/-fcf-protection//g'`

Fedora 28 has GCC 8.0 and clang 7.0, this is the root cause of this failure
and in a future version, this should work without the need for this hack.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Spurious smoke failure in build rpms

2018-08-09 Thread Nigel Babu
Infra issue. Please file a bug.

On Thu, Aug 9, 2018 at 3:57 PM Pranith Kumar Karampuri 
wrote:

> https://build.gluster.org/job/devrpm-el7/10441/console
>
> *10:12:42* Wrote: 
> /home/jenkins/root/workspace/devrpm-el7/extras/LinuxRPM/rpmbuild/SRPMS/glusterfs-4.2dev-0.240.git4657137.el7.src.rpm*10:12:42*
>  mv rpmbuild/SRPMS/* .*10:12:44* INFO: mock.py version 1.4.11 starting 
> (python version = 2.7.5)...*10:12:44* Start: init plugins*10:12:44* INFO: 
> selinux disabled*10:12:44* Finish: init plugins*10:12:44* Start: 
> run*10:12:44* INFO: Start(glusterfs-4.2dev-0.240.git4657137.el7.src.rpm)  
> Config(epel-7-x86_64)*10:12:44* Start: clean chroot*10:12:44* ERROR: 
> Exception(glusterfs-4.2dev-0.240.git4657137.el7.src.rpm) 
> Config(epel-7-x86_64) 0 minutes 0 seconds
>
>
> I am not sure why it is saying exception for the src.rpm and failing, does
> anyone know?
>
>
> --
> Pranith
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel



-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Post-upgrade issues

2018-08-08 Thread Nigel Babu
Hello folks,

We have two post-upgrade issues

1. Jenkins jobs are failing because git clones fail. This is now fixed.
2. git.gluster.org shows no repos at the moment. I'm currently debugging
this.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-infra] Fwd: Gerrit downtime on Aug 8, 2016

2018-08-08 Thread Nigel Babu
On Wed, Aug 8, 2018 at 4:59 PM Yaniv Kaul  wrote:

>
> Nice, thanks!
> I'm trying out the new UI. Needs getting used to, I guess.
> Have we upgraded to NotesDB?
>

Yep! Account information is now completely in NoteDB and not in
ReviewDB(which is backed by postgresql for us) anymore.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-Maintainers] Master branch lock down status

2018-08-08 Thread Nigel Babu
On Wed, Aug 8, 2018 at 2:00 PM Ravishankar N  wrote:

>
> On 08/08/2018 05:07 AM, Shyam Ranganathan wrote:
> > 5) Current test failures
> > We still have the following tests failing and some without any RCA or
> > attention, (If something is incorrect, write back).
> >
> > ./tests/basic/afr/add-brick-self-heal.t (needs attention)
>  From the runs captured at https://review.gluster.org/#/c/20637/ , I saw
> that the latest runs where this particular .t failed were at
> https://build.gluster.org/job/line-coverage/415 and
> https://build.gluster.org/job/line-coverage/421/.
> In both of these runs, there are no gluster 'regression' logs available
> at https://build.gluster.org/job/line-coverage//artifact.
> I have raised BZ 1613721 for it.
>

We've fixed this for newer runs, but we can do nothing for older runs,
sadly.


>
> Also, Shyam was saying that in case of retries, the old (failure) logs
> get overwritten by the retries which are successful. Can we disable
> re-trying the .ts when they fail just for this lock down period alone so
> that we do have the logs?


Please don't apply a band-aid. Please fix run-test.sh so that the second
run has a -retry attached to the file name or some such, please.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Fwd: Gerrit downtime on Aug 8, 2016

2018-08-07 Thread Nigel Babu
Reminder, this upgrade is tomorrow.

-- Forwarded message -
From: Nigel Babu 
Date: Fri, Jul 27, 2018 at 5:28 PM
Subject: Gerrit downtime on Aug 8, 2016
To: gluster-devel 
Cc: gluster-infra , <
automated-test...@gluster.org>


Hello,

It's been a while since we upgraded Gerrit. We plan to do a full upgrade
and move to 2.15.3. Among other changes, this brings in the new PolyGerrit
interface which brings significant frontend changes. You can take a look at
how this would look on the staging site[1].

## Outage Window
0330 EDT to 0730 EDT
0730 UTC to 1130 UTC
1300 IST to 1700 IST

The actual time needed for the upgrade is about than hour, but we want to
keep a larger window open to rollback in the event of any problems during
the upgrade.

-- 
nigelb


-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] New Coverity Scan

2018-08-06 Thread Nigel Babu
Hello folks,

We've run a new Coverity run that was entirely automated. Current split of
Coverity issues:
High: 132
Medium: 241
Low: 83
Total: 456

We will be pushing a nightly build into scan.coverity.com via Jenkins. So,
you should be able to see updates to these numbers as you merge in fixes.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Master branch is closed

2018-08-05 Thread Nigel Babu
Hello folks,

Master branch is now closed. Only a few people have commit access now and
it's to be exclusively used to merge fixes to make master stable again.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Nigel Babu
On Thu, Aug 2, 2018 at 5:12 PM Kotresh Hiremath Ravishankar <
khire...@redhat.com> wrote:

> Don't know, something to do with perf xlators I suppose. It's not
> repdroduced on my local system with brick-mux enabled as well. But it's
> happening on Xavis' system.
>
> Xavi,
> Could you try with the patch [1] and let me know whether it fixes the
> issue.
>
> [1] https://review.gluster.org/#/c/20619/1
>

If you cannot reproduce it on your laptop, why don't you request a machine
from softserve[1] and try it out?

[1]:
https://github.com/gluster/softserve/wiki/Running-Regressions-on-clean-Centos-7-machine

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] FreeBSD smoke test may fail for older changes, rebase needed

2018-08-02 Thread Nigel Babu
> That is fine with me. It is prepared for GlusterFS 5, so nothing needs
> to be done for that. Only for 4.1 and 3.12 FreeBSD needs to be disabled
> from the smoke job(s).
>
> I could not find the repo that contains the smoke job, otherwise I would
> have tried to send a PR.
>
> Niels
>

For future reference, any "production" job that's on build.gluster.org will
have a corresponding job on build-jobs[1] on review.gluster.org. This has
been announced in the past and non-CI team members have sent us patches and
new jobs. There may be some jobs that do not have a corresponding yml file,
this is most likely because they're WIP or not production ready.

[1] http://git.gluster.org/cgit/build-jobs.git/

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] bug-1432542-mpx-restart-crash.t failures

2018-08-01 Thread Nigel Babu
Hi Shyam,

Amar and I sat down to debug this failure[1] this morning. There was a bit
of fun looking at the logs. It looked like the test restarted itself. The
first log entry is at 16:20:03. This test has a timeout of 400 seconds
which is around 16:26:43.

However, if you account for the fact that we log from the second step or
so, it looks like the test timed out and we restarted it. The first log
entry is from a few steps in, this makes sense. I think your patch[2] to
increase the timeout to 800 seconds is the right way forward.

The last step before the timeout is this
[2018-07-30 16:26:29.160943]  : volume stop patchy-vol17 : SUCCESS
[2018-07-30 16:26:40.222688]  : volume delete patchy-vol17 : SUCCESS

There are 20 volumes, so it really needs at least a 90 second bump. I'm
estimating 30 seconds per volume to clean up. You probably want to some
extra time so it passes on lcov as well. So right now the 800 second clean
up looks good.

[1]: https://build.gluster.org/job/regression-test-burn-in/4051/
[2]: https://review.gluster.org/#/c/20568/2
-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] FreeBSD smoke test may fail for older changes, rebase needed

2018-07-30 Thread Nigel Babu
>
> The outcome is to get existing maintained release branches building and
> working on FreeBSD, would that be correct?
>
> If so I think we can use the cherry-picked version, the changes seem
> mostly straight forward, and it is possibly easier to maintain.
>
> Although, I have to ask, what is the downside of not taking it in at
> all? If it is just FreeBSD, then can we live with the same till release-
> is out?
>
> Finally, thanks for checking as the patch is not a simple bug-fix backport.
>
>

We also have the option of turning off FreeBSD builds for previous release
branches. If you choose to not take the patches in to release branches.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Gerrit downtime on Aug 8, 2016

2018-07-28 Thread Nigel Babu
FYI: There is an issue with seeing diffs on staging. I've root caused this
to a bug in our apache configuration for Gerrit. This is more tricky than I
want to handle at the moment, but I'm aware of the problem and tested out a
fix. We'll fix it more permanently in ansible on Monday. My fix will get
overwritten by Ansible tonight :)

On Fri, Jul 27, 2018 at 5:28 PM Nigel Babu  wrote:

> Hello,
>
> It's been a while since we upgraded Gerrit. We plan to do a full upgrade
> and move to 2.15.3. Among other changes, this brings in the new PolyGerrit
> interface which brings significant frontend changes. You can take a look at
> how this would look on the staging site[1].
>
> ## Outage Window
> 0330 EDT to 0730 EDT
> 0730 UTC to 1130 UTC
> 1300 IST to 1700 IST
>
> The actual time needed for the upgrade is about than hour, but we want to
> keep a larger window open to rollback in the event of any problems during
> the upgrade.
>
> --
> nigelb
>


-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-infra] [automated-testing] Gerrit downtime on Aug 8, 2016

2018-07-27 Thread Nigel Babu
Ah, apologies.

Staging URL: http://gerrit-stage.rht.gluster.org/

If you want to try out PolyGerrit, the new UI, click on the footer of the
page that says "Switch to new UI".

On Fri, Jul 27, 2018 at 5:46 PM Sankarshan Mukhopadhyay <
> sankarshan.mukhopadh...@gmail.com> wrote:
>
>> The staging URL seems to be missing from the note
>>
>> On Fri, Jul 27, 2018 at 5:28 PM, Nigel Babu  wrote:
>> > Hello,
>> >
>> > It's been a while since we upgraded Gerrit. We plan to do a full
>> upgrade and
>> > move to 2.15.3. Among other changes, this brings in the new PolyGerrit
>> > interface which brings significant frontend changes. You can take a
>> look at
>> > how this would look on the staging site[1].
>> >
>> > ## Outage Window
>> > 0330 EDT to 0730 EDT
>> > 0730 UTC to 1130 UTC
>> > 1300 IST to 1700 IST
>> >
>> > The actual time needed for the upgrade is about than hour, but we want
>> to
>> > keep a larger window open to rollback in the event of any problems
>> during
>> > the upgrade.
>> >
>>
>
-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Gerrit downtime on Aug 8, 2016

2018-07-27 Thread Nigel Babu
Hello,

It's been a while since we upgraded Gerrit. We plan to do a full upgrade
and move to 2.15.3. Among other changes, this brings in the new PolyGerrit
interface which brings significant frontend changes. You can take a look at
how this would look on the staging site[1].

## Outage Window
0330 EDT to 0730 EDT
0730 UTC to 1130 UTC
1300 IST to 1700 IST

The actual time needed for the upgrade is about than hour, but we want to
keep a larger window open to rollback in the event of any problems during
the upgrade.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Release 5: Master branch health report (Week of 23rd July)

2018-07-25 Thread Nigel Babu
Replies inline

On Thu, Jul 26, 2018 at 1:48 AM Shyam Ranganathan 
wrote:

> On 07/24/2018 03:28 PM, Shyam Ranganathan wrote:
> > On 07/24/2018 03:12 PM, Shyam Ranganathan wrote:
> >> 1) master branch health checks (weekly, till branching)
> >>   - Expect every Monday a status update on various tests runs
> >
> > See https://build.gluster.org/job/nightly-master/ for a report on
> > various nightly and periodic jobs on master.
> >
> > RED:
> > 1. Nightly regression
> > 2. Regression with multiplex (cores and test failures)
> > 3. line-coverage (cores and test failures)
>
> The failures for line coverage issues, are filed as the following BZs
> 1) Parent BZ for nightly line coverage failure:
> https://bugzilla.redhat.com/show_bug.cgi?id=1608564
>
> 2) glusterd crash in test sdfs-sanity.t:
> https://bugzilla.redhat.com/show_bug.cgi?id=1608566
>
> glusterd folks, request you to take a look to correct this.
>
> 3) bug-1432542-mpx-restart-crash.t times out consistently:
> https://bugzilla.redhat.com/show_bug.cgi?id=1608568
>
> @nigel is there a way to on-demand request lcov tests through gerrit? I
> am thinking of pushing a patch that increases the timeout and check if
> it solves the problem for this test as detailed in the bug.
>

You should have access to trigger the job from Jenkins. Does that work for
now?


>
> >
> > Calling out to contributors to take a look at various failures, and post
> > the same as bugs AND to the lists (so that duplication is avoided) to
> > get this to a GREEN status.
> >
> > GREEN:
> > 1. cpp-check
> > 2. RPM builds
> >
> > IGNORE (for now):
> > 1. clang scan (@nigel, this job requires clang warnings to be fixed to
> > go green, right?)
>

So there are two ways. Back when I first ran it, I set a limit on how many
clang failures we have. If we went above the number, the job would turn
yellow. The current threshold is 955 and we're at 1001. What would be
useful is for us to fix a few bugs a week and keeping bumping this limit
down.


> >
> > Shyam
> > ___
> > Gluster-devel mailing list
> > Gluster-devel@gluster.org
> > https://lists.gluster.org/mailman/listinfo/gluster-devel
> >
>


-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Github teams/repo cleanup

2018-07-25 Thread Nigel Babu
On Wed, Jul 25, 2018 at 6:51 PM Niels de Vos  wrote:

> We had someone working on starting/stopping Jenkins slaves in Rackspace
> on-demand. He since has left Red Hat and I do not think the infra team
> had a great interest in this either (with the move out of Rackspace).
>
> It can be deleted from my point of view.
>

FYI, stopping a cloud server does not mean we don't get charged for it. So
I don't know if it was a useful exercise to begin with.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Github teams/repo cleanup

2018-07-25 Thread Nigel Babu
> So while cleaning thing up, I wonder if we can remove this one:
> https://github.com/gluster/jenkins-ssh-slaves-plugin
>
> We have just a fork, lagging from upstream and I am sure we do not use
> it.
>

Safe to delete. We're not using it for sure.


>
> The same goes for:
> https://github.com/gluster/devstack-plugins
>
> since I think openstack did change a lot, that seems like some internal
>  configuration for dev, I guess we can remove it ?
>

This one seems ahead of the original fork, but I'd say delete.


>
> --
> Michael Scherer
> Sysadmin, Community Infrastructure and Platform, OSAS
>
>

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Github teams/repo cleanup

2018-07-25 Thread Nigel Babu
I think our team structure on Github has become unruly. I prefer that we
use teams only when we can demonstrate that there is a strong need. At the
moment, the gluster-maintainers and the glusterd2 projects have teams that
have a strong need. If any other repo has a strong need for teams, please
speak up. Otherwise, I suggest we delete the teams and add the relevant
people as collaborators on the project.

It should be safe to delete the gerrit-hooks repo. These are now Github
jobs. I'm not in favor of archiving the old projects if they're going to be
hidden from someone looking for it. If they just move to the end of the
listing, it's fine to archive.

On Fri, Jun 29, 2018 at 10:26 PM Michael Scherer 
wrote:

> Le vendredi 29 juin 2018 à 14:40 +0200, Michael Scherer a écrit :
> > Hi,
> >
> > So, after Gentoo hack, I started to look at all our teams on github,
> > and what access does everybody have, etc, etc
> >
> > And I have a few issues:
> > - we have old repositories that are no longer used
> > - we have team without description
> > - we have people without 2FA who are admins of some team
> > - github make this kind of audit really difficult without scripting
> > (and the API is not stable yet for teams)
> >
> > So I would propose the following rules, and apply them in 1 or 2
> > weeks
> > time.
> >
> > For projects:
> >
> > - archives all old projects, aka, ones that got no commit since 2
> > years, unless people give a reason for the project to stay
> > unarchived.
> > Being archived do not remove it, it just hide it by default and set
> > it
> > readonly. It can be reverted without trouble.
> >
> > See https://help.github.com/articles/archiving-a-github-repository/
> >
> > - remove project who never started ("vagrant" is one example, there
> > is
> > only one readme file).
> >
> > For teams:
> > - if you are admin of a team, you have to turn on 2FA on your
> > account.
> > - if you are admin of the github org, you have to turn 2FA.
> >
> > - if a team no longer have a purpose (for example, all repos got
> > archived or removed), it will be removed.
> >
> > - add a description in every team, that tell what kind of access does
> > it give.
> >
> >
> > This would permit to get a bit more clarity and security.
>
> So to get some perspective after writing a script to get the
> information, the repos I propose to archive:
>
> Older than 3 years, we have:
>
> - gmc-target
> - gmc
> - swiftkrbauth
> - devstack-plugins
> - forge
> - glupy
> - glusterfs-rackspace-regression-tester
> - jenkins-ssh-slaves-plugin
> - glusterfsiostat
>
>
> Older than 2 years, we have:
> - nagios-server-addons
> - gluster-nagios-common
> - gluster-nagios-addons
> - mod_proxy_gluster
> - gluster-tutorial
> - gerrit-hooks
> - distaf
> - libgfapi-java-io
>
> And to remove, because empty:
> - vagrant
> - bigdata
> - gluster-manila
>
>
> Once they are archived, I will take care of the code for finding teams
> to remove.
>
> --
> Michael Scherer
> Sysadmin, Community Infrastructure and Platform, OSAS
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel



-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Postmortem for Jenkins Outage on 20/07/18

2018-07-20 Thread Nigel Babu
Hello folks,

I had to take down Jenkins for some time today. The server ran out of space
and was silently ignoring Gerrit requests for new jobs. If you think one of
your jobs needed a smoke or regression run and it wasn't triggered, this is
the root cause. Please retrigger your jobs.

## Summary of Impact
Jenkins jobs not triggered intermittently in the last couple of days. At
the moment, we do not have numbers on how many developers were affected by
this. This would be mitigated slightly every day due to the rotation rules
we have in place causing issues only around evening IST when we retrigger
our regular nightly jobs.

## Timeline of Events.
July 19 evening: I've noticed since yesterday that occasionally Jenkins
would not trigger a job for a push. This was on the build-jobs repo. I
chalked it to a signal getting lost in the noise and decided to debug
later. I could trigger it manually, so I put as a thing to do in the
morning. Today morning, I found that jobs are getting triggered as they
should and could not notice anything untoward.

July 20 6:41 pm: Kotresh pinged me asking if there was a problem. I could
see the problem I noticed yesterday in his job. This time a manual trigger
did not work. Around the same time Raghavendra Gowdappa also hit the same
problem. I logged into the server to notice that the Jenkins partition was
out of space.

July 20 7:40 pm: Jenkins is back online completely. A retrigger of the two
failing jobs have been successful.

## Root Cause
* Out of disk space on the Jenkins partition on build.gluster.org
* The bugzilla-post did not delete old jobs and we had about 7000 jobs in
there consuming about 20G of space.
* clang-scan job consumes about 1G per job and we were storing about 30
days worth of archives.

## Resolution
* All centos6-regression jobs are now deleted. We moved over to
centos7-regression a while ago.
* We now only store 7 days of archives for bugzilla-post and clang-scan jobs

## Future Recommendation
* Our monitoring did not alert us about the disk being filled up on the
Jenkins node. Ideally, we should have gotten a warning when we were at
least 90% full so we could plan for additional capacity or look for
mistakes in patterns.
* All jobs need to have a property that discards old runs with the maxmium
of 90 days being kept in case it's absolutely needed. This is currently not
enforced by CI but we will plan to enforce it in the future.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] [FYI] GitHub connectivity issues

2018-07-20 Thread Nigel Babu
Hello folks,

Our infra also runs in the same network, so if you notice issues, they're
most likely related to the same network issues.

-- Forwarded message -
From: Fabian Arrotin 
Date: Fri, Jul 20, 2018 at 12:49 PM
Subject: [Ci-users] [FYI] GitHub connectivity issue
To: ci-us...@centos.org 


Hi,

Just to let all Projects using CI that our monitoring complained a lot
about "flapping" connectivity to some external nodes, including github.com.
After some investigations, it seems that there are some peering issues
(or routing) at Level3, but errors come and go.

We can't do anything but report internally and see if then error can be
reported "upstream" at the link provider level.

So this message is more about the fact that if your tests are having
issues with some external connectivity, that can be related to that issue.

-- 
Fabian Arrotin
The CentOS Project | https://www.centos.org
gpg key: 56BEC54E | twitter: @arrfab

___
Ci-users mailing list
ci-us...@centos.org
https://lists.centos.org/mailman/listinfo/ci-users


-- 
nigelb
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.22 (GNU/Linux)

iEYEARECAAYFAltRjOcACgkQnVkHo1a+xU5ndwCcDQmJTyunYsNRPvwpKGhK59Md
vN4AnAj7MV28RcZz4SoIUoWxiuyVzh+a
=w+ez
-END PGP SIGNATURE-
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Re-thinking gluster regression logging

2018-07-02 Thread Nigel Babu
Hello folks,

Deepshikha is working on getting the distributed-regression testing into
production. This is a good time to discuss how we log our regression. We
tend to go with the approach of "get as many logs as possible" and then we
try to make sense of it when it something fails.

In a setup where we distribute the tests to 10 machines, that means
fetching runs from 10 machines and trying to make sense of it. Granted, the
number of files will most likely remain the same since a successful test is
only run once, but a failed test is re-attempted two more times on
different machines. So we will now have duplicates.

I have a couple of suggestions and I'd like to see what people think.
1. We stop doing tar of tars to do the logs and just tar the
/var/log/glusterfs folder at the end of the run. That will probably achieve
better compression.
2. We could stream the logs to a service like ELK that we host. This means
that no more tarballs. It also lets us test any logging improvements we
plan to make for Gluster in one place.
2. I've been looking at Citellus[1] to write parsers that help us identify
critical problems. This could be a way for us to build a repo of parsers
that can identify common gluster issues.

Perhaps our solution would be a mix of all 2 and 3. Ideally, I'd like us to
avoid archiving tarballs to debug regression issues in the future.

[1]: https://github.com/citellusorg/citellus

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Fwd: Clang-format: Update

2018-06-28 Thread Nigel Babu
Hello folks,

A while ago we talked about using clang-format for our codebase[1]. We
started doing several pieces of this work asynchronously. Here's an update
on the current state of affairs:

* Team agrees on a style and a config file representing the style.
This has been happening asynchronously on Github[2]. Amar, Xavi, and Jeff
-- Can we close out this discussion and have a config file in 2 weeks? If
anyone feels strongly about coding style, please participate in the
discussion now.

* Commit the coding style guide to codebase and make changes in rfc.sh to
use it.
Waiting on 1. I can do this once we have the discussion finalized.

* gluster-ant commits a single large patch for whole codebase with a
standard clang-format style.
This is waiting on the first two steps and should be trivial to accomplish.
I have access to the gluster-ant account and I can make the necessary
changes.

* Have the job ready to check the patch with the config file, on the server
side, this should be a Voting job in smoke.
The server side Jenkins job is now ready[3]. The client side rfc.sh patch
is next, but merging that change will wait on the config file being ready.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1564149#c33
[2]: https://github.com/nigelbabu/clang-format-sample/
[3]: https://review.gluster.org/#/c/20418/

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] POC- Distributed regression testing framework

2018-06-25 Thread Nigel Babu
On Mon, Jun 25, 2018 at 7:28 PM Amar Tumballi  wrote:

>
>
> There are currently a few known issues:
>> * Not collecting the entire logs (/var/log/glusterfs) from servers.
>>
>
> If I look at the activities involved with regression failures, this can
> wait.
>

Well, we can't debug the current failures without having the logs. So this
has to be fixed first.


>
>
>> * A few tests fail due to infra-related issues like geo-rep tests.
>>
>
> Please open bugs for this, so we can track them, and take it to closure.
>

These are failing due to infra reasons. Most likely subtle differences in
the setup of these nodes vs our normal nodes. We'll only be able to debug
them once we get the logs. I know the geo-rep ones are easy to fix. The
playbook for setting up geo-rep correctly just didn't make it over to the
playbook used for these images.


>
>
>> * Takes ~80 minutes with 7 distributed servers (targetting 60 minutes)
>>
>
> Time can change with more tests added, and also please plan to have number
> of server as 1 to n.
>

While the n is configurable, however it will be fixed to a single digit
number for now. We will need to place *some* limitation somewhere or else
we'll end up not being able to control our cloud bills.


>
>
>> * We've only tested plain regressions. ASAN and Valgrind are currently
>> untested.
>>
>
> Great to have it running not 'per patch', but as nightly, or weekly to
> start with.
>

This is currently not targeted until we phase out current regressions.


>
>> Before bringing it into production, we'll run this job nightly and
>> watch it for a month to debug the other failures.
>>
>>
> I would say, bring it to production sooner, say 2 weeks, and also plan to
> have the current regression as is with a special command like 'run
> regression in-one-machine' in gerrit (or something similar) with voting
> rights, so we can fall back to this method if something is broken in
> parallel testing.
>
> I have seen that regardless of amount of time we put some scripts in
> testing, the day we move to production, some thing would be broken. So, let
> that happen earlier than later, so it would help next release branching
> out. Don't want to be stuck for branching due to infra failures.
>

Having two regression jobs that can vote is going to cause more confusion
than it's worth. There are a couple of intermittent memory issues with the
test script that we need to debug and fix before I'm comfortable in making
this job a voting job. We've worked around these problems right now. It
still pops up now and again. The fact that things break often is not an
excuse to prevent avoidable failures.  The one month timeline was taken
with all these factors into consideration. The 2-week timeline is a no-go
at this point.

When we are ready to make the switch, we won't be switching 100% of the
job. We'll start with a sliding scale so that we can monitor failures and
machine creation adequately.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Fedora builds and rawhide builds

2018-06-19 Thread Nigel Babu
Hello,

We ran into a problem where builds for F28 and above will not build on
CentOS7 chroots. We caught this when F28 was rawhide but deemed it not yet
important enough to fix, however, recent developments have forced us to
make the switch. Our Fedora builds will also switch to using F28.

We have 10 new builders builder{40..49}.int.rht.gluster.org, all of which
run F28. These will be currently used for Fedora builds (they build with
libtirpc and rpcgen) and for the upcoming clang-format jobs.

Please let us know if you notice anything wrong the voting patterns for
smoke jobs.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Running regressions with GD2

2018-06-01 Thread Nigel Babu
Hello,

We're nearly at 4.1 release, I think now is a time to decide when to flip
the switch to default to GD2 server for all regressions or a nightly GD2
run against the current regression.

Can someone help with what tasks need to be done for this to be
accomplished and how the CI team can help. I'd like to set a deadline to
this, so the task list will help me pick a good date.

I'm happy to help in terms of Infra/CI in any way I can for this effort.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Cleaning up artifacts.ci.centos.org/gluster

2018-05-27 Thread Nigel Babu
Hello folks,

I'd like to propose that we clean up artifacts.ci.centos.org/gluster.
Here's my proposal:

1. Nightly folder will only have rpms from pre-release versions. That is,
I'll be deleting everything that's not 4.1 or 4.2.
2. Releases that are no longer actively supported will be deleted.

This makes them easier to browse and removes the clutter. I plan to make
this change next Monday. So please let me know if you have any problems
with this.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Reminder OUTAGE Today 0800 EDT / 1200 UTC / 1730 IST

2018-05-14 Thread Nigel Babu
Hello,

This is a reminder that we have a an outage today at the community cage
outage window. The switches and routers will be getting updated and
rebooted. This will cause an outage for a short period of time.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Builds failing regularly in various jobs (Jenkins, Smoke, other)

2018-05-14 Thread Nigel Babu
Merging this in and deploying on builders based on Kotresh's +1 to unblock
builds and merges.

On Mon, May 14, 2018 at 9:49 AM, Nigel Babu <nig...@redhat.com> wrote:

> This is because of a new warning by liblvm2app. I have a hacky fix to the
> compilation process to get rid of the warning. Please review:
> https://github.com/gluster/glusterfs-patch-acceptance-tests/pull/130
>
> However, this will soon become more than just a warning. We should either
> fix this or completely get rid of bd xlator unless someone owns it up.
>
> On Mon, May 14, 2018 at 2:51 AM, Shyam Ranganathan <srang...@redhat.com>
> wrote:
>
>> Hi,
>>
>> The builds are failing due to the following,
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=1577668
>>
>> Nigel/Misc, can you take a stab at finding out what is new that is
>> causing these warnings to appear?
>>
>> Failing a quick resolution there, anyone from maintainers, could you
>> look at supressing the warnings or fixing them? (bd has no assigned
>> owners)
>>
>> This is important, as quite a few patches are not getting through.
>>
>> Shyam
>> P.S: The bug is marked as a release blocker as well!
>>
>
>
>
> --
> nigelb
>



-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Builds failing regularly in various jobs (Jenkins, Smoke, other)

2018-05-13 Thread Nigel Babu
This is because of a new warning by liblvm2app. I have a hacky fix to the
compilation process to get rid of the warning. Please review:
https://github.com/gluster/glusterfs-patch-acceptance-tests/pull/130

However, this will soon become more than just a warning. We should either
fix this or completely get rid of bd xlator unless someone owns it up.

On Mon, May 14, 2018 at 2:51 AM, Shyam Ranganathan 
wrote:

> Hi,
>
> The builds are failing due to the following,
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1577668
>
> Nigel/Misc, can you take a stab at finding out what is new that is
> causing these warnings to appear?
>
> Failing a quick resolution there, anyone from maintainers, could you
> look at supressing the warnings or fixing them? (bd has no assigned owners)
>
> This is important, as quite a few patches are not getting through.
>
> Shyam
> P.S: The bug is marked as a release blocker as well!
>



-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Coding Standard: Automation

2018-04-23 Thread Nigel Babu
I hope I've made the changes that Jeff's recommended in the first comment
correctly[1]. Xavi, I've not pulled in any of your suggestions yet, because
I figured you'd want to see the output and send suggestions.

Please send pull requests to the .clang-format file (and only that file)
for anything I've missed or anything you think needs changing. I'll do the
re-generation so we're not stuck with merge conflicts.

[1]:
https://github.com/nigelbabu/clang-format-sample/commit/733394939034ff9baaf579e7f327bdd078f204ef

On Mon, Apr 23, 2018 at 12:58 PM,  wrote:

> Planning to postpone this meeting, and idea is to work in more
> collaborated way off-line, instead of being in a meeting. we believe it
> would give everyone (those who didn't attend) too a fair chance to submit
> their opinion.
>
> For now, we will continue with Nigel's clang-format repo for this to
> experiment with different options. [https://github.com/nigelbabu/
> clang-format-sample]
>
> The plan on this is to go with a sample gluster file, which would have
> complex macros, a call with STACK_WIND/UNWIND function call. A switch case,
> a for loop, a do/while loop. Also a list_for_each loop. Have locked region.
> A sample 4-5 level depth if/else checks, etc.
>
> With this sample file, having the .clang-format decided as either Chromium
> or Mozilla as a base (with IndentSize set to 4 space), would be a good
> start. We will also make sure to have all the agreed points in bugzilla,
> and add it to clang-format file, and also regenerate the sample file. So,
> everyone gets an idea how the target file would look like. If everyone
> agrees, by the end of the week, we will have an agreement, so we can go
> ahead and make this possible before 4.1 release branching. (So, our
> backport efforts will be reduced drastically).
>
> -Amar
>
> Coding Standard: Automation
> BJ: https://bluejeans.com/205933580
> 
>
> We will talk and come to agreement on https://bugzilla.redhat.
> com/show_bug.cgi?id=1564149
> 
>
> It was agreed that we will go ahead with format change automation, so,
> goal of this meeting is to pick the right options.
>
> Goal is to get gluster's own `.clang-format` file. Once that file is
> agreed upon, we will go ahead and create a job for fixing the patches for
> format, and also fix the codebase to get the formats.
>
> Pre-work if you are interested, read about : https://clang.llvm.org/docs/
> ClangFormatStyleOptions.html
> 
>
> Also pick a gluster file which would pass through agreed format, so you
> can validate how it looks after formatting. Instead of waiting for this to
> happen, we can see is this good enough?
>
> Few things we mostly agree:
>
>  !AllowShortIfStatementsOnASingleLine !AllowShortLoopsOnASingleLine 
> BraceWrapping(!AfterControlStatement) BraceWrapping(AfterFunction) 
> BraceWrapping(!BeforeElse) ColumnLimit(80) IndentWidth(4) 
> PointerAlignment(PAS_Right) SpaceBeforeParens(SBPO_Always) TabWidth(8) 
> UseTab(UT_Never)
>
>
>   BinPackParameters=true
>
>   AlignEscapedNewLinesLeft=false
>
>  AlignConsecutiveDeclarations=true
>
>   AlignConsecutiveAssignments=true
>
>  AlwaysBreakAfterReturnType = true
>
>
>
> More options which we can discuss:
>
> !IndentCaseLabelsSpaceBeforeParens = ControlStatements
>
>
>
> I propose two steps as preventing history:
>
> * The commit before the mass-format-change commit will maintained as a
> separate branch. (No cost of space, but everyone clearly knows where to go
> for history, when git blame pointing to the commit of mass changes).
> * Similarly, to get history of pre-2009 (currently 'historic' repo), I
> personally feel moving  https://github.com/amarts/
> glusterfs/commits/git-based-history-from-historic
> ,
> as a separate branch in gluster/glusterfs would help. Again, today people
> has to switch repositories for this.
> *When*
> Mon Apr 23, 2018 6pm – 6:50pm India Standard Time
>
> *Who*
> •
> atumb...@redhat.com - organizer
> •
> j...@pl.atyp.us
> •
> nb...@redhat.com
> •
> srang...@redhat.com
> •
> gluster-devel@gluster.org
> •
> jaher...@redhat.com
>



-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] trash.t failure

2018-04-17 Thread Nigel Babu
I've reverted the original patch entirely. Our policy is to either mark the
test as bad or revert the entire patch. This seems to have caused multiple
failures in the test system, so I've reverted the entire patch. Please
re-land the patch with any fixes as a fresh review.

On Wed, Apr 18, 2018 at 8:25 AM, Atin Mukherjee  wrote:

> commit d206fab73f6815c927a84171ee9361c9b31557b1
> Author: Kinglong Mee 
> Date:   Mon Apr 9 08:33:51 2018 -0400
>
> storage/posix: add pgfid in readdirp if needed
>
> Change-Id: I6745428fd9d4e402bf2cad52cee8ab46b7fd822f
> fixes: bz#1560319
> Signed-off-by: Kinglong Mee 
>
>
> The above commit has caused (thanks to Amar for bisect!) trash.t test in
> upstream CI to fail very frequently. As per fstat.gluster.org (refer :
> https://bit.ly/2qGcSP6) this test has failed 17 times in master branch in
> last 4 days. Given we're nearing GlusterFS 4.1 branching and there're few
> important patches blocked in the regression pipeline queue, I've sent a
> patch https://review.gluster.org/19894  to mark trash.t as bad for now as
> a temporary arrangement.
>
> I request Kinglong and the owner of trash feature to debug this issue and
> send a fix which can revert back my change.
>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Regression with brick multiplex on demand

2018-04-17 Thread Nigel Babu
Hello folks,

In the past if you had a patch that was fixing a brick multiplex failure,
you couldn't test whether it actually fixed brick multiplex failures
easily. You had two options:

* Create a new review where you turn on brick multiplex via the code and
also apply your patch. Mark a -1 for this review and iterate until tests
passed.
* Merge the patch and pray.

Now on any patch that you want brick multiplex triggered, just add comment
"run brick-mux regression" and it will trigger a run and post results to
the review. This is a non-voting job. This should not mess up any votes.
Please file a bug if it does.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Unplanned Jenkins restart

2018-04-16 Thread Nigel Babu
Hello folks,

I've just restarted Jenkins for an security update to a plugin. There was
one running centos-regression job that I had to cancel.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] EC Stripe cache jobs

2018-04-11 Thread Nigel Babu
Hello,

We have a job that tries to turn on stripe cache and run EC tests. It looks
like we recently made the decision to turn on stripe cache by default. Is
this job needed anymore? It fails at the moment due to a merge conflict.


-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Jenkins upgrade today

2018-04-10 Thread Nigel Babu
Hello folks,

There's a Jenkins security fix scheduled to be released today. This will
most likely happen in the morning EDT. The Jenkins team has not specified a
time. When we're ready for an upgrade, I'll cancel all running jobs and
re-trigger them at te end of the upgrade. The downtime should be less than
15 mins.

Please bear with us as we continue to ensure that build.gluster.org has the
latest security fixes.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Announcing Softserve- serve yourself a VM

2018-03-20 Thread Nigel Babu
Please file an issue for this:
https://github.com/gluster/softserve/issues/new

On Tue, Mar 20, 2018 at 1:57 PM, Sanju Rakonde <srako...@redhat.com> wrote:

> Hi Nigel,
>
> I have a suggestion here. It will be good if we have a option like request
> for extension of the VM duration, and the option will be automatically
> activated after 3 hours of usage of VM. If somebody is using the VM after 3
> hours and they feel like they need it for 2 more hours they will request to
> extend the duration by 1 more hour. It will save the time of engineering
> since if a machine is expired, one has to configure the machine and all
> other stuff from the beginning.
>
> Thanks,
> Sanju
>
> On Tue, Mar 13, 2018 at 12:37 PM, Nigel Babu <nig...@redhat.com> wrote:
>
>>
>> We’ve enabled certain limits for this application:
>>>>
>>>>1.
>>>>
>>>>Maximum allowance of 5 VM at a time across all the users. User have
>>>>to wait until a slot is available for them after 5 machines allocation.
>>>>2.
>>>>
>>>>User will get the requesting machines maximum upto 4 hours.
>>>>
>>>>
>>> IMHO ,max cap of 4 hours is not sufficient. Most of the times, the
>>> reason of loaning a machine is basically debug a race where we can't
>>> reproduce the failure locally what I have seen debugging such tests might
>>> take more than 4 hours. Imagine you had done some tweaking to the code and
>>> you're so close to understand the problem and then the machine expires,
>>> it's definitely not a happy feeling. What are the operational challenges if
>>> we have to make it for atleast 8 hours or max a day?
>>>
>>
>> The 4h cap was kept so that multiple people could have a chance to debug
>> their test failures on the same day. Pushing the cap to 8h means that if
>> you don't have a machine to loan when you start work one will not be
>> available until the next day. At this point, we'll not be increasing the
>> timeout. So far, we've had one person actually hit this. I'd like to see
>> more data points before we make an application level change.
>>
>> --
>> nigelb
>>
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>
>


-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Accessing tarball of core fails with FORBIDDEN

2018-03-19 Thread Nigel Babu
As is the practice for any infra problems, please file a bug:
https://bugzilla.redhat.com/enter_bug.cgi?product=
GlusterFS=project-infrastructure

On Mon, Mar 19, 2018 at 5:58 PM, Raghavendra Gowdappa 
wrote:

> Hi Nigel,
>
> I am not able to download the archive of core from:
> http://builder102.cloud.gluster.org/archived_builds/build-
> install-centos7-regression-387.tar.bz2
>
> bash-4.4$ wget http://builder102.cloud.gluster.org/archived_builds/build-
> install-centos7-regression-387.tar.bz2
> --2018-03-19 17:56:14--  http://builder102.cloud.gluste
> r.org/archived_builds/build-install-centos7-regression-387.tar.bz2
> Resolving builder102.cloud.gluster.org (builder102.cloud.gluster.org)...
> 192.237.253.99
> Connecting to builder102.cloud.gluster.org 
> (builder102.cloud.gluster.org)|192.237.253.99|:80...
> connected.
> HTTP request sent, awaiting response... 403 Forbidden
> 2018-03-19 17:56:15 ERROR 403: Forbidden.
>
> regards,
> Raghavendra
>



-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Branching out Gluster docs

2018-03-17 Thread Nigel Babu
Hello folks,

Our docs need a significant facelift. Nithya has suggested that we branch
out the current docs into a branch called version-3 (or some such, please
let's not bikeshed about the name) and have the master branch track 4.x
series. We will significantly change the documentation for master branch so
that it has a better content flow as well as give correct instructions for
working with GD2.

Among other things, we'll have to add a banner to the docs that highlights
which version you're looking at. This is not a problem, I can handle that.

Humble and Prashant, do you both agree this is a good idea? I'm happy to do
all the work needed to make this happen.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] gluster-ant is now admin on synced repos

2018-03-15 Thread Nigel Babu
Hello,

If there's a repo that's synced from Gerrit to Github, gluster-ant is now
admin on those repos. This is so that when issues are closed via commit
message, it is closed by the right user (the bot). Rather than the Infra
person who set that repo up.

As always, please file a bug if you notice any problems.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] ./tests/basic/mount-nfs-auth.t spews out warnings

2018-03-14 Thread Nigel Babu
When the test works it takes less than 60 seconds. If it needs more than
200 seconds, that means there's an actual issue.

On Wed, Mar 14, 2018 at 10:16 AM, Raghavendra Gowdappa 
wrote:

> All,
>
> I was trying to debug a regression failure [1]. When I ran test locally on
> my laptop, I see some warnings as below:
>
> ++ gluster --mode=script --wignore volume get patchy nfs.mount-rmtab
> ++ xargs dirname
> ++ awk '/^nfs.mount-rmtab/{print $2}'
> dirname: missing operand
> Try 'dirname --help' for more information.
> + NFSDIR=
>
> To debug I ran the volume get cmds:
>
> [root@booradley glusterfs]# gluster volume get patchy nfs.mount-rmtab
> Option  Value
>
> --  -
>
> volume get option failed. Check the cli/glusterd log file for more details
>
> [root@booradley glusterfs]# gluster volume set patchy nfs.mount-rmtab
> testdir
> volume set: success
>
> [root@booradley glusterfs]# gluster volume get patchy nfs.mount-rmtab
> Option  Value
>
> --  -
>
> nfs.mount-rmtab testdir
>
>
> Does this mean the option value is not set properly in the script? Need
> your help in debugging this.
>
> @Nigel
> I noticed that test is timing out.
>
> *20:28:39* ./tests/basic/mount-nfs-auth.t timed out after 200 seconds
>
> Can this be infra issue where nfs was taking too much time to mount?
>
> [1] https://build.gluster.org/job/centos7-regression/316/console
>
> regards,
> Raghavendra
>



-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Announcing Softserve- serve yourself a VM

2018-03-13 Thread Nigel Babu
> We’ve enabled certain limits for this application:
>>
>>1.
>>
>>Maximum allowance of 5 VM at a time across all the users. User have
>>to wait until a slot is available for them after 5 machines allocation.
>>2.
>>
>>User will get the requesting machines maximum upto 4 hours.
>>
>>
> IMHO ,max cap of 4 hours is not sufficient. Most of the times, the reason
> of loaning a machine is basically debug a race where we can't reproduce the
> failure locally what I have seen debugging such tests might take more than
> 4 hours. Imagine you had done some tweaking to the code and you're so close
> to understand the problem and then the machine expires, it's definitely not
> a happy feeling. What are the operational challenges if we have to make it
> for atleast 8 hours or max a day?
>

The 4h cap was kept so that multiple people could have a chance to debug
their test failures on the same day. Pushing the cap to 8h means that if
you don't have a machine to loan when you start work one will not be
available until the next day. At this point, we'll not be increasing the
timeout. So far, we've had one person actually hit this. I'd like to see
more data points before we make an application level change.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Please help test Gerrit 2.14

2018-03-04 Thread Nigel Babu
Hello,

It's that time again. We need to move up a Gerrit release. Staging has now
been upgraded to the latest version. Please help test it and give us
feedback on any issues you notice: https://gerrit-stage.rht.gluster.org/

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-infra] Continuous tests failure on Fedora RPM builds

2018-03-02 Thread Nigel Babu
This is now fixed. Shyam found the root case. After a mock upgrade, mock
would wait for user confirmation that DNF wasn't installed on the system.
Given this was a centos machine, DNF wasn't readily available. I set the
config option dnf_warning=False and that fixed the failures. All previously
failed jobs were retried and should now be green. I also took the
opportunity to "upgrade" Fedora buildroot to F27.

On Wed, Feb 28, 2018 at 8:00 PM, Amar Tumballi  wrote:

> Looks like the tests here are continuously failing:
> https://build.gluster.org/job/devrpm-fedora/
>
> It would be great if someone takes a look at it.
>
> --
> Amar Tumballi (amarts)
>
> ___
> Gluster-infra mailing list
> gluster-in...@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-infra
>



-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] tests/bugs/rpc/bug-921072.t - fails almost all the times in mainline

2018-02-20 Thread Nigel Babu
Aha. Thanks Mohit. That was infra. Sorry about that. The first line in
/etc/hosts said

::1 localhost   localhost.localdomain   localhost6
localhost6.localdomain6

Once I removed it, the tests started running faster. I'll update my patch
to remove this particular test from timeout fix

On Wed, Feb 21, 2018 at 10:42 AM, Mohit Agrawal  wrote:

> Hi,
>
> I think as per test logs it is showing most of the taken by while test
> case is trying to mount a volume through nfs
> with a nolock option, i think it needs to be checked by nfs team why it is
> taking more time
>
> 
>
> sudo grep -i "mount_nfs" -A 1 d-backends-patchy.log
> [2018-02-21 04:55:56.466891]:++ G_LOG:./tests/bugs/rpc/bug-921072.t:
> TEST: 17 mount_nfs builder500.cloud.gluster.org:/patchy /mnt/nfs/0 nolock
> ++
> [2018-02-21 04:55:59.496667]:++ G_LOG:./tests/bugs/rpc/bug-921072.t:
> TEST: 18 Y force_umount /mnt/nfs/0 ++
> --
> [2018-02-21 04:56:24.943041]:++ G_LOG:./tests/bugs/rpc/bug-921072.t:
> TEST: 25 mount_nfs localhost:/patchy /mnt/nfs/0 nolock ++
> [2018-02-21 04:58:29.986140]:++ G_LOG:./tests/bugs/rpc/bug-921072.t:
> TEST: 26 Y force_umount /mnt/nfs/0 ++
> --
> [2018-02-21 04:58:54.419774]:++ G_LOG:./tests/bugs/rpc/bug-921072.t:
> TEST: 32 ! mount_nfs localhost:/patchy /mnt/nfs/0 nolock ++
> [2018-02-21 05:00:59.463965]:++ G_LOG:./tests/bugs/rpc/bug-921072.t:
> TEST: 33 gluster --mode=script --wignore volume reset patchy force
> ++
> --
> [2018-02-21 05:01:19.095653]:++ G_LOG:./tests/bugs/rpc/bug-921072.t:
> TEST: 39 ! mount_nfs localhost:/patchy /mnt/nfs/0 nolock ++
> [2018-02-21 05:03:24.150833]:++ G_LOG:./tests/bugs/rpc/bug-921072.t:
> TEST: 42 gluster --mode=script --wignore volume set patchy
> nfs.rpc-auth-reject 192.168.1.1 ++
> --
> [2018-02-21 05:03:44.550737]:++ G_LOG:./tests/bugs/rpc/bug-921072.t:
> TEST: 45 mount_nfs localhost:/patchy /mnt/nfs/0 nolock ++
> [2018-02-21 05:05:49.595573]:++ G_LOG:./tests/bugs/rpc/bug-921072.t:
> TEST: 46 Y force_umount /mnt/nfs/0 ++
> --
> [2018-02-21 05:06:15.297241]:++ G_LOG:./tests/bugs/rpc/bug-921072.t:
> TEST: 57 mount_nfs localhost:/patchy /mnt/nfs/0 nolock ++
> [2018-02-21 05:08:20.341869]:++ G_LOG:./tests/bugs/rpc/bug-921072.t:
> TEST: 58 Y force_umount /mnt/nfs/0 ++
>
> .
>
> Regards
> Mohit Agrawal
>
> On Wed, Feb 21, 2018 at 9:52 AM, Raghavendra Gowdappa  > wrote:
>
>> +Mohit.
>>
>> On Wed, Feb 21, 2018 at 7:47 AM, Atin Mukherjee 
>> wrote:
>>
>>>
>>>
>>> *https://build.gluster.org/job/centos7-regression/15/consoleFull 
>>> 20:24:36* 
>>> [20:24:39] Running tests in file ./tests/bugs/rpc/bug-921072.t*20:27:56* 
>>> ./tests/bugs/rpc/bug-921072.t timed out after 200 seconds*20:27:56* 
>>> ./tests/bugs/rpc/bug-921072.t: bad status 124
>>>
>>> This is just one of the instances, but I have seen this test failing in 
>>> last 3-4 days at least 10 times.
>>>
>>> Unfortunately, it doesn't look like the regression actually passes in 
>>> mainline for any of the patches atm.
>>>
>>>
>>>
>>>
>>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] tests/bugs/rpc/bug-921072.t - fails almost all the times in mainline

2018-02-20 Thread Nigel Babu
The immediate cause of this failure is that we merged the timeout patch
which gives each test 200 seconds to finish. This test and another one
takes over 200 seconds on regression nodes.

I have a patch up to change the timeout
https://review.gluster.org/#/c/19605/1

However, tests/bugs/rpc/bug-921072.t taking 897 seconds is in itself an
abnormality and is worth looking into.

On Wed, Feb 21, 2018 at 7:47 AM, Atin Mukherjee  wrote:

>
>
> *https://build.gluster.org/job/centos7-regression/15/consoleFull 
> 20:24:36* 
> [20:24:39] Running tests in file ./tests/bugs/rpc/bug-921072.t*20:27:56* 
> ./tests/bugs/rpc/bug-921072.t timed out after 200 seconds*20:27:56* 
> ./tests/bugs/rpc/bug-921072.t: bad status 124
>
> This is just one of the instances, but I have seen this test failing in last 
> 3-4 days at least 10 times.
>
> Unfortunately, it doesn't look like the regression actually passes in 
> mainline for any of the patches atm.
>
>
>
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Infra machines update

2018-02-19 Thread Nigel Babu
Hello folks,

We're all out of Centos 6 nodes from today. I've just deleted the last of
them. We now run exclusively on Centos 7 nodes.

We've not received any negative feedback about plans to move NetBSD, so
I've disabled and removed all the NetBSD jobs and nodes as well.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Jenkins Issues this weekend and how we're solving them

2018-02-19 Thread Nigel Babu
On Mon, Feb 19, 2018 at 5:58 PM, Nithya Balachandran <nbala...@redhat.com>
wrote:

>
>
> On 19 February 2018 at 13:12, Atin Mukherjee <amukh...@redhat.com> wrote:
>
>>
>>
>> On Mon, Feb 19, 2018 at 8:53 AM, Nigel Babu <nig...@redhat.com> wrote:
>>
>>> Hello,
>>>
>>> As you all most likely know, we store the tarball of the binaries and
>>> core if there's a core during regression. Occasionally, we've introduced a
>>> bug in Gluster and this tar can take up a lot of space. This has happened
>>> recently with brick multiplex tests. The build-install tar takes up 25G,
>>> causing the machine to run out of space and continuously fail.
>>>
>>
>> AFAIK, we don't have a .t file in upstream regression suits where
>> hundreds of volumes are created. With that scale and brick multiplexing
>> enabled, I can understand the core will be quite heavy loaded and may
>> consume up to this much of crazy amount of space. FWIW, can we first try to
>> figure out which test was causing this crash and see if running a gcore
>> after a certain steps in the tests do left us with a similar size of the
>> core file? IOW, have we actually seen such huge size of core file generated
>> earlier? If not, what changed because which we've started seeing this is
>> something to be invested on.
>>
>
> We also need to check if this is only the core file that is causing the
> increase in size or whether there is something else that is taking up a lot
> of space.
>
>
I don't disagree. However there are two problems here. In the few cases
where we've had such a large build-install tarball,

1. The tar doesn't actually finish being created. So it's not even
something that can be untar'd. It would just error out.
2. All subsequent jobs on this node fail.

The only remaining option is to watch out for situations when the tar file
doesn't finish creation and highlight it. When we moved to chunked
regressions, the nodes do not get re-used, so 2 isn't a problem.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Jenkins Issues this weekend and how we're solving them

2018-02-18 Thread Nigel Babu
Hello,

As you all most likely know, we store the tarball of the binaries and core
if there's a core during regression. Occasionally, we've introduced a bug
in Gluster and this tar can take up a lot of space. This has happened
recently with brick multiplex tests. The build-install tar takes up 25G,
causing the machine to run out of space and continuously fail.

I've made some changes this morning. Right after we create the tarball,
we'll delete all files in /archive that are greater than 1G. Please be
aware that this means all large files including the newly created tarball
will be deleted. You will have to work with the traceback on the Jenkins
job.




-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] run-tests-in-vagrant

2018-02-15 Thread Nigel Babu
So we have a job that's unmaintained and unwatched. If nobody steps up to
own it in the next 2 weeks, I'll be deleting this job.

On Wed, Feb 14, 2018 at 4:49 PM, Niels de Vos <nde...@redhat.com> wrote:

> On Wed, Feb 14, 2018 at 11:15:23AM +0530, Nigel Babu wrote:
> > Hello,
> >
> > Centos CI has a run-tests-in-vagrant job. Do we continue to need this
> > anymore? It still runs master and 3.8. I don't see this job adding much
> > value at this point given we only look at results that are on
> > build.gluster.org. I'd like to use the extra capacity for other tests
> that
> > will run on centos-ci.
>
> The ./run-tests-in-vagrant.sh script is ideally what developers run
> before submitting their patches. In case it fails, we should fix it.
> Being able to run tests locally is something many of the new
> contributors want to do. Having a controlled setup for the testing can
> really help with getting new contributors onboard.
>
> Hmm, and the script/job definitely seems to be broken with at least two
> parts:
> - the Vagrant version on CentOS uses the old URL to get the box
> - 00-georep-verify-setup.t fails, but the result is marked as SUCCESS
>
> It seems we need to get better at watching the CI, or at least be able
> to receive and handle notifications...
>
> Thanks,
> Niels
>



-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] build.gluster.org in shutdown mode

2018-02-14 Thread Nigel Babu
This upgrade is now complete and we're now running the latest version of
Jenkins.

On Thu, Feb 15, 2018 at 9:53 AM, Nigel Babu <nig...@redhat.com> wrote:

> Hello,
>
> I've just placed Jenkins in shutdown mode. No new jobs will be started for
> about an hour from now. I intend to upgrade Jenkins to pull in the latest
> security fixes.
>
> --
> nigelb
>



-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] build.gluster.org in shutdown mode

2018-02-14 Thread Nigel Babu
Hello,

I've just placed Jenkins in shutdown mode. No new jobs will be started for
about an hour from now. I intend to upgrade Jenkins to pull in the latest
security fixes.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] run-tests-in-vagrant

2018-02-13 Thread Nigel Babu
Hello,

Centos CI has a run-tests-in-vagrant job. Do we continue to need this
anymore? It still runs master and 3.8. I don't see this job adding much
value at this point given we only look at results that are on
build.gluster.org. I'd like to use the extra capacity for other tests that
will run on centos-ci.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Migration of Centos CI jobs to it's own repo

2018-02-12 Thread Nigel Babu
Hello folks,

I'm trying to make the glusterfs-patch-acceptance-tests repo lighter by
really only having code that's needed to run regressions and build for
gluster. The Centos CI jobs, therefore need to move to it's own repo. As a
first step, I've created a new repo[1] for centos ci jobs.

The jobs are still running from the old repo, but as I migrate jobs to the
new repo, I'll delete their code from the old one. I expect this migration
to finish by the end of the month. Please get in touch if you have
questions or concerns.

[1]: https://github.com/gluster/centosci

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Replacing Centos 6 nodes with Centos 7

2018-02-01 Thread Nigel Babu
This seems to be working well so far. I noticed that this one doesn't vote
correctly. I've fixed this up and also fixed up all the jobs where the
voting wasn't accurate.

Overall regression time seems to have dropped down to 3.5h now from 6h or
so. I attribute this to less slower down in some specific test cases and
the SSD disks. I'm going to add one more Centos 7 machine to the pool today.

On Thu, Feb 1, 2018 at 9:26 AM, Nigel Babu <nig...@redhat.com> wrote:

> Hello folks,
>
> Today, I'm putting the first Centos 7 node in our regression pool.
>
> slave28.cloud.gluster.org -> Shutdown and removed
> builder100.cloud.gluster.org -> New Centos7 node (we'll be starting from
> 100 upwards)
>
> If this run goes well, we'll be replacing the nodes one by one with Centos
> 7. If you notice tests failing consistently on a Centos 7 node, please file
> a bug.
>
> --
> nigelb
>



-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Replacing Centos 6 nodes with Centos 7

2018-01-31 Thread Nigel Babu
Hello folks,

Today, I'm putting the first Centos 7 node in our regression pool.

slave28.cloud.gluster.org -> Shutdown and removed
builder100.cloud.gluster.org -> New Centos7 node (we'll be starting from
100 upwards)

If this run goes well, we'll be replacing the nodes one by one with Centos
7. If you notice tests failing consistently on a Centos 7 node, please file
a bug.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Planned Outage: supercolony.gluster.org on 2018-02-21

2018-01-31 Thread Nigel Babu
Hello folks,

We're going to be resizing the supercolony.gluster.org on our cloud
provider. This will definitely lead to a small outage for 5 mins. In the
event that something goes wrong in this process, we're taking a 2-hour
window for this outage.

Date: Feb 21
Server: supercolony.gluster.org
Time: 1000 to 1200 UTC / 1100 to 1300 CET / 1530 to 1730 IST
Services affected:
* gluster.org redirect
* lists.gluster.org (UI and mail server)
* planet.gluster.org

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Rawhide RPM builds failing

2018-01-24 Thread Nigel Babu
More details: https://build.gluster.org/job/rpm-rawhide/1182/

On Wed, Jan 24, 2018 at 2:03 PM, Niels de Vos <nde...@redhat.com> wrote:

> On Wed, Jan 24, 2018 at 09:14:51AM +0530, Nigel Babu wrote:
> > Hello folks,
> >
> > Our rawhide rpm builds seem to be failing with what looks like a specfile
> > issue. It's worth looking into this now before F28 is released in May.
>
> Do you have more details? The errors from a build.log from mock would
> help. Which .spec are you using, the one from the GlusterFS sources, or
> the one from Fedora?
>
> Please report it as a bug, either against Fedora/glusterfs or
> GlusterFS/build.
>
> Thanks!
> Niels
>



-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Rawhide RPM builds failing

2018-01-23 Thread Nigel Babu
Hello folks,

Our rawhide rpm builds seem to be failing with what looks like a specfile
issue. It's worth looking into this now before F28 is released in May.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Infra-related Regression Failures and What We're Doing

2018-01-22 Thread Nigel Babu
Update: All the nodes that had problems with geo-rep are now fixed. Waiting
on the patch to be merged before we switch over to Centos 7. If things go
well, we'll replace nodes one by one as soon as we have one green on Centos
7.

On Mon, Jan 22, 2018 at 12:21 PM, Nigel Babu <nig...@redhat.com> wrote:

> Hello folks,
>
> As you may have noticed, we've had a lot of centos6-regression failures
> lately. The geo-replication failures are the new ones which particularly
> concern me. These failures have nothing to do with the test. The tests are
> exposing a problem in our infrastructure that we've carried around for a
> long time. Our machines are not clean machines that we automated. We setup
> automation on machines that were already created. At some point, we loaned
> machines for debugging. During this time, developers have inadvertently
> done 'make install' on the system to install onto system paths rather than
> into /build/install. This is what is causing the geo-replication tests to
> fail. I've tried cleaning the machines up several times with little to no
> success.
>
> Last week, we decided to take an aggressive path to fix this problem. We
> planned to replace all our problematic nodes with new Centos 7 nodes. This
> exposed more problems. We expected a specific type of machine from
> Rackspace. These are no longer offered. Thus, our automation fails on some
> steps. I've spent this weekend tweaking our automation so that it works
> on the new Rackspace machines and I'm down to just one test failure[1]. I
> have a patch up to fix this failure[2]. As soon as that patch is merged,
> we can push forward with Centos7 nodes. In 4.0, we're dropping support for
> Centos 6, so this decision makes more sense to do sooner than later.
>
> We'll not be lending machines anymore from production. We'll be creating
> new nodes which are a snapshots of an existing production node. This
> machine will be destroyed after use. This helps prevent this particular
> problem in the future. This also means that our machine capacity at all
> times is at 100 with very minimal wastage.
>
> [1]: https://build.gluster.org/job/cage-test/184/consoleText
> [2]: https://review.gluster.org/#/c/19262/
>
> --
> nigelb
>



-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Infra-related Regression Failures and What We're Doing

2018-01-21 Thread Nigel Babu
Hello folks,

As you may have noticed, we've had a lot of centos6-regression failures
lately. The geo-replication failures are the new ones which particularly
concern me. These failures have nothing to do with the test. The tests are
exposing a problem in our infrastructure that we've carried around for a
long time. Our machines are not clean machines that we automated. We setup
automation on machines that were already created. At some point, we loaned
machines for debugging. During this time, developers have inadvertently
done 'make install' on the system to install onto system paths rather than
into /build/install. This is what is causing the geo-replication tests to
fail. I've tried cleaning the machines up several times with little to no
success.

Last week, we decided to take an aggressive path to fix this problem. We
planned to replace all our problematic nodes with new Centos 7 nodes. This
exposed more problems. We expected a specific type of machine from
Rackspace. These are no longer offered. Thus, our automation fails on some
steps. I've spent this weekend tweaking our automation so that it works on
the new Rackspace machines and I'm down to just one test failure[1]. I have
a patch up to fix this failure[2]. As soon as that patch is merged, we can
push forward with Centos7 nodes. In 4.0, we're dropping support for Centos
6, so this decision makes more sense to do sooner than later.

We'll not be lending machines anymore from production. We'll be creating
new nodes which are a snapshots of an existing production node. This
machine will be destroyed after use. This helps prevent this particular
problem in the future. This also means that our machine capacity at all
times is at 100 with very minimal wastage.

[1]: https://build.gluster.org/job/cage-test/184/consoleText
[2]: https://review.gluster.org/#/c/19262/

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Please file a bug if you take a machine offline

2018-01-10 Thread Nigel Babu
Hello folks,

If you take a machine offline, please file a bug so that the machine can be
debugged and return to the pool.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Recent regression failures

2018-01-10 Thread Nigel Babu
Hello folks,

We may have been a little too quick to blame Meltdown on the Jenkins
failures yesterday. In any case, we've open a ticket with our provider and
they're looking into the failures. I've looked at the last 90 failures to
get a comprehensive number on the failures.

Total Jobs: 90
Failures: 62
Failure Percentage: 68.89%

I've analyzed the individual failures categorized them as well.

slave28.cloud.gluster.org failure: 9
Geo-replication failures: 12
Fops-during-migration.t: 4
Compilation failures: 3
durability-off.t failures: 7

These alone total to 35 failures. The slave28 failures were due to the
machine running out of disk space. We had a very large binary archived from
an experimental branch build failure. I've cleared that core out and this
is now fixed. The geo-replication failures were due to geo-rep tests
depending on root's .bashrc having the PATH variable modified. This was not
a standard setup and therefore didn't work on many machines. This has now
been fixed. The other 3 were transient failures either limited to a
particular review or a temporary bustage on master. The majority of the
recent failures had more to do with infra than to do with tests.

I'm therefore cautiously moving with the assumption that the impact of KPTI
patch is minimal so far.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Moving Regressions to Centos 7

2017-12-20 Thread Nigel Babu
Hello folks,

We've been using Centos 6 for our regressions for a long time. I believe
it's time that we moved to Centos 7. It's causing us minor issues. For
example, tests run fine on the regression boxes but don't work on local
machines or vice-versa. Moving up gives us the ability to use newer
versions of tools as well.

If nobody has any disagreement, the plan is going to look like this:
* Bring up 10 Rackspace Centos 7 nodes.
* Test chunked regression runs on Rackspace Centos 7 nodes for one week.
* If all works well, kill off all the old nodes and switch all normal
regressions to Rackspace Centos 7 nodes.

I expect this process to be complete right around 2nd week of Jan. Please
let me know if there are concerns.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Emergency Jenkins Restart

2017-12-13 Thread Nigel Babu
Hello folks,

I'm going to be restarting Jenkins for an important security update. Any
running jobs will be canceled and retriggered.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Permission change for +2 votes on review.gluster.org

2017-12-07 Thread Nigel Babu
Hello folks,

We talked about this last week at the maintainer's meeting. We're going to
restrict +2 votes to people who can also submit the patch. This makes sure
that patches have actual maintainers giving +2. Everyone else will be able
to give a +1.

If this affects your project/component's development workflow, please
consider adding more maintainers/peers :)

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Need help figuring out the reason for test failure

2017-11-27 Thread Nigel Babu
Pranith,

Our logging has changed slightly. Please read my email titled "Changes in
handling logs from (centos) regressions and smoke" to gluster-devel and
gluster-infra.

On Tue, Nov 28, 2017 at 8:06 AM, Pranith Kumar Karampuri <
pkara...@redhat.com> wrote:

> One of my patches(https://review.gluster.org/18857) is consistently
> leading to a failure for the test:
>
> tests/bugs/core/bug-1432542-mpx-restart-crash.t
>
> https://build.gluster.org/job/centos6-regression/7676/consoleFull
>
> Jeff/Atin,
> Do you know anything about these kinds of failures for this test?
>
> Nigel,
>Unfortunately I am not able to look at the logs because the logs
> location is not given properly (at least for me :-) )
>
> *11:41:14* 
> filename="${ARCHIVED_LOGS}/glusterfs-logs-${UNIQUE_ID}.tgz"*11:41:14* sudo -E 
> tar -czf "${ARCHIVE_BASE}/${filename}" /var/log/glusterfs 
> /var/log/messages*;*11:41:14* echo "Logs archived in 
> http://${SERVER}/${filename};
>
>
> Could you help me find what the location could be?
> --
> Pranith
>



-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Tests failing on Centos 7

2017-11-27 Thread Nigel Babu
Hello folks,

I have an update on chunking. There's good news and bad. The first bit is
that We a chunked regression job now. It splits it out into 10 chunks that
are run in parallel. This chunking is quite simple at the moment and
doesn't try to be very smart. The intelligence steps will come in once
we're ready to go live.

In the meanwhile, we've run into a few road blocks. The following tests do
not work on CentOS 7:

./tests/bugs/cli/bug-1169302.t
./tests/bugs/posix/bug-990028.t
./tests/bugs/glusterd/bug-1260185-donot-allow-detach-commit-unnecessarily.t
./tests/bugs/core/multiplex-limit-issue-151.t
./tests/basic/afr/split-brain-favorite-child-policy.t
./tests/bugs/core/bug-1432542-mpx-restart-crash.t

Can the maintainers for these components please take a look at this test
and fix them to run on Centos 7? When we land chunked regressions, we'll
switch out our entire build farm over to centos 7. If you want a test
machine to reproduce the failure and debug, please file a bug requesting
one with your SSH public key attached.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Changes in handling logs from (centos) regressions and smoke

2017-11-20 Thread Nigel Babu
Hello folks,

We're making some changes in how we handle logs from Centos regression and
smoke tests. Instead of having them available via HTTP access to the node
itself, it will be available via the Jenkins job as artifacts.

For example:
Smoke job: https://build.gluster.org/job/smoke/38523/console
Logs: https://build.gluster.org/job/smoke/38523/artifact/ (link available
from the main page)

We clear out regression logs every 30 days, so if you can see a regression
on build.gluster.org, logs for that should be available. This reduces the
need for space or HTTP access on our nodes and for separate deletion
process.

We also archive builds and cores. This is still available the old-fashioned
way, however, I intend to change that in the next few weeks to centralize
it to a file server.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Unplanned Jenkins restart

2017-11-19 Thread Nigel Babu
I noticed that Jenkins wasn't loading up this morning. Further debugging
showed a java heap size problem. I tried to debug it, but eventually just
restarted Jenkins. This means any running job or any job triggered was
stopped. Please re-trigger your jobs.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Unplanned Jenkins restart this morning

2017-11-08 Thread Nigel Babu
Hello folks,

I had to do a quick Jenkins upgrade and restart this morning for an urgent
security fix. A few of our periodic jobs were cancelled, I'll re-trigger
them now.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Change #18681 has broken build on master

2017-11-08 Thread Nigel Babu
I can try to explain what happened. For instance, here's a git tree. Each
alphabet represents a commit.

A -> B -> C -> D -> E -> F (F is the HEAD of master. green builds)

Change X branched off at B
A -> B -> X (green builds)

Change Y branched off at D
A -> B -> C -> D -> Y (green builds)

Now if change X and Y do not work together, for instance, if change X
introduced a new parameter for a function. They also do not conflict with
each other. First change Y lands.

So history now looks like this:

A -> B -> C -> D -> E -> F -> Y (green builds)

Now change X lands:

A -> B -> C -> D -> E -> F -> Y -> Z (red builds)

Because change Z touched a function whose signature had changed in change
Y. If this doesn't make sense, please have a look at this:
https://docs.openstack.org/infra/zuul/user/gating.html

Using a gating system is the most likely solution to our problem. Right
now, adding a gating solution without reducing how much time our tests take
is pointless.


-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Change #18681 has broken build on master

2017-11-07 Thread Nigel Babu
Landed:
https://github.com/gluster/glusterfs/commit/c3d7974e2be68f0fac8f54c9557d0f868e6be6c8

Please rebase your patches and re-trigger.

On Tue, Nov 7, 2017 at 5:23 PM, Nigel Babu <nig...@redhat.com> wrote:

> Rafi has a fix[1]. I'm going to make it skip regressions and land it
> directly.
>
> https://review.gluster.org/#/c/18680/
>
> On Tue, Nov 7, 2017 at 4:42 PM, Raghavendra Gowdappa <rgowd...@redhat.com>
> wrote:
>
>> Please check [1].
>>
>> Build on master branch on my laptop failed too:
>>
>> [raghu@unused server]$ make > /dev/null
>> server.c: In function 'init':
>> server.c:1205:9: error: too few arguments to function
>> 'rpcsvc_program_register'
>> In file included from server.h:17:0,
>>  from server.c:16:
>> ../../../../rpc/rpc-lib/src/rpcsvc.h:426:1: note: declared here
>> make[1]: *** [server.lo] Error 1
>> make: *** [all-recursive] Error 1
>>
>> The change was introduced by [2]. However, the puzzling thing is [2]
>> itself was built successfully and has passed all tests. Wondering how did
>> that happen.
>>
>> [1] https://build.gluster.org/job/centos6-regression/7281/console
>> [2] review.gluster.org/18681
>>
>> regards,
>> Raghavendra
>>
>
>
>
> --
> nigelb
>



-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] Change #18681 has broken build on master

2017-11-07 Thread Nigel Babu
Rafi has a fix[1]. I'm going to make it skip regressions and land it
directly.

https://review.gluster.org/#/c/18680/

On Tue, Nov 7, 2017 at 4:42 PM, Raghavendra Gowdappa 
wrote:

> Please check [1].
>
> Build on master branch on my laptop failed too:
>
> [raghu@unused server]$ make > /dev/null
> server.c: In function 'init':
> server.c:1205:9: error: too few arguments to function
> 'rpcsvc_program_register'
> In file included from server.h:17:0,
>  from server.c:16:
> ../../../../rpc/rpc-lib/src/rpcsvc.h:426:1: note: declared here
> make[1]: *** [server.lo] Error 1
> make: *** [all-recursive] Error 1
>
> The change was introduced by [2]. However, the puzzling thing is [2]
> itself was built successfully and has passed all tests. Wondering how did
> that happen.
>
> [1] https://build.gluster.org/job/centos6-regression/7281/console
> [2] review.gluster.org/18681
>
> regards,
> Raghavendra
>



-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

[Gluster-devel] Unplanned Gerrit Outage yesterday

2017-11-02 Thread Nigel Babu
Hello folks,

Yesterday, we had an unplanned Gerrit outage. We have now determined that
for some reason the machine rebooted for some reason. Michael is continuing
to debug what lead to this issue. Gerrit does not start automatically when
the VM restarted at this point.

We are currently testing a systemd unit file for Gerrit in staging. Once
that's in place, we can ensure that we start Gerrit automatically when we
restart the server.

Timeline of events (in CET):
16:29 - I receive an alert that Gerrit is down. This goes ignored because
we're still working on Jenkins.

18:25 - I notice the alerts as we're packing up for the day and start
Gerrit.

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel

  1   2   3   >