Re: [Gluster-devel] Brick-Mux tests failing for over 11+ weeks

2018-05-25 Thread Shyam Ranganathan
Updates from tests:

Last 5 runs on 4.1 have passed.

Run 55 and 58 failed on bug-1363721.t which was not merged before the
tests were kicked off, hence still considering it as passing.

Started 2 more runs on 4.1 [1] and possibly more during the day, to call
an all clear on this blocker for the release.

`master` branch still has issues, jFYI.

Shyam

[1] New patch to test mux only: https://review.gluster.org/#/c/20087/1
On 05/24/2018 03:08 PM, Shyam Ranganathan wrote:
> After various analysis and fixes here is the current state,
> 
> - Reverted 3 patches aimed at proper cleanup sequence when a mux'd brick
> is detached [2]
> - Fixed a core within the same patch, for a lookup before brick is ready
> case
> - Fixed an replicate test case, that was partly failing due to cleanup
> sequence and partly due to replicate issues [3]
> 
> Current status is that we are still tracking some failures using this
> [1] bug and running on demand brick-mux regressions on Jenkins.
> 
> State of 4.0 release is almost even with 4.1, with the fixes/changes
> above, but it is not yet in the clear to continue with 4.1 release.
> 
> Request any help that can be provided to Mohit, Du and Milind who are
> currently looking at this actively.
> 
> Shyam
> 
> [1] Mux failure bug: https://bugzilla.redhat.com/show_bug.cgi?id=1582286
> 
> [2] Mux patches reverted: https://review.gluster.org/#/c/20060/4 (has 2
> other patches that are dependent on this)
> 
> [3] Replicate fix: https://review.gluster.org/#/c/20066/
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Brick-Mux tests failing for over 11+ weeks

2018-05-24 Thread Shyam Ranganathan
After various analysis and fixes here is the current state,

- Reverted 3 patches aimed at proper cleanup sequence when a mux'd brick
is detached [2]
- Fixed a core within the same patch, for a lookup before brick is ready
case
- Fixed an replicate test case, that was partly failing due to cleanup
sequence and partly due to replicate issues [3]

Current status is that we are still tracking some failures using this
[1] bug and running on demand brick-mux regressions on Jenkins.

State of 4.0 release is almost even with 4.1, with the fixes/changes
above, but it is not yet in the clear to continue with 4.1 release.

Request any help that can be provided to Mohit, Du and Milind who are
currently looking at this actively.

Shyam

[1] Mux failure bug: https://bugzilla.redhat.com/show_bug.cgi?id=1582286

[2] Mux patches reverted: https://review.gluster.org/#/c/20060/4 (has 2
other patches that are dependent on this)

[3] Replicate fix: https://review.gluster.org/#/c/20066/
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Brick-Mux tests failing for over 11+ weeks

2018-05-16 Thread Shyam Ranganathan
On 05/16/2018 10:34 AM, Shyam Ranganathan wrote:
> Some further analysis based on what Mohit commented on the patch:
> 
> 1) gf_attach used to kill a brick is taking more time, causing timeouts
> in tests, mainly br-state-check.t. Usually when there are back to back
> kill_bricks in the test.

Invalid root cause. The failure seems to be the clients (in this test
case the bitrod daemon and scrubber) actually losing connection to other
bricks due to a ping timeout, when one of the (the first brick) is
terminated (or detached in mux parlance).

The above should not happen, and hence points back to glusterd and brick
daemon interaction causing some mayhem in this case.

> 
> 2) Problem in ./tests/bugs/replicate/bug-1363721.t seems to be that
> kill_brick has not completed before an attach request, causing it to be
> a duplicate attach and hence dropped/ignored? (speculation)
> 
> Writing a test case to see if this is reproducible in that short case!

The modified version of the test case did not help.

There was a core that I encountered which looks like the detach and
subsequent attach of a brick may have races, and hence cause some
disruption to the test case. Will send a followon to this mail about the
details there.

> 
> The above replicate test seems to also have a different issue when it
> compares the md5sums towards the end of the tests (can be seen in the
> console logs), which seems to be unrelated to brick-mux, (see:
> https://build.gluster.org/job/centos7-regression/853/console for
> example). Would be nice if someone from the replicate team took a look
> at this one.

Ran the test case as is on a local setup using the latest patch on
master. There is a failure in comparing md5sums across the bricks,
towards the end of the test, and this happens quite regularly (I would
state 1 in 4 tries). Replicate team has been made aware of the same, to
look into the problem better.

> 
> 3) ./tests/bugs/index/bug-1559004-EMLINK-handling.t seems to be a
> timeout in most (if not all cases), stuck in the last iteration.

Added timeout seems to have helped this case.

> 
> I will be modifying the patch (discussed in this thread) to add more
> time for 1 and 3 sfailures, and fire off a few more regressions, as I
> try to reproduce 2.
> 
> Shyam
> P.S: If work is happening on these issues, request that the
> data/analysis be posted to the lists, reduces rework!
> 
> On 05/15/2018 09:10 PM, Shyam Ranganathan wrote:
>> Hi,
>>
>> After the fix provided by Atin here [1] for the issue reported below, we
>> ran 7-8 runs of brick mux regressions against this fix, and we have had
>> 1/3 runs successful (even those have some tests retried). The run links
>> are in the review at [1].
>>
>> The failures are as below, sorted in descending order of frequency.
>> Requesting respective component owners/peers to take a stab at root
>> causing these, as the current pass rate is not sufficient to qualify the
>> release (or master) as stable.
>>
>> 1) ./tests/bitrot/br-state-check.t (bitrot folks please take a look,
>> this has the maximum instances of failures, including a core in the run [2])
>>
>> 2) ./tests/bugs/replicate/bug-1363721.t (Replicate component owners
>> please note, there are some failures in GFID comparison that seems
>> outside of mux cases as well)
>>
>> 3) ./tests/bugs/distribute/bug-1543279.t (Distribute)
>>
>> ./tests/bugs/index/bug-1559004-EMLINK-handling.t (I think we need to up
>> the SCRIPT timeout on this, if someone can confirm looking at the runs
>> and failures, it would help determining the same)
>>
>> -- We can possibly wait to analyze things below this line as the
>> instance count is 2 or less --
>>
>> 4)  ./tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
>>
>> ./tests/bugs/snapshot/bug-1482023-snpashot-issue-with-other-processes-accessing-mounted-path.t
>> ./tests/bugs/quota/bug-1293601.t
>>
>> 5)  ./tests/bugs/distribute/bug-1161311.t
>> ./tests/bitrot/bug-1373520.t
>>
>> Thanks,
>> Shyam
>>
>> [1] Review containing the fix and the regression run links for logs:
>> https://review.gluster.org/#/c/20022/3
>>
>> [2] Test with core:
>> https://build.gluster.org/job/regression-on-demand-multiplex/20/
>> On 05/14/2018 08:31 PM, Shyam Ranganathan wrote:
>>> *** Calling out to Glusterd folks to take a look at this ASAP and
>>> provide a fix. ***
>>>
>>> Further to the mail sent yesterday, work done in my day with Johnny
>>> (RaghuB), points to a problem in glusterd rpc port map having stale
>>> entries for certain bricks as the cause for connection failures when
>>> running in the multiplex mode.
>>>
>>> It seems like this problem has been partly addressed in this bug:
>>> https://bugzilla.redhat.com/show_bug.cgi?id=1545048
>>>
>>> What is occurring now is that glusterd retains older ports in its
>>> mapping table against bricks that have recently terminated, when a
>>> volume is stopped and restarted, this leads to connection failures from
>>> clients as there are 

Re: [Gluster-devel] Brick-Mux tests failing for over 11+ weeks

2018-05-16 Thread Shyam Ranganathan
Some further analysis based on what Mohit commented on the patch:

1) gf_attach used to kill a brick is taking more time, causing timeouts
in tests, mainly br-state-check.t. Usually when there are back to back
kill_bricks in the test.

2) Problem in ./tests/bugs/replicate/bug-1363721.t seems to be that
kill_brick has not completed before an attach request, causing it to be
a duplicate attach and hence dropped/ignored? (speculation)

Writing a test case to see if this is reproducible in that short case!

The above replicate test seems to also have a different issue when it
compares the md5sums towards the end of the tests (can be seen in the
console logs), which seems to be unrelated to brick-mux, (see:
https://build.gluster.org/job/centos7-regression/853/console for
example). Would be nice if someone from the replicate team took a look
at this one.

3) ./tests/bugs/index/bug-1559004-EMLINK-handling.t seems to be a
timeout in most (if not all cases), stuck in the last iteration.

I will be modifying the patch (discussed in this thread) to add more
time for 1 and 3 sfailures, and fire off a few more regressions, as I
try to reproduce 2.

Shyam
P.S: If work is happening on these issues, request that the
data/analysis be posted to the lists, reduces rework!

On 05/15/2018 09:10 PM, Shyam Ranganathan wrote:
> Hi,
> 
> After the fix provided by Atin here [1] for the issue reported below, we
> ran 7-8 runs of brick mux regressions against this fix, and we have had
> 1/3 runs successful (even those have some tests retried). The run links
> are in the review at [1].
> 
> The failures are as below, sorted in descending order of frequency.
> Requesting respective component owners/peers to take a stab at root
> causing these, as the current pass rate is not sufficient to qualify the
> release (or master) as stable.
> 
> 1) ./tests/bitrot/br-state-check.t (bitrot folks please take a look,
> this has the maximum instances of failures, including a core in the run [2])
> 
> 2) ./tests/bugs/replicate/bug-1363721.t (Replicate component owners
> please note, there are some failures in GFID comparison that seems
> outside of mux cases as well)
> 
> 3) ./tests/bugs/distribute/bug-1543279.t (Distribute)
> 
> ./tests/bugs/index/bug-1559004-EMLINK-handling.t (I think we need to up
> the SCRIPT timeout on this, if someone can confirm looking at the runs
> and failures, it would help determining the same)
> 
> -- We can possibly wait to analyze things below this line as the
> instance count is 2 or less --
> 
> 4)  ./tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
> 
> ./tests/bugs/snapshot/bug-1482023-snpashot-issue-with-other-processes-accessing-mounted-path.t
> ./tests/bugs/quota/bug-1293601.t
> 
> 5)  ./tests/bugs/distribute/bug-1161311.t
> ./tests/bitrot/bug-1373520.t
> 
> Thanks,
> Shyam
> 
> [1] Review containing the fix and the regression run links for logs:
> https://review.gluster.org/#/c/20022/3
> 
> [2] Test with core:
> https://build.gluster.org/job/regression-on-demand-multiplex/20/
> On 05/14/2018 08:31 PM, Shyam Ranganathan wrote:
>> *** Calling out to Glusterd folks to take a look at this ASAP and
>> provide a fix. ***
>>
>> Further to the mail sent yesterday, work done in my day with Johnny
>> (RaghuB), points to a problem in glusterd rpc port map having stale
>> entries for certain bricks as the cause for connection failures when
>> running in the multiplex mode.
>>
>> It seems like this problem has been partly addressed in this bug:
>> https://bugzilla.redhat.com/show_bug.cgi?id=1545048
>>
>> What is occurring now is that glusterd retains older ports in its
>> mapping table against bricks that have recently terminated, when a
>> volume is stopped and restarted, this leads to connection failures from
>> clients as there are no listeners on the now stale port.
>>
>> Test case as in [1], when run on my F27 machine fails 1 in 5 times with
>> the said error.
>>
>> The above does narrow down failures in tests:
>> - lk-quorum.t
>> - br-state-check.t
>> - entry-self-heal.t
>> - bug-1363721.t (possibly)
>>
>> Failure from client mount logs can be seen as using the wrong port
>> number in messages like "[rpc-clnt.c:2069:rpc_clnt_reconfig]
>> 6-patchy-client-2: changing port to 49156 (from 0)" when there are
>> failures, the real port for the brick-mux process would be different.
>>
>> We also used gdb to inspect glusterd pmap registry and found that older
>> stale port map data is present (in function pmap_registry_search as
>> clients invoke a connection).
>>
>> Thanks,
>> Shyam
>>
>> On 05/13/2018 06:56 PM, Shyam Ranganathan wrote:
>>> Hi,
>>>
>>> Nigel pointed out that the nightly brick-mux tests are now failing for
>>> about 11 weeks and we do not have a clear run of the same.
>>>
>>> Spent some time on Friday collecting what tests failed and to an extent
>>> why, and filed bug https://bugzilla.redhat.com/show_bug.cgi?id=1577672
>>>
>>> Asks: Whoever has cycles please look into 

Re: [Gluster-devel] Brick-Mux tests failing for over 11+ weeks

2018-05-15 Thread Shyam Ranganathan
Hi,

After the fix provided by Atin here [1] for the issue reported below, we
ran 7-8 runs of brick mux regressions against this fix, and we have had
1/3 runs successful (even those have some tests retried). The run links
are in the review at [1].

The failures are as below, sorted in descending order of frequency.
Requesting respective component owners/peers to take a stab at root
causing these, as the current pass rate is not sufficient to qualify the
release (or master) as stable.

1) ./tests/bitrot/br-state-check.t (bitrot folks please take a look,
this has the maximum instances of failures, including a core in the run [2])

2) ./tests/bugs/replicate/bug-1363721.t (Replicate component owners
please note, there are some failures in GFID comparison that seems
outside of mux cases as well)

3) ./tests/bugs/distribute/bug-1543279.t (Distribute)

./tests/bugs/index/bug-1559004-EMLINK-handling.t (I think we need to up
the SCRIPT timeout on this, if someone can confirm looking at the runs
and failures, it would help determining the same)

-- We can possibly wait to analyze things below this line as the
instance count is 2 or less --

4)  ./tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t

./tests/bugs/snapshot/bug-1482023-snpashot-issue-with-other-processes-accessing-mounted-path.t
./tests/bugs/quota/bug-1293601.t

5)  ./tests/bugs/distribute/bug-1161311.t
./tests/bitrot/bug-1373520.t

Thanks,
Shyam

[1] Review containing the fix and the regression run links for logs:
https://review.gluster.org/#/c/20022/3

[2] Test with core:
https://build.gluster.org/job/regression-on-demand-multiplex/20/
On 05/14/2018 08:31 PM, Shyam Ranganathan wrote:
> *** Calling out to Glusterd folks to take a look at this ASAP and
> provide a fix. ***
> 
> Further to the mail sent yesterday, work done in my day with Johnny
> (RaghuB), points to a problem in glusterd rpc port map having stale
> entries for certain bricks as the cause for connection failures when
> running in the multiplex mode.
> 
> It seems like this problem has been partly addressed in this bug:
> https://bugzilla.redhat.com/show_bug.cgi?id=1545048
> 
> What is occurring now is that glusterd retains older ports in its
> mapping table against bricks that have recently terminated, when a
> volume is stopped and restarted, this leads to connection failures from
> clients as there are no listeners on the now stale port.
> 
> Test case as in [1], when run on my F27 machine fails 1 in 5 times with
> the said error.
> 
> The above does narrow down failures in tests:
> - lk-quorum.t
> - br-state-check.t
> - entry-self-heal.t
> - bug-1363721.t (possibly)
> 
> Failure from client mount logs can be seen as using the wrong port
> number in messages like "[rpc-clnt.c:2069:rpc_clnt_reconfig]
> 6-patchy-client-2: changing port to 49156 (from 0)" when there are
> failures, the real port for the brick-mux process would be different.
> 
> We also used gdb to inspect glusterd pmap registry and found that older
> stale port map data is present (in function pmap_registry_search as
> clients invoke a connection).
> 
> Thanks,
> Shyam
> 
> On 05/13/2018 06:56 PM, Shyam Ranganathan wrote:
>> Hi,
>>
>> Nigel pointed out that the nightly brick-mux tests are now failing for
>> about 11 weeks and we do not have a clear run of the same.
>>
>> Spent some time on Friday collecting what tests failed and to an extent
>> why, and filed bug https://bugzilla.redhat.com/show_bug.cgi?id=1577672
>>
>> Asks: Whoever has cycles please look into these failures ASAP as these
>> tests failing are blockers for 4.1 release, and overall the state of
>> master (and hence 4.1 release branch) are not clean when these tests are
>> failing for over 11 weeks.
>>
>> Most of the tests fail if run on a local setup as well, so debugging the
>> same should be easier than requiring the mux or regression setup, just
>> ensure that mux is turned on (either by default in the code base you are
>> testing or in the test case adding the line `TEST $CLI volume set all
>> cluster.brick-multiplex on` after any cleanup and post starting glusterd.
>>
>> 1) A lot of test cases time out, of which, the following 2 have the most
>> failures, and hence possibly can help with the debugging of the root
>> cause faster. Request Glusterd and bitrot teams to look at this, as the
>> failures do not seem to bein replicate or client side layers (at present).
>>
>> (number in brackets is # times this failed in the last 13 instances of
>> mux testing)
>> ./tests/basic/afr/entry-self-heal.t (4)
>> ./tests/bitrot/br-state-check.t (8)
>>
>> 2)
>> ./tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t (7)
>>
>> The above test constantly fails at this point:
>> 
>> 16:46:28 volume add-brick: failed: /d/backends/patchy3 is already part
>> of a volume
>> 16:46:28 not ok 25 , LINENUM:47
>> 16:46:28 FAILED COMMAND: gluster --mode=script --wignore volume
>> add-brick patchy replica 3 

Re: [Gluster-devel] Brick-Mux tests failing for over 11+ weeks

2018-05-14 Thread Shyam Ranganathan
On 05/14/2018 08:35 PM, Shyam Ranganathan wrote:
> Further to the mail below,
> 
> 1. Test bug-1559004-EMLINK-handling.t possibly just needs a larger
> script timeout in mux based testing. I can see no errors in the 2-3
> times that it has failed, other than taking over 1000 seconds. Further
> investigation on normal non-mux regression also shows that this test
> takes 850-950 seconds to complete at times, I assume increasing the
> timeout will fix the failures due to this.
> 
> 2. We still need answers for the following
> - add-brick-and-validate-replicated-volume-options.t
> 
> Details on where it is failing is given in the mail below (point (2),
> points to possible glusterd issue again). This does not seem to
> correlate to the other glusterd stale port map information (as glusterd
> is restarted in this case), so we possibly need to narrow this down
> further. Help appreciated!

Looks like (2) above is fixed and has not reoccurred in the last 8+
runs, the fix being https://review.gluster.org/#/c/19924/

Can we get some more details on the fix, as to why the port mapper had a
stale port and for which brick? (because glusterd is restarted in this
test in between the issue of stale port map as present in other cases
does not apply, and hence would like more data in the bug or here).

> 
> Thanks,
> Shyam
> On 05/13/2018 06:56 PM, Shyam Ranganathan wrote:
>> Hi,
>>
>> Nigel pointed out that the nightly brick-mux tests are now failing for
>> about 11 weeks and we do not have a clear run of the same.
>>
>> Spent some time on Friday collecting what tests failed and to an extent
>> why, and filed bug https://bugzilla.redhat.com/show_bug.cgi?id=1577672
>>
>> Asks: Whoever has cycles please look into these failures ASAP as these
>> tests failing are blockers for 4.1 release, and overall the state of
>> master (and hence 4.1 release branch) are not clean when these tests are
>> failing for over 11 weeks.
>>
>> Most of the tests fail if run on a local setup as well, so debugging the
>> same should be easier than requiring the mux or regression setup, just
>> ensure that mux is turned on (either by default in the code base you are
>> testing or in the test case adding the line `TEST $CLI volume set all
>> cluster.brick-multiplex on` after any cleanup and post starting glusterd.
>>
>> 1) A lot of test cases time out, of which, the following 2 have the most
>> failures, and hence possibly can help with the debugging of the root
>> cause faster. Request Glusterd and bitrot teams to look at this, as the
>> failures do not seem to bein replicate or client side layers (at present).
>>
>> (number in brackets is # times this failed in the last 13 instances of
>> mux testing)
>> ./tests/basic/afr/entry-self-heal.t (4)
>> ./tests/bitrot/br-state-check.t (8)
>>
>> 2)
>> ./tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t (7)
>>
>> The above test constantly fails at this point:
>> 
>> 16:46:28 volume add-brick: failed: /d/backends/patchy3 is already part
>> of a volume
>> 16:46:28 not ok 25 , LINENUM:47
>> 16:46:28 FAILED COMMAND: gluster --mode=script --wignore volume
>> add-brick patchy replica 3 builder104.cloud.gluster.org:/d/backends/patchy3
>> 
>>
>> From the logs the failure is occurring from here:
>> 
>> [2018-05-03 16:47:12.728893] E [MSGID: 106053]
>> [glusterd-utils.c:13865:glusterd_handle_replicate_brick_ops]
>> 0-management: Failed to set extended attribute trusted.add-brick :
>> Transport endpoint is not connected [Transport endpoint is not connected]
>> [2018-05-03 16:47:12.741438] E [MSGID: 106073]
>> [glusterd-brick-ops.c:2590:glusterd_op_add_brick] 0-glusterd: Unable to
>> add bricks
>> 
>>
>> This seems like the added brick is not accepting connections.
>>
>> 3) The following tests also show similar behaviour to (2), where the AFR
>> checks for brick up fails after timeout, as the birck is not accepting
>> connections.
>>
>> ./tests/bugs/replicate/bug-1363721.t (4)
>> ./tests/basic/afr/lk-quorum.t (5)
>>
>> I would suggest someone familiar with mux process and also brick muxing
>> look at these from the initialization/RPC/socket front, as these seem to
>> be bricks that do not show errors in the logs but are failing connections.
>>
>> As we find different root causes, we may want different bugs than the
>> one filed, please do so and post patches in an effort to move this forward.
>>
>> Thanks,
>> Shyam
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Brick-Mux tests failing for over 11+ weeks

2018-05-14 Thread Shyam Ranganathan
Further to the mail below,

1. Test bug-1559004-EMLINK-handling.t possibly just needs a larger
script timeout in mux based testing. I can see no errors in the 2-3
times that it has failed, other than taking over 1000 seconds. Further
investigation on normal non-mux regression also shows that this test
takes 850-950 seconds to complete at times, I assume increasing the
timeout will fix the failures due to this.

2. We still need answers for the following
- add-brick-and-validate-replicated-volume-options.t

Details on where it is failing is given in the mail below (point (2),
points to possible glusterd issue again). This does not seem to
correlate to the other glusterd stale port map information (as glusterd
is restarted in this case), so we possibly need to narrow this down
further. Help appreciated!

Thanks,
Shyam
On 05/13/2018 06:56 PM, Shyam Ranganathan wrote:
> Hi,
> 
> Nigel pointed out that the nightly brick-mux tests are now failing for
> about 11 weeks and we do not have a clear run of the same.
> 
> Spent some time on Friday collecting what tests failed and to an extent
> why, and filed bug https://bugzilla.redhat.com/show_bug.cgi?id=1577672
> 
> Asks: Whoever has cycles please look into these failures ASAP as these
> tests failing are blockers for 4.1 release, and overall the state of
> master (and hence 4.1 release branch) are not clean when these tests are
> failing for over 11 weeks.
> 
> Most of the tests fail if run on a local setup as well, so debugging the
> same should be easier than requiring the mux or regression setup, just
> ensure that mux is turned on (either by default in the code base you are
> testing or in the test case adding the line `TEST $CLI volume set all
> cluster.brick-multiplex on` after any cleanup and post starting glusterd.
> 
> 1) A lot of test cases time out, of which, the following 2 have the most
> failures, and hence possibly can help with the debugging of the root
> cause faster. Request Glusterd and bitrot teams to look at this, as the
> failures do not seem to bein replicate or client side layers (at present).
> 
> (number in brackets is # times this failed in the last 13 instances of
> mux testing)
> ./tests/basic/afr/entry-self-heal.t (4)
> ./tests/bitrot/br-state-check.t (8)
> 
> 2)
> ./tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t (7)
> 
> The above test constantly fails at this point:
> 
> 16:46:28 volume add-brick: failed: /d/backends/patchy3 is already part
> of a volume
> 16:46:28 not ok 25 , LINENUM:47
> 16:46:28 FAILED COMMAND: gluster --mode=script --wignore volume
> add-brick patchy replica 3 builder104.cloud.gluster.org:/d/backends/patchy3
> 
> 
> From the logs the failure is occurring from here:
> 
> [2018-05-03 16:47:12.728893] E [MSGID: 106053]
> [glusterd-utils.c:13865:glusterd_handle_replicate_brick_ops]
> 0-management: Failed to set extended attribute trusted.add-brick :
> Transport endpoint is not connected [Transport endpoint is not connected]
> [2018-05-03 16:47:12.741438] E [MSGID: 106073]
> [glusterd-brick-ops.c:2590:glusterd_op_add_brick] 0-glusterd: Unable to
> add bricks
> 
> 
> This seems like the added brick is not accepting connections.
> 
> 3) The following tests also show similar behaviour to (2), where the AFR
> checks for brick up fails after timeout, as the birck is not accepting
> connections.
> 
> ./tests/bugs/replicate/bug-1363721.t (4)
> ./tests/basic/afr/lk-quorum.t (5)
> 
> I would suggest someone familiar with mux process and also brick muxing
> look at these from the initialization/RPC/socket front, as these seem to
> be bricks that do not show errors in the logs but are failing connections.
> 
> As we find different root causes, we may want different bugs than the
> one filed, please do so and post patches in an effort to move this forward.
> 
> Thanks,
> Shyam
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
> 
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] Brick-Mux tests failing for over 11+ weeks

2018-05-14 Thread Shyam Ranganathan
*** Calling out to Glusterd folks to take a look at this ASAP and
provide a fix. ***

Further to the mail sent yesterday, work done in my day with Johnny
(RaghuB), points to a problem in glusterd rpc port map having stale
entries for certain bricks as the cause for connection failures when
running in the multiplex mode.

It seems like this problem has been partly addressed in this bug:
https://bugzilla.redhat.com/show_bug.cgi?id=1545048

What is occurring now is that glusterd retains older ports in its
mapping table against bricks that have recently terminated, when a
volume is stopped and restarted, this leads to connection failures from
clients as there are no listeners on the now stale port.

Test case as in [1], when run on my F27 machine fails 1 in 5 times with
the said error.

The above does narrow down failures in tests:
- lk-quorum.t
- br-state-check.t
- entry-self-heal.t
- bug-1363721.t (possibly)

Failure from client mount logs can be seen as using the wrong port
number in messages like "[rpc-clnt.c:2069:rpc_clnt_reconfig]
6-patchy-client-2: changing port to 49156 (from 0)" when there are
failures, the real port for the brick-mux process would be different.

We also used gdb to inspect glusterd pmap registry and found that older
stale port map data is present (in function pmap_registry_search as
clients invoke a connection).

Thanks,
Shyam

On 05/13/2018 06:56 PM, Shyam Ranganathan wrote:
> Hi,
> 
> Nigel pointed out that the nightly brick-mux tests are now failing for
> about 11 weeks and we do not have a clear run of the same.
> 
> Spent some time on Friday collecting what tests failed and to an extent
> why, and filed bug https://bugzilla.redhat.com/show_bug.cgi?id=1577672
> 
> Asks: Whoever has cycles please look into these failures ASAP as these
> tests failing are blockers for 4.1 release, and overall the state of
> master (and hence 4.1 release branch) are not clean when these tests are
> failing for over 11 weeks.
> 
> Most of the tests fail if run on a local setup as well, so debugging the
> same should be easier than requiring the mux or regression setup, just
> ensure that mux is turned on (either by default in the code base you are
> testing or in the test case adding the line `TEST $CLI volume set all
> cluster.brick-multiplex on` after any cleanup and post starting glusterd.
> 
> 1) A lot of test cases time out, of which, the following 2 have the most
> failures, and hence possibly can help with the debugging of the root
> cause faster. Request Glusterd and bitrot teams to look at this, as the
> failures do not seem to bein replicate or client side layers (at present).
> 
> (number in brackets is # times this failed in the last 13 instances of
> mux testing)
> ./tests/basic/afr/entry-self-heal.t (4)
> ./tests/bitrot/br-state-check.t (8)
> 
> 2)
> ./tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t (7)
> 
> The above test constantly fails at this point:
> 
> 16:46:28 volume add-brick: failed: /d/backends/patchy3 is already part
> of a volume
> 16:46:28 not ok 25 , LINENUM:47
> 16:46:28 FAILED COMMAND: gluster --mode=script --wignore volume
> add-brick patchy replica 3 builder104.cloud.gluster.org:/d/backends/patchy3
> 
> 
> From the logs the failure is occurring from here:
> 
> [2018-05-03 16:47:12.728893] E [MSGID: 106053]
> [glusterd-utils.c:13865:glusterd_handle_replicate_brick_ops]
> 0-management: Failed to set extended attribute trusted.add-brick :
> Transport endpoint is not connected [Transport endpoint is not connected]
> [2018-05-03 16:47:12.741438] E [MSGID: 106073]
> [glusterd-brick-ops.c:2590:glusterd_op_add_brick] 0-glusterd: Unable to
> add bricks
> 
> 
> This seems like the added brick is not accepting connections.
> 
> 3) The following tests also show similar behaviour to (2), where the AFR
> checks for brick up fails after timeout, as the birck is not accepting
> connections.
> 
> ./tests/bugs/replicate/bug-1363721.t (4)
> ./tests/basic/afr/lk-quorum.t (5)
> 
> I would suggest someone familiar with mux process and also brick muxing
> look at these from the initialization/RPC/socket front, as these seem to
> be bricks that do not show errors in the logs but are failing connections.
> 
> As we find different root causes, we may want different bugs than the
> one filed, please do so and post patches in an effort to move this forward.
> 
> Thanks,
> Shyam
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> http://lists.gluster.org/mailman/listinfo/gluster-devel
> 


mux-failure.t
Description: Perl program
___
Gluster-devel mailing list
Gluster-devel@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-devel