Re: [Gluster-devel] Brick-Mux tests failing for over 11+ weeks

2018-05-25 Thread Shyam Ranganathan
Updates from tests: Last 5 runs on 4.1 have passed. Run 55 and 58 failed on bug-1363721.t which was not merged before the tests were kicked off, hence still considering it as passing. Started 2 more runs on 4.1 [1] and possibly more during the day, to call an all clear on this blocker for the

Re: [Gluster-devel] Brick-Mux tests failing for over 11+ weeks

2018-05-24 Thread Shyam Ranganathan
After various analysis and fixes here is the current state, - Reverted 3 patches aimed at proper cleanup sequence when a mux'd brick is detached [2] - Fixed a core within the same patch, for a lookup before brick is ready case - Fixed an replicate test case, that was partly failing due to cleanup

Re: [Gluster-devel] Brick-Mux tests failing for over 11+ weeks

2018-05-16 Thread Shyam Ranganathan
On 05/16/2018 10:34 AM, Shyam Ranganathan wrote: > Some further analysis based on what Mohit commented on the patch: > > 1) gf_attach used to kill a brick is taking more time, causing timeouts > in tests, mainly br-state-check.t. Usually when there are back to back > kill_bricks in the test.

Re: [Gluster-devel] Brick-Mux tests failing for over 11+ weeks

2018-05-16 Thread Shyam Ranganathan
Some further analysis based on what Mohit commented on the patch: 1) gf_attach used to kill a brick is taking more time, causing timeouts in tests, mainly br-state-check.t. Usually when there are back to back kill_bricks in the test. 2) Problem in ./tests/bugs/replicate/bug-1363721.t seems to be

Re: [Gluster-devel] Brick-Mux tests failing for over 11+ weeks

2018-05-15 Thread Shyam Ranganathan
Hi, After the fix provided by Atin here [1] for the issue reported below, we ran 7-8 runs of brick mux regressions against this fix, and we have had 1/3 runs successful (even those have some tests retried). The run links are in the review at [1]. The failures are as below, sorted in descending

Re: [Gluster-devel] Brick-Mux tests failing for over 11+ weeks

2018-05-14 Thread Shyam Ranganathan
On 05/14/2018 08:35 PM, Shyam Ranganathan wrote: > Further to the mail below, > > 1. Test bug-1559004-EMLINK-handling.t possibly just needs a larger > script timeout in mux based testing. I can see no errors in the 2-3 > times that it has failed, other than taking over 1000 seconds. Further >

Re: [Gluster-devel] Brick-Mux tests failing for over 11+ weeks

2018-05-14 Thread Shyam Ranganathan
Further to the mail below, 1. Test bug-1559004-EMLINK-handling.t possibly just needs a larger script timeout in mux based testing. I can see no errors in the 2-3 times that it has failed, other than taking over 1000 seconds. Further investigation on normal non-mux regression also shows that this

Re: [Gluster-devel] Brick-Mux tests failing for over 11+ weeks

2018-05-14 Thread Shyam Ranganathan
*** Calling out to Glusterd folks to take a look at this ASAP and provide a fix. *** Further to the mail sent yesterday, work done in my day with Johnny (RaghuB), points to a problem in glusterd rpc port map having stale entries for certain bricks as the cause for connection failures when running