Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-06 Thread Nithya Balachandran
On 2 August 2018 at 05:46, Shyam Ranganathan  wrote:

> Below is a summary of failures over the last 7 days on the nightly
> health check jobs. This is one test per line, sorted in descending order
> of occurrence (IOW, most frequent failure is on top).
>
> The list includes spurious failures as well, IOW passed on a retry. This
> is because if we do not weed out the spurious errors, failures may
> persist and make it difficult to gauge the health of the branch.
>
> The number at the end of the test line are Jenkins job numbers where
> these failed. The job numbers runs as follows,
> - https://build.gluster.org/job/regression-test-burn-in/ ID: 4048 - 4053
> - https://build.gluster.org/job/line-coverage/ ID: 392 - 407
> - https://build.gluster.org/job/regression-test-with-multiplex/ ID: 811
> - 817
>
> So to get to job 4051 (say), use the link
> https://build.gluster.org/job/regression-test-burn-in/4051
>
> Atin has called out some folks for attention to some tests, consider
> this a call out to others, if you see a test against your component,
> help around root causing and fixing it is needed.
>
> tests/bugs/core/bug-1432542-mpx-restart-crash.t, 4049, 4051, 4052, 405,
> 404, 403, 396, 392
>
> tests/00-geo-rep/georep-basic-dr-tarssh.t, 811, 814, 817, 4050, 4053
>
> tests/bugs/bug-1368312.t, 815, 816, 811, 813, 403
>
> tests/bugs/distribute/bug-1122443.t, 4050, 407, 403, 815, 816
>
> tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t,
> 814, 816, 817, 812, 815
>
> tests/bugs/replicate/bug-1586020-mark-dirty-for-entry-
> txn-on-quorum-failure.t,
> 4049, 812, 814, 405, 392
>
> tests/bitrot/bug-1373520.t, 811, 816, 817, 813
>
> tests/bugs/ec/bug-1236065.t, 812, 813, 815
>
> tests/00-geo-rep/georep-basic-dr-rsync.t, 813, 4046
>
> tests/basic/ec/ec-1468261.t, 817, 812
>
> tests/bugs/glusterd/quorum-validation.t, 4049, 407
>
> tests/bugs/quota/bug-1293601.t, 811, 812
>
> tests/basic/afr/add-brick-self-heal.t, 407
>
> tests/basic/afr/granular-esh/replace-brick.t, 392
>
> tests/bugs/core/multiplex-limit-issue-151.t, 405
>
> tests/bugs/distribute/bug-1042725.t, 405
>

I think this was caused by a failure to cleanup the mounts from the
previous test. It succeeds on retry.

*16:59:10* 
*16:59:10*
[16:59:12] Running tests in file
./tests/bugs/distribute/bug-1042725.t*16:59:27*
./tests/bugs/distribute/bug-1042725.t .. *16:59:27* 1..16*16:59:27*
Aborting.*16:59:27* *16:59:27* /mnt/nfs/1 could not be deleted, here
are the left over items*16:59:27* drwxr-xr-x. 2 root root 6 Jul 31
16:59 /d/backends*16:59:27* drwxr-xr-x. 2 root root 4096 Jul 31 16:59
/mnt/glusterfs/0*16:59:27* drwxr-xr-x. 2 root root 4096 Jul 31 16:59
/mnt/glusterfs/1*16:59:27* drwxr-xr-x. 2 root root 4096 Jul 31 16:59
/mnt/glusterfs/2*16:59:27* drwxr-xr-x. 2 root root 4096 Jul 31 16:59
/mnt/glusterfs/3*16:59:27* drwxr-xr-x. 2 root root 4096 Jul 31 16:59
/mnt/nfs/0*16:59:27* drwxr-xr-x. 2 root root 4096 Jul 31 16:59
/mnt/nfs/1*16:59:27* *16:59:27* Please correct the problem and try
again.*16:59:27*


I don't think there is anything to be done for this one.



>
> tests/bugs/distribute/bug-1117851.t, 405
>
> tests/bugs/glusterd/rebalance-operations-in-single-node.t, 405
>
> tests/bugs/index/bug-1559004-EMLINK-handling.t, 405
>
> tests/bugs/replicate/bug-1386188-sbrain-fav-child.t, 4048
>
> tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t, 813
>
>
>
> Thanks,
> Shyam
>
>
> On 07/30/2018 03:21 PM, Shyam Ranganathan wrote:
> > On 07/24/2018 03:12 PM, Shyam Ranganathan wrote:
> >> 1) master branch health checks (weekly, till branching)
> >>   - Expect every Monday a status update on various tests runs
> >
> > See https://build.gluster.org/job/nightly-master/ for a report on
> > various nightly and periodic jobs on master.
> >
> > RED:
> > 1. Nightly regression (3/6 failed)
> > - Tests that reported failure:
> > ./tests/00-geo-rep/georep-basic-dr-rsync.t
> > ./tests/bugs/core/bug-1432542-mpx-restart-crash.t
> > ./tests/bugs/replicate/bug-1586020-mark-dirty-for-entry-
> txn-on-quorum-failure.t
> > ./tests/bugs/distribute/bug-1122443.t
> >
> > - Tests that needed a retry:
> > ./tests/00-geo-rep/georep-basic-dr-tarssh.t
> > ./tests/bugs/glusterd/quorum-validation.t
> >
> > 2. Regression with multiplex (cores and test failures)
> >
> > 3. line-coverage (cores and test failures)
> > - Tests that failed:
> > ./tests/bugs/core/bug-1432542-mpx-restart-crash.t (patch
> > https://review.gluster.org/20568 does not fix the timeout entirely, as
> > can be seen in this run,
> > https://build.gluster.org/job/line-coverage/401/consoleFull )
> >
> > Calling out to contributors to take a look at various failures, and post
> > the same as bugs AND to the lists (so that duplication is avoided) to
> > get this to a GREEN status.
> >
> > GREEN:
> > 1. cpp-check
> > 2. RPM builds
> >
> > IGNORE (for now):
> > 1. clang scan (@nigel, this job requires clang warnings to be 

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-06 Thread Nithya Balachandran
On 6 August 2018 at 18:03, Nithya Balachandran  wrote:

>
>
> On 2 August 2018 at 05:46, Shyam Ranganathan  wrote:
>
>> Below is a summary of failures over the last 7 days on the nightly
>> health check jobs. This is one test per line, sorted in descending order
>> of occurrence (IOW, most frequent failure is on top).
>>
>> The list includes spurious failures as well, IOW passed on a retry. This
>> is because if we do not weed out the spurious errors, failures may
>> persist and make it difficult to gauge the health of the branch.
>>
>> The number at the end of the test line are Jenkins job numbers where
>> these failed. The job numbers runs as follows,
>> - https://build.gluster.org/job/regression-test-burn-in/ ID: 4048 - 4053
>> - https://build.gluster.org/job/line-coverage/ ID: 392 - 407
>> - https://build.gluster.org/job/regression-test-with-multiplex/ ID: 811
>> - 817
>>
>> So to get to job 4051 (say), use the link
>> https://build.gluster.org/job/regression-test-burn-in/4051
>>
>> Atin has called out some folks for attention to some tests, consider
>> this a call out to others, if you see a test against your component,
>> help around root causing and fixing it is needed.
>>
>> tests/bugs/core/bug-1432542-mpx-restart-crash.t, 4049, 4051, 4052, 405,
>> 404, 403, 396, 392
>>
>> tests/00-geo-rep/georep-basic-dr-tarssh.t, 811, 814, 817, 4050, 4053
>>
>> tests/bugs/bug-1368312.t, 815, 816, 811, 813, 403
>>
>> tests/bugs/distribute/bug-1122443.t, 4050, 407, 403, 815, 816
>>
>> tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t,
>> 814, 816, 817, 812, 815
>>
>> tests/bugs/replicate/bug-1586020-mark-dirty-for-entry-txn-
>> on-quorum-failure.t,
>> 4049, 812, 814, 405, 392
>>
>> tests/bitrot/bug-1373520.t, 811, 816, 817, 813
>>
>> tests/bugs/ec/bug-1236065.t, 812, 813, 815
>>
>> tests/00-geo-rep/georep-basic-dr-rsync.t, 813, 4046
>>
>> tests/basic/ec/ec-1468261.t, 817, 812
>>
>> tests/bugs/glusterd/quorum-validation.t, 4049, 407
>>
>> tests/bugs/quota/bug-1293601.t, 811, 812
>>
>> tests/basic/afr/add-brick-self-heal.t, 407
>>
>> tests/basic/afr/granular-esh/replace-brick.t, 392
>>
>> tests/bugs/core/multiplex-limit-issue-151.t, 405
>>
>> tests/bugs/distribute/bug-1042725.t, 405
>>
>> tests/bugs/distribute/bug-1117851.t, 405
>>
>
> From the non-lcov vs lcov runs:
>
> Non-lcov:
>
> [nbalacha@myserver glusterfs]$ grep TEST mnt-glusterfs-0.log
> [2018-07-31 16:30:36.930726]:++ 
> G_LOG:./tests/bugs/distribute/bug-1117851.t:
> TEST: 72 create_files /mnt/glusterfs/0 ++
> [2018-07-31 16:31:47.649022]:++ 
> G_LOG:./tests/bugs/distribute/bug-1117851.t:
> TEST: 75 glusterfs --entry-timeout=0 --attribute-timeout=0 -s
> builder104.cloud.gluster.org --volfile-id patchy /mnt/glusterfs/1
> ++
> [2018-07-31 16:31:47.746734]:++ 
> G_LOG:./tests/bugs/distribute/bug-1117851.t:
> TEST: 77 move_files /mnt/glusterfs/0 ++
> [2018-07-31 16:31:47.783606]:++ 
> G_LOG:./tests/bugs/distribute/bug-1117851.t:
> TEST: 78 move_files /mnt/glusterfs/1 ++
> [2018-07-31 16:31:47.842878]:++ 
> G_LOG:./tests/bugs/distribute/bug-1117851.t:
> TEST: 85 done cat /mnt/glusterfs/0/status_0 ++
> [2018-07-31 16:33:14.849807]:++ 
> G_LOG:./tests/bugs/distribute/bug-1117851.t:
> TEST: 86 done cat /mnt/glusterfs/1/status_1 ++
> [2018-07-31 16:33:14.872184]:++ 
> G_LOG:./tests/bugs/distribute/bug-1117851.t:
> TEST: 88 Y force_umount /mnt/glusterfs/0 ++
> [2018-07-31 16:33:14.900334]:++ 
> G_LOG:./tests/bugs/distribute/bug-1117851.t:
> TEST: 89 Y force_umount /mnt/glusterfs/1 ++
> [2018-07-31 16:33:14.929238]:++ 
> G_LOG:./tests/bugs/distribute/bug-1117851.t:
> TEST: 90 glusterfs --entry-timeout=0 --attribute-timeout=0 -s
> builder104.cloud.gluster.org --volfile-id patchy /mnt/glusterfs/0
> ++
> [2018-07-31 16:33:15.027094]:++ 
> G_LOG:./tests/bugs/distribute/bug-1117851.t:
> TEST: 91 check_files /mnt/glusterfs/0 ++
> [2018-07-31 16:33:20.268030]:++ 
> G_LOG:./tests/bugs/distribute/bug-1117851.t:
> TEST: 93 gluster --mode=script --wignore volume stop patchy ++
> [2018-07-31 16:33:22.392247]:++ 
> G_LOG:./tests/bugs/distribute/bug-1117851.t:
> TEST: 94 Stopped volinfo_field patchy Status ++
> [2018-07-31 16:33:22.492175]:++ 
> G_LOG:./tests/bugs/distribute/bug-1117851.t:
> TEST: 96 gluster --mode=script --wignore volume delete patchy ++
> [2018-07-31 16:33:25.475566]:++ 
> G_LOG:./tests/bugs/distribute/bug-1117851.t:
> TEST: 97 ! gluster --mode=script --wignore volume info patchy ++
>
>
> Total time for the tests: *169* seconds
>
>
> Lcov:
>
> [nbalacha@myserver glusterfs]$ grep TEST mnt-glusterfs-0.log
> [2018-08-06 08:33:05.737012]:++ 
> G_LOG:./tests/bugs/distribute/bug-1117851.t:
> TEST: 72 create_files /mnt/glusterfs/0 ++
> [2018-08-06 08:34:29.133045]:++ 
> 

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-06 Thread Nithya Balachandran
On 2 August 2018 at 05:46, Shyam Ranganathan  wrote:

> Below is a summary of failures over the last 7 days on the nightly
> health check jobs. This is one test per line, sorted in descending order
> of occurrence (IOW, most frequent failure is on top).
>
> The list includes spurious failures as well, IOW passed on a retry. This
> is because if we do not weed out the spurious errors, failures may
> persist and make it difficult to gauge the health of the branch.
>
> The number at the end of the test line are Jenkins job numbers where
> these failed. The job numbers runs as follows,
> - https://build.gluster.org/job/regression-test-burn-in/ ID: 4048 - 4053
> - https://build.gluster.org/job/line-coverage/ ID: 392 - 407
> - https://build.gluster.org/job/regression-test-with-multiplex/ ID: 811
> - 817
>
> So to get to job 4051 (say), use the link
> https://build.gluster.org/job/regression-test-burn-in/4051
>
> Atin has called out some folks for attention to some tests, consider
> this a call out to others, if you see a test against your component,
> help around root causing and fixing it is needed.
>
> tests/bugs/core/bug-1432542-mpx-restart-crash.t, 4049, 4051, 4052, 405,
> 404, 403, 396, 392
>
> tests/00-geo-rep/georep-basic-dr-tarssh.t, 811, 814, 817, 4050, 4053
>
> tests/bugs/bug-1368312.t, 815, 816, 811, 813, 403
>
> tests/bugs/distribute/bug-1122443.t, 4050, 407, 403, 815, 816
>
> tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t,
> 814, 816, 817, 812, 815
>
> tests/bugs/replicate/bug-1586020-mark-dirty-for-entry-
> txn-on-quorum-failure.t,
> 4049, 812, 814, 405, 392
>
> tests/bitrot/bug-1373520.t, 811, 816, 817, 813
>
> tests/bugs/ec/bug-1236065.t, 812, 813, 815
>
> tests/00-geo-rep/georep-basic-dr-rsync.t, 813, 4046
>
> tests/basic/ec/ec-1468261.t, 817, 812
>
> tests/bugs/glusterd/quorum-validation.t, 4049, 407
>
> tests/bugs/quota/bug-1293601.t, 811, 812
>
> tests/basic/afr/add-brick-self-heal.t, 407
>
> tests/basic/afr/granular-esh/replace-brick.t, 392
>
> tests/bugs/core/multiplex-limit-issue-151.t, 405
>
> tests/bugs/distribute/bug-1042725.t, 405
>
> tests/bugs/distribute/bug-1117851.t, 405
>

>From the non-lcov vs lcov runs:

Non-lcov:

[nbalacha@myserver glusterfs]$ grep TEST mnt-glusterfs-0.log
[2018-07-31 16:30:36.930726]:++
G_LOG:./tests/bugs/distribute/bug-1117851.t: TEST: 72 create_files
/mnt/glusterfs/0 ++
[2018-07-31 16:31:47.649022]:++
G_LOG:./tests/bugs/distribute/bug-1117851.t: TEST: 75 glusterfs
--entry-timeout=0 --attribute-timeout=0 -s builder104.cloud.gluster.org
--volfile-id patchy /mnt/glusterfs/1 ++
[2018-07-31 16:31:47.746734]:++
G_LOG:./tests/bugs/distribute/bug-1117851.t: TEST: 77 move_files
/mnt/glusterfs/0 ++
[2018-07-31 16:31:47.783606]:++
G_LOG:./tests/bugs/distribute/bug-1117851.t: TEST: 78 move_files
/mnt/glusterfs/1 ++
[2018-07-31 16:31:47.842878]:++
G_LOG:./tests/bugs/distribute/bug-1117851.t: TEST: 85 done cat
/mnt/glusterfs/0/status_0 ++
[2018-07-31 16:33:14.849807]:++
G_LOG:./tests/bugs/distribute/bug-1117851.t: TEST: 86 done cat
/mnt/glusterfs/1/status_1 ++
[2018-07-31 16:33:14.872184]:++
G_LOG:./tests/bugs/distribute/bug-1117851.t: TEST: 88 Y force_umount
/mnt/glusterfs/0 ++
[2018-07-31 16:33:14.900334]:++
G_LOG:./tests/bugs/distribute/bug-1117851.t: TEST: 89 Y force_umount
/mnt/glusterfs/1 ++
[2018-07-31 16:33:14.929238]:++
G_LOG:./tests/bugs/distribute/bug-1117851.t: TEST: 90 glusterfs
--entry-timeout=0 --attribute-timeout=0 -s builder104.cloud.gluster.org
--volfile-id patchy /mnt/glusterfs/0 ++
[2018-07-31 16:33:15.027094]:++
G_LOG:./tests/bugs/distribute/bug-1117851.t: TEST: 91 check_files
/mnt/glusterfs/0 ++
[2018-07-31 16:33:20.268030]:++
G_LOG:./tests/bugs/distribute/bug-1117851.t: TEST: 93 gluster --mode=script
--wignore volume stop patchy ++
[2018-07-31 16:33:22.392247]:++
G_LOG:./tests/bugs/distribute/bug-1117851.t: TEST: 94 Stopped volinfo_field
patchy Status ++
[2018-07-31 16:33:22.492175]:++
G_LOG:./tests/bugs/distribute/bug-1117851.t: TEST: 96 gluster --mode=script
--wignore volume delete patchy ++
[2018-07-31 16:33:25.475566]:++
G_LOG:./tests/bugs/distribute/bug-1117851.t: TEST: 97 ! gluster
--mode=script --wignore volume info patchy ++


Total time for the tests: *169* seconds


Lcov:

[nbalacha@myserver glusterfs]$ grep TEST mnt-glusterfs-0.log
[2018-08-06 08:33:05.737012]:++
G_LOG:./tests/bugs/distribute/bug-1117851.t: TEST: 72 create_files
/mnt/glusterfs/0 ++
[2018-08-06 08:34:29.133045]:++
G_LOG:./tests/bugs/distribute/bug-1117851.t: TEST: 75 glusterfs
--entry-timeout=0 --attribute-timeout=0 -s builder100.cloud.gluster.org
--volfile-id patchy /mnt/glusterfs/1 ++
[2018-08-06 08:34:29.257888]:++
G_LOG:./tests/bugs/distribute/bug-1117851.t: TEST: 77 move_files

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-05 Thread Atin Mukherjee
On Mon, 6 Aug 2018 at 06:09, Sankarshan Mukhopadhyay <
sankarshan.mukhopadh...@gmail.com> wrote:

> On Mon, Aug 6, 2018 at 5:17 AM, Amye Scavarda  wrote:
> >
> >
> > On Sun, Aug 5, 2018 at 3:24 PM Shyam Ranganathan 
> > wrote:
> >>
> >> On 07/31/2018 07:16 AM, Shyam Ranganathan wrote:
> >> > On 07/30/2018 03:21 PM, Shyam Ranganathan wrote:
> >> >> On 07/24/2018 03:12 PM, Shyam Ranganathan wrote:
> >> >>> 1) master branch health checks (weekly, till branching)
> >> >>>   - Expect every Monday a status update on various tests runs
> >> >> See https://build.gluster.org/job/nightly-master/ for a report on
> >> >> various nightly and periodic jobs on master.
> >> > Thinking aloud, we may have to stop merges to master to get these test
> >> > failures addressed at the earliest and to continue maintaining them
> >> > GREEN for the health of the branch.
> >> >
> >> > I would give the above a week, before we lockdown the branch to fix
> the
> >> > failures.
> >> >
> >> > Let's try and get line-coverage and nightly regression tests addressed
> >> > this week (leaving mux-regression open), and if addressed not lock the
> >> > branch down.
> >> >
> >>
> >> Health on master as of the last nightly run [4] is still the same.
> >>
> >> Potential patches that rectify the situation (as in [1]) are bunched in
> >> a patch [2] that Atin and myself have put through several regressions
> >> (mux, normal and line coverage) and these have also not passed.
> >>
> >> Till we rectify the situation we are locking down master branch commit
> >> rights to the following people, Amar, Atin, Shyam, Vijay.
> >>
> >> The intention is to stabilize master and not add more patches that my
> >> destabilize it.
> >>
> >> Test cases that are tracked as failures and need action are present here
> >> [3].
> >>
> >> @Nigel, request you to apply the commit rights change as you see this
> >> mail and let the list know regarding the same as well.
> >>
> >> Thanks,
> >> Shyam
> >>
> >> [1] Patches that address regression failures:
> >> https://review.gluster.org/#/q/starredby:srangana%2540redhat.com
> >>
> >> [2] Bunched up patch against which regressions were run:
> >> https://review.gluster.org/#/c/20637
> >>
> >> [3] Failing tests list:
> >>
> >>
> https://docs.google.com/spreadsheets/d/1IF9GhpKah4bto19RQLr0y_Kkw26E_-crKALHSaSjZMQ/edit?usp=sharing
> >>
> >> [4] Nightly run dashboard:
> https://build.gluster.org/job/nightly-master/
>
> >
> > Locking master is fine, this seems like there's been ample notice and
> > conversation.
> > Do we have test criteria to indicate when we're unlocking master? X
> amount
> > of tests passing, Y amount of bugs?
>
> The "till we rectify" might just include 3 days of the entire set of
> tests passing - thinking out loud here.


3 days = 3 nightly regressions isn’t enough as most failures are spurious
in nature (IMHO). What Shyam and I are doing is retriggering various
regressions on top of patch [2] . We’re looking for atleast 10 iterations
to go through with out any tests retry and failures.


>
>
> --
> sankarshan mukhopadhyay
> 
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
-- 
--Atin
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-05 Thread Sankarshan Mukhopadhyay
On Mon, Aug 6, 2018 at 5:17 AM, Amye Scavarda  wrote:
>
>
> On Sun, Aug 5, 2018 at 3:24 PM Shyam Ranganathan 
> wrote:
>>
>> On 07/31/2018 07:16 AM, Shyam Ranganathan wrote:
>> > On 07/30/2018 03:21 PM, Shyam Ranganathan wrote:
>> >> On 07/24/2018 03:12 PM, Shyam Ranganathan wrote:
>> >>> 1) master branch health checks (weekly, till branching)
>> >>>   - Expect every Monday a status update on various tests runs
>> >> See https://build.gluster.org/job/nightly-master/ for a report on
>> >> various nightly and periodic jobs on master.
>> > Thinking aloud, we may have to stop merges to master to get these test
>> > failures addressed at the earliest and to continue maintaining them
>> > GREEN for the health of the branch.
>> >
>> > I would give the above a week, before we lockdown the branch to fix the
>> > failures.
>> >
>> > Let's try and get line-coverage and nightly regression tests addressed
>> > this week (leaving mux-regression open), and if addressed not lock the
>> > branch down.
>> >
>>
>> Health on master as of the last nightly run [4] is still the same.
>>
>> Potential patches that rectify the situation (as in [1]) are bunched in
>> a patch [2] that Atin and myself have put through several regressions
>> (mux, normal and line coverage) and these have also not passed.
>>
>> Till we rectify the situation we are locking down master branch commit
>> rights to the following people, Amar, Atin, Shyam, Vijay.
>>
>> The intention is to stabilize master and not add more patches that my
>> destabilize it.
>>
>> Test cases that are tracked as failures and need action are present here
>> [3].
>>
>> @Nigel, request you to apply the commit rights change as you see this
>> mail and let the list know regarding the same as well.
>>
>> Thanks,
>> Shyam
>>
>> [1] Patches that address regression failures:
>> https://review.gluster.org/#/q/starredby:srangana%2540redhat.com
>>
>> [2] Bunched up patch against which regressions were run:
>> https://review.gluster.org/#/c/20637
>>
>> [3] Failing tests list:
>>
>> https://docs.google.com/spreadsheets/d/1IF9GhpKah4bto19RQLr0y_Kkw26E_-crKALHSaSjZMQ/edit?usp=sharing
>>
>> [4] Nightly run dashboard: https://build.gluster.org/job/nightly-master/

>
> Locking master is fine, this seems like there's been ample notice and
> conversation.
> Do we have test criteria to indicate when we're unlocking master? X amount
> of tests passing, Y amount of bugs?

The "till we rectify" might just include 3 days of the entire set of
tests passing - thinking out loud here.


-- 
sankarshan mukhopadhyay

___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-05 Thread Amye Scavarda
On Sun, Aug 5, 2018 at 3:24 PM Shyam Ranganathan 
wrote:

> On 07/31/2018 07:16 AM, Shyam Ranganathan wrote:
> > On 07/30/2018 03:21 PM, Shyam Ranganathan wrote:
> >> On 07/24/2018 03:12 PM, Shyam Ranganathan wrote:
> >>> 1) master branch health checks (weekly, till branching)
> >>>   - Expect every Monday a status update on various tests runs
> >> See https://build.gluster.org/job/nightly-master/ for a report on
> >> various nightly and periodic jobs on master.
> > Thinking aloud, we may have to stop merges to master to get these test
> > failures addressed at the earliest and to continue maintaining them
> > GREEN for the health of the branch.
> >
> > I would give the above a week, before we lockdown the branch to fix the
> > failures.
> >
> > Let's try and get line-coverage and nightly regression tests addressed
> > this week (leaving mux-regression open), and if addressed not lock the
> > branch down.
> >
>
> Health on master as of the last nightly run [4] is still the same.
>
> Potential patches that rectify the situation (as in [1]) are bunched in
> a patch [2] that Atin and myself have put through several regressions
> (mux, normal and line coverage) and these have also not passed.
>
> Till we rectify the situation we are locking down master branch commit
> rights to the following people, Amar, Atin, Shyam, Vijay.
>
> The intention is to stabilize master and not add more patches that my
> destabilize it.
>
> Test cases that are tracked as failures and need action are present here
> [3].
>
> @Nigel, request you to apply the commit rights change as you see this
> mail and let the list know regarding the same as well.
>
> Thanks,
> Shyam
>
> [1] Patches that address regression failures:
> https://review.gluster.org/#/q/starredby:srangana%2540redhat.com
>
> [2] Bunched up patch against which regressions were run:
> https://review.gluster.org/#/c/20637
>
> [3] Failing tests list:
>
> https://docs.google.com/spreadsheets/d/1IF9GhpKah4bto19RQLr0y_Kkw26E_-crKALHSaSjZMQ/edit?usp=sharing
>
> [4] Nightly run dashboard: https://build.gluster.org/job/nightly-master/
> ___
> maintainers mailing list
> maintain...@gluster.org
> https://lists.gluster.org/mailman/listinfo/maintainers
>

Locking master is fine, this seems like there's been ample notice and
conversation.
Do we have test criteria to indicate when we're unlocking master? X amount
of tests passing, Y amount of bugs?
- amye

-- 
Amye Scavarda | a...@redhat.com | Gluster Community Lead
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-03 Thread Nithya Balachandran
On 31 July 2018 at 22:11, Atin Mukherjee  wrote:

> I just went through the nightly regression report of brick mux runs and
> here's what I can summarize.
>
> 
> 
> =
> Fails only with brick-mux
> 
> 
> =
> tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even after
> 400 secs. Refer https://fstat.gluster.org/failure/209?state=2_
> date=2018-06-30_date=2018-07-31=all, specifically the latest
> report https://build.gluster.org/job/regression-test-burn-in/4051/
> consoleText . Wasn't timing out as frequently as it was till 12 July. But
> since 27 July, it has timed out twice. Beginning to believe commit
> 9400b6f2c8aa219a493961e0ab9770b7f12e80d2 has added the delay and now 400
> secs isn't sufficient enough (Mohit?)
>

One of the failed regression-test-burn in was an actual failure,not a
timeout.
https://build.gluster.org/job/regression-test-burn-in/4049

The brick disconnects from glusterd:

[2018-07-27 16:28:42.882668] I [MSGID: 106005]
[glusterd-handler.c:6129:__glusterd_brick_rpc_notify] 0-management: Brick
builder103.cloud.gluster.org:/d/backends/vol01/brick0 has disconnected from
glusterd.
[2018-07-27 16:28:42.891031] I [MSGID: 106143]
[glusterd-pmap.c:397:pmap_registry_remove] 0*-pmap: removing brick
/d/backends/vol01/brick0 on port 49152*
[2018-07-27 16:28:42.892379] I [MSGID: 106143]
[glusterd-pmap.c:397:pmap_registry_remove] 0-pmap: removing brick (null) on
port 49152
[2018-07-27 16:29:02.636027]:++
G_LOG:./tests/bugs/core/bug-1432542-mpx-restart-crash.t: TEST: 56 _GFS
--attribute-timeout=0 --entry-timeout=0 -s builder103.cloud.gluster.org
--volfile-id=patchy-vol20 /mnt/glusterfs/vol20 ++


So the client cannot connect to the bricks after this as it never gets the
port info from glusterd. From mnt-glusterfs-vol20.log:

[2018-07-27 16:29:02.769947] I [MSGID: 114020] [client.c:2329:notify]
0-patchy-vol20-client-1: parent translators are ready, attempting connect
on transport
[2018-07-27 16:29:02.770677] E [MSGID: 114058]
[client-handshake.c:1518:client_query_portmap_cbk]
0-patchy-vol20-client-0: *failed
to get the port number for remote subvolume. Please run 'gluster volume
status' on server to see if brick process is running*.
[2018-07-27 16:29:02.770767] I [MSGID: 114018]
[client.c:2255:client_rpc_notify] 0-patchy-vol20-client-0: disconnected
from patchy-vol20-client-0. Client process will keep trying to connect to
glusterd until brick's port is available


>From the brick logs:
[2018-07-27 16:28:34.729241] I [login.c:111:gf_auth] 0-auth/login: allowed
user names: 2b65c380-392e-459f-b722-c130aac29377
[2018-07-27 16:28:34.945474] I [MSGID: 115029]
[server-handshake.c:786:server_setvolume] 0-patchy-vol01-server: accepted
client from
CTX_ID:72dcd65e-2125-4a79-8331-48c0fe9abce7-GRAPH_ID:0-PID:8483-HOST:builder103.cloud.gluster.org-PC_NAME:patchy-vol06-client-2-RECON_NO:-0
(version: 4.2dev)
[2018-07-27 16:28:35.946588] I [MSGID: 101016]
[glusterfs3.h:739:dict_to_xdr] 0-dict: key 'glusterfs.xattrop_index_gfid'
is would not be sent on wire in future [Invalid argument]  *  <--- Last
Brick Log. It looks like the brick went down at this point.*
[2018-07-27 16:29:02.636027]:++
G_LOG:./tests/bugs/core/bug-1432542-mpx-restart-crash.t: TEST: 56 _GFS
--attribute-timeout=0 --entry-timeout=0 -s builder103.cloud.gluster.org
--volfile-id=patchy-vol20 /mnt/glusterfs/vol20 ++
[2018-07-27 16:29:12.021827]:++
G_LOG:./tests/bugs/core/bug-1432542-mpx-restart-crash.t: TEST: 83 dd
if=/dev/zero of=/mnt/glusterfs/vol20/a_file bs=4k count=1 ++
[2018-07-27 16:29:12.039248]:++
G_LOG:./tests/bugs/core/bug-1432542-mpx-restart-crash.t: TEST: 87 killall
-9 glusterd ++
[2018-07-27 16:29:17.073995]:++
G_LOG:./tests/bugs/core/bug-1432542-mpx-restart-crash.t: TEST: 89 killall
-9 glusterfsd ++
[2018-07-27 16:29:22.096385]:++
G_LOG:./tests/bugs/core/bug-1432542-mpx-restart-crash.t: TEST: 95 glusterd
++
[2018-07-27 16:29:24.481555] I [MSGID: 100030] [glusterfsd.c:2728:main]
0-/build/install/sbin/glusterfsd: Started running
/build/install/sbin/glusterfsd version 4.2dev (args:
/build/install/sbin/glusterfsd -s builder103.cloud.gluster.org --volfile-id
patchy-vol01.builder103.cloud.gluster.org.d-backends-vol01-brick0 -p
/var/run/gluster/vols/patchy-vol01/builder103.cloud.gluster.org-d-backends-vol01-brick0.pid
-S /var/run/gluster/f4d6c8f7c3f85b18.socket --brick-name
/d/backends/vol01/brick0 -l
/var/log/glusterfs/bricks/d-backends-vol01-brick0.log --xlator-option
*-posix.glusterd-uuid=0db25f79-8880-4f2d-b1e8-584e751ff0b9 --process-name
brick --brick-port 49153 --xlator-option

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-03 Thread Milind Changire
On Fri, Aug 3, 2018 at 11:04 AM, Pranith Kumar Karampuri <
pkara...@redhat.com> wrote:

> On Thu, Aug 2, 2018 at 10:03 PM Pranith Kumar Karampuri <
> pkara...@redhat.com> wrote:
>
>> On Thu, Aug 2, 2018 at 7:19 PM Atin Mukherjee 
>> wrote:
>>
>>> New addition - tests/basic/volume.t - failed twice atleast with shd core.
>>>
>>> One such ref - https://build.gluster.org/job/centos7-regression/2058/
>>> console
>>>
>>
>> I will take a look.
>>
>
> The crash is happening inside libc and there are no line numbers to debug
> further. Is there anyway to get symbols, line numbers even for that? We can
> find hints as to what could be going wrong. Let me try to re-create it on
> the machines I have in the meanwhile.
>
> (gdb) bt
> #0  0x7feae916bb4f in _IO_cleanup () from ./lib64/libc.so.6
> #1  0x7feae9127b8b in __run_exit_handlers () from ./lib64/libc.so.6
> #2  0x7feae9127c27 in exit () from ./lib64/libc.so.6
> #3  0x00408ba5 in cleanup_and_exit (signum=15) at
> /home/jenkins/root/workspace/centos7-regression/glusterfsd/
> src/glusterfsd.c:1570
> #4  0x0040a75f in glusterfs_sigwaiter (arg=0x7ffe6faa7540) at
> /home/jenkins/root/workspace/centos7-regression/glusterfsd/
> src/glusterfsd.c:2332
> #5  0x7feae9b27e25 in start_thread () from ./lib64/libpthread.so.0
> #6  0x7feae91ecbad in clone () from ./lib64/libc.so.6
>
> You could install the glibc-debuginfo and other relevant debuginfos on the
system you are trying to reproduce this issue on. That will get you the
line numbers  and symbols.


>>
>>>
>>>
>>> On Thu, Aug 2, 2018 at 6:28 PM Sankarshan Mukhopadhyay <
>>> sankarshan.mukhopadh...@gmail.com> wrote:
>>>
 On Thu, Aug 2, 2018 at 5:48 PM, Kotresh Hiremath Ravishankar
  wrote:
 > I am facing different issue in softserve machines. The fuse mount
 itself is
 > failing.
 > I tried day before yesterday to debug geo-rep failures. I discussed
 with
 > Raghu,
 > but could not root cause it. So none of the tests were passing. It
 happened
 > on
 > both machine instances I tried.
 >

 Ugh! -infra team should have an issue to work with and resolve this.


 --
 sankarshan mukhopadhyay
 
 ___
 Gluster-devel mailing list
 Gluster-devel@gluster.org
 https://lists.gluster.org/mailman/listinfo/gluster-devel

>>> ___
>>> Gluster-devel mailing list
>>> Gluster-devel@gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>
>>
>>
>> --
>> Pranith
>>
>
>
> --
> Pranith
>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>



-- 
Milind
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-03 Thread Raghavendra Gowdappa
On Fri, Aug 3, 2018 at 4:01 PM, Kotresh Hiremath Ravishankar <
khire...@redhat.com> wrote:

> Hi Du/Poornima,
>
> I was analysing bitrot and geo-rep failures and I suspect there is a bug
> in some perf xlator
> that was one of the cause. I was seeing following behaviour in few runs.
>
> 1. Geo-rep synced data to slave. It creats empty file and then rsync syncs
> data.
> But test does "stat --format "%F" " to confirm. If it's empty,
> it returns
> "regular empty file" else "regular file". I believe it did get the
> "regular empty file"
> instead of "regular file" until timeout.
>

https://review.gluster.org/20549 might be relevant.


> 2. Other behaviour is with bitrot, with brick-mux. If a file is deleted on
> the back end on one brick
> and the look up is done. What all performance xlators needs to be
> disabled to get the lookup/revalidate
> on the brick where the file was deleted. Earlier, only md-cache was
> disable and it used to work.
> No it's failing intermittently.
>

You need to disable readdirplus in the entire stack. Refer to
https://lists.gluster.org/pipermail/gluster-users/2017-March/030148.html


> Are there any pending patches around these areas that needs to be merged ?
> If there are, then it could be affecting other tests as well.
>
> Thanks,
> Kotresh HR
>
> On Fri, Aug 3, 2018 at 3:07 PM, Karthik Subrahmanya 
> wrote:
>
>>
>>
>> On Fri, Aug 3, 2018 at 2:12 PM Karthik Subrahmanya 
>> wrote:
>>
>>>
>>>
>>> On Thu, Aug 2, 2018 at 11:00 PM Karthik Subrahmanya 
>>> wrote:
>>>


 On Tue 31 Jul, 2018, 10:17 PM Atin Mukherjee, 
 wrote:

> I just went through the nightly regression report of brick mux runs
> and here's what I can summarize.
>
> 
> 
> =
> Fails only with brick-mux
> 
> 
> =
> tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even
> after 400 secs. Refer https://fstat.gluster.org/fail
> ure/209?state=2_date=2018-06-30_date=2018-07-31=all,
> specifically the latest report https://build.gluster.org/job/
> regression-test-burn-in/4051/consoleText . Wasn't timing out as
> frequently as it was till 12 July. But since 27 July, it has timed out
> twice. Beginning to believe commit 
> 9400b6f2c8aa219a493961e0ab9770b7f12e80d2
> has added the delay and now 400 secs isn't sufficient enough (Mohit?)
>
> tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
> (Ref - https://build.gluster.org/job/regression-test-with-multiplex
> /814/console) -  Test fails only in brick-mux mode, AI on Atin to
> look at and get back.
>
> tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t (
> https://build.gluster.org/job/regression-test-with-multiple
> x/813/console) - Seems like failed just twice in last 30 days as per
> https://fstat.gluster.org/failure/251?state=2_date=
> 2018-06-30_date=2018-07-31=all. Need help from AFR team.
>
> tests/bugs/quota/bug-1293601.t (https://build.gluster.org/job
> /regression-test-with-multiplex/812/console) - Hasn't failed after 26
> July and earlier it was failing regularly. Did we fix this test through 
> any
> patch (Mohit?)
>
> tests/bitrot/bug-1373520.t - (https://build.gluster.org/job
> /regression-test-with-multiplex/811/console)  - Hasn't failed after
> 27 July and earlier it was failing regularly. Did we fix this test through
> any patch (Mohit?)
>
> tests/bugs/glusterd/remove-brick-testcases.t - Failed once with a
> core, not sure if related to brick mux or not, so not sure if brick mux is
> culprit here or not. Ref - https://build.gluster.org/job/
> regression-test-with-multiplex/806/console . Seems to be a glustershd
> crash. Need help from AFR folks.
>
> 
> 
> =
> Fails for non-brick mux case too
> 
> 
> =
> tests/bugs/distribute/bug-1122443.t 0 Seems to be failing at my setup
> very often, with out brick mux as well. Refer
> https://build.gluster.org/job/regression-test-burn-in/4050/consoleText
> . There's an email in gluster-devel and a BZ 1610240 for the same.
>
> tests/bugs/bug-1368312.t - Seems to be recent failures (
> 

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-03 Thread Karthik Subrahmanya
On Fri, Aug 3, 2018 at 3:07 PM Karthik Subrahmanya 
wrote:

>
>
> On Fri, Aug 3, 2018 at 2:12 PM Karthik Subrahmanya 
> wrote:
>
>>
>>
>> On Thu, Aug 2, 2018 at 11:00 PM Karthik Subrahmanya 
>> wrote:
>>
>>>
>>>
>>> On Tue 31 Jul, 2018, 10:17 PM Atin Mukherjee, 
>>> wrote:
>>>
 I just went through the nightly regression report of brick mux runs and
 here's what I can summarize.


 =
 Fails only with brick-mux

 =
 tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even after
 400 secs. Refer
 https://fstat.gluster.org/failure/209?state=2_date=2018-06-30_date=2018-07-31=all,
 specifically the latest report
 https://build.gluster.org/job/regression-test-burn-in/4051/consoleText
 . Wasn't timing out as frequently as it was till 12 July. But since 27
 July, it has timed out twice. Beginning to believe commit
 9400b6f2c8aa219a493961e0ab9770b7f12e80d2 has added the delay and now 400
 secs isn't sufficient enough (Mohit?)

 tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
 (Ref -
 https://build.gluster.org/job/regression-test-with-multiplex/814/console)
 -  Test fails only in brick-mux mode, AI on Atin to look at and get back.

 tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t (
 https://build.gluster.org/job/regression-test-with-multiplex/813/console)
 - Seems like failed just twice in last 30 days as per
 https://fstat.gluster.org/failure/251?state=2_date=2018-06-30_date=2018-07-31=all.
 Need help from AFR team.

 tests/bugs/quota/bug-1293601.t (
 https://build.gluster.org/job/regression-test-with-multiplex/812/console)
 - Hasn't failed after 26 July and earlier it was failing regularly. Did we
 fix this test through any patch (Mohit?)

 tests/bitrot/bug-1373520.t - (
 https://build.gluster.org/job/regression-test-with-multiplex/811/console)
 - Hasn't failed after 27 July and earlier it was failing regularly. Did we
 fix this test through any patch (Mohit?)

 tests/bugs/glusterd/remove-brick-testcases.t - Failed once with a core,
 not sure if related to brick mux or not, so not sure if brick mux is
 culprit here or not. Ref -
 https://build.gluster.org/job/regression-test-with-multiplex/806/console
 . Seems to be a glustershd crash. Need help from AFR folks.


 =
 Fails for non-brick mux case too

 =
 tests/bugs/distribute/bug-1122443.t 0 Seems to be failing at my setup
 very often, with out brick mux as well. Refer
 https://build.gluster.org/job/regression-test-burn-in/4050/consoleText
 . There's an email in gluster-devel and a BZ 1610240 for the same.

 tests/bugs/bug-1368312.t - Seems to be recent failures (
 https://build.gluster.org/job/regression-test-with-multiplex/815/console)
 - seems to be a new failure, however seen this for a non-brick-mux case too
 -
 https://build.gluster.org/job/regression-test-burn-in/4039/consoleText
 . Need some eyes from AFR folks.

 tests/00-geo-rep/georep-basic-dr-tarssh.t - this isn't specific to
 brick mux, have seen this failing at multiple default regression runs.
 Refer
 https://fstat.gluster.org/failure/392?state=2_date=2018-06-30_date=2018-07-31=all
 . We need help from geo-rep dev to root cause this earlier than later

 tests/00-geo-rep/georep-basic-dr-rsync.t - this isn't specific to brick
 mux, have seen this failing at multiple default regression runs. Refer
 https://fstat.gluster.org/failure/393?state=2_date=2018-06-30_date=2018-07-31=all
 . We need help from geo-rep dev to root cause this earlier than later

 tests/bugs/glusterd/validating-server-quorum.t (
 https://build.gluster.org/job/regression-test-with-multiplex/810/console)
 - Fails for non-brick-mux cases too,
 https://fstat.gluster.org/failure/580?state=2_date=2018-06-30_date=2018-07-31=all
 .  Atin has a patch https://review.gluster.org/20584 which resolves it
 but patch is failing regression for a different test which is unrelated.

 tests/bugs/replicate/bug-1586020-mark-dirty-for-entry-txn-on-quorum-failure.t
 (Ref -
 

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-03 Thread Kotresh Hiremath Ravishankar
Hi Du/Poornima,

I was analysing bitrot and geo-rep failures and I suspect there is a bug in
some perf xlator
that was one of the cause. I was seeing following behaviour in few runs.

1. Geo-rep synced data to slave. It creats empty file and then rsync syncs
data.
But test does "stat --format "%F" " to confirm. If it's empty, it
returns
"regular empty file" else "regular file". I believe it did get the
"regular empty file"
instead of "regular file" until timeout.

2. Other behaviour is with bitrot, with brick-mux. If a file is deleted on
the back end on one brick
and the look up is done. What all performance xlators needs to be
disabled to get the lookup/revalidate
on the brick where the file was deleted. Earlier, only md-cache was
disable and it used to work.
No it's failing intermittently.

Are there any pending patches around these areas that needs to be merged ?
If there are, then it could be affecting other tests as well.

Thanks,
Kotresh HR

On Fri, Aug 3, 2018 at 3:07 PM, Karthik Subrahmanya 
wrote:

>
>
> On Fri, Aug 3, 2018 at 2:12 PM Karthik Subrahmanya 
> wrote:
>
>>
>>
>> On Thu, Aug 2, 2018 at 11:00 PM Karthik Subrahmanya 
>> wrote:
>>
>>>
>>>
>>> On Tue 31 Jul, 2018, 10:17 PM Atin Mukherjee, 
>>> wrote:
>>>
 I just went through the nightly regression report of brick mux runs and
 here's what I can summarize.

 
 
 =
 Fails only with brick-mux
 
 
 =
 tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even after
 400 secs. Refer https://fstat.gluster.org/failure/209?state=2_
 date=2018-06-30_date=2018-07-31=all, specifically the
 latest report https://build.gluster.org/job/
 regression-test-burn-in/4051/consoleText . Wasn't timing out as
 frequently as it was till 12 July. But since 27 July, it has timed out
 twice. Beginning to believe commit 9400b6f2c8aa219a493961e0ab9770b7f12e80d2
 has added the delay and now 400 secs isn't sufficient enough (Mohit?)

 tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
 (Ref - https://build.gluster.org/job/regression-test-with-
 multiplex/814/console) -  Test fails only in brick-mux mode, AI on
 Atin to look at and get back.

 tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t (
 https://build.gluster.org/job/regression-test-with-
 multiplex/813/console) - Seems like failed just twice in last 30 days
 as per https://fstat.gluster.org/failure/251?state=2_
 date=2018-06-30_date=2018-07-31=all. Need help from AFR
 team.

 tests/bugs/quota/bug-1293601.t (https://build.gluster.org/
 job/regression-test-with-multiplex/812/console) - Hasn't failed after
 26 July and earlier it was failing regularly. Did we fix this test through
 any patch (Mohit?)

 tests/bitrot/bug-1373520.t - (https://build.gluster.org/
 job/regression-test-with-multiplex/811/console)  - Hasn't failed after
 27 July and earlier it was failing regularly. Did we fix this test through
 any patch (Mohit?)

 tests/bugs/glusterd/remove-brick-testcases.t - Failed once with a
 core, not sure if related to brick mux or not, so not sure if brick mux is
 culprit here or not. Ref - https://build.gluster.org/job/
 regression-test-with-multiplex/806/console . Seems to be a glustershd
 crash. Need help from AFR folks.

 
 
 =
 Fails for non-brick mux case too
 
 
 =
 tests/bugs/distribute/bug-1122443.t 0 Seems to be failing at my setup
 very often, with out brick mux as well. Refer
 https://build.gluster.org/job/regression-test-burn-in/4050/consoleText
 . There's an email in gluster-devel and a BZ 1610240 for the same.

 tests/bugs/bug-1368312.t - Seems to be recent failures (
 https://build.gluster.org/job/regression-test-with-
 multiplex/815/console) - seems to be a new failure, however seen this
 for a non-brick-mux case too - https://build.gluster.org/job/
 regression-test-burn-in/4039/consoleText . Need some eyes from AFR
 folks.

 tests/00-geo-rep/georep-basic-dr-tarssh.t - this isn't specific to
 brick mux, have seen this failing at multiple default regression runs.
 Refer 

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-03 Thread Karthik Subrahmanya
On Fri, Aug 3, 2018 at 2:12 PM Karthik Subrahmanya 
wrote:

>
>
> On Thu, Aug 2, 2018 at 11:00 PM Karthik Subrahmanya 
> wrote:
>
>>
>>
>> On Tue 31 Jul, 2018, 10:17 PM Atin Mukherjee, 
>> wrote:
>>
>>> I just went through the nightly regression report of brick mux runs and
>>> here's what I can summarize.
>>>
>>>
>>> =
>>> Fails only with brick-mux
>>>
>>> =
>>> tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even after
>>> 400 secs. Refer
>>> https://fstat.gluster.org/failure/209?state=2_date=2018-06-30_date=2018-07-31=all,
>>> specifically the latest report
>>> https://build.gluster.org/job/regression-test-burn-in/4051/consoleText
>>> . Wasn't timing out as frequently as it was till 12 July. But since 27
>>> July, it has timed out twice. Beginning to believe commit
>>> 9400b6f2c8aa219a493961e0ab9770b7f12e80d2 has added the delay and now 400
>>> secs isn't sufficient enough (Mohit?)
>>>
>>> tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
>>> (Ref -
>>> https://build.gluster.org/job/regression-test-with-multiplex/814/console)
>>> -  Test fails only in brick-mux mode, AI on Atin to look at and get back.
>>>
>>> tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t (
>>> https://build.gluster.org/job/regression-test-with-multiplex/813/console)
>>> - Seems like failed just twice in last 30 days as per
>>> https://fstat.gluster.org/failure/251?state=2_date=2018-06-30_date=2018-07-31=all.
>>> Need help from AFR team.
>>>
>>> tests/bugs/quota/bug-1293601.t (
>>> https://build.gluster.org/job/regression-test-with-multiplex/812/console)
>>> - Hasn't failed after 26 July and earlier it was failing regularly. Did we
>>> fix this test through any patch (Mohit?)
>>>
>>> tests/bitrot/bug-1373520.t - (
>>> https://build.gluster.org/job/regression-test-with-multiplex/811/console)
>>> - Hasn't failed after 27 July and earlier it was failing regularly. Did we
>>> fix this test through any patch (Mohit?)
>>>
>>> tests/bugs/glusterd/remove-brick-testcases.t - Failed once with a core,
>>> not sure if related to brick mux or not, so not sure if brick mux is
>>> culprit here or not. Ref -
>>> https://build.gluster.org/job/regression-test-with-multiplex/806/console
>>> . Seems to be a glustershd crash. Need help from AFR folks.
>>>
>>>
>>> =
>>> Fails for non-brick mux case too
>>>
>>> =
>>> tests/bugs/distribute/bug-1122443.t 0 Seems to be failing at my setup
>>> very often, with out brick mux as well. Refer
>>> https://build.gluster.org/job/regression-test-burn-in/4050/consoleText
>>> . There's an email in gluster-devel and a BZ 1610240 for the same.
>>>
>>> tests/bugs/bug-1368312.t - Seems to be recent failures (
>>> https://build.gluster.org/job/regression-test-with-multiplex/815/console)
>>> - seems to be a new failure, however seen this for a non-brick-mux case too
>>> - https://build.gluster.org/job/regression-test-burn-in/4039/consoleText
>>> . Need some eyes from AFR folks.
>>>
>>> tests/00-geo-rep/georep-basic-dr-tarssh.t - this isn't specific to brick
>>> mux, have seen this failing at multiple default regression runs. Refer
>>> https://fstat.gluster.org/failure/392?state=2_date=2018-06-30_date=2018-07-31=all
>>> . We need help from geo-rep dev to root cause this earlier than later
>>>
>>> tests/00-geo-rep/georep-basic-dr-rsync.t - this isn't specific to brick
>>> mux, have seen this failing at multiple default regression runs. Refer
>>> https://fstat.gluster.org/failure/393?state=2_date=2018-06-30_date=2018-07-31=all
>>> . We need help from geo-rep dev to root cause this earlier than later
>>>
>>> tests/bugs/glusterd/validating-server-quorum.t (
>>> https://build.gluster.org/job/regression-test-with-multiplex/810/console)
>>> - Fails for non-brick-mux cases too,
>>> https://fstat.gluster.org/failure/580?state=2_date=2018-06-30_date=2018-07-31=all
>>> .  Atin has a patch https://review.gluster.org/20584 which resolves it
>>> but patch is failing regression for a different test which is unrelated.
>>>
>>> tests/bugs/replicate/bug-1586020-mark-dirty-for-entry-txn-on-quorum-failure.t
>>> (Ref -
>>> https://build.gluster.org/job/regression-test-with-multiplex/809/console)
>>> - fails for non brick mux case too -
>>> https://build.gluster.org/job/regression-test-burn-in/4049/consoleText
>>> - Need some eyes from AFR 

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-03 Thread Nithya Balachandran
On 31 July 2018 at 22:11, Atin Mukherjee  wrote:

> I just went through the nightly regression report of brick mux runs and
> here's what I can summarize.
>
> 
> 
> =
> Fails only with brick-mux
> 
> 
> =
> tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even after
> 400 secs. Refer https://fstat.gluster.org/failure/209?state=2_
> date=2018-06-30_date=2018-07-31=all, specifically the latest
> report https://build.gluster.org/job/regression-test-burn-in/4051/
> consoleText . Wasn't timing out as frequently as it was till 12 July. But
> since 27 July, it has timed out twice. Beginning to believe commit
> 9400b6f2c8aa219a493961e0ab9770b7f12e80d2 has added the delay and now 400
> secs isn't sufficient enough (Mohit?)
>
> tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
> (Ref - https://build.gluster.org/job/regression-test-with-
> multiplex/814/console) -  Test fails only in brick-mux mode, AI on Atin
> to look at and get back.
>
> tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t (
> https://build.gluster.org/job/regression-test-with-multiplex/813/console)
> - Seems like failed just twice in last 30 days as per
> https://fstat.gluster.org/failure/251?state=2_
> date=2018-06-30_date=2018-07-31=all. Need help from AFR team.
>
> tests/bugs/quota/bug-1293601.t (https://build.gluster.org/
> job/regression-test-with-multiplex/812/console) - Hasn't failed after 26
> July and earlier it was failing regularly. Did we fix this test through any
> patch (Mohit?)
>
> tests/bitrot/bug-1373520.t - (https://build.gluster.org/
> job/regression-test-with-multiplex/811/console)  - Hasn't failed after 27
> July and earlier it was failing regularly. Did we fix this test through any
> patch (Mohit?)
>
> tests/bugs/glusterd/remove-brick-testcases.t - Failed once with a core,
> not sure if related to brick mux or not, so not sure if brick mux is
> culprit here or not. Ref - https://build.gluster.org/job/
> regression-test-with-multiplex/806/console . Seems to be a glustershd
> crash. Need help from AFR folks.
>
> 
> 
> =
> Fails for non-brick mux case too
> 
> 
> =
> tests/bugs/distribute/bug-1122443.t 0 Seems to be failing at my setup
> very often, with out brick mux as well. Refer
> https://build.gluster.org/job/regression-test-burn-in/4050/consoleText .
> There's an email in gluster-devel and a BZ 1610240 for the same.
>

Not a spurious failure. This is a bug introduced by commit
7131de81f72dda0ef685ed60d0887c6e14289b8c. I have provided more details in
the other email thread around this.

regards,
Nithya


>
>
> tests/bugs/bug-1368312.t - Seems to be recent failures (
> https://build.gluster.org/job/regression-test-with-multiplex/815/console)
> - seems to be a new failure, however seen this for a non-brick-mux case too
> - https://build.gluster.org/job/regression-test-burn-in/4039/consoleText
> . Need some eyes from AFR folks.
>
> tests/00-geo-rep/georep-basic-dr-tarssh.t - this isn't specific to brick
> mux, have seen this failing at multiple default regression runs. Refer
> https://fstat.gluster.org/failure/392?state=2_
> date=2018-06-30_date=2018-07-31=all . We need help from
> geo-rep dev to root cause this earlier than later
>
> tests/00-geo-rep/georep-basic-dr-rsync.t - this isn't specific to brick
> mux, have seen this failing at multiple default regression runs. Refer
> https://fstat.gluster.org/failure/393?state=2_
> date=2018-06-30_date=2018-07-31=all . We need help from
> geo-rep dev to root cause this earlier than later
>
> tests/bugs/glusterd/validating-server-quorum.t (https://build.gluster.org/
> job/regression-test-with-multiplex/810/console) - Fails for non-brick-mux
> cases too, https://fstat.gluster.org/failure/580?state=2_
> date=2018-06-30_date=2018-07-31=all .  Atin has a patch
> https://review.gluster.org/20584 which resolves it but patch is failing
> regression for a different test which is unrelated.
>
> tests/bugs/replicate/bug-1586020-mark-dirty-for-entry-txn-on-quorum-failure.t
> (Ref - https://build.gluster.org/job/regression-test-with-
> multiplex/809/console) - fails for non brick mux case too -
> https://build.gluster.org/job/regression-test-burn-in/4049/consoleText -
> Need some eyes from AFR folks.
>
> ___
> maintainers mailing list
> 

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-03 Thread Karthik Subrahmanya
On Thu, Aug 2, 2018 at 11:00 PM Karthik Subrahmanya 
wrote:

>
>
> On Tue 31 Jul, 2018, 10:17 PM Atin Mukherjee,  wrote:
>
>> I just went through the nightly regression report of brick mux runs and
>> here's what I can summarize.
>>
>>
>> =
>> Fails only with brick-mux
>>
>> =
>> tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even after
>> 400 secs. Refer
>> https://fstat.gluster.org/failure/209?state=2_date=2018-06-30_date=2018-07-31=all,
>> specifically the latest report
>> https://build.gluster.org/job/regression-test-burn-in/4051/consoleText .
>> Wasn't timing out as frequently as it was till 12 July. But since 27 July,
>> it has timed out twice. Beginning to believe commit
>> 9400b6f2c8aa219a493961e0ab9770b7f12e80d2 has added the delay and now 400
>> secs isn't sufficient enough (Mohit?)
>>
>> tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
>> (Ref -
>> https://build.gluster.org/job/regression-test-with-multiplex/814/console)
>> -  Test fails only in brick-mux mode, AI on Atin to look at and get back.
>>
>> tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t (
>> https://build.gluster.org/job/regression-test-with-multiplex/813/console)
>> - Seems like failed just twice in last 30 days as per
>> https://fstat.gluster.org/failure/251?state=2_date=2018-06-30_date=2018-07-31=all.
>> Need help from AFR team.
>>
>> tests/bugs/quota/bug-1293601.t (
>> https://build.gluster.org/job/regression-test-with-multiplex/812/console)
>> - Hasn't failed after 26 July and earlier it was failing regularly. Did we
>> fix this test through any patch (Mohit?)
>>
>> tests/bitrot/bug-1373520.t - (
>> https://build.gluster.org/job/regression-test-with-multiplex/811/console)
>> - Hasn't failed after 27 July and earlier it was failing regularly. Did we
>> fix this test through any patch (Mohit?)
>>
>> tests/bugs/glusterd/remove-brick-testcases.t - Failed once with a core,
>> not sure if related to brick mux or not, so not sure if brick mux is
>> culprit here or not. Ref -
>> https://build.gluster.org/job/regression-test-with-multiplex/806/console
>> . Seems to be a glustershd crash. Need help from AFR folks.
>>
>>
>> =
>> Fails for non-brick mux case too
>>
>> =
>> tests/bugs/distribute/bug-1122443.t 0 Seems to be failing at my setup
>> very often, with out brick mux as well. Refer
>> https://build.gluster.org/job/regression-test-burn-in/4050/consoleText .
>> There's an email in gluster-devel and a BZ 1610240 for the same.
>>
>> tests/bugs/bug-1368312.t - Seems to be recent failures (
>> https://build.gluster.org/job/regression-test-with-multiplex/815/console)
>> - seems to be a new failure, however seen this for a non-brick-mux case too
>> - https://build.gluster.org/job/regression-test-burn-in/4039/consoleText
>> . Need some eyes from AFR folks.
>>
>> tests/00-geo-rep/georep-basic-dr-tarssh.t - this isn't specific to brick
>> mux, have seen this failing at multiple default regression runs. Refer
>> https://fstat.gluster.org/failure/392?state=2_date=2018-06-30_date=2018-07-31=all
>> . We need help from geo-rep dev to root cause this earlier than later
>>
>> tests/00-geo-rep/georep-basic-dr-rsync.t - this isn't specific to brick
>> mux, have seen this failing at multiple default regression runs. Refer
>> https://fstat.gluster.org/failure/393?state=2_date=2018-06-30_date=2018-07-31=all
>> . We need help from geo-rep dev to root cause this earlier than later
>>
>> tests/bugs/glusterd/validating-server-quorum.t (
>> https://build.gluster.org/job/regression-test-with-multiplex/810/console)
>> - Fails for non-brick-mux cases too,
>> https://fstat.gluster.org/failure/580?state=2_date=2018-06-30_date=2018-07-31=all
>> .  Atin has a patch https://review.gluster.org/20584 which resolves it
>> but patch is failing regression for a different test which is unrelated.
>>
>> tests/bugs/replicate/bug-1586020-mark-dirty-for-entry-txn-on-quorum-failure.t
>> (Ref -
>> https://build.gluster.org/job/regression-test-with-multiplex/809/console)
>> - fails for non brick mux case too -
>> https://build.gluster.org/job/regression-test-burn-in/4049/consoleText -
>> Need some eyes from AFR folks.
>>
> I am looking at this. It is not reproducible locally. Trying to do this on
> soft serve.
>

In soft serve machine also it is not failing where the 

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Pranith Kumar Karampuri
On Thu, Aug 2, 2018 at 10:03 PM Pranith Kumar Karampuri 
wrote:

>
>
> On Thu, Aug 2, 2018 at 7:19 PM Atin Mukherjee  wrote:
>
>> New addition - tests/basic/volume.t - failed twice atleast with shd core.
>>
>> One such ref -
>> https://build.gluster.org/job/centos7-regression/2058/console
>>
>
> I will take a look.
>

The crash is happening inside libc and there are no line numbers to debug
further. Is there anyway to get symbols, line numbers even for that? We can
find hints as to what could be going wrong. Let me try to re-create it on
the machines I have in the meanwhile.

(gdb) bt
#0  0x7feae916bb4f in _IO_cleanup () from ./lib64/libc.so.6
#1  0x7feae9127b8b in __run_exit_handlers () from ./lib64/libc.so.6
#2  0x7feae9127c27 in exit () from ./lib64/libc.so.6
#3  0x00408ba5 in cleanup_and_exit (signum=15) at
/home/jenkins/root/workspace/centos7-regression/glusterfsd/src/glusterfsd.c:1570
#4  0x0040a75f in glusterfs_sigwaiter (arg=0x7ffe6faa7540) at
/home/jenkins/root/workspace/centos7-regression/glusterfsd/src/glusterfsd.c:2332
#5  0x7feae9b27e25 in start_thread () from ./lib64/libpthread.so.0
#6  0x7feae91ecbad in clone () from ./lib64/libc.so.6


>
>>
>>
>> On Thu, Aug 2, 2018 at 6:28 PM Sankarshan Mukhopadhyay <
>> sankarshan.mukhopadh...@gmail.com> wrote:
>>
>>> On Thu, Aug 2, 2018 at 5:48 PM, Kotresh Hiremath Ravishankar
>>>  wrote:
>>> > I am facing different issue in softserve machines. The fuse mount
>>> itself is
>>> > failing.
>>> > I tried day before yesterday to debug geo-rep failures. I discussed
>>> with
>>> > Raghu,
>>> > but could not root cause it. So none of the tests were passing. It
>>> happened
>>> > on
>>> > both machine instances I tried.
>>> >
>>>
>>> Ugh! -infra team should have an issue to work with and resolve this.
>>>
>>>
>>> --
>>> sankarshan mukhopadhyay
>>> 
>>> ___
>>> Gluster-devel mailing list
>>> Gluster-devel@gluster.org
>>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>>
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
>
>
> --
> Pranith
>


-- 
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Karthik Subrahmanya
On Tue 31 Jul, 2018, 10:17 PM Atin Mukherjee,  wrote:

> I just went through the nightly regression report of brick mux runs and
> here's what I can summarize.
>
>
> =
> Fails only with brick-mux
>
> =
> tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even after 400
> secs. Refer
> https://fstat.gluster.org/failure/209?state=2_date=2018-06-30_date=2018-07-31=all,
> specifically the latest report
> https://build.gluster.org/job/regression-test-burn-in/4051/consoleText .
> Wasn't timing out as frequently as it was till 12 July. But since 27 July,
> it has timed out twice. Beginning to believe commit
> 9400b6f2c8aa219a493961e0ab9770b7f12e80d2 has added the delay and now 400
> secs isn't sufficient enough (Mohit?)
>
> tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
> (Ref -
> https://build.gluster.org/job/regression-test-with-multiplex/814/console)
> -  Test fails only in brick-mux mode, AI on Atin to look at and get back.
>
> tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t (
> https://build.gluster.org/job/regression-test-with-multiplex/813/console)
> - Seems like failed just twice in last 30 days as per
> https://fstat.gluster.org/failure/251?state=2_date=2018-06-30_date=2018-07-31=all.
> Need help from AFR team.
>
> tests/bugs/quota/bug-1293601.t (
> https://build.gluster.org/job/regression-test-with-multiplex/812/console)
> - Hasn't failed after 26 July and earlier it was failing regularly. Did we
> fix this test through any patch (Mohit?)
>
> tests/bitrot/bug-1373520.t - (
> https://build.gluster.org/job/regression-test-with-multiplex/811/console)
> - Hasn't failed after 27 July and earlier it was failing regularly. Did we
> fix this test through any patch (Mohit?)
>
> tests/bugs/glusterd/remove-brick-testcases.t - Failed once with a core,
> not sure if related to brick mux or not, so not sure if brick mux is
> culprit here or not. Ref -
> https://build.gluster.org/job/regression-test-with-multiplex/806/console
> . Seems to be a glustershd crash. Need help from AFR folks.
>
>
> =
> Fails for non-brick mux case too
>
> =
> tests/bugs/distribute/bug-1122443.t 0 Seems to be failing at my setup very
> often, with out brick mux as well. Refer
> https://build.gluster.org/job/regression-test-burn-in/4050/consoleText .
> There's an email in gluster-devel and a BZ 1610240 for the same.
>
> tests/bugs/bug-1368312.t - Seems to be recent failures (
> https://build.gluster.org/job/regression-test-with-multiplex/815/console)
> - seems to be a new failure, however seen this for a non-brick-mux case too
> - https://build.gluster.org/job/regression-test-burn-in/4039/consoleText
> . Need some eyes from AFR folks.
>
> tests/00-geo-rep/georep-basic-dr-tarssh.t - this isn't specific to brick
> mux, have seen this failing at multiple default regression runs. Refer
> https://fstat.gluster.org/failure/392?state=2_date=2018-06-30_date=2018-07-31=all
> . We need help from geo-rep dev to root cause this earlier than later
>
> tests/00-geo-rep/georep-basic-dr-rsync.t - this isn't specific to brick
> mux, have seen this failing at multiple default regression runs. Refer
> https://fstat.gluster.org/failure/393?state=2_date=2018-06-30_date=2018-07-31=all
> . We need help from geo-rep dev to root cause this earlier than later
>
> tests/bugs/glusterd/validating-server-quorum.t (
> https://build.gluster.org/job/regression-test-with-multiplex/810/console)
> - Fails for non-brick-mux cases too,
> https://fstat.gluster.org/failure/580?state=2_date=2018-06-30_date=2018-07-31=all
> .  Atin has a patch https://review.gluster.org/20584 which resolves it
> but patch is failing regression for a different test which is unrelated.
>
> tests/bugs/replicate/bug-1586020-mark-dirty-for-entry-txn-on-quorum-failure.t
> (Ref -
> https://build.gluster.org/job/regression-test-with-multiplex/809/console)
> - fails for non brick mux case too -
> https://build.gluster.org/job/regression-test-burn-in/4049/consoleText -
> Need some eyes from AFR folks.
>
I am looking at this. It is not reproducible locally. Trying to do this on
soft serve.

Regards,
Karthik

> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Kotresh Hiremath Ravishankar
Have attached in the Bug https://bugzilla.redhat.com/show_bug.cgi?id=1611635


On Thu, 2 Aug 2018, 22:21 Raghavendra Gowdappa,  wrote:

>
>
> On Thu, Aug 2, 2018 at 5:48 PM, Kotresh Hiremath Ravishankar <
> khire...@redhat.com> wrote:
>
>> I am facing different issue in softserve machines. The fuse mount itself
>> is failing.
>> I tried day before yesterday to debug geo-rep failures. I discussed with
>> Raghu,
>> but could not root cause it.
>>
>
> Where can I find the complete client logs for this?
>
> So none of the tests were passing. It happened on
>> both machine instances I tried.
>>
>> 
>> [2018-07-31 10:41:49.288117] D [fuse-bridge.c:5407:notify] 0-fuse: got
>> event 6 on graph 0
>> [2018-07-31 10:41:49.289427] D [fuse-bridge.c:4990:fuse_get_mount_status]
>> 0-fuse: mount status is 0
>> [2018-07-31 10:41:49.289555] D [fuse-bridge.c:4256:fuse_init]
>> 0-glusterfs-fuse: Detected support for FUSE_AUTO_INVAL_DATA. Enabling
>> fopen_keep_cache automatically.
>> [2018-07-31 10:41:49.289591] T [fuse-bridge.c:278:send_fuse_iov]
>> 0-glusterfs-fuse: writev() result 40/40
>> [2018-07-31 10:41:49.289610] I [fuse-bridge.c:4314:fuse_init]
>> 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
>> 7.22
>> [2018-07-31 10:41:49.289627] I [fuse-bridge.c:4948:fuse_graph_sync]
>> 0-fuse: switched to graph 0
>> [2018-07-31 10:41:49.289696] T [MSGID: 0] [syncop.c:1261:syncop_lookup]
>> 0-stack-trace: stack-address: 0x7f36e4001058, winding from fuse to
>> meta-autoload
>> [2018-07-31 10:41:49.289743] T [MSGID: 0]
>> [defaults.c:2716:default_lookup] 0-stack-trace: stack-address:
>> 0x7f36e4001058, winding from meta-autoload to master
>> [2018-07-31 10:41:49.289787] T [MSGID: 0]
>> [io-stats.c:2788:io_stats_lookup] 0-stack-trace: stack-address:
>> 0x7f36e4001058, winding from master to master-md-cache
>> [2018-07-31 10:41:49.289833] T [MSGID: 0]
>> [md-cache.c:513:mdc_inode_iatt_get] 0-md-cache: mdc_inode_ctx_get failed
>> (----0001)
>> [2018-07-31 10:41:49.289923] T [MSGID: 0] [md-cache.c:1200:mdc_lookup]
>> 0-stack-trace: stack-address: 0x7f36e4001058, winding from master-md-cache
>> to master-open-behind
>> [2018-07-31 10:41:49.289946] T [MSGID: 0]
>> [defaults.c:2716:default_lookup] 0-stack-trace: stack-address:
>> 0x7f36e4001058, winding from master-open-behind to master-quick-read
>> [2018-07-31 10:41:49.289973] T [MSGID: 0] [quick-read.c:556:qr_lookup]
>> 0-stack-trace: stack-address: 0x7f36e4001058, winding from
>> master-quick-read to master-io-cache
>> [2018-07-31 10:41:49.290002] T [MSGID: 0] [io-cache.c:298:ioc_lookup]
>> 0-stack-trace: stack-address: 0x7f36e4001058, winding from master-io-cache
>> to master-readdir-ahead
>> [2018-07-31 10:41:49.290034] T [MSGID: 0]
>> [defaults.c:2716:default_lookup] 0-stack-trace: stack-address:
>> 0x7f36e4001058, winding from master-readdir-ahead to master-read-ahead
>> [2018-07-31 10:41:49.290052] T [MSGID: 0]
>> [defaults.c:2716:default_lookup] 0-stack-trace: stack-address:
>> 0x7f36e4001058, winding from master-read-ahead to master-write-behind
>> [2018-07-31 10:41:49.290077] T [MSGID: 0] [write-behind.c:2439:wb_lookup]
>> 0-stack-trace: stack-address: 0x7f36e4001058, winding from
>> master-write-behind to master-dht
>> [2018-07-31 10:41:49.290156] D [MSGID: 0]
>> [dht-common.c:3674:dht_do_fresh_lookup] 0-master-dht: /: no subvolume in
>> layout for path, checking on all the subvols to see if it is a directory
>> [2018-07-31 10:41:49.290180] D [MSGID: 0]
>> [dht-common.c:3688:dht_do_fresh_lookup] 0-master-dht: /: Found null hashed
>> subvol. Calling lookup on all nodes.
>> [2018-07-31 10:41:49.290199] T [MSGID: 0]
>> [dht-common.c:3695:dht_do_fresh_lookup] 0-stack-trace: stack-address:
>> 0x7f36e4001058, winding from master-dht to master-replicate-0
>> [2018-07-31 10:41:49.290245] I [MSGID: 108006]
>> [afr-common.c:5582:afr_local_init] 0-master-replicate-0: no subvolumes up
>> [2018-07-31 10:41:49.290291] D [MSGID: 0]
>> [afr-common.c:3212:afr_discover] 0-stack-trace: stack-address:
>> 0x7f36e4001058, master-replicate-0 returned -1 error: Transport endpoint is
>> not conne
>> cted [Transport endpoint is not connected]
>> [2018-07-31 10:41:49.290323] D [MSGID: 0]
>> [dht-common.c:1391:dht_lookup_dir_cbk] 0-master-dht: lookup of / on
>> master-replicate-0 returned error [Transport endpoint is not connected]
>> [2018-07-31 10:41:49.290350] T [MSGID: 0]
>> [dht-common.c:3695:dht_do_fresh_lookup] 0-stack-trace: stack-address:
>> 0x7f36e4001058, winding from master-dht to master-replicate-1
>> [2018-07-31 10:41:49.290381] I [MSGID: 108006]
>> [afr-common.c:5582:afr_local_init] 0-master-replicate-1: no subvolumes up
>> [2018-07-31 10:41:49.290403] D [MSGID: 0]
>> [afr-common.c:3212:afr_discover] 0-stack-trace: stack-address:
>> 0x7f36e4001058, master-replicate-1 returned -1 error: Transport endpoint is
>> not connected [Transport endpoint is not connected]
>> [2018-07-31 10:41:49.290427] D [MSGID: 0]
>> 

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Raghavendra Gowdappa
On Thu, Aug 2, 2018 at 5:48 PM, Kotresh Hiremath Ravishankar <
khire...@redhat.com> wrote:

> I am facing different issue in softserve machines. The fuse mount itself
> is failing.
> I tried day before yesterday to debug geo-rep failures. I discussed with
> Raghu,
> but could not root cause it.
>

Where can I find the complete client logs for this?

So none of the tests were passing. It happened on
> both machine instances I tried.
>
> 
> [2018-07-31 10:41:49.288117] D [fuse-bridge.c:5407:notify] 0-fuse: got
> event 6 on graph 0
> [2018-07-31 10:41:49.289427] D [fuse-bridge.c:4990:fuse_get_mount_status]
> 0-fuse: mount status is 0
> [2018-07-31 10:41:49.289555] D [fuse-bridge.c:4256:fuse_init]
> 0-glusterfs-fuse: Detected support for FUSE_AUTO_INVAL_DATA. Enabling
> fopen_keep_cache automatically.
> [2018-07-31 10:41:49.289591] T [fuse-bridge.c:278:send_fuse_iov]
> 0-glusterfs-fuse: writev() result 40/40
> [2018-07-31 10:41:49.289610] I [fuse-bridge.c:4314:fuse_init]
> 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
> 7.22
> [2018-07-31 10:41:49.289627] I [fuse-bridge.c:4948:fuse_graph_sync]
> 0-fuse: switched to graph 0
> [2018-07-31 10:41:49.289696] T [MSGID: 0] [syncop.c:1261:syncop_lookup]
> 0-stack-trace: stack-address: 0x7f36e4001058, winding from fuse to
> meta-autoload
> [2018-07-31 10:41:49.289743] T [MSGID: 0] [defaults.c:2716:default_lookup]
> 0-stack-trace: stack-address: 0x7f36e4001058, winding from meta-autoload to
> master
> [2018-07-31 10:41:49.289787] T [MSGID: 0] [io-stats.c:2788:io_stats_lookup]
> 0-stack-trace: stack-address: 0x7f36e4001058, winding from master to
> master-md-cache
> [2018-07-31 10:41:49.289833] T [MSGID: 0] [md-cache.c:513:mdc_inode_iatt_get]
> 0-md-cache: mdc_inode_ctx_get failed (----
> 0001)
> [2018-07-31 10:41:49.289923] T [MSGID: 0] [md-cache.c:1200:mdc_lookup]
> 0-stack-trace: stack-address: 0x7f36e4001058, winding from master-md-cache
> to master-open-behind
> [2018-07-31 10:41:49.289946] T [MSGID: 0] [defaults.c:2716:default_lookup]
> 0-stack-trace: stack-address: 0x7f36e4001058, winding from
> master-open-behind to master-quick-read
> [2018-07-31 10:41:49.289973] T [MSGID: 0] [quick-read.c:556:qr_lookup]
> 0-stack-trace: stack-address: 0x7f36e4001058, winding from
> master-quick-read to master-io-cache
> [2018-07-31 10:41:49.290002] T [MSGID: 0] [io-cache.c:298:ioc_lookup]
> 0-stack-trace: stack-address: 0x7f36e4001058, winding from master-io-cache
> to master-readdir-ahead
> [2018-07-31 10:41:49.290034] T [MSGID: 0] [defaults.c:2716:default_lookup]
> 0-stack-trace: stack-address: 0x7f36e4001058, winding from
> master-readdir-ahead to master-read-ahead
> [2018-07-31 10:41:49.290052] T [MSGID: 0] [defaults.c:2716:default_lookup]
> 0-stack-trace: stack-address: 0x7f36e4001058, winding from
> master-read-ahead to master-write-behind
> [2018-07-31 10:41:49.290077] T [MSGID: 0] [write-behind.c:2439:wb_lookup]
> 0-stack-trace: stack-address: 0x7f36e4001058, winding from
> master-write-behind to master-dht
> [2018-07-31 10:41:49.290156] D [MSGID: 0] 
> [dht-common.c:3674:dht_do_fresh_lookup]
> 0-master-dht: /: no subvolume in layout for path, checking on all the
> subvols to see if it is a directory
> [2018-07-31 10:41:49.290180] D [MSGID: 0] 
> [dht-common.c:3688:dht_do_fresh_lookup]
> 0-master-dht: /: Found null hashed subvol. Calling lookup on all nodes.
> [2018-07-31 10:41:49.290199] T [MSGID: 0] 
> [dht-common.c:3695:dht_do_fresh_lookup]
> 0-stack-trace: stack-address: 0x7f36e4001058, winding from master-dht to
> master-replicate-0
> [2018-07-31 10:41:49.290245] I [MSGID: 108006]
> [afr-common.c:5582:afr_local_init] 0-master-replicate-0: no subvolumes up
> [2018-07-31 10:41:49.290291] D [MSGID: 0] [afr-common.c:3212:afr_discover]
> 0-stack-trace: stack-address: 0x7f36e4001058, master-replicate-0 returned
> -1 error: Transport endpoint is not conne
> cted [Transport endpoint is not connected]
> [2018-07-31 10:41:49.290323] D [MSGID: 0] 
> [dht-common.c:1391:dht_lookup_dir_cbk]
> 0-master-dht: lookup of / on master-replicate-0 returned error [Transport
> endpoint is not connected]
> [2018-07-31 10:41:49.290350] T [MSGID: 0] 
> [dht-common.c:3695:dht_do_fresh_lookup]
> 0-stack-trace: stack-address: 0x7f36e4001058, winding from master-dht to
> master-replicate-1
> [2018-07-31 10:41:49.290381] I [MSGID: 108006]
> [afr-common.c:5582:afr_local_init] 0-master-replicate-1: no subvolumes up
> [2018-07-31 10:41:49.290403] D [MSGID: 0] [afr-common.c:3212:afr_discover]
> 0-stack-trace: stack-address: 0x7f36e4001058, master-replicate-1 returned
> -1 error: Transport endpoint is not connected [Transport endpoint is not
> connected]
> [2018-07-31 10:41:49.290427] D [MSGID: 0] 
> [dht-common.c:1391:dht_lookup_dir_cbk]
> 0-master-dht: lookup of / on master-replicate-1 returned error [Transport
> endpoint is not connected]
> [2018-07-31 10:41:49.290452] D [MSGID: 0] 
> 

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Pranith Kumar Karampuri
On Thu, Aug 2, 2018 at 7:19 PM Atin Mukherjee  wrote:

> New addition - tests/basic/volume.t - failed twice atleast with shd core.
>
> One such ref -
> https://build.gluster.org/job/centos7-regression/2058/console
>

I will take a look.


>
>
> On Thu, Aug 2, 2018 at 6:28 PM Sankarshan Mukhopadhyay <
> sankarshan.mukhopadh...@gmail.com> wrote:
>
>> On Thu, Aug 2, 2018 at 5:48 PM, Kotresh Hiremath Ravishankar
>>  wrote:
>> > I am facing different issue in softserve machines. The fuse mount
>> itself is
>> > failing.
>> > I tried day before yesterday to debug geo-rep failures. I discussed with
>> > Raghu,
>> > but could not root cause it. So none of the tests were passing. It
>> happened
>> > on
>> > both machine instances I tried.
>> >
>>
>> Ugh! -infra team should have an issue to work with and resolve this.
>>
>>
>> --
>> sankarshan mukhopadhyay
>> 
>> ___
>> Gluster-devel mailing list
>> Gluster-devel@gluster.org
>> https://lists.gluster.org/mailman/listinfo/gluster-devel
>>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel



-- 
Pranith
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Atin Mukherjee
New addition - tests/basic/volume.t - failed twice atleast with shd core.

One such ref - https://build.gluster.org/job/centos7-regression/2058/console


On Thu, Aug 2, 2018 at 6:28 PM Sankarshan Mukhopadhyay <
sankarshan.mukhopadh...@gmail.com> wrote:

> On Thu, Aug 2, 2018 at 5:48 PM, Kotresh Hiremath Ravishankar
>  wrote:
> > I am facing different issue in softserve machines. The fuse mount itself
> is
> > failing.
> > I tried day before yesterday to debug geo-rep failures. I discussed with
> > Raghu,
> > but could not root cause it. So none of the tests were passing. It
> happened
> > on
> > both machine instances I tried.
> >
>
> Ugh! -infra team should have an issue to work with and resolve this.
>
>
> --
> sankarshan mukhopadhyay
> 
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
>
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Sankarshan Mukhopadhyay
On Thu, Aug 2, 2018 at 5:48 PM, Kotresh Hiremath Ravishankar
 wrote:
> I am facing different issue in softserve machines. The fuse mount itself is
> failing.
> I tried day before yesterday to debug geo-rep failures. I discussed with
> Raghu,
> but could not root cause it. So none of the tests were passing. It happened
> on
> both machine instances I tried.
>

Ugh! -infra team should have an issue to work with and resolve this.


-- 
sankarshan mukhopadhyay

___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Xavi Hernandez
On Thu, Aug 2, 2018 at 1:42 PM Kotresh Hiremath Ravishankar <
khire...@redhat.com> wrote:

>
>
> On Thu, Aug 2, 2018 at 5:05 PM, Atin Mukherjee  > wrote:
>
>>
>>
>> On Thu, Aug 2, 2018 at 4:37 PM Kotresh Hiremath Ravishankar <
>> khire...@redhat.com> wrote:
>>
>>>
>>>
>>> On Thu, Aug 2, 2018 at 3:49 PM, Xavi Hernandez 
>>> wrote:
>>>
 On Thu, Aug 2, 2018 at 6:14 AM Atin Mukherjee 
 wrote:

>
>
> On Tue, Jul 31, 2018 at 10:11 PM Atin Mukherjee 
> wrote:
>
>> I just went through the nightly regression report of brick mux runs
>> and here's what I can summarize.
>>
>>
>> =
>> Fails only with brick-mux
>>
>> =
>> tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even
>> after 400 secs. Refer
>> https://fstat.gluster.org/failure/209?state=2_date=2018-06-30_date=2018-07-31=all,
>> specifically the latest report
>> https://build.gluster.org/job/regression-test-burn-in/4051/consoleText
>> . Wasn't timing out as frequently as it was till 12 July. But since 27
>> July, it has timed out twice. Beginning to believe commit
>> 9400b6f2c8aa219a493961e0ab9770b7f12e80d2 has added the delay and now 400
>> secs isn't sufficient enough (Mohit?)
>>
>> tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
>> (Ref -
>> https://build.gluster.org/job/regression-test-with-multiplex/814/console)
>> -  Test fails only in brick-mux mode, AI on Atin to look at and get back.
>>
>> tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t (
>> https://build.gluster.org/job/regression-test-with-multiplex/813/console)
>> - Seems like failed just twice in last 30 days as per
>> https://fstat.gluster.org/failure/251?state=2_date=2018-06-30_date=2018-07-31=all.
>> Need help from AFR team.
>>
>> tests/bugs/quota/bug-1293601.t (
>> https://build.gluster.org/job/regression-test-with-multiplex/812/console)
>> - Hasn't failed after 26 July and earlier it was failing regularly. Did 
>> we
>> fix this test through any patch (Mohit?)
>>
>> tests/bitrot/bug-1373520.t - (
>> https://build.gluster.org/job/regression-test-with-multiplex/811/console)
>> - Hasn't failed after 27 July and earlier it was failing regularly. Did 
>> we
>> fix this test through any patch (Mohit?)
>>
>
> I see this has failed in day before yesterday's regression run as well
> (and I could reproduce it locally with brick mux enabled). The test fails
> in healing a file within a particular time period.
>
> *15:55:19* not ok 25 Got "0" instead of "512", LINENUM:55*15:55:19* 
> FAILED COMMAND: 512 path_size /d/backends/patchy5/FILE1
>
> Need EC dev's help here.
>

 I'm not sure where the problem is exactly. I've seen that when the test
 fails, self-heal is attempting to heal the file, but when the file is
 accessed, an Input/Output error is returned, aborting heal. I've checked
 that a heal is attempted every time the file is accessed, but it fails
 always. This error seems to come from bit-rot stub xlator.

 When in this situation, if I stop and start the volume, self-heal
 immediately heals the files. It seems like an stale state that is kept by
 the stub xlator, preventing the file from being healed.

 Adding bit-rot maintainers for help on this one.

>>>
>>> Bitrot-stub marks the file as corrupted in inode_ctx. But when the file
>>> and it's hardlink are deleted from that brick and a lookup is done
>>> on the file, it cleans up the marker on getting ENOENT. This is part of
>>> recovery steps, and only md-cache is disabled during the process.
>>> Is there any other perf xlators that needs to be disabled for this
>>> scenario to expect a lookup/revalidate on the brick where
>>> the back end file is deleted?
>>>
>>
>> But the same test doesn't fail with brick multiplexing not enabled. Do we
>> know why?
>>
> Don't know, something to do with perf xlators I suppose. It's not
> repdroduced on my local system with brick-mux enabled as well. But it's
> happening on Xavis' system.
>
> Xavi,
> Could you try with the patch [1] and let me know whether it fixes the
> issue.
>

With the additional performance xlators disabled still happens.

The only thing that I've observed is that if I add a sleep just before
stopping the volume, the test seems to pass always. Maybe there are some
background updates going on ? (ec does background updates, but I'm not sure
how this can be related with the Input/Output error accessing the brick
file).

Xavi


> 

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Kotresh Hiremath Ravishankar
I am facing different issue in softserve machines. The fuse mount itself is
failing.
I tried day before yesterday to debug geo-rep failures. I discussed with
Raghu,
but could not root cause it. So none of the tests were passing. It happened
on
both machine instances I tried.


[2018-07-31 10:41:49.288117] D [fuse-bridge.c:5407:notify] 0-fuse: got
event 6 on graph 0
[2018-07-31 10:41:49.289427] D [fuse-bridge.c:4990:fuse_get_mount_status]
0-fuse: mount status is 0
[2018-07-31 10:41:49.289555] D [fuse-bridge.c:4256:fuse_init]
0-glusterfs-fuse: Detected support for FUSE_AUTO_INVAL_DATA. Enabling
fopen_keep_cache automatically.
[2018-07-31 10:41:49.289591] T [fuse-bridge.c:278:send_fuse_iov]
0-glusterfs-fuse: writev() result 40/40
[2018-07-31 10:41:49.289610] I [fuse-bridge.c:4314:fuse_init]
0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel
7.22
[2018-07-31 10:41:49.289627] I [fuse-bridge.c:4948:fuse_graph_sync] 0-fuse:
switched to graph 0
[2018-07-31 10:41:49.289696] T [MSGID: 0] [syncop.c:1261:syncop_lookup]
0-stack-trace: stack-address: 0x7f36e4001058, winding from fuse to
meta-autoload
[2018-07-31 10:41:49.289743] T [MSGID: 0] [defaults.c:2716:default_lookup]
0-stack-trace: stack-address: 0x7f36e4001058, winding from meta-autoload to
master
[2018-07-31 10:41:49.289787] T [MSGID: 0] [io-stats.c:2788:io_stats_lookup]
0-stack-trace: stack-address: 0x7f36e4001058, winding from master to
master-md-cache
[2018-07-31 10:41:49.289833] T [MSGID: 0]
[md-cache.c:513:mdc_inode_iatt_get] 0-md-cache: mdc_inode_ctx_get failed
(----0001)
[2018-07-31 10:41:49.289923] T [MSGID: 0] [md-cache.c:1200:mdc_lookup]
0-stack-trace: stack-address: 0x7f36e4001058, winding from master-md-cache
to master-open-behind
[2018-07-31 10:41:49.289946] T [MSGID: 0] [defaults.c:2716:default_lookup]
0-stack-trace: stack-address: 0x7f36e4001058, winding from
master-open-behind to master-quick-read
[2018-07-31 10:41:49.289973] T [MSGID: 0] [quick-read.c:556:qr_lookup]
0-stack-trace: stack-address: 0x7f36e4001058, winding from
master-quick-read to master-io-cache
[2018-07-31 10:41:49.290002] T [MSGID: 0] [io-cache.c:298:ioc_lookup]
0-stack-trace: stack-address: 0x7f36e4001058, winding from master-io-cache
to master-readdir-ahead
[2018-07-31 10:41:49.290034] T [MSGID: 0] [defaults.c:2716:default_lookup]
0-stack-trace: stack-address: 0x7f36e4001058, winding from
master-readdir-ahead to master-read-ahead
[2018-07-31 10:41:49.290052] T [MSGID: 0] [defaults.c:2716:default_lookup]
0-stack-trace: stack-address: 0x7f36e4001058, winding from
master-read-ahead to master-write-behind
[2018-07-31 10:41:49.290077] T [MSGID: 0] [write-behind.c:2439:wb_lookup]
0-stack-trace: stack-address: 0x7f36e4001058, winding from
master-write-behind to master-dht
[2018-07-31 10:41:49.290156] D [MSGID: 0]
[dht-common.c:3674:dht_do_fresh_lookup] 0-master-dht: /: no subvolume in
layout for path, checking on all the subvols to see if it is a directory
[2018-07-31 10:41:49.290180] D [MSGID: 0]
[dht-common.c:3688:dht_do_fresh_lookup] 0-master-dht: /: Found null hashed
subvol. Calling lookup on all nodes.
[2018-07-31 10:41:49.290199] T [MSGID: 0]
[dht-common.c:3695:dht_do_fresh_lookup] 0-stack-trace: stack-address:
0x7f36e4001058, winding from master-dht to master-replicate-0
[2018-07-31 10:41:49.290245] I [MSGID: 108006]
[afr-common.c:5582:afr_local_init] 0-master-replicate-0: no subvolumes up
[2018-07-31 10:41:49.290291] D [MSGID: 0] [afr-common.c:3212:afr_discover]
0-stack-trace: stack-address: 0x7f36e4001058, master-replicate-0 returned
-1 error: Transport endpoint is not conne
cted [Transport endpoint is not connected]
[2018-07-31 10:41:49.290323] D [MSGID: 0]
[dht-common.c:1391:dht_lookup_dir_cbk] 0-master-dht: lookup of / on
master-replicate-0 returned error [Transport endpoint is not connected]
[2018-07-31 10:41:49.290350] T [MSGID: 0]
[dht-common.c:3695:dht_do_fresh_lookup] 0-stack-trace: stack-address:
0x7f36e4001058, winding from master-dht to master-replicate-1
[2018-07-31 10:41:49.290381] I [MSGID: 108006]
[afr-common.c:5582:afr_local_init] 0-master-replicate-1: no subvolumes up
[2018-07-31 10:41:49.290403] D [MSGID: 0] [afr-common.c:3212:afr_discover]
0-stack-trace: stack-address: 0x7f36e4001058, master-replicate-1 returned
-1 error: Transport endpoint is not connected [Transport endpoint is not
connected]
[2018-07-31 10:41:49.290427] D [MSGID: 0]
[dht-common.c:1391:dht_lookup_dir_cbk] 0-master-dht: lookup of / on
master-replicate-1 returned error [Transport endpoint is not connected]
[2018-07-31 10:41:49.290452] D [MSGID: 0]
[dht-common.c:1574:dht_lookup_dir_cbk] 0-stack-trace: stack-address:
0x7f36e4001058, master-dht returned -1 error: Transport endpoint is not
connected [Transport endpoint is not connected]
[2018-07-31 10:41:49.290477] D [MSGID: 0]
[write-behind.c:2393:wb_lookup_cbk] 0-stack-trace: stack-address:
0x7f36e4001058, master-write-behind returned -1 error: Transport endpoint

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Nigel Babu
On Thu, Aug 2, 2018 at 5:12 PM Kotresh Hiremath Ravishankar <
khire...@redhat.com> wrote:

> Don't know, something to do with perf xlators I suppose. It's not
> repdroduced on my local system with brick-mux enabled as well. But it's
> happening on Xavis' system.
>
> Xavi,
> Could you try with the patch [1] and let me know whether it fixes the
> issue.
>
> [1] https://review.gluster.org/#/c/20619/1
>

If you cannot reproduce it on your laptop, why don't you request a machine
from softserve[1] and try it out?

[1]:
https://github.com/gluster/softserve/wiki/Running-Regressions-on-clean-Centos-7-machine

-- 
nigelb
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Kotresh Hiremath Ravishankar
On Thu, Aug 2, 2018 at 5:05 PM, Atin Mukherjee 
wrote:

>
>
> On Thu, Aug 2, 2018 at 4:37 PM Kotresh Hiremath Ravishankar <
> khire...@redhat.com> wrote:
>
>>
>>
>> On Thu, Aug 2, 2018 at 3:49 PM, Xavi Hernandez 
>> wrote:
>>
>>> On Thu, Aug 2, 2018 at 6:14 AM Atin Mukherjee 
>>> wrote:
>>>


 On Tue, Jul 31, 2018 at 10:11 PM Atin Mukherjee 
 wrote:

> I just went through the nightly regression report of brick mux runs
> and here's what I can summarize.
>
> 
> 
> =
> Fails only with brick-mux
> 
> 
> =
> tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even
> after 400 secs. Refer https://fstat.gluster.org/
> failure/209?state=2_date=2018-06-30_date=2018-
> 07-31=all, specifically the latest report
> https://build.gluster.org/job/regression-test-burn-in/4051/consoleText
> . Wasn't timing out as frequently as it was till 12 July. But since 27
> July, it has timed out twice. Beginning to believe commit
> 9400b6f2c8aa219a493961e0ab9770b7f12e80d2 has added the delay and now
> 400 secs isn't sufficient enough (Mohit?)
>
> tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
> (Ref - https://build.gluster.org/job/regression-test-with-
> multiplex/814/console) -  Test fails only in brick-mux mode, AI on
> Atin to look at and get back.
>
> tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t (
> https://build.gluster.org/job/regression-test-with-
> multiplex/813/console) - Seems like failed just twice in last 30 days
> as per https://fstat.gluster.org/failure/251?state=2_
> date=2018-06-30_date=2018-07-31=all. Need help from AFR
> team.
>
> tests/bugs/quota/bug-1293601.t (https://build.gluster.org/
> job/regression-test-with-multiplex/812/console) - Hasn't failed after
> 26 July and earlier it was failing regularly. Did we fix this test through
> any patch (Mohit?)
>
> tests/bitrot/bug-1373520.t - (https://build.gluster.org/
> job/regression-test-with-multiplex/811/console)  - Hasn't failed
> after 27 July and earlier it was failing regularly. Did we fix this test
> through any patch (Mohit?)
>

 I see this has failed in day before yesterday's regression run as well
 (and I could reproduce it locally with brick mux enabled). The test fails
 in healing a file within a particular time period.

 *15:55:19* not ok 25 Got "0" instead of "512", LINENUM:55*15:55:19* FAILED 
 COMMAND: 512 path_size /d/backends/patchy5/FILE1

 Need EC dev's help here.

>>>
>>> I'm not sure where the problem is exactly. I've seen that when the test
>>> fails, self-heal is attempting to heal the file, but when the file is
>>> accessed, an Input/Output error is returned, aborting heal. I've checked
>>> that a heal is attempted every time the file is accessed, but it fails
>>> always. This error seems to come from bit-rot stub xlator.
>>>
>>> When in this situation, if I stop and start the volume, self-heal
>>> immediately heals the files. It seems like an stale state that is kept by
>>> the stub xlator, preventing the file from being healed.
>>>
>>> Adding bit-rot maintainers for help on this one.
>>>
>>
>> Bitrot-stub marks the file as corrupted in inode_ctx. But when the file
>> and it's hardlink are deleted from that brick and a lookup is done
>> on the file, it cleans up the marker on getting ENOENT. This is part of
>> recovery steps, and only md-cache is disabled during the process.
>> Is there any other perf xlators that needs to be disabled for this
>> scenario to expect a lookup/revalidate on the brick where
>> the back end file is deleted?
>>
>
> But the same test doesn't fail with brick multiplexing not enabled. Do we
> know why?
>
Don't know, something to do with perf xlators I suppose. It's not
repdroduced on my local system with brick-mux enabled as well. But it's
happening on Xavis' system.

Xavi,
Could you try with the patch [1] and let me know whether it fixes the issue.

[1] https://review.gluster.org/#/c/20619/1

>
>
>>
>>> Xavi
>>>
>>>
>>>

> tests/bugs/glusterd/remove-brick-testcases.t - Failed once with a
> core, not sure if related to brick mux or not, so not sure if brick mux is
> culprit here or not. Ref - https://build.gluster.org/job/
> regression-test-with-multiplex/806/console . Seems to be a glustershd
> crash. Need help from AFR folks.
>
> 
> 
> 

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Atin Mukherjee
On Thu, Aug 2, 2018 at 4:37 PM Kotresh Hiremath Ravishankar <
khire...@redhat.com> wrote:

>
>
> On Thu, Aug 2, 2018 at 3:49 PM, Xavi Hernandez 
> wrote:
>
>> On Thu, Aug 2, 2018 at 6:14 AM Atin Mukherjee 
>> wrote:
>>
>>>
>>>
>>> On Tue, Jul 31, 2018 at 10:11 PM Atin Mukherjee 
>>> wrote:
>>>
 I just went through the nightly regression report of brick mux runs and
 here's what I can summarize.


 =
 Fails only with brick-mux

 =
 tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even after
 400 secs. Refer
 https://fstat.gluster.org/failure/209?state=2_date=2018-06-30_date=2018-07-31=all,
 specifically the latest report
 https://build.gluster.org/job/regression-test-burn-in/4051/consoleText
 . Wasn't timing out as frequently as it was till 12 July. But since 27
 July, it has timed out twice. Beginning to believe commit
 9400b6f2c8aa219a493961e0ab9770b7f12e80d2 has added the delay and now 400
 secs isn't sufficient enough (Mohit?)

 tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
 (Ref -
 https://build.gluster.org/job/regression-test-with-multiplex/814/console)
 -  Test fails only in brick-mux mode, AI on Atin to look at and get back.

 tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t (
 https://build.gluster.org/job/regression-test-with-multiplex/813/console)
 - Seems like failed just twice in last 30 days as per
 https://fstat.gluster.org/failure/251?state=2_date=2018-06-30_date=2018-07-31=all.
 Need help from AFR team.

 tests/bugs/quota/bug-1293601.t (
 https://build.gluster.org/job/regression-test-with-multiplex/812/console)
 - Hasn't failed after 26 July and earlier it was failing regularly. Did we
 fix this test through any patch (Mohit?)

 tests/bitrot/bug-1373520.t - (
 https://build.gluster.org/job/regression-test-with-multiplex/811/console)
 - Hasn't failed after 27 July and earlier it was failing regularly. Did we
 fix this test through any patch (Mohit?)

>>>
>>> I see this has failed in day before yesterday's regression run as well
>>> (and I could reproduce it locally with brick mux enabled). The test fails
>>> in healing a file within a particular time period.
>>>
>>> *15:55:19* not ok 25 Got "0" instead of "512", LINENUM:55*15:55:19* FAILED 
>>> COMMAND: 512 path_size /d/backends/patchy5/FILE1
>>>
>>> Need EC dev's help here.
>>>
>>
>> I'm not sure where the problem is exactly. I've seen that when the test
>> fails, self-heal is attempting to heal the file, but when the file is
>> accessed, an Input/Output error is returned, aborting heal. I've checked
>> that a heal is attempted every time the file is accessed, but it fails
>> always. This error seems to come from bit-rot stub xlator.
>>
>> When in this situation, if I stop and start the volume, self-heal
>> immediately heals the files. It seems like an stale state that is kept by
>> the stub xlator, preventing the file from being healed.
>>
>> Adding bit-rot maintainers for help on this one.
>>
>
> Bitrot-stub marks the file as corrupted in inode_ctx. But when the file
> and it's hardlink are deleted from that brick and a lookup is done
> on the file, it cleans up the marker on getting ENOENT. This is part of
> recovery steps, and only md-cache is disabled during the process.
> Is there any other perf xlators that needs to be disabled for this
> scenario to expect a lookup/revalidate on the brick where
> the back end file is deleted?
>

But the same test doesn't fail with brick multiplexing not enabled. Do we
know why?


>
>> Xavi
>>
>>
>>
>>>
 tests/bugs/glusterd/remove-brick-testcases.t - Failed once with a core,
 not sure if related to brick mux or not, so not sure if brick mux is
 culprit here or not. Ref -
 https://build.gluster.org/job/regression-test-with-multiplex/806/console
 . Seems to be a glustershd crash. Need help from AFR folks.


 =
 Fails for non-brick mux case too

 =
 tests/bugs/distribute/bug-1122443.t 0 Seems to be failing at my setup
 very often, with out brick mux as well. Refer
 https://build.gluster.org/job/regression-test-burn-in/4050/consoleText
 . There's an email in gluster-devel and a BZ 

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Kotresh Hiremath Ravishankar
On Thu, Aug 2, 2018 at 4:50 PM, Amar Tumballi  wrote:

>
>
> On Thu, Aug 2, 2018 at 4:37 PM, Kotresh Hiremath Ravishankar <
> khire...@redhat.com> wrote:
>
>>
>>
>> On Thu, Aug 2, 2018 at 3:49 PM, Xavi Hernandez 
>> wrote:
>>
>>> On Thu, Aug 2, 2018 at 6:14 AM Atin Mukherjee 
>>> wrote:
>>>


 On Tue, Jul 31, 2018 at 10:11 PM Atin Mukherjee 
 wrote:

> I just went through the nightly regression report of brick mux runs
> and here's what I can summarize.
>
> 
> 
> =
> Fails only with brick-mux
> 
> 
> =
> tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even
> after 400 secs. Refer https://fstat.gluster.org/fail
> ure/209?state=2_date=2018-06-30_date=2018-07-31=all,
> specifically the latest report https://build.gluster.org/job/
> regression-test-burn-in/4051/consoleText . Wasn't timing out as
> frequently as it was till 12 July. But since 27 July, it has timed out
> twice. Beginning to believe commit 
> 9400b6f2c8aa219a493961e0ab9770b7f12e80d2
> has added the delay and now 400 secs isn't sufficient enough (Mohit?)
>
> tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
> (Ref - https://build.gluster.org/job/regression-test-with-multiplex
> /814/console) -  Test fails only in brick-mux mode, AI on Atin to
> look at and get back.
>
> tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t (
> https://build.gluster.org/job/regression-test-with-multiple
> x/813/console) - Seems like failed just twice in last 30 days as per
> https://fstat.gluster.org/failure/251?state=2_date=201
> 8-06-30_date=2018-07-31=all. Need help from AFR team.
>
> tests/bugs/quota/bug-1293601.t (https://build.gluster.org/job
> /regression-test-with-multiplex/812/console) - Hasn't failed after 26
> July and earlier it was failing regularly. Did we fix this test through 
> any
> patch (Mohit?)
>
> tests/bitrot/bug-1373520.t - (https://build.gluster.org/job
> /regression-test-with-multiplex/811/console)  - Hasn't failed after
> 27 July and earlier it was failing regularly. Did we fix this test through
> any patch (Mohit?)
>

 I see this has failed in day before yesterday's regression run as well
 (and I could reproduce it locally with brick mux enabled). The test fails
 in healing a file within a particular time period.

 *15:55:19* not ok 25 Got "0" instead of "512", LINENUM:55*15:55:19* FAILED 
 COMMAND: 512 path_size /d/backends/patchy5/FILE1

 Need EC dev's help here.

>>>
>>> I'm not sure where the problem is exactly. I've seen that when the test
>>> fails, self-heal is attempting to heal the file, but when the file is
>>> accessed, an Input/Output error is returned, aborting heal. I've checked
>>> that a heal is attempted every time the file is accessed, but it fails
>>> always. This error seems to come from bit-rot stub xlator.
>>>
>>> When in this situation, if I stop and start the volume, self-heal
>>> immediately heals the files. It seems like an stale state that is kept by
>>> the stub xlator, preventing the file from being healed.
>>>
>>> Adding bit-rot maintainers for help on this one.
>>>
>>
>> Bitrot-stub marks the file as corrupted in inode_ctx. But when the file
>> and it's hardlink are deleted from that brick and a lookup is done
>> on the file, it cleans up the marker on getting ENOENT. This is part of
>> recovery steps, and only md-cache is disabled during the process.
>> Is there any other perf xlators that needs to be disabled for this
>> scenario to expect a lookup/revalidate on the brick where
>> the back end file is deleted?
>>
>
> Can you make sure there are no perf xlators in bitrot stack while doing
> it? That may not be a good idea to keep it for internal 'validations'.
>

Ok, sending the patch in sometime.

>
>
>>
>>> Xavi
>>>
>>>
>>>

> tests/bugs/glusterd/remove-brick-testcases.t - Failed once with a
> core, not sure if related to brick mux or not, so not sure if brick mux is
> culprit here or not. Ref - https://build.gluster.org/job/
> regression-test-with-multiplex/806/console . Seems to be a glustershd
> crash. Need help from AFR folks.
>
> 
> 
> =
> Fails for non-brick mux case too
> 
> 

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Amar Tumballi
On Thu, Aug 2, 2018 at 4:37 PM, Kotresh Hiremath Ravishankar <
khire...@redhat.com> wrote:

>
>
> On Thu, Aug 2, 2018 at 3:49 PM, Xavi Hernandez 
> wrote:
>
>> On Thu, Aug 2, 2018 at 6:14 AM Atin Mukherjee 
>> wrote:
>>
>>>
>>>
>>> On Tue, Jul 31, 2018 at 10:11 PM Atin Mukherjee 
>>> wrote:
>>>
 I just went through the nightly regression report of brick mux runs and
 here's what I can summarize.

 
 
 =
 Fails only with brick-mux
 
 
 =
 tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even after
 400 secs. Refer https://fstat.gluster.org/fail
 ure/209?state=2_date=2018-06-30_date=2018-07-31=all,
 specifically the latest report https://build.gluster.org/job/
 regression-test-burn-in/4051/consoleText . Wasn't timing out as
 frequently as it was till 12 July. But since 27 July, it has timed out
 twice. Beginning to believe commit 9400b6f2c8aa219a493961e0ab9770b7f12e80d2
 has added the delay and now 400 secs isn't sufficient enough (Mohit?)

 tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
 (Ref - https://build.gluster.org/job/regression-test-with-multiplex
 /814/console) -  Test fails only in brick-mux mode, AI on Atin to look
 at and get back.

 tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t (
 https://build.gluster.org/job/regression-test-with-multiple
 x/813/console) - Seems like failed just twice in last 30 days as per
 https://fstat.gluster.org/failure/251?state=2_date=
 2018-06-30_date=2018-07-31=all. Need help from AFR team.

 tests/bugs/quota/bug-1293601.t (https://build.gluster.org/job
 /regression-test-with-multiplex/812/console) - Hasn't failed after 26
 July and earlier it was failing regularly. Did we fix this test through any
 patch (Mohit?)

 tests/bitrot/bug-1373520.t - (https://build.gluster.org/job
 /regression-test-with-multiplex/811/console)  - Hasn't failed after 27
 July and earlier it was failing regularly. Did we fix this test through any
 patch (Mohit?)

>>>
>>> I see this has failed in day before yesterday's regression run as well
>>> (and I could reproduce it locally with brick mux enabled). The test fails
>>> in healing a file within a particular time period.
>>>
>>> *15:55:19* not ok 25 Got "0" instead of "512", LINENUM:55*15:55:19* FAILED 
>>> COMMAND: 512 path_size /d/backends/patchy5/FILE1
>>>
>>> Need EC dev's help here.
>>>
>>
>> I'm not sure where the problem is exactly. I've seen that when the test
>> fails, self-heal is attempting to heal the file, but when the file is
>> accessed, an Input/Output error is returned, aborting heal. I've checked
>> that a heal is attempted every time the file is accessed, but it fails
>> always. This error seems to come from bit-rot stub xlator.
>>
>> When in this situation, if I stop and start the volume, self-heal
>> immediately heals the files. It seems like an stale state that is kept by
>> the stub xlator, preventing the file from being healed.
>>
>> Adding bit-rot maintainers for help on this one.
>>
>
> Bitrot-stub marks the file as corrupted in inode_ctx. But when the file
> and it's hardlink are deleted from that brick and a lookup is done
> on the file, it cleans up the marker on getting ENOENT. This is part of
> recovery steps, and only md-cache is disabled during the process.
> Is there any other perf xlators that needs to be disabled for this
> scenario to expect a lookup/revalidate on the brick where
> the back end file is deleted?
>

Can you make sure there are no perf xlators in bitrot stack while doing it?
That may not be a good idea to keep it for internal 'validations'.


>
>> Xavi
>>
>>
>>
>>>
 tests/bugs/glusterd/remove-brick-testcases.t - Failed once with a
 core, not sure if related to brick mux or not, so not sure if brick mux is
 culprit here or not. Ref - https://build.gluster.org/job/
 regression-test-with-multiplex/806/console . Seems to be a glustershd
 crash. Need help from AFR folks.

 
 
 =
 Fails for non-brick mux case too
 
 
 =
 tests/bugs/distribute/bug-1122443.t 0 Seems to be failing at my setup
 very often, with out brick mux as well. Refer
 

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Kotresh Hiremath Ravishankar
On Thu, Aug 2, 2018 at 11:43 AM, Xavi Hernandez 
wrote:

> On Thu, Aug 2, 2018 at 6:14 AM Atin Mukherjee  wrote:
>
>>
>>
>> On Tue, Jul 31, 2018 at 10:11 PM Atin Mukherjee 
>> wrote:
>>
>>> I just went through the nightly regression report of brick mux runs and
>>> here's what I can summarize.
>>>
>>> 
>>> 
>>> =
>>> Fails only with brick-mux
>>> 
>>> 
>>> =
>>> tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even after
>>> 400 secs. Refer https://fstat.gluster.org/failure/209?state=2_
>>> date=2018-06-30_date=2018-07-31=all, specifically the latest
>>> report https://build.gluster.org/job/regression-test-burn-in/4051/
>>> consoleText . Wasn't timing out as frequently as it was till 12 July.
>>> But since 27 July, it has timed out twice. Beginning to believe commit
>>> 9400b6f2c8aa219a493961e0ab9770b7f12e80d2 has added the delay and now
>>> 400 secs isn't sufficient enough (Mohit?)
>>>
>>> tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
>>> (Ref - https://build.gluster.org/job/regression-test-with-
>>> multiplex/814/console) -  Test fails only in brick-mux mode, AI on Atin
>>> to look at and get back.
>>>
>>> tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t (
>>> https://build.gluster.org/job/regression-test-with-multiplex/813/console)
>>> - Seems like failed just twice in last 30 days as per
>>> https://fstat.gluster.org/failure/251?state=2_
>>> date=2018-06-30_date=2018-07-31=all. Need help from AFR team.
>>>
>>> tests/bugs/quota/bug-1293601.t (https://build.gluster.org/
>>> job/regression-test-with-multiplex/812/console) - Hasn't failed after
>>> 26 July and earlier it was failing regularly. Did we fix this test through
>>> any patch (Mohit?)
>>>
>>> tests/bitrot/bug-1373520.t - (https://build.gluster.org/
>>> job/regression-test-with-multiplex/811/console)  - Hasn't failed after
>>> 27 July and earlier it was failing regularly. Did we fix this test through
>>> any patch (Mohit?)
>>>
>>
>> I see this has failed in day before yesterday's regression run as well
>> (and I could reproduce it locally with brick mux enabled). The test fails
>> in healing a file within a particular time period.
>>
>> *15:55:19* not ok 25 Got "0" instead of "512", LINENUM:55*15:55:19* FAILED 
>> COMMAND: 512 path_size /d/backends/patchy5/FILE1
>>
>> Need EC dev's help here.
>>
>
> I'll investigate this.
>
>
>>
>>
>>> tests/bugs/glusterd/remove-brick-testcases.t - Failed once with a core,
>>> not sure if related to brick mux or not, so not sure if brick mux is
>>> culprit here or not. Ref - https://build.gluster.org/job/
>>> regression-test-with-multiplex/806/console . Seems to be a glustershd
>>> crash. Need help from AFR folks.
>>>
>>> 
>>> 
>>> =
>>> Fails for non-brick mux case too
>>> 
>>> 
>>> =
>>> tests/bugs/distribute/bug-1122443.t 0 Seems to be failing at my setup
>>> very often, with out brick mux as well. Refer
>>> https://build.gluster.org/job/regression-test-burn-in/4050/consoleText
>>> . There's an email in gluster-devel and a BZ 1610240 for the same.
>>>
>>> tests/bugs/bug-1368312.t - Seems to be recent failures (
>>> https://build.gluster.org/job/regression-test-with-multiplex/815/console)
>>> - seems to be a new failure, however seen this for a non-brick-mux case too
>>> - https://build.gluster.org/job/regression-test-burn-in/4039/consoleText
>>> . Need some eyes from AFR folks.
>>>
>>> tests/00-geo-rep/georep-basic-dr-tarssh.t - this isn't specific to
>>> brick mux, have seen this failing at multiple default regression runs.
>>> Refer https://fstat.gluster.org/failure/392?state=2_
>>> date=2018-06-30_date=2018-07-31=all . We need help from
>>> geo-rep dev to root cause this earlier than later
>>>
>>> tests/00-geo-rep/georep-basic-dr-rsync.t - this isn't specific to brick
>>> mux, have seen this failing at multiple default regression runs. Refer
>>> https://fstat.gluster.org/failure/393?state=2_
>>> date=2018-06-30_date=2018-07-31=all . We need help from
>>> geo-rep dev to root cause this earlier than later
>>>
>>
I have posted the patch [1] for above two. This should handle connection
time outs without any logs. But I still see a strange behaviour now and then
where the one of the worker doesn't get started at all. I am debugging that
with instrumentation patch [2]. I am not hitting that on this 

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-02 Thread Xavi Hernandez
On Thu, Aug 2, 2018 at 6:14 AM Atin Mukherjee  wrote:

>
>
> On Tue, Jul 31, 2018 at 10:11 PM Atin Mukherjee 
> wrote:
>
>> I just went through the nightly regression report of brick mux runs and
>> here's what I can summarize.
>>
>>
>> =
>> Fails only with brick-mux
>>
>> =
>> tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even after
>> 400 secs. Refer
>> https://fstat.gluster.org/failure/209?state=2_date=2018-06-30_date=2018-07-31=all,
>> specifically the latest report
>> https://build.gluster.org/job/regression-test-burn-in/4051/consoleText .
>> Wasn't timing out as frequently as it was till 12 July. But since 27 July,
>> it has timed out twice. Beginning to believe commit
>> 9400b6f2c8aa219a493961e0ab9770b7f12e80d2 has added the delay and now 400
>> secs isn't sufficient enough (Mohit?)
>>
>> tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
>> (Ref -
>> https://build.gluster.org/job/regression-test-with-multiplex/814/console)
>> -  Test fails only in brick-mux mode, AI on Atin to look at and get back.
>>
>> tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t (
>> https://build.gluster.org/job/regression-test-with-multiplex/813/console)
>> - Seems like failed just twice in last 30 days as per
>> https://fstat.gluster.org/failure/251?state=2_date=2018-06-30_date=2018-07-31=all.
>> Need help from AFR team.
>>
>> tests/bugs/quota/bug-1293601.t (
>> https://build.gluster.org/job/regression-test-with-multiplex/812/console)
>> - Hasn't failed after 26 July and earlier it was failing regularly. Did we
>> fix this test through any patch (Mohit?)
>>
>> tests/bitrot/bug-1373520.t - (
>> https://build.gluster.org/job/regression-test-with-multiplex/811/console)
>> - Hasn't failed after 27 July and earlier it was failing regularly. Did we
>> fix this test through any patch (Mohit?)
>>
>
> I see this has failed in day before yesterday's regression run as well
> (and I could reproduce it locally with brick mux enabled). The test fails
> in healing a file within a particular time period.
>
> *15:55:19* not ok 25 Got "0" instead of "512", LINENUM:55*15:55:19* FAILED 
> COMMAND: 512 path_size /d/backends/patchy5/FILE1
>
> Need EC dev's help here.
>

I'll investigate this.


>
>
>> tests/bugs/glusterd/remove-brick-testcases.t - Failed once with a core,
>> not sure if related to brick mux or not, so not sure if brick mux is
>> culprit here or not. Ref -
>> https://build.gluster.org/job/regression-test-with-multiplex/806/console
>> . Seems to be a glustershd crash. Need help from AFR folks.
>>
>>
>> =
>> Fails for non-brick mux case too
>>
>> =
>> tests/bugs/distribute/bug-1122443.t 0 Seems to be failing at my setup
>> very often, with out brick mux as well. Refer
>> https://build.gluster.org/job/regression-test-burn-in/4050/consoleText .
>> There's an email in gluster-devel and a BZ 1610240 for the same.
>>
>> tests/bugs/bug-1368312.t - Seems to be recent failures (
>> https://build.gluster.org/job/regression-test-with-multiplex/815/console)
>> - seems to be a new failure, however seen this for a non-brick-mux case too
>> - https://build.gluster.org/job/regression-test-burn-in/4039/consoleText
>> . Need some eyes from AFR folks.
>>
>> tests/00-geo-rep/georep-basic-dr-tarssh.t - this isn't specific to brick
>> mux, have seen this failing at multiple default regression runs. Refer
>> https://fstat.gluster.org/failure/392?state=2_date=2018-06-30_date=2018-07-31=all
>> . We need help from geo-rep dev to root cause this earlier than later
>>
>> tests/00-geo-rep/georep-basic-dr-rsync.t - this isn't specific to brick
>> mux, have seen this failing at multiple default regression runs. Refer
>> https://fstat.gluster.org/failure/393?state=2_date=2018-06-30_date=2018-07-31=all
>> . We need help from geo-rep dev to root cause this earlier than later
>>
>> tests/bugs/glusterd/validating-server-quorum.t (
>> https://build.gluster.org/job/regression-test-with-multiplex/810/console)
>> - Fails for non-brick-mux cases too,
>> https://fstat.gluster.org/failure/580?state=2_date=2018-06-30_date=2018-07-31=all
>> .  Atin has a patch https://review.gluster.org/20584 which resolves it
>> but patch is failing regression for a different test which is unrelated.
>>
>> 

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-01 Thread Atin Mukherjee
On Tue, Jul 31, 2018 at 10:11 PM Atin Mukherjee  wrote:

> I just went through the nightly regression report of brick mux runs and
> here's what I can summarize.
>
>
> =
> Fails only with brick-mux
>
> =
> tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even after 400
> secs. Refer
> https://fstat.gluster.org/failure/209?state=2_date=2018-06-30_date=2018-07-31=all,
> specifically the latest report
> https://build.gluster.org/job/regression-test-burn-in/4051/consoleText .
> Wasn't timing out as frequently as it was till 12 July. But since 27 July,
> it has timed out twice. Beginning to believe commit
> 9400b6f2c8aa219a493961e0ab9770b7f12e80d2 has added the delay and now 400
> secs isn't sufficient enough (Mohit?)
>
> tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
> (Ref -
> https://build.gluster.org/job/regression-test-with-multiplex/814/console)
> -  Test fails only in brick-mux mode, AI on Atin to look at and get back.
>
> tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t (
> https://build.gluster.org/job/regression-test-with-multiplex/813/console)
> - Seems like failed just twice in last 30 days as per
> https://fstat.gluster.org/failure/251?state=2_date=2018-06-30_date=2018-07-31=all.
> Need help from AFR team.
>
> tests/bugs/quota/bug-1293601.t (
> https://build.gluster.org/job/regression-test-with-multiplex/812/console)
> - Hasn't failed after 26 July and earlier it was failing regularly. Did we
> fix this test through any patch (Mohit?)
>
> tests/bitrot/bug-1373520.t - (
> https://build.gluster.org/job/regression-test-with-multiplex/811/console)
> - Hasn't failed after 27 July and earlier it was failing regularly. Did we
> fix this test through any patch (Mohit?)
>

I see this has failed in day before yesterday's regression run as well (and
I could reproduce it locally with brick mux enabled). The test fails in
healing a file within a particular time period.

*15:55:19* not ok 25 Got "0" instead of "512", LINENUM:55*15:55:19*
FAILED COMMAND: 512 path_size /d/backends/patchy5/FILE1

Need EC dev's help here.


> tests/bugs/glusterd/remove-brick-testcases.t - Failed once with a core,
> not sure if related to brick mux or not, so not sure if brick mux is
> culprit here or not. Ref -
> https://build.gluster.org/job/regression-test-with-multiplex/806/console
> . Seems to be a glustershd crash. Need help from AFR folks.
>
>
> =
> Fails for non-brick mux case too
>
> =
> tests/bugs/distribute/bug-1122443.t 0 Seems to be failing at my setup very
> often, with out brick mux as well. Refer
> https://build.gluster.org/job/regression-test-burn-in/4050/consoleText .
> There's an email in gluster-devel and a BZ 1610240 for the same.
>
> tests/bugs/bug-1368312.t - Seems to be recent failures (
> https://build.gluster.org/job/regression-test-with-multiplex/815/console)
> - seems to be a new failure, however seen this for a non-brick-mux case too
> - https://build.gluster.org/job/regression-test-burn-in/4039/consoleText
> . Need some eyes from AFR folks.
>
> tests/00-geo-rep/georep-basic-dr-tarssh.t - this isn't specific to brick
> mux, have seen this failing at multiple default regression runs. Refer
> https://fstat.gluster.org/failure/392?state=2_date=2018-06-30_date=2018-07-31=all
> . We need help from geo-rep dev to root cause this earlier than later
>
> tests/00-geo-rep/georep-basic-dr-rsync.t - this isn't specific to brick
> mux, have seen this failing at multiple default regression runs. Refer
> https://fstat.gluster.org/failure/393?state=2_date=2018-06-30_date=2018-07-31=all
> . We need help from geo-rep dev to root cause this earlier than later
>
> tests/bugs/glusterd/validating-server-quorum.t (
> https://build.gluster.org/job/regression-test-with-multiplex/810/console)
> - Fails for non-brick-mux cases too,
> https://fstat.gluster.org/failure/580?state=2_date=2018-06-30_date=2018-07-31=all
> .  Atin has a patch https://review.gluster.org/20584 which resolves it
> but patch is failing regression for a different test which is unrelated.
>
> tests/bugs/replicate/bug-1586020-mark-dirty-for-entry-txn-on-quorum-failure.t
> (Ref -
> https://build.gluster.org/job/regression-test-with-multiplex/809/console)
> - fails for non brick mux case too -
> 

Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-01 Thread Sankarshan Mukhopadhyay
On Thu, Aug 2, 2018 at 12:19 AM, Shyam Ranganathan  wrote:
> On 07/31/2018 12:41 PM, Atin Mukherjee wrote:
>> tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even after
>> 400 secs. Refer
>> https://fstat.gluster.org/failure/209?state=2_date=2018-06-30_date=2018-07-31=all,
>> specifically the latest report
>> https://build.gluster.org/job/regression-test-burn-in/4051/consoleText .
>> Wasn't timing out as frequently as it was till 12 July. But since 27
>> July, it has timed out twice. Beginning to believe commit
>> 9400b6f2c8aa219a493961e0ab9770b7f12e80d2 has added the delay and now 400
>> secs isn't sufficient enough (Mohit?)
>
> The above test is the one that is causing line coverage to fail as well
> (mostly, say 50% of the time).
>
> I did have this patch up to increase timeouts and also ran a few rounds
> of tests, but results are mixed. It passes when run first, and later
> errors out in other places (although not timing out).
>
> See: https://review.gluster.org/#/c/20568/2 for the changes and test run
> details.
>

If I may ask - why are we always exploring the "increase timeout" part
of this? I understand that some tests may take longer - but 400s is
quite a non-trivial amount of time - what other efficient means are we
not able to explore?

> The failure of this test in regression-test-burn-in run#4051 is strange
> again, it looks like the test completed within stipulated time, but
> restarted again post cleanup_func was invoked.
>
> Digging a little further the manner of cleanup_func and traps used in
> this test seem *interesting* and maybe needs a closer look to arrive at
> possible issues here.
>
> @Mohit, request you to take a look at the line coverage failures as
> well, as you handle the failures in this test.


-- 
sankarshan mukhopadhyay

___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-01 Thread Shyam Ranganathan
On 07/31/2018 12:41 PM, Atin Mukherjee wrote:
> tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even after
> 400 secs. Refer
> https://fstat.gluster.org/failure/209?state=2_date=2018-06-30_date=2018-07-31=all,
> specifically the latest report
> https://build.gluster.org/job/regression-test-burn-in/4051/consoleText .
> Wasn't timing out as frequently as it was till 12 July. But since 27
> July, it has timed out twice. Beginning to believe commit
> 9400b6f2c8aa219a493961e0ab9770b7f12e80d2 has added the delay and now 400
> secs isn't sufficient enough (Mohit?)

The above test is the one that is causing line coverage to fail as well
(mostly, say 50% of the time).

I did have this patch up to increase timeouts and also ran a few rounds
of tests, but results are mixed. It passes when run first, and later
errors out in other places (although not timing out).

See: https://review.gluster.org/#/c/20568/2 for the changes and test run
details.

The failure of this test in regression-test-burn-in run#4051 is strange
again, it looks like the test completed within stipulated time, but
restarted again post cleanup_func was invoked.

Digging a little further the manner of cleanup_func and traps used in
this test seem *interesting* and maybe needs a closer look to arrive at
possible issues here.

@Mohit, request you to take a look at the line coverage failures as
well, as you handle the failures in this test.

Thanks,
Shyam
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-08-01 Thread Shyam Ranganathan
On 08/01/2018 12:13 AM, Sankarshan Mukhopadhyay wrote:
>> Thinking aloud, we may have to stop merges to master to get these test
>> failures addressed at the earliest and to continue maintaining them
>> GREEN for the health of the branch.
>>
>> I would give the above a week, before we lockdown the branch to fix the
>> failures.
>>
> Is 1 week a sufficient estimate to address the issues?
> 

Branching is Aug 20th, so I would say Aug 6th lockdown decision is
almost a little late, and also once we get this going it should be
possible to maintain health going forward. So taking a blocking stance
at this juncture is probably for the best.

Having said that, I am also stating we get Cent7 regressions and lcov
GREEN by this time, giving mux a week more to get the stability in
place. This is due to my belief that mux may take a bit longer than the
other 2 (IOW, addressing the sufficiency clause in the concern raised
above).
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-07-31 Thread Sankarshan Mukhopadhyay
On Tue, Jul 31, 2018 at 4:46 PM, Shyam Ranganathan  wrote:
> On 07/30/2018 03:21 PM, Shyam Ranganathan wrote:
>> On 07/24/2018 03:12 PM, Shyam Ranganathan wrote:
>>> 1) master branch health checks (weekly, till branching)
>>>   - Expect every Monday a status update on various tests runs
>>
>> See https://build.gluster.org/job/nightly-master/ for a report on
>> various nightly and periodic jobs on master.
>

This doesn't look like how things are expected to be.

> Thinking aloud, we may have to stop merges to master to get these test
> failures addressed at the earliest and to continue maintaining them
> GREEN for the health of the branch.
>
> I would give the above a week, before we lockdown the branch to fix the
> failures.
>

Is 1 week a sufficient estimate to address the issues?

> Let's try and get line-coverage and nightly regression tests addressed
> this week (leaving mux-regression open), and if addressed not lock the
> branch down.
>
>>
>> RED:
>> 1. Nightly regression (3/6 failed)
>> - Tests that reported failure:
>> ./tests/00-geo-rep/georep-basic-dr-rsync.t
>> ./tests/bugs/core/bug-1432542-mpx-restart-crash.t
>> ./tests/bugs/replicate/bug-1586020-mark-dirty-for-entry-txn-on-quorum-failure.t
>> ./tests/bugs/distribute/bug-1122443.t
>>
>> - Tests that needed a retry:
>> ./tests/00-geo-rep/georep-basic-dr-tarssh.t
>> ./tests/bugs/glusterd/quorum-validation.t
>>
>> 2. Regression with multiplex (cores and test failures)
>>
>> 3. line-coverage (cores and test failures)
>> - Tests that failed:
>> ./tests/bugs/core/bug-1432542-mpx-restart-crash.t (patch
>> https://review.gluster.org/20568 does not fix the timeout entirely, as
>> can be seen in this run,
>> https://build.gluster.org/job/line-coverage/401/consoleFull )
>>
>> Calling out to contributors to take a look at various failures, and post
>> the same as bugs AND to the lists (so that duplication is avoided) to
>> get this to a GREEN status.
>>
>> GREEN:
>> 1. cpp-check
>> 2. RPM builds
>>
>> IGNORE (for now):
>> 1. clang scan (@nigel, this job requires clang warnings to be fixed to
>> go green, right?)
>>
>> Shyam
-- 
sankarshan mukhopadhyay

___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel


Re: [Gluster-devel] [Gluster-Maintainers] Release 5: Master branch health report (Week of 30th July)

2018-07-31 Thread Atin Mukherjee
I just went through the nightly regression report of brick mux runs and
here's what I can summarize.

=
Fails only with brick-mux
=
tests/bugs/core/bug-1432542-mpx-restart-crash.t - Times out even after 400
secs. Refer
https://fstat.gluster.org/failure/209?state=2_date=2018-06-30_date=2018-07-31=all,
specifically the latest report
https://build.gluster.org/job/regression-test-burn-in/4051/consoleText .
Wasn't timing out as frequently as it was till 12 July. But since 27 July,
it has timed out twice. Beginning to believe commit
9400b6f2c8aa219a493961e0ab9770b7f12e80d2 has added the delay and now 400
secs isn't sufficient enough (Mohit?)

tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t (Ref
- https://build.gluster.org/job/regression-test-with-multiplex/814/console)
-  Test fails only in brick-mux mode, AI on Atin to look at and get back.

tests/bugs/replicate/bug-1433571-undo-pending-only-on-up-bricks.t (
https://build.gluster.org/job/regression-test-with-multiplex/813/console) -
Seems like failed just twice in last 30 days as per
https://fstat.gluster.org/failure/251?state=2_date=2018-06-30_date=2018-07-31=all.
Need help from AFR team.

tests/bugs/quota/bug-1293601.t (
https://build.gluster.org/job/regression-test-with-multiplex/812/console) -
Hasn't failed after 26 July and earlier it was failing regularly. Did we
fix this test through any patch (Mohit?)

tests/bitrot/bug-1373520.t - (
https://build.gluster.org/job/regression-test-with-multiplex/811/console)
- Hasn't failed after 27 July and earlier it was failing regularly. Did we
fix this test through any patch (Mohit?)

tests/bugs/glusterd/remove-brick-testcases.t - Failed once with a core, not
sure if related to brick mux or not, so not sure if brick mux is culprit
here or not. Ref -
https://build.gluster.org/job/regression-test-with-multiplex/806/console .
Seems to be a glustershd crash. Need help from AFR folks.

=
Fails for non-brick mux case too
=
tests/bugs/distribute/bug-1122443.t 0 Seems to be failing at my setup very
often, with out brick mux as well. Refer
https://build.gluster.org/job/regression-test-burn-in/4050/consoleText .
There's an email in gluster-devel and a BZ 1610240 for the same.

tests/bugs/bug-1368312.t - Seems to be recent failures (
https://build.gluster.org/job/regression-test-with-multiplex/815/console) -
seems to be a new failure, however seen this for a non-brick-mux case too -
https://build.gluster.org/job/regression-test-burn-in/4039/consoleText .
Need some eyes from AFR folks.

tests/00-geo-rep/georep-basic-dr-tarssh.t - this isn't specific to brick
mux, have seen this failing at multiple default regression runs. Refer
https://fstat.gluster.org/failure/392?state=2_date=2018-06-30_date=2018-07-31=all
. We need help from geo-rep dev to root cause this earlier than later

tests/00-geo-rep/georep-basic-dr-rsync.t - this isn't specific to brick
mux, have seen this failing at multiple default regression runs. Refer
https://fstat.gluster.org/failure/393?state=2_date=2018-06-30_date=2018-07-31=all
. We need help from geo-rep dev to root cause this earlier than later

tests/bugs/glusterd/validating-server-quorum.t (
https://build.gluster.org/job/regression-test-with-multiplex/810/console) -
Fails for non-brick-mux cases too,
https://fstat.gluster.org/failure/580?state=2_date=2018-06-30_date=2018-07-31=all
.  Atin has a patch https://review.gluster.org/20584 which resolves it but
patch is failing regression for a different test which is unrelated.

tests/bugs/replicate/bug-1586020-mark-dirty-for-entry-txn-on-quorum-failure.t
(Ref -
https://build.gluster.org/job/regression-test-with-multiplex/809/console) -
fails for non brick mux case too -
https://build.gluster.org/job/regression-test-burn-in/4049/consoleText -
Need some eyes from AFR folks.
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel