Re: [Gluster-devel] test failure reports for last 15 days

2019-04-15 Thread Xavi Hernandez
On Mon, Apr 15, 2019 at 11:08 AM Pranith Kumar Karampuri <
pkara...@redhat.com> wrote:

>
>
> On Thu, Apr 11, 2019 at 2:59 PM Xavi Hernandez 
> wrote:
>
>> On Wed, Apr 10, 2019 at 7:25 PM Xavi Hernandez 
>> wrote:
>>
>>> On Wed, Apr 10, 2019 at 4:01 PM Atin Mukherjee 
>>> wrote:
>>>
 And now for last 15 days:


 https://fstat.gluster.org/summary?start_date=2019-03-25_date=2019-04-10

 ./tests/bitrot/bug-1373520.t 18  ==> Fixed through
 https://review.gluster.org/#/c/glusterfs/+/22481/, I don't see this
 failing in brick mux post 5th April
 ./tests/bugs/ec/bug-1236065.t 17  ==> happens only in brick mux,
 needs analysis.

>>>
>>> I've identified the problem here, but not the cause yet. There's a stale
>>> inodelk acquired by a process that is already dead, which causes inodelk
>>> requests from self-heal and other processes to block.
>>>
>>> The reason why it seemed to block in random places is that all commands
>>> are executed with the working directory pointing to a gluster directory
>>> which needs healing after the initial tests. Because of the stale inodelk,
>>> when any application tries to open a file in the working directory, it's
>>> blocked.
>>>
>>> I'll investigate what causes this.
>>>
>>
>> I think I've found the problem. This is a fragment of the brick log that
>> includes script steps, connections and disconnections of brick 0, and lock
>> requests to the problematic lock:
>>
>> [2019-04-11 08:22:20.381398]:++
>> G_LOG:tests/bugs/ec/bug-1236065.t: TEST: 66 kill_brick patchy jahernan
>> /d/backends/patchy2 ++
>> [2019-04-11 08:22:22.532646]:++
>> G_LOG:tests/bugs/ec/bug-1236065.t: TEST: 67 kill_brick patchy jahernan
>> /d/backends/patchy3 ++
>> [2019-04-11 08:22:23.709655] I [MSGID: 115029]
>> [server-handshake.c:550:server_setvolume] 0-patchy-server: accepted client
>> from
>> CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2
>> (version: 7dev) with subvol /d/backends/patchy1
>> [2019-04-11 08:22:23.792204] I [common.c:234:pl_trace_in] 8-patchy-locks:
>> [REQUEST] Locker = {Pid=29710, lk-owner=68580998b47f,
>> Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2,
>> Frame=18676} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
>> path=/test} Lock = {lock=INODELK, cmd=SETLK, type=WRITE, domain:
>> patchy-disperse-0, start=0, len=0, pid=0}
>> [2019-04-11 08:22:23.792299] I [common.c:285:pl_trace_out]
>> 8-patchy-locks: [GRANTED] Locker = {Pid=29710, lk-owner=68580998b47f,
>> Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2,
>> Frame=18676} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
>> path=/test} Lock = {lock=INODELK, cmd=SETLK, type=WRITE, domain:
>> patchy-disperse-0, start=0, len=0, pid=0}
>> [2019-04-11 08:22:24.628478]:++
>> G_LOG:tests/bugs/ec/bug-1236065.t: TEST: 68 5 online_brick_count ++
>> [2019-04-11 08:22:26.097092]:++
>> G_LOG:tests/bugs/ec/bug-1236065.t: TEST: 70 rm -f 0.o 10.o 11.o 12.o 13.o
>> 14.o 15.o 16.o 17.o 18.o 19.o 1.o 2.o 3.o 4.o 5.o 6.o 7.o 8.o 9.o ++
>> [2019-04-11 08:22:26.333740]:++
>> G_LOG:tests/bugs/ec/bug-1236065.t: TEST: 71 ec_test_make ++
>> [2019-04-11 08:22:27.718963] I [MSGID: 115029]
>> [server-handshake.c:550:server_setvolume] 0-patchy-server: accepted client
>> from
>> CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3
>> (version: 7dev) with subvol /d/backends/patchy1
>> [2019-04-11 08:22:27.801416] I [common.c:234:pl_trace_in] 8-patchy-locks:
>> [REQUEST] Locker = {Pid=29885, lk-owner=68580998b47f,
>> Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3,
>> Frame=19233} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
>> path=/test} Lock = {lock=INODELK, cmd=SETLK, type=UNLOCK, domain:
>> patchy-disperse-0, start=0, len=0, pid=0}
>> [2019-04-11 08:22:27.801434] E [inodelk.c:513:__inode_unlock_lock]
>> 8-patchy-locks:  Matching lock not found for unlock 0-9223372036854775807,
>> by 68580998b47f on 0x7f0ed0029190
>> [2019-04-11 08:22:27.801446] I [common.c:285:pl_trace_out]
>> 8-patchy-locks: [Invalid argument] Locker = {Pid=29885,
>> lk-owner=68580998b47f,
>> Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3,
>> Frame=19233} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
>> path=/test} Lock = {lock=INODELK, cmd=SETLK, type=UNLOCK, domain:
>> patchy-disperse-0, start=0, len=0, pid=0}
>>
>> This is a fragment of the client log:
>>
>> [2019-04-11 08:22:20.381398]:++
>> G_LOG:tests/bugs/ec/bug-1236065.t: TEST: 66 kill_brick patchy 

Re: [Gluster-devel] test failure reports for last 15 days

2019-04-15 Thread Pranith Kumar Karampuri
On Thu, Apr 11, 2019 at 2:59 PM Xavi Hernandez  wrote:

> On Wed, Apr 10, 2019 at 7:25 PM Xavi Hernandez 
> wrote:
>
>> On Wed, Apr 10, 2019 at 4:01 PM Atin Mukherjee 
>> wrote:
>>
>>> And now for last 15 days:
>>>
>>>
>>> https://fstat.gluster.org/summary?start_date=2019-03-25_date=2019-04-10
>>>
>>> ./tests/bitrot/bug-1373520.t 18  ==> Fixed through
>>> https://review.gluster.org/#/c/glusterfs/+/22481/, I don't see this
>>> failing in brick mux post 5th April
>>> ./tests/bugs/ec/bug-1236065.t 17  ==> happens only in brick mux,
>>> needs analysis.
>>>
>>
>> I've identified the problem here, but not the cause yet. There's a stale
>> inodelk acquired by a process that is already dead, which causes inodelk
>> requests from self-heal and other processes to block.
>>
>> The reason why it seemed to block in random places is that all commands
>> are executed with the working directory pointing to a gluster directory
>> which needs healing after the initial tests. Because of the stale inodelk,
>> when any application tries to open a file in the working directory, it's
>> blocked.
>>
>> I'll investigate what causes this.
>>
>
> I think I've found the problem. This is a fragment of the brick log that
> includes script steps, connections and disconnections of brick 0, and lock
> requests to the problematic lock:
>
> [2019-04-11 08:22:20.381398]:++ G_LOG:tests/bugs/ec/bug-1236065.t:
> TEST: 66 kill_brick patchy jahernan /d/backends/patchy2 ++
> [2019-04-11 08:22:22.532646]:++ G_LOG:tests/bugs/ec/bug-1236065.t:
> TEST: 67 kill_brick patchy jahernan /d/backends/patchy3 ++
> [2019-04-11 08:22:23.709655] I [MSGID: 115029]
> [server-handshake.c:550:server_setvolume] 0-patchy-server: accepted client
> from
> CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2
> (version: 7dev) with subvol /d/backends/patchy1
> [2019-04-11 08:22:23.792204] I [common.c:234:pl_trace_in] 8-patchy-locks:
> [REQUEST] Locker = {Pid=29710, lk-owner=68580998b47f,
> Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2,
> Frame=18676} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
> path=/test} Lock = {lock=INODELK, cmd=SETLK, type=WRITE, domain:
> patchy-disperse-0, start=0, len=0, pid=0}
> [2019-04-11 08:22:23.792299] I [common.c:285:pl_trace_out] 8-patchy-locks:
> [GRANTED] Locker = {Pid=29710, lk-owner=68580998b47f,
> Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2,
> Frame=18676} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
> path=/test} Lock = {lock=INODELK, cmd=SETLK, type=WRITE, domain:
> patchy-disperse-0, start=0, len=0, pid=0}
> [2019-04-11 08:22:24.628478]:++ G_LOG:tests/bugs/ec/bug-1236065.t:
> TEST: 68 5 online_brick_count ++
> [2019-04-11 08:22:26.097092]:++ G_LOG:tests/bugs/ec/bug-1236065.t:
> TEST: 70 rm -f 0.o 10.o 11.o 12.o 13.o 14.o 15.o 16.o 17.o 18.o 19.o 1.o
> 2.o 3.o 4.o 5.o 6.o 7.o 8.o 9.o ++
> [2019-04-11 08:22:26.333740]:++ G_LOG:tests/bugs/ec/bug-1236065.t:
> TEST: 71 ec_test_make ++
> [2019-04-11 08:22:27.718963] I [MSGID: 115029]
> [server-handshake.c:550:server_setvolume] 0-patchy-server: accepted client
> from
> CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3
> (version: 7dev) with subvol /d/backends/patchy1
> [2019-04-11 08:22:27.801416] I [common.c:234:pl_trace_in] 8-patchy-locks:
> [REQUEST] Locker = {Pid=29885, lk-owner=68580998b47f,
> Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3,
> Frame=19233} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
> path=/test} Lock = {lock=INODELK, cmd=SETLK, type=UNLOCK, domain:
> patchy-disperse-0, start=0, len=0, pid=0}
> [2019-04-11 08:22:27.801434] E [inodelk.c:513:__inode_unlock_lock]
> 8-patchy-locks:  Matching lock not found for unlock 0-9223372036854775807,
> by 68580998b47f on 0x7f0ed0029190
> [2019-04-11 08:22:27.801446] I [common.c:285:pl_trace_out] 8-patchy-locks:
> [Invalid argument] Locker = {Pid=29885, lk-owner=68580998b47f,
> Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3,
> Frame=19233} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
> path=/test} Lock = {lock=INODELK, cmd=SETLK, type=UNLOCK, domain:
> patchy-disperse-0, start=0, len=0, pid=0}
>
> This is a fragment of the client log:
>
> [2019-04-11 08:22:20.381398]:++ G_LOG:tests/bugs/ec/bug-1236065.t:
> TEST: 66 kill_brick patchy jahernan /d/backends/patchy2 ++
> [2019-04-11 08:22:20.675938] I [MSGID: 114018]
> [client.c:2333:client_rpc_notify] 0-patchy-client-1: disconnected from
> patchy-client-1. Client 

Re: [Gluster-devel] test failure reports for last 15 days

2019-04-11 Thread FNU Raghavendra Manjunath
While analysing the logs of the runs where uss.t failed made following
observations.

1) In the first iteration of uss.t, the time difference between the first
test of the .t file and the last test of the .t file is just within 1
minute.

But, I think it is the cleanup sequence which is taking more time. One of
the reasons I guess this is happening is, we dont see the brick process
shutting down message
in the logs.


2) In the 2nd iteration of uss.t (because 1st iteration failed because of
timeout) it fails because something has not been completed in the cleanup
sequence of the previous iteration.

The volume start command itself fails in the 2nd iteration. Because of that
the remaining tests also fail

This is from cmd_history.log

uster.org:/d/backends/2/patchy_snap_mnt
builder202.int.aws.gluster.org:/d/backends/3/patchy_snap_mnt
++
[2019-04-10 19:54:09.145086]  : volume create patchy
builder202.int.aws.gluster.org:/d/backends/1/patchy_snap_mnt
builder202.int.aws.gluster.org:/d/backends/2/patchy_snap_mnt
builder202.int.aws.gluster.org:/d/backends/3/patchy_snap_mnt : SUCCESS
[2019-04-10 19:54:09.156221]:++ G_LOG:./tests/basic/uss.t: TEST: 39
gluster --mode=script --wignore volume set patchy nfs.disable false
++
[2019-04-10 19:54:09.265138]  : volume set patchy nfs.disable false :
SUCCESS
[2019-04-10 19:54:09.274386]:++ G_LOG:./tests/basic/uss.t: TEST: 42
gluster --mode=script --wignore volume start patchy ++
[2019-04-10 19:54:09.565086]  : volume start patchy : FAILED : Commit
failed on localhost. Please check log file for details.
[2019-04-10 19:54:09.572753]:++ G_LOG:./tests/basic/uss.t: TEST: 44
_GFS --attribute-timeout=0 --entry-timeout=0 --volfile-server=
builder202.int.aws.gluster.org --volfile-id=patchy /mnt/glusterfs/0
++


And this is from the brick showing some issue with the export directory not
being present properly.

[2019-04-10 19:54:09.544476] I [MSGID: 100030] [glusterfsd.c:2857:main]
0-/build/install/sbin/glusterfsd: Started running
/build/install/sbin/glusterfsd version 7dev (args:
/build/install/sbin/glusterfsd -s buil
der202.int.aws.gluster.org --volfile-id
patchy.builder202.int.aws.gluster.org.d-backends-1-patchy_snap_mnt -p
/var/run/gluster/vols/patchy/builder202.int.aws.gluster.org-d-backends-1-patchy_snap_mnt.pid
-S /var/
run/gluster/7ac65190b72da80a.socket --brick-name
/d/backends/1/patchy_snap_mnt -l
/var/log/glusterfs/bricks/d-backends-1-patchy_snap_mnt.log --xlator-option
*-posix.glusterd-uuid=695c060d-74d3-440e-8cdb-327ec297
f2d2 --process-name brick --brick-port 49152 --xlator-option
patchy-server.listen-port=49152)
[2019-04-10 19:54:09.549394] I [socket.c:962:__socket_server_bind]
0-socket.glusterfsd: closing (AF_UNIX) reuse check socket 9
[2019-04-10 19:54:09.553190] I [MSGID: 101190]
[event-epoll.c:680:event_dispatch_epoll_worker] 0-epoll: Started thread
with index 1
[2019-04-10 19:54:09.553209] I [MSGID: 101190]
[event-epoll.c:680:event_dispatch_epoll_worker] 0-epoll: Started thread
with index 0
[2019-04-10 19:54:09.556932] I
[rpcsvc.c:2694:rpcsvc_set_outstanding_rpc_limit] 0-rpc-service: Configured
rpc.outstanding-rpc-limit with value 64
[2019-04-10 19:54:09.557859] E [MSGID: 138001] [index.c:2392:init]
0-patchy-index: Failed to find parent dir
(/d/backends/1/patchy_snap_mnt/.glusterfs) of index basepath
/d/backends/1/patchy_snap_mnt/.glusterfs/
indices. [No such file or directory]>
(.glusterfs is absent)
[2019-04-10 19:54:09.557884] E [MSGID: 101019] [xlator.c:629:xlator_init]
0-patchy-index: Initialization of volume 'patchy-index' failed, review your
volfile again
[2019-04-10 19:54:09.557892] E [MSGID: 101066]
[graph.c:409:glusterfs_graph_init] 0-patchy-index: initializing translator
failed
[2019-04-10 19:54:09.557900] E [MSGID: 101176]
[graph.c:772:glusterfs_graph_activate] 0-graph: init failed
[2019-04-10 19:54:09.564154] I [io-stats.c:4033:fini] 0-patchy-io-stats:
io-stats translator unloaded
[2019-04-10 19:54:09.564748] W [glusterfsd.c:1592:cleanup_and_exit]
(-->/build/install/sbin/glusterfsd(mgmt_getspec_cbk+0x806) [0x411f32]
-->/build/install/sbin/glusterfsd(glusterfs_process_volfp+0x272) [0x40b9b
9] -->/build/install/sbin/glusterfsd(cleanup_and_exit+0x88) [0x4093a5] )
0-: received signum (-1), shutting down


And this is from the cmd_history.log file of the 2nd iteration uss.t from
another jenkins run of uss.t

[2019-04-10 15:35:51.927343]:++ G_LOG:./tests/basic/uss.t: TEST: 39
gluster --mode=script --wignore volume set patchy nfs.disable false
++
[2019-04-10 15:35:52.038072]  : volume set patchy nfs.disable false :
SUCCESS
[2019-04-10 15:35:52.057582]:++ G_LOG:./tests/basic/uss.t: TEST: 42
gluster --mode=script --wignore volume start patchy ++
[2019-04-10 15:35:52.104288]  : volume start patchy : FAILED : Failed to
find brick directory /d/backends/1/patchy_snap_mnt for volume patchy.
Reason : No such file or directory => (export 

Re: [Gluster-devel] test failure reports for last 15 days

2019-04-11 Thread Xavi Hernandez
On Thu, Apr 11, 2019 at 11:28 AM Xavi Hernandez  wrote:

> On Wed, Apr 10, 2019 at 7:25 PM Xavi Hernandez 
> wrote:
>
>> On Wed, Apr 10, 2019 at 4:01 PM Atin Mukherjee 
>> wrote:
>>
>>> And now for last 15 days:
>>>
>>>
>>> https://fstat.gluster.org/summary?start_date=2019-03-25_date=2019-04-10
>>>
>>> ./tests/bitrot/bug-1373520.t 18  ==> Fixed through
>>> https://review.gluster.org/#/c/glusterfs/+/22481/, I don't see this
>>> failing in brick mux post 5th April
>>> ./tests/bugs/ec/bug-1236065.t 17  ==> happens only in brick mux,
>>> needs analysis.
>>>
>>
>> I've identified the problem here, but not the cause yet. There's a stale
>> inodelk acquired by a process that is already dead, which causes inodelk
>> requests from self-heal and other processes to block.
>>
>> The reason why it seemed to block in random places is that all commands
>> are executed with the working directory pointing to a gluster directory
>> which needs healing after the initial tests. Because of the stale inodelk,
>> when any application tries to open a file in the working directory, it's
>> blocked.
>>
>> I'll investigate what causes this.
>>
>
> I think I've found the problem. This is a fragment of the brick log that
> includes script steps, connections and disconnections of brick 0, and lock
> requests to the problematic lock:
>
> [2019-04-11 08:22:20.381398]:++ G_LOG:tests/bugs/ec/bug-1236065.t:
> TEST: 66 kill_brick patchy jahernan /d/backends/patchy2 ++
> [2019-04-11 08:22:22.532646]:++ G_LOG:tests/bugs/ec/bug-1236065.t:
> TEST: 67 kill_brick patchy jahernan /d/backends/patchy3 ++
> [2019-04-11 08:22:23.709655] I [MSGID: 115029]
> [server-handshake.c:550:server_setvolume] 0-patchy-server: accepted client
> from
> CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2
> (version: 7dev) with subvol /d/backends/patchy1
> [2019-04-11 08:22:23.792204] I [common.c:234:pl_trace_in] 8-patchy-locks:
> [REQUEST] Locker = {Pid=29710, lk-owner=68580998b47f,
> Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2,
> Frame=18676} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
> path=/test} Lock = {lock=INODELK, cmd=SETLK, type=WRITE, domain:
> patchy-disperse-0, start=0, len=0, pid=0}
> [2019-04-11 08:22:23.792299] I [common.c:285:pl_trace_out] 8-patchy-locks:
> [GRANTED] Locker = {Pid=29710, lk-owner=68580998b47f,
> Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2,
> Frame=18676} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
> path=/test} Lock = {lock=INODELK, cmd=SETLK, type=WRITE, domain:
> patchy-disperse-0, start=0, len=0, pid=0}
> [2019-04-11 08:22:24.628478]:++ G_LOG:tests/bugs/ec/bug-1236065.t:
> TEST: 68 5 online_brick_count ++
> [2019-04-11 08:22:26.097092]:++ G_LOG:tests/bugs/ec/bug-1236065.t:
> TEST: 70 rm -f 0.o 10.o 11.o 12.o 13.o 14.o 15.o 16.o 17.o 18.o 19.o 1.o
> 2.o 3.o 4.o 5.o 6.o 7.o 8.o 9.o ++
> [2019-04-11 08:22:26.333740]:++ G_LOG:tests/bugs/ec/bug-1236065.t:
> TEST: 71 ec_test_make ++
> [2019-04-11 08:22:27.718963] I [MSGID: 115029]
> [server-handshake.c:550:server_setvolume] 0-patchy-server: accepted client
> from
> CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3
> (version: 7dev) with subvol /d/backends/patchy1
> [2019-04-11 08:22:27.801416] I [common.c:234:pl_trace_in] 8-patchy-locks:
> [REQUEST] Locker = {Pid=29885, lk-owner=68580998b47f,
> Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3,
> Frame=19233} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
> path=/test} Lock = {lock=INODELK, cmd=SETLK, type=UNLOCK, domain:
> patchy-disperse-0, start=0, len=0, pid=0}
> [2019-04-11 08:22:27.801434] E [inodelk.c:513:__inode_unlock_lock]
> 8-patchy-locks:  Matching lock not found for unlock 0-9223372036854775807,
> by 68580998b47f on 0x7f0ed0029190
> [2019-04-11 08:22:27.801446] I [common.c:285:pl_trace_out] 8-patchy-locks:
> [Invalid argument] Locker = {Pid=29885, lk-owner=68580998b47f,
> Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3,
> Frame=19233} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
> path=/test} Lock = {lock=INODELK, cmd=SETLK, type=UNLOCK, domain:
> patchy-disperse-0, start=0, len=0, pid=0}
>
> This is a fragment of the client log:
>
> [2019-04-11 08:22:20.381398]:++ G_LOG:tests/bugs/ec/bug-1236065.t:
> TEST: 66 kill_brick patchy jahernan /d/backends/patchy2 ++
> [2019-04-11 08:22:20.675938] I [MSGID: 114018]
> [client.c:2333:client_rpc_notify] 0-patchy-client-1: disconnected from
> patchy-client-1. Client 

Re: [Gluster-devel] test failure reports for last 15 days

2019-04-11 Thread Xavi Hernandez
On Wed, Apr 10, 2019 at 7:25 PM Xavi Hernandez  wrote:

> On Wed, Apr 10, 2019 at 4:01 PM Atin Mukherjee 
> wrote:
>
>> And now for last 15 days:
>>
>>
>> https://fstat.gluster.org/summary?start_date=2019-03-25_date=2019-04-10
>>
>> ./tests/bitrot/bug-1373520.t 18  ==> Fixed through
>> https://review.gluster.org/#/c/glusterfs/+/22481/, I don't see this
>> failing in brick mux post 5th April
>> ./tests/bugs/ec/bug-1236065.t 17  ==> happens only in brick mux,
>> needs analysis.
>>
>
> I've identified the problem here, but not the cause yet. There's a stale
> inodelk acquired by a process that is already dead, which causes inodelk
> requests from self-heal and other processes to block.
>
> The reason why it seemed to block in random places is that all commands
> are executed with the working directory pointing to a gluster directory
> which needs healing after the initial tests. Because of the stale inodelk,
> when any application tries to open a file in the working directory, it's
> blocked.
>
> I'll investigate what causes this.
>

I think I've found the problem. This is a fragment of the brick log that
includes script steps, connections and disconnections of brick 0, and lock
requests to the problematic lock:

[2019-04-11 08:22:20.381398]:++ G_LOG:tests/bugs/ec/bug-1236065.t:
TEST: 66 kill_brick patchy jahernan /d/backends/patchy2 ++
[2019-04-11 08:22:22.532646]:++ G_LOG:tests/bugs/ec/bug-1236065.t:
TEST: 67 kill_brick patchy jahernan /d/backends/patchy3 ++
[2019-04-11 08:22:23.709655] I [MSGID: 115029]
[server-handshake.c:550:server_setvolume] 0-patchy-server: accepted client
from
CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2
(version: 7dev) with subvol /d/backends/patchy1
[2019-04-11 08:22:23.792204] I [common.c:234:pl_trace_in] 8-patchy-locks:
[REQUEST] Locker = {Pid=29710, lk-owner=68580998b47f,
Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2,
Frame=18676} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
path=/test} Lock = {lock=INODELK, cmd=SETLK, type=WRITE, domain:
patchy-disperse-0, start=0, len=0, pid=0}
[2019-04-11 08:22:23.792299] I [common.c:285:pl_trace_out] 8-patchy-locks:
[GRANTED] Locker = {Pid=29710, lk-owner=68580998b47f,
Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2,
Frame=18676} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
path=/test} Lock = {lock=INODELK, cmd=SETLK, type=WRITE, domain:
patchy-disperse-0, start=0, len=0, pid=0}
[2019-04-11 08:22:24.628478]:++ G_LOG:tests/bugs/ec/bug-1236065.t:
TEST: 68 5 online_brick_count ++
[2019-04-11 08:22:26.097092]:++ G_LOG:tests/bugs/ec/bug-1236065.t:
TEST: 70 rm -f 0.o 10.o 11.o 12.o 13.o 14.o 15.o 16.o 17.o 18.o 19.o 1.o
2.o 3.o 4.o 5.o 6.o 7.o 8.o 9.o ++
[2019-04-11 08:22:26.333740]:++ G_LOG:tests/bugs/ec/bug-1236065.t:
TEST: 71 ec_test_make ++
[2019-04-11 08:22:27.718963] I [MSGID: 115029]
[server-handshake.c:550:server_setvolume] 0-patchy-server: accepted client
from
CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3
(version: 7dev) with subvol /d/backends/patchy1
[2019-04-11 08:22:27.801416] I [common.c:234:pl_trace_in] 8-patchy-locks:
[REQUEST] Locker = {Pid=29885, lk-owner=68580998b47f,
Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3,
Frame=19233} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
path=/test} Lock = {lock=INODELK, cmd=SETLK, type=UNLOCK, domain:
patchy-disperse-0, start=0, len=0, pid=0}
[2019-04-11 08:22:27.801434] E [inodelk.c:513:__inode_unlock_lock]
8-patchy-locks:  Matching lock not found for unlock 0-9223372036854775807,
by 68580998b47f on 0x7f0ed0029190
[2019-04-11 08:22:27.801446] I [common.c:285:pl_trace_out] 8-patchy-locks:
[Invalid argument] Locker = {Pid=29885, lk-owner=68580998b47f,
Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3,
Frame=19233} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil),
path=/test} Lock = {lock=INODELK, cmd=SETLK, type=UNLOCK, domain:
patchy-disperse-0, start=0, len=0, pid=0}

This is a fragment of the client log:

[2019-04-11 08:22:20.381398]:++ G_LOG:tests/bugs/ec/bug-1236065.t:
TEST: 66 kill_brick patchy jahernan /d/backends/patchy2 ++
[2019-04-11 08:22:20.675938] I [MSGID: 114018]
[client.c:2333:client_rpc_notify] 0-patchy-client-1: disconnected from
patchy-client-1. Client process will keep trying to connect to glusterd
until brick's port is available
[2019-04-11 08:22:21.674772] W [MSGID: 122035]
[ec-common.c:654:ec_child_select] 0-patchy-disperse-0: Executing operation
with some 

Re: [Gluster-devel] test failure reports for last 15 days

2019-04-10 Thread FNU Raghavendra Manjunath
On Wed, Apr 10, 2019 at 9:59 AM Atin Mukherjee  wrote:

> And now for last 15 days:
>
> https://fstat.gluster.org/summary?start_date=2019-03-25_date=2019-04-10
>
> ./tests/bitrot/bug-1373520.t 18  ==> Fixed through
> https://review.gluster.org/#/c/glusterfs/+/22481/, I don't see this
> failing in brick mux post 5th April
>

The above patch has been sent to fix the failure with brick mux enabled.


> ./tests/bugs/ec/bug-1236065.t 17  ==> happens only in brick mux, needs
> analysis.
> ./tests/basic/uss.t 15  ==> happens in both brick mux and non
> brick mux runs, test just simply times out. Needs urgent analysis.
>

Nothing has changed in snapview-server and snapview-client recently.
Looking into it.

./tests/basic/ec/ec-fix-openfd.t 13  ==> Fixed through
> https://review.gluster.org/#/c/22508/ , patch merged today.
> ./tests/basic/volfile-sanity.t  8  ==> Some race, though this succeeds
> in second attempt every time.
>
> There're plenty more with 5 instances of failure from many tests. We need
> all maintainers/owners to look through these failures and fix them, we
> certainly don't want to get into a stage where master is unstable and we
> have to lock down the merges till all these failures are resolved. So
> please help.
>
> (Please note fstat stats show up the retries as failures too which in a
> way is right)
>
>
> On Tue, Feb 26, 2019 at 5:27 PM Atin Mukherjee 
> wrote:
>
>> [1] captures the test failures report since last 30 days and we'd need
>> volunteers/component owners to see why the number of failures are so high
>> against few tests.
>>
>> [1]
>> https://fstat.gluster.org/summary?start_date=2019-01-26_date=2019-02-25=all
>>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] test failure reports for last 15 days

2019-04-10 Thread Xavi Hernandez
On Wed, Apr 10, 2019 at 4:01 PM Atin Mukherjee  wrote:

> And now for last 15 days:
>
> https://fstat.gluster.org/summary?start_date=2019-03-25_date=2019-04-10
>
> ./tests/bitrot/bug-1373520.t 18  ==> Fixed through
> https://review.gluster.org/#/c/glusterfs/+/22481/, I don't see this
> failing in brick mux post 5th April
> ./tests/bugs/ec/bug-1236065.t 17  ==> happens only in brick mux, needs
> analysis.
>

I've identified the problem here, but not the cause yet. There's a stale
inodelk acquired by a process that is already dead, which causes inodelk
requests from self-heal and other processes to block.

The reason why it seemed to block in random places is that all commands are
executed with the working directory pointing to a gluster directory which
needs healing after the initial tests. Because of the stale inodelk, when
any application tries to open a file in the working directory, it's blocked.

I'll investigate what causes this.

Xavi

./tests/basic/uss.t 15  ==> happens in both brick mux and non
> brick mux runs, test just simply times out. Needs urgent analysis.
> ./tests/basic/ec/ec-fix-openfd.t 13  ==> Fixed through
> https://review.gluster.org/#/c/22508/ , patch merged today.
> ./tests/basic/volfile-sanity.t  8  ==> Some race, though this succeeds
> in second attempt every time.
>
> There're plenty more with 5 instances of failure from many tests. We need
> all maintainers/owners to look through these failures and fix them, we
> certainly don't want to get into a stage where master is unstable and we
> have to lock down the merges till all these failures are resolved. So
> please help.
>
> (Please note fstat stats show up the retries as failures too which in a
> way is right)
>
>
> On Tue, Feb 26, 2019 at 5:27 PM Atin Mukherjee 
> wrote:
>
>> [1] captures the test failures report since last 30 days and we'd need
>> volunteers/component owners to see why the number of failures are so high
>> against few tests.
>>
>> [1]
>> https://fstat.gluster.org/summary?start_date=2019-01-26_date=2019-02-25=all
>>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] test failure reports for last 15 days

2019-04-10 Thread Amar Tumballi Suryanarayan
Thanks for the summary Atin.

On Wed, Apr 10, 2019 at 7:30 PM Atin Mukherjee  wrote:

> And now for last 15 days:
>
> https://fstat.gluster.org/summary?start_date=2019-03-25_date=2019-04-10
>
> ./tests/bitrot/bug-1373520.t 18  ==> Fixed through
> https://review.gluster.org/#/c/glusterfs/+/22481/, I don't see this
> failing in brick mux post 5th April
> ./tests/bugs/ec/bug-1236065.t 17  ==> happens only in brick mux, needs
> analysis.
> ./tests/basic/uss.t 15  ==> happens in both brick mux and non
> brick mux runs, test just simply times out. Needs urgent analysis.
> ./tests/basic/ec/ec-fix-openfd.t 13  ==> Fixed through
> https://review.gluster.org/#/c/22508/ , patch merged today.
> ./tests/basic/volfile-sanity.t  8  ==> Some race, though this succeeds
> in second attempt every time.
>
>
Can volfile-sanity.t be  failing because of the 'hang' in uss.t ? It is
possible as volfile-sanity.t runs after uss.t in regressions. I checked
volfile-sanity.t, but it has 'cleanup' at the beginning, but not sure if
there are any lingering things which caused these failures.


> There're plenty more with 5 instances of failure from many tests. We need
> all maintainers/owners to look through these failures and fix them, we
> certainly don't want to get into a stage where master is unstable and we
> have to lock down the merges till all these failures are resolved. So
> please help.
>
> (Please note fstat stats show up the retries as failures too which in a
> way is right)
>
>
> On Tue, Feb 26, 2019 at 5:27 PM Atin Mukherjee 
> wrote:
>
>> [1] captures the test failures report since last 30 days and we'd need
>> volunteers/component owners to see why the number of failures are so high
>> against few tests.
>>
>> [1]
>> https://fstat.gluster.org/summary?start_date=2019-01-26_date=2019-02-25=all
>>
> ___
> Gluster-devel mailing list
> Gluster-devel@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-devel



-- 
Amar Tumballi (amarts)
___
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel