Re: [Gluster-devel] test failure reports for last 15 days
On Mon, Apr 15, 2019 at 11:08 AM Pranith Kumar Karampuri < pkara...@redhat.com> wrote: > > > On Thu, Apr 11, 2019 at 2:59 PM Xavi Hernandez > wrote: > >> On Wed, Apr 10, 2019 at 7:25 PM Xavi Hernandez >> wrote: >> >>> On Wed, Apr 10, 2019 at 4:01 PM Atin Mukherjee >>> wrote: >>> And now for last 15 days: https://fstat.gluster.org/summary?start_date=2019-03-25_date=2019-04-10 ./tests/bitrot/bug-1373520.t 18 ==> Fixed through https://review.gluster.org/#/c/glusterfs/+/22481/, I don't see this failing in brick mux post 5th April ./tests/bugs/ec/bug-1236065.t 17 ==> happens only in brick mux, needs analysis. >>> >>> I've identified the problem here, but not the cause yet. There's a stale >>> inodelk acquired by a process that is already dead, which causes inodelk >>> requests from self-heal and other processes to block. >>> >>> The reason why it seemed to block in random places is that all commands >>> are executed with the working directory pointing to a gluster directory >>> which needs healing after the initial tests. Because of the stale inodelk, >>> when any application tries to open a file in the working directory, it's >>> blocked. >>> >>> I'll investigate what causes this. >>> >> >> I think I've found the problem. This is a fragment of the brick log that >> includes script steps, connections and disconnections of brick 0, and lock >> requests to the problematic lock: >> >> [2019-04-11 08:22:20.381398]:++ >> G_LOG:tests/bugs/ec/bug-1236065.t: TEST: 66 kill_brick patchy jahernan >> /d/backends/patchy2 ++ >> [2019-04-11 08:22:22.532646]:++ >> G_LOG:tests/bugs/ec/bug-1236065.t: TEST: 67 kill_brick patchy jahernan >> /d/backends/patchy3 ++ >> [2019-04-11 08:22:23.709655] I [MSGID: 115029] >> [server-handshake.c:550:server_setvolume] 0-patchy-server: accepted client >> from >> CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2 >> (version: 7dev) with subvol /d/backends/patchy1 >> [2019-04-11 08:22:23.792204] I [common.c:234:pl_trace_in] 8-patchy-locks: >> [REQUEST] Locker = {Pid=29710, lk-owner=68580998b47f, >> Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2, >> Frame=18676} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil), >> path=/test} Lock = {lock=INODELK, cmd=SETLK, type=WRITE, domain: >> patchy-disperse-0, start=0, len=0, pid=0} >> [2019-04-11 08:22:23.792299] I [common.c:285:pl_trace_out] >> 8-patchy-locks: [GRANTED] Locker = {Pid=29710, lk-owner=68580998b47f, >> Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2, >> Frame=18676} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil), >> path=/test} Lock = {lock=INODELK, cmd=SETLK, type=WRITE, domain: >> patchy-disperse-0, start=0, len=0, pid=0} >> [2019-04-11 08:22:24.628478]:++ >> G_LOG:tests/bugs/ec/bug-1236065.t: TEST: 68 5 online_brick_count ++ >> [2019-04-11 08:22:26.097092]:++ >> G_LOG:tests/bugs/ec/bug-1236065.t: TEST: 70 rm -f 0.o 10.o 11.o 12.o 13.o >> 14.o 15.o 16.o 17.o 18.o 19.o 1.o 2.o 3.o 4.o 5.o 6.o 7.o 8.o 9.o ++ >> [2019-04-11 08:22:26.333740]:++ >> G_LOG:tests/bugs/ec/bug-1236065.t: TEST: 71 ec_test_make ++ >> [2019-04-11 08:22:27.718963] I [MSGID: 115029] >> [server-handshake.c:550:server_setvolume] 0-patchy-server: accepted client >> from >> CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3 >> (version: 7dev) with subvol /d/backends/patchy1 >> [2019-04-11 08:22:27.801416] I [common.c:234:pl_trace_in] 8-patchy-locks: >> [REQUEST] Locker = {Pid=29885, lk-owner=68580998b47f, >> Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3, >> Frame=19233} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil), >> path=/test} Lock = {lock=INODELK, cmd=SETLK, type=UNLOCK, domain: >> patchy-disperse-0, start=0, len=0, pid=0} >> [2019-04-11 08:22:27.801434] E [inodelk.c:513:__inode_unlock_lock] >> 8-patchy-locks: Matching lock not found for unlock 0-9223372036854775807, >> by 68580998b47f on 0x7f0ed0029190 >> [2019-04-11 08:22:27.801446] I [common.c:285:pl_trace_out] >> 8-patchy-locks: [Invalid argument] Locker = {Pid=29885, >> lk-owner=68580998b47f, >> Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3, >> Frame=19233} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil), >> path=/test} Lock = {lock=INODELK, cmd=SETLK, type=UNLOCK, domain: >> patchy-disperse-0, start=0, len=0, pid=0} >> >> This is a fragment of the client log: >> >> [2019-04-11 08:22:20.381398]:++ >> G_LOG:tests/bugs/ec/bug-1236065.t: TEST: 66 kill_brick patchy
Re: [Gluster-devel] test failure reports for last 15 days
On Thu, Apr 11, 2019 at 2:59 PM Xavi Hernandez wrote: > On Wed, Apr 10, 2019 at 7:25 PM Xavi Hernandez > wrote: > >> On Wed, Apr 10, 2019 at 4:01 PM Atin Mukherjee >> wrote: >> >>> And now for last 15 days: >>> >>> >>> https://fstat.gluster.org/summary?start_date=2019-03-25_date=2019-04-10 >>> >>> ./tests/bitrot/bug-1373520.t 18 ==> Fixed through >>> https://review.gluster.org/#/c/glusterfs/+/22481/, I don't see this >>> failing in brick mux post 5th April >>> ./tests/bugs/ec/bug-1236065.t 17 ==> happens only in brick mux, >>> needs analysis. >>> >> >> I've identified the problem here, but not the cause yet. There's a stale >> inodelk acquired by a process that is already dead, which causes inodelk >> requests from self-heal and other processes to block. >> >> The reason why it seemed to block in random places is that all commands >> are executed with the working directory pointing to a gluster directory >> which needs healing after the initial tests. Because of the stale inodelk, >> when any application tries to open a file in the working directory, it's >> blocked. >> >> I'll investigate what causes this. >> > > I think I've found the problem. This is a fragment of the brick log that > includes script steps, connections and disconnections of brick 0, and lock > requests to the problematic lock: > > [2019-04-11 08:22:20.381398]:++ G_LOG:tests/bugs/ec/bug-1236065.t: > TEST: 66 kill_brick patchy jahernan /d/backends/patchy2 ++ > [2019-04-11 08:22:22.532646]:++ G_LOG:tests/bugs/ec/bug-1236065.t: > TEST: 67 kill_brick patchy jahernan /d/backends/patchy3 ++ > [2019-04-11 08:22:23.709655] I [MSGID: 115029] > [server-handshake.c:550:server_setvolume] 0-patchy-server: accepted client > from > CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2 > (version: 7dev) with subvol /d/backends/patchy1 > [2019-04-11 08:22:23.792204] I [common.c:234:pl_trace_in] 8-patchy-locks: > [REQUEST] Locker = {Pid=29710, lk-owner=68580998b47f, > Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2, > Frame=18676} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil), > path=/test} Lock = {lock=INODELK, cmd=SETLK, type=WRITE, domain: > patchy-disperse-0, start=0, len=0, pid=0} > [2019-04-11 08:22:23.792299] I [common.c:285:pl_trace_out] 8-patchy-locks: > [GRANTED] Locker = {Pid=29710, lk-owner=68580998b47f, > Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2, > Frame=18676} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil), > path=/test} Lock = {lock=INODELK, cmd=SETLK, type=WRITE, domain: > patchy-disperse-0, start=0, len=0, pid=0} > [2019-04-11 08:22:24.628478]:++ G_LOG:tests/bugs/ec/bug-1236065.t: > TEST: 68 5 online_brick_count ++ > [2019-04-11 08:22:26.097092]:++ G_LOG:tests/bugs/ec/bug-1236065.t: > TEST: 70 rm -f 0.o 10.o 11.o 12.o 13.o 14.o 15.o 16.o 17.o 18.o 19.o 1.o > 2.o 3.o 4.o 5.o 6.o 7.o 8.o 9.o ++ > [2019-04-11 08:22:26.333740]:++ G_LOG:tests/bugs/ec/bug-1236065.t: > TEST: 71 ec_test_make ++ > [2019-04-11 08:22:27.718963] I [MSGID: 115029] > [server-handshake.c:550:server_setvolume] 0-patchy-server: accepted client > from > CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3 > (version: 7dev) with subvol /d/backends/patchy1 > [2019-04-11 08:22:27.801416] I [common.c:234:pl_trace_in] 8-patchy-locks: > [REQUEST] Locker = {Pid=29885, lk-owner=68580998b47f, > Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3, > Frame=19233} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil), > path=/test} Lock = {lock=INODELK, cmd=SETLK, type=UNLOCK, domain: > patchy-disperse-0, start=0, len=0, pid=0} > [2019-04-11 08:22:27.801434] E [inodelk.c:513:__inode_unlock_lock] > 8-patchy-locks: Matching lock not found for unlock 0-9223372036854775807, > by 68580998b47f on 0x7f0ed0029190 > [2019-04-11 08:22:27.801446] I [common.c:285:pl_trace_out] 8-patchy-locks: > [Invalid argument] Locker = {Pid=29885, lk-owner=68580998b47f, > Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3, > Frame=19233} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil), > path=/test} Lock = {lock=INODELK, cmd=SETLK, type=UNLOCK, domain: > patchy-disperse-0, start=0, len=0, pid=0} > > This is a fragment of the client log: > > [2019-04-11 08:22:20.381398]:++ G_LOG:tests/bugs/ec/bug-1236065.t: > TEST: 66 kill_brick patchy jahernan /d/backends/patchy2 ++ > [2019-04-11 08:22:20.675938] I [MSGID: 114018] > [client.c:2333:client_rpc_notify] 0-patchy-client-1: disconnected from > patchy-client-1. Client
Re: [Gluster-devel] test failure reports for last 15 days
While analysing the logs of the runs where uss.t failed made following observations. 1) In the first iteration of uss.t, the time difference between the first test of the .t file and the last test of the .t file is just within 1 minute. But, I think it is the cleanup sequence which is taking more time. One of the reasons I guess this is happening is, we dont see the brick process shutting down message in the logs. 2) In the 2nd iteration of uss.t (because 1st iteration failed because of timeout) it fails because something has not been completed in the cleanup sequence of the previous iteration. The volume start command itself fails in the 2nd iteration. Because of that the remaining tests also fail This is from cmd_history.log uster.org:/d/backends/2/patchy_snap_mnt builder202.int.aws.gluster.org:/d/backends/3/patchy_snap_mnt ++ [2019-04-10 19:54:09.145086] : volume create patchy builder202.int.aws.gluster.org:/d/backends/1/patchy_snap_mnt builder202.int.aws.gluster.org:/d/backends/2/patchy_snap_mnt builder202.int.aws.gluster.org:/d/backends/3/patchy_snap_mnt : SUCCESS [2019-04-10 19:54:09.156221]:++ G_LOG:./tests/basic/uss.t: TEST: 39 gluster --mode=script --wignore volume set patchy nfs.disable false ++ [2019-04-10 19:54:09.265138] : volume set patchy nfs.disable false : SUCCESS [2019-04-10 19:54:09.274386]:++ G_LOG:./tests/basic/uss.t: TEST: 42 gluster --mode=script --wignore volume start patchy ++ [2019-04-10 19:54:09.565086] : volume start patchy : FAILED : Commit failed on localhost. Please check log file for details. [2019-04-10 19:54:09.572753]:++ G_LOG:./tests/basic/uss.t: TEST: 44 _GFS --attribute-timeout=0 --entry-timeout=0 --volfile-server= builder202.int.aws.gluster.org --volfile-id=patchy /mnt/glusterfs/0 ++ And this is from the brick showing some issue with the export directory not being present properly. [2019-04-10 19:54:09.544476] I [MSGID: 100030] [glusterfsd.c:2857:main] 0-/build/install/sbin/glusterfsd: Started running /build/install/sbin/glusterfsd version 7dev (args: /build/install/sbin/glusterfsd -s buil der202.int.aws.gluster.org --volfile-id patchy.builder202.int.aws.gluster.org.d-backends-1-patchy_snap_mnt -p /var/run/gluster/vols/patchy/builder202.int.aws.gluster.org-d-backends-1-patchy_snap_mnt.pid -S /var/ run/gluster/7ac65190b72da80a.socket --brick-name /d/backends/1/patchy_snap_mnt -l /var/log/glusterfs/bricks/d-backends-1-patchy_snap_mnt.log --xlator-option *-posix.glusterd-uuid=695c060d-74d3-440e-8cdb-327ec297 f2d2 --process-name brick --brick-port 49152 --xlator-option patchy-server.listen-port=49152) [2019-04-10 19:54:09.549394] I [socket.c:962:__socket_server_bind] 0-socket.glusterfsd: closing (AF_UNIX) reuse check socket 9 [2019-04-10 19:54:09.553190] I [MSGID: 101190] [event-epoll.c:680:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1 [2019-04-10 19:54:09.553209] I [MSGID: 101190] [event-epoll.c:680:event_dispatch_epoll_worker] 0-epoll: Started thread with index 0 [2019-04-10 19:54:09.556932] I [rpcsvc.c:2694:rpcsvc_set_outstanding_rpc_limit] 0-rpc-service: Configured rpc.outstanding-rpc-limit with value 64 [2019-04-10 19:54:09.557859] E [MSGID: 138001] [index.c:2392:init] 0-patchy-index: Failed to find parent dir (/d/backends/1/patchy_snap_mnt/.glusterfs) of index basepath /d/backends/1/patchy_snap_mnt/.glusterfs/ indices. [No such file or directory]> (.glusterfs is absent) [2019-04-10 19:54:09.557884] E [MSGID: 101019] [xlator.c:629:xlator_init] 0-patchy-index: Initialization of volume 'patchy-index' failed, review your volfile again [2019-04-10 19:54:09.557892] E [MSGID: 101066] [graph.c:409:glusterfs_graph_init] 0-patchy-index: initializing translator failed [2019-04-10 19:54:09.557900] E [MSGID: 101176] [graph.c:772:glusterfs_graph_activate] 0-graph: init failed [2019-04-10 19:54:09.564154] I [io-stats.c:4033:fini] 0-patchy-io-stats: io-stats translator unloaded [2019-04-10 19:54:09.564748] W [glusterfsd.c:1592:cleanup_and_exit] (-->/build/install/sbin/glusterfsd(mgmt_getspec_cbk+0x806) [0x411f32] -->/build/install/sbin/glusterfsd(glusterfs_process_volfp+0x272) [0x40b9b 9] -->/build/install/sbin/glusterfsd(cleanup_and_exit+0x88) [0x4093a5] ) 0-: received signum (-1), shutting down And this is from the cmd_history.log file of the 2nd iteration uss.t from another jenkins run of uss.t [2019-04-10 15:35:51.927343]:++ G_LOG:./tests/basic/uss.t: TEST: 39 gluster --mode=script --wignore volume set patchy nfs.disable false ++ [2019-04-10 15:35:52.038072] : volume set patchy nfs.disable false : SUCCESS [2019-04-10 15:35:52.057582]:++ G_LOG:./tests/basic/uss.t: TEST: 42 gluster --mode=script --wignore volume start patchy ++ [2019-04-10 15:35:52.104288] : volume start patchy : FAILED : Failed to find brick directory /d/backends/1/patchy_snap_mnt for volume patchy. Reason : No such file or directory => (export
Re: [Gluster-devel] test failure reports for last 15 days
On Thu, Apr 11, 2019 at 11:28 AM Xavi Hernandez wrote: > On Wed, Apr 10, 2019 at 7:25 PM Xavi Hernandez > wrote: > >> On Wed, Apr 10, 2019 at 4:01 PM Atin Mukherjee >> wrote: >> >>> And now for last 15 days: >>> >>> >>> https://fstat.gluster.org/summary?start_date=2019-03-25_date=2019-04-10 >>> >>> ./tests/bitrot/bug-1373520.t 18 ==> Fixed through >>> https://review.gluster.org/#/c/glusterfs/+/22481/, I don't see this >>> failing in brick mux post 5th April >>> ./tests/bugs/ec/bug-1236065.t 17 ==> happens only in brick mux, >>> needs analysis. >>> >> >> I've identified the problem here, but not the cause yet. There's a stale >> inodelk acquired by a process that is already dead, which causes inodelk >> requests from self-heal and other processes to block. >> >> The reason why it seemed to block in random places is that all commands >> are executed with the working directory pointing to a gluster directory >> which needs healing after the initial tests. Because of the stale inodelk, >> when any application tries to open a file in the working directory, it's >> blocked. >> >> I'll investigate what causes this. >> > > I think I've found the problem. This is a fragment of the brick log that > includes script steps, connections and disconnections of brick 0, and lock > requests to the problematic lock: > > [2019-04-11 08:22:20.381398]:++ G_LOG:tests/bugs/ec/bug-1236065.t: > TEST: 66 kill_brick patchy jahernan /d/backends/patchy2 ++ > [2019-04-11 08:22:22.532646]:++ G_LOG:tests/bugs/ec/bug-1236065.t: > TEST: 67 kill_brick patchy jahernan /d/backends/patchy3 ++ > [2019-04-11 08:22:23.709655] I [MSGID: 115029] > [server-handshake.c:550:server_setvolume] 0-patchy-server: accepted client > from > CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2 > (version: 7dev) with subvol /d/backends/patchy1 > [2019-04-11 08:22:23.792204] I [common.c:234:pl_trace_in] 8-patchy-locks: > [REQUEST] Locker = {Pid=29710, lk-owner=68580998b47f, > Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2, > Frame=18676} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil), > path=/test} Lock = {lock=INODELK, cmd=SETLK, type=WRITE, domain: > patchy-disperse-0, start=0, len=0, pid=0} > [2019-04-11 08:22:23.792299] I [common.c:285:pl_trace_out] 8-patchy-locks: > [GRANTED] Locker = {Pid=29710, lk-owner=68580998b47f, > Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2, > Frame=18676} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil), > path=/test} Lock = {lock=INODELK, cmd=SETLK, type=WRITE, domain: > patchy-disperse-0, start=0, len=0, pid=0} > [2019-04-11 08:22:24.628478]:++ G_LOG:tests/bugs/ec/bug-1236065.t: > TEST: 68 5 online_brick_count ++ > [2019-04-11 08:22:26.097092]:++ G_LOG:tests/bugs/ec/bug-1236065.t: > TEST: 70 rm -f 0.o 10.o 11.o 12.o 13.o 14.o 15.o 16.o 17.o 18.o 19.o 1.o > 2.o 3.o 4.o 5.o 6.o 7.o 8.o 9.o ++ > [2019-04-11 08:22:26.333740]:++ G_LOG:tests/bugs/ec/bug-1236065.t: > TEST: 71 ec_test_make ++ > [2019-04-11 08:22:27.718963] I [MSGID: 115029] > [server-handshake.c:550:server_setvolume] 0-patchy-server: accepted client > from > CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3 > (version: 7dev) with subvol /d/backends/patchy1 > [2019-04-11 08:22:27.801416] I [common.c:234:pl_trace_in] 8-patchy-locks: > [REQUEST] Locker = {Pid=29885, lk-owner=68580998b47f, > Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3, > Frame=19233} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil), > path=/test} Lock = {lock=INODELK, cmd=SETLK, type=UNLOCK, domain: > patchy-disperse-0, start=0, len=0, pid=0} > [2019-04-11 08:22:27.801434] E [inodelk.c:513:__inode_unlock_lock] > 8-patchy-locks: Matching lock not found for unlock 0-9223372036854775807, > by 68580998b47f on 0x7f0ed0029190 > [2019-04-11 08:22:27.801446] I [common.c:285:pl_trace_out] 8-patchy-locks: > [Invalid argument] Locker = {Pid=29885, lk-owner=68580998b47f, > Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3, > Frame=19233} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil), > path=/test} Lock = {lock=INODELK, cmd=SETLK, type=UNLOCK, domain: > patchy-disperse-0, start=0, len=0, pid=0} > > This is a fragment of the client log: > > [2019-04-11 08:22:20.381398]:++ G_LOG:tests/bugs/ec/bug-1236065.t: > TEST: 66 kill_brick patchy jahernan /d/backends/patchy2 ++ > [2019-04-11 08:22:20.675938] I [MSGID: 114018] > [client.c:2333:client_rpc_notify] 0-patchy-client-1: disconnected from > patchy-client-1. Client
Re: [Gluster-devel] test failure reports for last 15 days
On Wed, Apr 10, 2019 at 7:25 PM Xavi Hernandez wrote: > On Wed, Apr 10, 2019 at 4:01 PM Atin Mukherjee > wrote: > >> And now for last 15 days: >> >> >> https://fstat.gluster.org/summary?start_date=2019-03-25_date=2019-04-10 >> >> ./tests/bitrot/bug-1373520.t 18 ==> Fixed through >> https://review.gluster.org/#/c/glusterfs/+/22481/, I don't see this >> failing in brick mux post 5th April >> ./tests/bugs/ec/bug-1236065.t 17 ==> happens only in brick mux, >> needs analysis. >> > > I've identified the problem here, but not the cause yet. There's a stale > inodelk acquired by a process that is already dead, which causes inodelk > requests from self-heal and other processes to block. > > The reason why it seemed to block in random places is that all commands > are executed with the working directory pointing to a gluster directory > which needs healing after the initial tests. Because of the stale inodelk, > when any application tries to open a file in the working directory, it's > blocked. > > I'll investigate what causes this. > I think I've found the problem. This is a fragment of the brick log that includes script steps, connections and disconnections of brick 0, and lock requests to the problematic lock: [2019-04-11 08:22:20.381398]:++ G_LOG:tests/bugs/ec/bug-1236065.t: TEST: 66 kill_brick patchy jahernan /d/backends/patchy2 ++ [2019-04-11 08:22:22.532646]:++ G_LOG:tests/bugs/ec/bug-1236065.t: TEST: 67 kill_brick patchy jahernan /d/backends/patchy3 ++ [2019-04-11 08:22:23.709655] I [MSGID: 115029] [server-handshake.c:550:server_setvolume] 0-patchy-server: accepted client from CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2 (version: 7dev) with subvol /d/backends/patchy1 [2019-04-11 08:22:23.792204] I [common.c:234:pl_trace_in] 8-patchy-locks: [REQUEST] Locker = {Pid=29710, lk-owner=68580998b47f, Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2, Frame=18676} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil), path=/test} Lock = {lock=INODELK, cmd=SETLK, type=WRITE, domain: patchy-disperse-0, start=0, len=0, pid=0} [2019-04-11 08:22:23.792299] I [common.c:285:pl_trace_out] 8-patchy-locks: [GRANTED] Locker = {Pid=29710, lk-owner=68580998b47f, Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-2, Frame=18676} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil), path=/test} Lock = {lock=INODELK, cmd=SETLK, type=WRITE, domain: patchy-disperse-0, start=0, len=0, pid=0} [2019-04-11 08:22:24.628478]:++ G_LOG:tests/bugs/ec/bug-1236065.t: TEST: 68 5 online_brick_count ++ [2019-04-11 08:22:26.097092]:++ G_LOG:tests/bugs/ec/bug-1236065.t: TEST: 70 rm -f 0.o 10.o 11.o 12.o 13.o 14.o 15.o 16.o 17.o 18.o 19.o 1.o 2.o 3.o 4.o 5.o 6.o 7.o 8.o 9.o ++ [2019-04-11 08:22:26.333740]:++ G_LOG:tests/bugs/ec/bug-1236065.t: TEST: 71 ec_test_make ++ [2019-04-11 08:22:27.718963] I [MSGID: 115029] [server-handshake.c:550:server_setvolume] 0-patchy-server: accepted client from CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3 (version: 7dev) with subvol /d/backends/patchy1 [2019-04-11 08:22:27.801416] I [common.c:234:pl_trace_in] 8-patchy-locks: [REQUEST] Locker = {Pid=29885, lk-owner=68580998b47f, Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3, Frame=19233} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil), path=/test} Lock = {lock=INODELK, cmd=SETLK, type=UNLOCK, domain: patchy-disperse-0, start=0, len=0, pid=0} [2019-04-11 08:22:27.801434] E [inodelk.c:513:__inode_unlock_lock] 8-patchy-locks: Matching lock not found for unlock 0-9223372036854775807, by 68580998b47f on 0x7f0ed0029190 [2019-04-11 08:22:27.801446] I [common.c:285:pl_trace_out] 8-patchy-locks: [Invalid argument] Locker = {Pid=29885, lk-owner=68580998b47f, Client=CTX_ID:1c2952c2-e90f-4631-8712-170b8c05aa6e-GRAPH_ID:0-PID:28900-HOST:jahernan-PC_NAME:patchy-client-1-RECON_NO:-3, Frame=19233} Lockee = {gfid=35743386-b7c2-41c9-aafd-6b13de216704, fd=(nil), path=/test} Lock = {lock=INODELK, cmd=SETLK, type=UNLOCK, domain: patchy-disperse-0, start=0, len=0, pid=0} This is a fragment of the client log: [2019-04-11 08:22:20.381398]:++ G_LOG:tests/bugs/ec/bug-1236065.t: TEST: 66 kill_brick patchy jahernan /d/backends/patchy2 ++ [2019-04-11 08:22:20.675938] I [MSGID: 114018] [client.c:2333:client_rpc_notify] 0-patchy-client-1: disconnected from patchy-client-1. Client process will keep trying to connect to glusterd until brick's port is available [2019-04-11 08:22:21.674772] W [MSGID: 122035] [ec-common.c:654:ec_child_select] 0-patchy-disperse-0: Executing operation with some
Re: [Gluster-devel] test failure reports for last 15 days
On Wed, Apr 10, 2019 at 9:59 AM Atin Mukherjee wrote: > And now for last 15 days: > > https://fstat.gluster.org/summary?start_date=2019-03-25_date=2019-04-10 > > ./tests/bitrot/bug-1373520.t 18 ==> Fixed through > https://review.gluster.org/#/c/glusterfs/+/22481/, I don't see this > failing in brick mux post 5th April > The above patch has been sent to fix the failure with brick mux enabled. > ./tests/bugs/ec/bug-1236065.t 17 ==> happens only in brick mux, needs > analysis. > ./tests/basic/uss.t 15 ==> happens in both brick mux and non > brick mux runs, test just simply times out. Needs urgent analysis. > Nothing has changed in snapview-server and snapview-client recently. Looking into it. ./tests/basic/ec/ec-fix-openfd.t 13 ==> Fixed through > https://review.gluster.org/#/c/22508/ , patch merged today. > ./tests/basic/volfile-sanity.t 8 ==> Some race, though this succeeds > in second attempt every time. > > There're plenty more with 5 instances of failure from many tests. We need > all maintainers/owners to look through these failures and fix them, we > certainly don't want to get into a stage where master is unstable and we > have to lock down the merges till all these failures are resolved. So > please help. > > (Please note fstat stats show up the retries as failures too which in a > way is right) > > > On Tue, Feb 26, 2019 at 5:27 PM Atin Mukherjee > wrote: > >> [1] captures the test failures report since last 30 days and we'd need >> volunteers/component owners to see why the number of failures are so high >> against few tests. >> >> [1] >> https://fstat.gluster.org/summary?start_date=2019-01-26_date=2019-02-25=all >> > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] test failure reports for last 15 days
On Wed, Apr 10, 2019 at 4:01 PM Atin Mukherjee wrote: > And now for last 15 days: > > https://fstat.gluster.org/summary?start_date=2019-03-25_date=2019-04-10 > > ./tests/bitrot/bug-1373520.t 18 ==> Fixed through > https://review.gluster.org/#/c/glusterfs/+/22481/, I don't see this > failing in brick mux post 5th April > ./tests/bugs/ec/bug-1236065.t 17 ==> happens only in brick mux, needs > analysis. > I've identified the problem here, but not the cause yet. There's a stale inodelk acquired by a process that is already dead, which causes inodelk requests from self-heal and other processes to block. The reason why it seemed to block in random places is that all commands are executed with the working directory pointing to a gluster directory which needs healing after the initial tests. Because of the stale inodelk, when any application tries to open a file in the working directory, it's blocked. I'll investigate what causes this. Xavi ./tests/basic/uss.t 15 ==> happens in both brick mux and non > brick mux runs, test just simply times out. Needs urgent analysis. > ./tests/basic/ec/ec-fix-openfd.t 13 ==> Fixed through > https://review.gluster.org/#/c/22508/ , patch merged today. > ./tests/basic/volfile-sanity.t 8 ==> Some race, though this succeeds > in second attempt every time. > > There're plenty more with 5 instances of failure from many tests. We need > all maintainers/owners to look through these failures and fix them, we > certainly don't want to get into a stage where master is unstable and we > have to lock down the merges till all these failures are resolved. So > please help. > > (Please note fstat stats show up the retries as failures too which in a > way is right) > > > On Tue, Feb 26, 2019 at 5:27 PM Atin Mukherjee > wrote: > >> [1] captures the test failures report since last 30 days and we'd need >> volunteers/component owners to see why the number of failures are so high >> against few tests. >> >> [1] >> https://fstat.gluster.org/summary?start_date=2019-01-26_date=2019-02-25=all >> > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel ___ Gluster-devel mailing list Gluster-devel@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel
Re: [Gluster-devel] test failure reports for last 15 days
Thanks for the summary Atin. On Wed, Apr 10, 2019 at 7:30 PM Atin Mukherjee wrote: > And now for last 15 days: > > https://fstat.gluster.org/summary?start_date=2019-03-25_date=2019-04-10 > > ./tests/bitrot/bug-1373520.t 18 ==> Fixed through > https://review.gluster.org/#/c/glusterfs/+/22481/, I don't see this > failing in brick mux post 5th April > ./tests/bugs/ec/bug-1236065.t 17 ==> happens only in brick mux, needs > analysis. > ./tests/basic/uss.t 15 ==> happens in both brick mux and non > brick mux runs, test just simply times out. Needs urgent analysis. > ./tests/basic/ec/ec-fix-openfd.t 13 ==> Fixed through > https://review.gluster.org/#/c/22508/ , patch merged today. > ./tests/basic/volfile-sanity.t 8 ==> Some race, though this succeeds > in second attempt every time. > > Can volfile-sanity.t be failing because of the 'hang' in uss.t ? It is possible as volfile-sanity.t runs after uss.t in regressions. I checked volfile-sanity.t, but it has 'cleanup' at the beginning, but not sure if there are any lingering things which caused these failures. > There're plenty more with 5 instances of failure from many tests. We need > all maintainers/owners to look through these failures and fix them, we > certainly don't want to get into a stage where master is unstable and we > have to lock down the merges till all these failures are resolved. So > please help. > > (Please note fstat stats show up the retries as failures too which in a > way is right) > > > On Tue, Feb 26, 2019 at 5:27 PM Atin Mukherjee > wrote: > >> [1] captures the test failures report since last 30 days and we'd need >> volunteers/component owners to see why the number of failures are so high >> against few tests. >> >> [1] >> https://fstat.gluster.org/summary?start_date=2019-01-26_date=2019-02-25=all >> > ___ > Gluster-devel mailing list > Gluster-devel@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-devel -- Amar Tumballi (amarts) ___ Gluster-devel mailing list Gluster-devel@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel