Re: [Ocfs2-users] Disk access hang
Hello, thanks for hints. Obviously, you wont be surprised..no tcpdump :) As soon as I get another hang (they happen quite often, if not I'll cause one by purpose) I'll post all the infos you suggested to gather. Thanks, kind regards G. On Fri, Mar 12, 2010 at 4:01 AM, Sunil Mushran sunil.mush...@oracle.com wrote: Other than messages the only other persistent info that can be used for debugging is tcpdump. And I will be surprised if you have that. Some other useful files are: # cat /sys/kernel/debug/o2dlm/DOMAIN/dlm_state # cat /sys/kernel/debug/ocfs2/UUID/fs_state These are synthetic files and thus not persistent. This allows one to monitor the state(s). Say the recovery master is waiting a message from a node during recovery, this state file will indicate that. It is interesting that you see the replay_journal on one node. Means that the dlm recovery completed. That node was then able to take an exclusive lock on the super block lock and replay the journal. Others should have followed. Sunil Gabriele Alberti wrote: Hello, I looked for the infos you requested. 1) The eviction message was on all nodes. Playing with grep I noticed in some nodes it appeared twice with different numbers in parenthesis: Mar 4 04:10:22 node05 kernel: (22595,1):o2dlm_eviction_cb:258 o2dlm has evicted node 9 from group B2F5C3291557493B99AE7326AF8B7471 Mar 4 04:10:23 node05 kernel: (22328,0):o2dlm_eviction_cb:258 o2dlm has evicted node 9 from group B2F5C3291557493B99AE7326AF8B7471 Mar 4 04:10:35 node07 kernel: (6900,0):o2dlm_eviction_cb:258 o2dlm has evicted node 9 from group B2F5C3291557493B99AE7326AF8B7471 Mar 4 04:10:35 node07 kernel: (6892,0):o2dlm_eviction_cb:258 o2dlm has evicted node 9 from group B2F5C3291557493B99AE7326AF8B7471 2) The recovery master message appeared on one node, here is the log at that time. Please note that node10 (hostname) is Node 3 (ocfs2 settings) Mar 4 04:09:51 node10 kernel: o2net: connection to node node08 (num 9) at 192.168.1.8: has been idle for 30.0 seconds, shutting it down. Mar 4 04:09:51 node10 kernel: (0,0):o2net_idle_timer:1498 here are some times that might help debug the situation: (tmr 1267672161.718025 now 1267672191.723171 dr 1267672161.718019 adv 1267672161.718025:12676 72161.718026 func (a6c57cb2:502) 1267552114.706439:1267552114.706441) Mar 4 04:09:51 node10 kernel: o2net: no longer connected to node node08 (num 9) at 192.168.1.8: Mar 4 04:10:21 node10 kernel: (30475,0):o2net_connect_expired:1659 ERROR: no connection established with node 9 after 30.0 seconds, giving up and returning errors. Mar 4 04:10:23 node10 kernel: (30740,1):o2dlm_eviction_cb:258 o2dlm has evicted node 9 from group B2F5C3291557493B99AE7326AF8B7471 Mar 4 04:10:23 node10 kernel: (30772,0):dlm_get_lock_resource:839 B2F5C3291557493B99AE7326AF8B7471:$RECOVERY: at least one node (9) to recover before lock mastery can begin Mar 4 04:10:23 node10 kernel: (30772,0):dlm_get_lock_resource:873 B2F5C3291557493B99AE7326AF8B7471: recovery map is not empty, but must master $RECOVERY lock now Mar 4 04:10:23 node10 kernel: (30772,0):dlm_do_recovery:524 (30772) Node 3 is the Recovery Master for the Dead Node 9 for Domain B2F5C3291557493B99AE7326AF8B7471 And the log doesnt contain anything til the morning. Instead, another node contains the following: Mar 4 04:10:29 node05 kernel: (1861,1):ocfs2_replay_journal:1224 Recovering node 9 from slot 7 on device (152,0) But the ocfs2 disk was unavailable anyway. Any other hint? Regards, G. On Wed, Mar 10, 2010 at 8:56 PM, Sunil Mushran sunil.mush...@oracle.com wrote: Were the first set of messages on all nodes? On that node atleast the o2hb node down event fired. It should have fired on all nodes. This is the dlm eviction message. If they all fired, then look for a node to have a message that reads Node x is the Recovery Master for the Dead Node y. That shows a node was elected to run the dlm recovery. That has to complete before the journal is replayed. Recovering node x from slot y on device. I did a quick scan of the patches since 2.6.28. They are a lot of them. I did not see any fixes in this area. git log --oneline --no-merges v2.6.28..HEAD fs/ocfs2 Sunil Gabriele Alberti wrote: Hello, I have a weird behavior in my ocfs2 cluster. I have few nodes accessing a shared device, and everything works fine as long as one node crashes for whatever reason. When this happens, the ocfs2 filesystem hangs and it seems impossible to access it until I dont bring down all the nodes but one. I have a (commented) log of what happened few nights ago, when a node shut itself down because of a fan failure. In order to avoid uncontrolled re-joins to the cluster my nodes stay off when they go off for a reason. The log is available at http://pastebin.com/gDg577hH Is this the expected behavior? I thought when one node fails, the rest of the world should go on working
Re: [Ocfs2-users] Disk access hang
Hello, I looked for the infos you requested. 1) The eviction message was on all nodes. Playing with grep I noticed in some nodes it appeared twice with different numbers in parenthesis: Mar 4 04:10:22 node05 kernel: (22595,1):o2dlm_eviction_cb:258 o2dlm has evicted node 9 from group B2F5C3291557493B99AE7326AF8B7471 Mar 4 04:10:23 node05 kernel: (22328,0):o2dlm_eviction_cb:258 o2dlm has evicted node 9 from group B2F5C3291557493B99AE7326AF8B7471 Mar 4 04:10:35 node07 kernel: (6900,0):o2dlm_eviction_cb:258 o2dlm has evicted node 9 from group B2F5C3291557493B99AE7326AF8B7471 Mar 4 04:10:35 node07 kernel: (6892,0):o2dlm_eviction_cb:258 o2dlm has evicted node 9 from group B2F5C3291557493B99AE7326AF8B7471 2) The recovery master message appeared on one node, here is the log at that time. Please note that node10 (hostname) is Node 3 (ocfs2 settings) Mar 4 04:09:51 node10 kernel: o2net: connection to node node08 (num 9) at 192.168.1.8: has been idle for 30.0 seconds, shutting it down. Mar 4 04:09:51 node10 kernel: (0,0):o2net_idle_timer:1498 here are some times that might help debug the situation: (tmr 1267672161.718025 now 1267672191.723171 dr 1267672161.718019 adv 1267672161.718025:12676 72161.718026 func (a6c57cb2:502) 1267552114.706439:1267552114.706441) Mar 4 04:09:51 node10 kernel: o2net: no longer connected to node node08 (num 9) at 192.168.1.8: Mar 4 04:10:21 node10 kernel: (30475,0):o2net_connect_expired:1659 ERROR: no connection established with node 9 after 30.0 seconds, giving up and returning errors. Mar 4 04:10:23 node10 kernel: (30740,1):o2dlm_eviction_cb:258 o2dlm has evicted node 9 from group B2F5C3291557493B99AE7326AF8B7471 Mar 4 04:10:23 node10 kernel: (30772,0):dlm_get_lock_resource:839 B2F5C3291557493B99AE7326AF8B7471:$RECOVERY: at least one node (9) to recover before lock mastery can begin Mar 4 04:10:23 node10 kernel: (30772,0):dlm_get_lock_resource:873 B2F5C3291557493B99AE7326AF8B7471: recovery map is not empty, but must master $RECOVERY lock now Mar 4 04:10:23 node10 kernel: (30772,0):dlm_do_recovery:524 (30772) Node 3 is the Recovery Master for the Dead Node 9 for Domain B2F5C3291557493B99AE7326AF8B7471 And the log doesnt contain anything til the morning. Instead, another node contains the following: Mar 4 04:10:29 node05 kernel: (1861,1):ocfs2_replay_journal:1224 Recovering node 9 from slot 7 on device (152,0) But the ocfs2 disk was unavailable anyway. Any other hint? Regards, G. On Wed, Mar 10, 2010 at 8:56 PM, Sunil Mushran sunil.mush...@oracle.com wrote: Were the first set of messages on all nodes? On that node atleast the o2hb node down event fired. It should have fired on all nodes. This is the dlm eviction message. If they all fired, then look for a node to have a message that reads Node x is the Recovery Master for the Dead Node y. That shows a node was elected to run the dlm recovery. That has to complete before the journal is replayed. Recovering node x from slot y on device. I did a quick scan of the patches since 2.6.28. They are a lot of them. I did not see any fixes in this area. git log --oneline --no-merges v2.6.28..HEAD fs/ocfs2 Sunil Gabriele Alberti wrote: Hello, I have a weird behavior in my ocfs2 cluster. I have few nodes accessing a shared device, and everything works fine as long as one node crashes for whatever reason. When this happens, the ocfs2 filesystem hangs and it seems impossible to access it until I dont bring down all the nodes but one. I have a (commented) log of what happened few nights ago, when a node shut itself down because of a fan failure. In order to avoid uncontrolled re-joins to the cluster my nodes stay off when they go off for a reason. The log is available at http://pastebin.com/gDg577hH Is this the expected behavior? I thought when one node fails, the rest of the world should go on working after the timeout (I used default values for timeouts). Here are my versions # modinfo ocfs2 filename: /lib/modules/2.6.28.9/kernel/fs/ocfs2/ocfs2.ko author: Oracle license: GPL description: OCFS2 1.5.0 version: 1.5.0 vermagic: 2.6.28.9 SMP mod_unload modversions PENTIUM4 4KSTACKS depends: jbd2,ocfs2_stackglue,ocfs2_nodemanager srcversion: FEA8BA1FCC9D61DAAF32077 Best regards, G. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users
Re: [Ocfs2-users] Disk access hang
Other than messages the only other persistent info that can be used for debugging is tcpdump. And I will be surprised if you have that. Some other useful files are: # cat /sys/kernel/debug/o2dlm/DOMAIN/dlm_state # cat /sys/kernel/debug/ocfs2/UUID/fs_state These are synthetic files and thus not persistent. This allows one to monitor the state(s). Say the recovery master is waiting a message from a node during recovery, this state file will indicate that. It is interesting that you see the replay_journal on one node. Means that the dlm recovery completed. That node was then able to take an exclusive lock on the super block lock and replay the journal. Others should have followed. Sunil Gabriele Alberti wrote: Hello, I looked for the infos you requested. 1) The eviction message was on all nodes. Playing with grep I noticed in some nodes it appeared twice with different numbers in parenthesis: Mar 4 04:10:22 node05 kernel: (22595,1):o2dlm_eviction_cb:258 o2dlm has evicted node 9 from group B2F5C3291557493B99AE7326AF8B7471 Mar 4 04:10:23 node05 kernel: (22328,0):o2dlm_eviction_cb:258 o2dlm has evicted node 9 from group B2F5C3291557493B99AE7326AF8B7471 Mar 4 04:10:35 node07 kernel: (6900,0):o2dlm_eviction_cb:258 o2dlm has evicted node 9 from group B2F5C3291557493B99AE7326AF8B7471 Mar 4 04:10:35 node07 kernel: (6892,0):o2dlm_eviction_cb:258 o2dlm has evicted node 9 from group B2F5C3291557493B99AE7326AF8B7471 2) The recovery master message appeared on one node, here is the log at that time. Please note that node10 (hostname) is Node 3 (ocfs2 settings) Mar 4 04:09:51 node10 kernel: o2net: connection to node node08 (num 9) at 192.168.1.8: has been idle for 30.0 seconds, shutting it down. Mar 4 04:09:51 node10 kernel: (0,0):o2net_idle_timer:1498 here are some times that might help debug the situation: (tmr 1267672161.718025 now 1267672191.723171 dr 1267672161.718019 adv 1267672161.718025:12676 72161.718026 func (a6c57cb2:502) 1267552114.706439:1267552114.706441) Mar 4 04:09:51 node10 kernel: o2net: no longer connected to node node08 (num 9) at 192.168.1.8: Mar 4 04:10:21 node10 kernel: (30475,0):o2net_connect_expired:1659 ERROR: no connection established with node 9 after 30.0 seconds, giving up and returning errors. Mar 4 04:10:23 node10 kernel: (30740,1):o2dlm_eviction_cb:258 o2dlm has evicted node 9 from group B2F5C3291557493B99AE7326AF8B7471 Mar 4 04:10:23 node10 kernel: (30772,0):dlm_get_lock_resource:839 B2F5C3291557493B99AE7326AF8B7471:$RECOVERY: at least one node (9) to recover before lock mastery can begin Mar 4 04:10:23 node10 kernel: (30772,0):dlm_get_lock_resource:873 B2F5C3291557493B99AE7326AF8B7471: recovery map is not empty, but must master $RECOVERY lock now Mar 4 04:10:23 node10 kernel: (30772,0):dlm_do_recovery:524 (30772) Node 3 is the Recovery Master for the Dead Node 9 for Domain B2F5C3291557493B99AE7326AF8B7471 And the log doesnt contain anything til the morning. Instead, another node contains the following: Mar 4 04:10:29 node05 kernel: (1861,1):ocfs2_replay_journal:1224 Recovering node 9 from slot 7 on device (152,0) But the ocfs2 disk was unavailable anyway. Any other hint? Regards, G. On Wed, Mar 10, 2010 at 8:56 PM, Sunil Mushran sunil.mush...@oracle.com wrote: Were the first set of messages on all nodes? On that node atleast the o2hb node down event fired. It should have fired on all nodes. This is the dlm eviction message. If they all fired, then look for a node to have a message that reads Node x is the Recovery Master for the Dead Node y. That shows a node was elected to run the dlm recovery. That has to complete before the journal is replayed. Recovering node x from slot y on device. I did a quick scan of the patches since 2.6.28. They are a lot of them. I did not see any fixes in this area. git log --oneline --no-merges v2.6.28..HEAD fs/ocfs2 Sunil Gabriele Alberti wrote: Hello, I have a weird behavior in my ocfs2 cluster. I have few nodes accessing a shared device, and everything works fine as long as one node crashes for whatever reason. When this happens, the ocfs2 filesystem hangs and it seems impossible to access it until I dont bring down all the nodes but one. I have a (commented) log of what happened few nights ago, when a node shut itself down because of a fan failure. In order to avoid uncontrolled re-joins to the cluster my nodes stay off when they go off for a reason. The log is available at http://pastebin.com/gDg577hH Is this the expected behavior? I thought when one node fails, the rest of the world should go on working after the timeout (I used default values for timeouts). Here are my versions # modinfo ocfs2 filename: /lib/modules/2.6.28.9/kernel/fs/ocfs2/ocfs2.ko author: Oracle license:GPL description:OCFS2 1.5.0 version:1.5.0 vermagic: 2.6.28.9 SMP mod_unload modversions PENTIUM4 4KSTACKS depends:
Re: [Ocfs2-users] Disk access hang
Were the first set of messages on all nodes? On that node atleast the o2hb node down event fired. It should have fired on all nodes. This is the dlm eviction message. If they all fired, then look for a node to have a message that reads Node x is the Recovery Master for the Dead Node y. That shows a node was elected to run the dlm recovery. That has to complete before the journal is replayed. Recovering node x from slot y on device. I did a quick scan of the patches since 2.6.28. They are a lot of them. I did not see any fixes in this area. git log --oneline --no-merges v2.6.28..HEAD fs/ocfs2 Sunil Gabriele Alberti wrote: Hello, I have a weird behavior in my ocfs2 cluster. I have few nodes accessing a shared device, and everything works fine as long as one node crashes for whatever reason. When this happens, the ocfs2 filesystem hangs and it seems impossible to access it until I dont bring down all the nodes but one. I have a (commented) log of what happened few nights ago, when a node shut itself down because of a fan failure. In order to avoid uncontrolled re-joins to the cluster my nodes stay off when they go off for a reason. The log is available at http://pastebin.com/gDg577hH Is this the expected behavior? I thought when one node fails, the rest of the world should go on working after the timeout (I used default values for timeouts). Here are my versions # modinfo ocfs2 filename: /lib/modules/2.6.28.9/kernel/fs/ocfs2/ocfs2.ko author: Oracle license:GPL description:OCFS2 1.5.0 version:1.5.0 vermagic: 2.6.28.9 SMP mod_unload modversions PENTIUM4 4KSTACKS depends:jbd2,ocfs2_stackglue,ocfs2_nodemanager srcversion: FEA8BA1FCC9D61DAAF32077 Best regards, G. ___ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users