Re: [Lustre-discuss] HA problem with Lustre 2.2

2013-04-01 Thread Adrian Ulrich

> Mar 28 16:05:56 mds1 kernel: LustreError: 11-0: an error occurred while
> communicating with 192.168.1.44@o2ib. The ost_connect operation failed with
> -19

I suppose thats the NID of a 'dead' OST?

How was the filesystem formatted? Did you specify --msgnode and --failnode ?

-- 
 RFC 1925:
   (11) Every old idea will be proposed again with a different name and
a different presentation, regardless of whether it works.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] HA problem with Lustre 2.2

2013-04-01 Thread Dilip Sathaye
Dear All,

We have a lustre 2.3 setup with CentOS 6.3 . We have corosync for HA. Basic
lustre is working.  While testing for HA , we manually migrated some OST or
manually put an OSS in standby mode. The OSTs get migrated , and get
mounted on other server properly.  crm_mon shows relocated resources. But
lustre filesystem is hanging at client end. "lfs df" hangs as well. Some of
the messages we see in messages file of MDS are as given below.  We need
help on this.

---
Mar 28 16:01:30 mds1 lrmd: [5405]: info: rsc:mdt:11: monitor
Mar 28 16:02:36 mds1 kernel: LustreError: 11-0: an error occurred while
communicating with 192.168.1.44@o2ib. The obd_ping operation failed with
-107
Mar 28 16:02:36 mds1 kernel: LustreError: Skipped 3 previous similar
messages
Mar 28 16:02:36 mds1 kernel: Lustre: lustre-OST0004-osc-MDT: Connection
to lustre-OST0004 (at 192.168.1.44@o2ib) was lost; in progress operations
using this service will wait for recovery to complete
Mar 28 16:02:41 mds1 kernel: LustreError:
5983:0:(mgc_request.c:1375:mgc_apply_recover_logs()) mgc: cannot find uuid
by nid 192.168.1.43@o2ib
Mar 28 16:02:41 mds1 kernel: Lustre:
5983:0:(mgc_request.c:1534:mgc_process_recover_log()) Process recover log
lustre-mdtir error -2
Mar 28 16:03:01 mds1 kernel: Lustre:
3050:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has
failed due to network error: [sent 1364466781/real 1364466781]
req@880c58e6d000 x1430230346733886/t0(0)
o8->lustre-OST0004-osc-MDT@10.70.10.43@o2ib:28/4 lens 400/544 e 0 to 1
dl 1364466787 ref 1 fl Rpc:XN/0/ rc 0/-1
Mar 28 16:03:01 mds1 kernel: Lustre:
3050:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 14 previous
similar messages
Mar 28 16:04:16 mds1 kernel: LustreError: 11-0: an error occurred while
communicating with 192.168.1.44@o2ib. The ost_connect operation failed with
-19
Mar 28 16:04:16 mds1 kernel: LustreError: Skipped 1 previous similar message
Mar 28 16:04:41 mds1 kernel: Lustre:
3050:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has
failed due to network error: [sent 1364466881/real 1364466881]
req@880c58c70800 x1430230346733918/t0(0)
o8->lustre-OST0004-osc-MDT@10.70.10.43@o2ib:28/4 lens 400/544 e 0 to 1
dl 1364466892 ref 1 fl Rpc:XN/0/ rc 0/-1
Mar 28 16:04:41 mds1 kernel: Lustre:
3050:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 2 previous
similar messages
Mar 28 16:05:32 mds1 cib[5403]: info: cib_stats: Processed 14
operations (714.00us average, 0% utilization) in the last 10min
Mar 28 16:05:56 mds1 kernel: LustreError: 11-0: an error occurred while
communicating with 192.168.1.44@o2ib. The ost_connect operation failed with
-19
Mar 28 16:07:27 mds1 kernel: Lustre:
3050:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has
timed out for sent delay: [sent 1364467031/real 0]
req@880c596f2800x1430230346733966/t0(0)
o8->lustre-OST0004-osc-MDT@10.70.10.45@o2ib:28/4
lens 400/544 e 0 to 1 dl 1364467047 ref 2 fl Rpc:XN/0/ rc 0/-1
Mar 28 16:07:27 mds1 kernel: Lustre:
3050:0:(client.c:1917:ptlrpc_expire_one_request()) Skipped 4 previous
similar messages
Mar 28 16:07:36 mds1 kernel: LustreError: 11-0: an error occurred while
communicating with 192.168.1.44@o2ib. The ost_connect operation failed with
-19
Mar 28 16:07:46 mds1 lrmd: [5405]: info: rsc:ost6:17: monitor
Mar 28 16:09:16 mds1 kernel: LustreError: 11-0: an error occurred while
communicating with 192.168.1.44@o2ib. The ost_connect operation failed with
-19
Mar 28 16:11:21 mds1 kernel: LustreError: 11-0: an error occurred while
communicating with 192.168.1.44@o2ib. The ost_connect operation failed with
-19
Mar 28 16:11:46 mds1 kernel: Lustre:
3050:0:(client.c:1917:ptlrpc_expire_one_request()) @@@ Request  sent has
failed due to network error: [sent 1364467306/real 1364467306]
req@880c58dc9000 x1430230346734053/t0(0)
o8->lustre-OST0004-osc-MDT@10.70.10.43@o2ib:28/4 lens 400/544 e 0 to 1
dl 1364467337 ref 1 fl Rpc:XN/0/ rc 0/-1


Thanks and Regards
Dilip
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss