[Lustre-discuss] 1.8 client loses contact to 1.6 router

2012-02-03 Thread Michael Kluge
Hi list,

we have a 1.6.7 fs running which still works nicely. One node exports this FS 
(via 10GE)  to another cluster that has some 1.8.5 patchless clients. These 
clients at some point (randomly, I think) mark the router as down (lctl 
show_route). It is always a different client and usually a few clients each 
week that do this. Despite that we configured the clients to ping the router 
again from time to time, the route never comes back. On these clients I can 
still ping the IP of the router but lctl ping gives me an Input/Output 
error. If I do somthing like:

lctl --net o2ib set_route 172.30.128.241@tcp1 down
sleep 45
lctl --net o2ib del_route 172.30.128.241@tcp1
sleep 45
lctl --net o2ib add_route 172.30.128.241@tcp1
sleep 45
lctl --net o2ib set_route 172.30.128.241@tcp1 up

the route comes back, sometimes the client works again but sometimes the 
clients issue an unexpected aliveness of peer .. and need a reboot.

I looked around and could not find a note whether 1.8. clients and 1.6 routers 
will work together as expexted. Has anyone experience with this kind of setup 
or an idea for further debugging?


Regards, Michael

modprobe.d/luste.conf on the 1.8.5 clients
-8--
options lnet networks=tcp1(eth0)
options lnet routes=o2ib 172.30.128.241@tcp1;
options lnet dead_router_check_interval=60 router_ping_timeout=30
-8--



-- 

Dr.-Ing. Michael Kluge

Technische Universität Dresden
Center for Information Services and
High Performance Computing (ZIH)
D-01062 Dresden
Germany

Contact:
Willersbau, Room A 208
Phone:  (+49) 351 463-34217
Fax:(+49) 351 463-37773
e-mail: michael.kl...@tu-dresden.de
WWW:http://www.tu-dresden.de/zih
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] issue with the OSS

2012-02-03 Thread Prithu Tiwari
Hi All,
We have lustre installation where 2  OSS nodes are in the HA mode. It was
found that One was stonithed.
The var log messages showed following errors before it was stonithed

=
Feb  1 10:35:57 oss5 heartbeat: [8336]: WARN: Gmain_timeout_dispatch:
Dispatch function for memory stats took too long to execute: 870 ms ( 100
ms) (GSource: 0x1e6c62a8)
Feb  1 10:36:00 oss5 kernel: LustreError:
27684:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:36:03 oss5 kernel: LustreError:
15913:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:36:08 oss5 kernel: LustreError:
12380:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:36:09 oss5 kernel: LustreError:
12261:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:36:10 oss5 kernel: LustreError:
9713:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:36:12 oss5 kernel: LustreError:
4114:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:36:13 oss5 kernel: LustreError:
4092:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:36:15 oss5 kernel: LustreError:
12398:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:36:17 oss5 kernel: LustreError:
12283:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:36:18 oss5 kernel: LustreError:
12325:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:36:19 oss5 kernel: LustreError:
9752:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:36:19 oss5 kernel: LustreError:
23057:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:36:22 oss5 kernel: LustreError:
12428:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:36:22 oss5 kernel: LustreError:
9679:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:36:27 oss5 kernel: LustreError:
9686:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:36:28 oss5 kernel: LustreError:
12385:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:36:33 oss5 kernel: LustreError:
27687:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:36:35 oss5 kernel: LustreError:
12264:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:36:40 oss5 kernel: LustreError:
9784:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:36:43 oss5 kernel: LustreError:
23117:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:36:48 oss5 kernel: LustreError:
12265:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:36:52 oss5 kernel: LustreError:
4103:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:36:57 oss5 kernel: LustreError:
12415:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:37:02 oss5 kernel: LustreError:
23132:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:37:05 oss5 kernel: LustreError:
23100:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:37:07 oss5 kernel: LustreError:
9714:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:37:07 oss5 kernel: LustreError:
12429:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:37:07 oss5 kernel: LustreError:
4090:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:37:10 oss5 kernel: LustreError:
9773:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:37:14 oss5 kernel: LustreError:
9781:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:37:19 oss5 kernel: LustreError:
9752:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:37:25 oss5 kernel: LustreError:
23082:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:37:32 oss5 kernel: LustreError:
15927:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:37:41 oss5 kernel: LustreError:
9761:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
transaction: rc = -30
Feb  1 10:37:50 oss5 kernel: LustreError:

Re: [Lustre-discuss] issue with the OSS

2012-02-03 Thread Johann Lombardi
On Sat, Feb 04, 2012 at 12:41:08AM +0530, Prithu Tiwari wrote:
 27684:0:(filter_io_26.c:669:filter_commitrw_write()) error starting
 transaction: rc = -30

#define EROFS   30  /* Read-only file system */

This means that the backend filesystem has been remounted read-only. There is 
likely an error message from ldiskfs earlier in the log.

Cheers,
Johann
-- 
Johann Lombardi
Whamcloud, Inc.
www.whamcloud.com
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Lustre 1.8.7 - Setup prototype in Research field - STUCK !

2012-02-03 Thread Cliff White
You should download from the Whamcloud download site, for a start:
http://downloads.whamcloud.com/public/lustre/
Typically, the Lustre server does nothing but run Lustre. For that reason
there is generally little risk
from using our current version on the server platforms. If your clients
require a particular kernel version, you
would have to build lustre-client and lustre-client modules only.

I would reccomend
http://wiki.whamcloud.com/display/PUB/Getting+started+with+Lustre
as a useful resource.
cliffw


On Fri, Feb 3, 2012 at 2:30 PM, Charles Cummings ccummi...@harthosp.orgwrote:

  Hello Everyone,

 being the local crafty busy admin for a neuroscience research branch,
 Lustre seems the only way to go however I'm a bit stuck and need some
 thoughtful guidance.

 My goal  is to setup a virtual OS environment which is a replica of our
 Direct attached storage head node running SLES 11.0 x86 64   Kernel:
 2.6.27.19-5 default #1 SMP
 and our (2) Dell blade clusters running CentOS 5.3 x86 64   Kernel:
 2.6.18-128.el5 #1 SMP
 which I now have running as a) SLES 11 same kernel MDS  b) SLES 11 same
 kernel OSS   and   c) CentOS 5.3 x86 65 same kernel
 and then get Lustre running across it.

 The trouble began when i was informed that the Lustre rpm kernel numbers
 MUST match the OS kernel number EXACTLY due to modprobe errors and mount
 errors on the client,
 and some known messages on the servers after the rpm installs.

 My only direct access to Oracle Lustre downloads is through another person
 with an Oracle ID who's not very willing to help - i.e. this route is
 painful

 So to explain why I'm stuck:

 a) access to oracle downloads is not easy
 b) there is so much risk with altering kernels, given all the applications
 and stability of the environment you could literally trash the server and
 spend days recovering - in addition to it being the main storage / resource
 for research
 c) I can't seem to find after looking Lustre RPMs that match my kernel
 environment specifically, i.e. the SLES 11 AND CENTOS 5.3
 d) I've never created rpms to a specific kernel version and that would be
 a deep dive into new territory and frankly another gamble

 What's the least painful and least risky to get Lustre working in this
 prototype which will then lend to production (equally least painful) given
 these statements - Help !
 Cliff, I could use some details on how specifically wamcloud can fit this
 scenero - and thanks for all the enlightenment.


 thanks for your help
 Charles




-- 
cliffw
Support Guy
WhamCloud, Inc.
www.whamcloud.com
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss