[Lustre-discuss] 1.8 client loses contact to 1.6 router
Hi list, we have a 1.6.7 fs running which still works nicely. One node exports this FS (via 10GE) to another cluster that has some 1.8.5 patchless clients. These clients at some point (randomly, I think) mark the router as down (lctl show_route). It is always a different client and usually a few clients each week that do this. Despite that we configured the clients to ping the router again from time to time, the route never comes back. On these clients I can still ping the IP of the router but lctl ping gives me an Input/Output error. If I do somthing like: lctl --net o2ib set_route 172.30.128.241@tcp1 down sleep 45 lctl --net o2ib del_route 172.30.128.241@tcp1 sleep 45 lctl --net o2ib add_route 172.30.128.241@tcp1 sleep 45 lctl --net o2ib set_route 172.30.128.241@tcp1 up the route comes back, sometimes the client works again but sometimes the clients issue an unexpected aliveness of peer .. and need a reboot. I looked around and could not find a note whether 1.8. clients and 1.6 routers will work together as expexted. Has anyone experience with this kind of setup or an idea for further debugging? Regards, Michael modprobe.d/luste.conf on the 1.8.5 clients -8-- options lnet networks=tcp1(eth0) options lnet routes=o2ib 172.30.128.241@tcp1; options lnet dead_router_check_interval=60 router_ping_timeout=30 -8-- -- Dr.-Ing. Michael Kluge Technische Universität Dresden Center for Information Services and High Performance Computing (ZIH) D-01062 Dresden Germany Contact: Willersbau, Room A 208 Phone: (+49) 351 463-34217 Fax:(+49) 351 463-37773 e-mail: michael.kl...@tu-dresden.de WWW:http://www.tu-dresden.de/zih ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] issue with the OSS
Hi All, We have lustre installation where 2 OSS nodes are in the HA mode. It was found that One was stonithed. The var log messages showed following errors before it was stonithed = Feb 1 10:35:57 oss5 heartbeat: [8336]: WARN: Gmain_timeout_dispatch: Dispatch function for memory stats took too long to execute: 870 ms ( 100 ms) (GSource: 0x1e6c62a8) Feb 1 10:36:00 oss5 kernel: LustreError: 27684:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:36:03 oss5 kernel: LustreError: 15913:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:36:08 oss5 kernel: LustreError: 12380:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:36:09 oss5 kernel: LustreError: 12261:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:36:10 oss5 kernel: LustreError: 9713:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:36:12 oss5 kernel: LustreError: 4114:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:36:13 oss5 kernel: LustreError: 4092:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:36:15 oss5 kernel: LustreError: 12398:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:36:17 oss5 kernel: LustreError: 12283:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:36:18 oss5 kernel: LustreError: 12325:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:36:19 oss5 kernel: LustreError: 9752:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:36:19 oss5 kernel: LustreError: 23057:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:36:22 oss5 kernel: LustreError: 12428:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:36:22 oss5 kernel: LustreError: 9679:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:36:27 oss5 kernel: LustreError: 9686:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:36:28 oss5 kernel: LustreError: 12385:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:36:33 oss5 kernel: LustreError: 27687:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:36:35 oss5 kernel: LustreError: 12264:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:36:40 oss5 kernel: LustreError: 9784:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:36:43 oss5 kernel: LustreError: 23117:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:36:48 oss5 kernel: LustreError: 12265:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:36:52 oss5 kernel: LustreError: 4103:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:36:57 oss5 kernel: LustreError: 12415:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:37:02 oss5 kernel: LustreError: 23132:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:37:05 oss5 kernel: LustreError: 23100:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:37:07 oss5 kernel: LustreError: 9714:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:37:07 oss5 kernel: LustreError: 12429:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:37:07 oss5 kernel: LustreError: 4090:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:37:10 oss5 kernel: LustreError: 9773:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:37:14 oss5 kernel: LustreError: 9781:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:37:19 oss5 kernel: LustreError: 9752:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:37:25 oss5 kernel: LustreError: 23082:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:37:32 oss5 kernel: LustreError: 15927:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:37:41 oss5 kernel: LustreError: 9761:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 Feb 1 10:37:50 oss5 kernel: LustreError:
Re: [Lustre-discuss] issue with the OSS
On Sat, Feb 04, 2012 at 12:41:08AM +0530, Prithu Tiwari wrote: 27684:0:(filter_io_26.c:669:filter_commitrw_write()) error starting transaction: rc = -30 #define EROFS 30 /* Read-only file system */ This means that the backend filesystem has been remounted read-only. There is likely an error message from ldiskfs earlier in the log. Cheers, Johann -- Johann Lombardi Whamcloud, Inc. www.whamcloud.com ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Lustre 1.8.7 - Setup prototype in Research field - STUCK !
You should download from the Whamcloud download site, for a start: http://downloads.whamcloud.com/public/lustre/ Typically, the Lustre server does nothing but run Lustre. For that reason there is generally little risk from using our current version on the server platforms. If your clients require a particular kernel version, you would have to build lustre-client and lustre-client modules only. I would reccomend http://wiki.whamcloud.com/display/PUB/Getting+started+with+Lustre as a useful resource. cliffw On Fri, Feb 3, 2012 at 2:30 PM, Charles Cummings ccummi...@harthosp.orgwrote: Hello Everyone, being the local crafty busy admin for a neuroscience research branch, Lustre seems the only way to go however I'm a bit stuck and need some thoughtful guidance. My goal is to setup a virtual OS environment which is a replica of our Direct attached storage head node running SLES 11.0 x86 64 Kernel: 2.6.27.19-5 default #1 SMP and our (2) Dell blade clusters running CentOS 5.3 x86 64 Kernel: 2.6.18-128.el5 #1 SMP which I now have running as a) SLES 11 same kernel MDS b) SLES 11 same kernel OSS and c) CentOS 5.3 x86 65 same kernel and then get Lustre running across it. The trouble began when i was informed that the Lustre rpm kernel numbers MUST match the OS kernel number EXACTLY due to modprobe errors and mount errors on the client, and some known messages on the servers after the rpm installs. My only direct access to Oracle Lustre downloads is through another person with an Oracle ID who's not very willing to help - i.e. this route is painful So to explain why I'm stuck: a) access to oracle downloads is not easy b) there is so much risk with altering kernels, given all the applications and stability of the environment you could literally trash the server and spend days recovering - in addition to it being the main storage / resource for research c) I can't seem to find after looking Lustre RPMs that match my kernel environment specifically, i.e. the SLES 11 AND CENTOS 5.3 d) I've never created rpms to a specific kernel version and that would be a deep dive into new territory and frankly another gamble What's the least painful and least risky to get Lustre working in this prototype which will then lend to production (equally least painful) given these statements - Help ! Cliff, I could use some details on how specifically wamcloud can fit this scenero - and thanks for all the enlightenment. thanks for your help Charles -- cliffw Support Guy WhamCloud, Inc. www.whamcloud.com ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss