Re: [Lustre-discuss] LustreError: server_bulk_callback
On Sep 24, 2008 17:22 -0600, Nathan Dauchy wrote: We have 4 OSS nodes and 2 MDS nodes configured in HA pairs, running 2.6.18-53.1.14.el5_lustre.1.6.5smp, and using the o2ib network transport. We had multiple failovers recently (possibly due to hardware problems, but no root cause yet) and managed to get things back again to what I _thought_ was a normal state. However, in the system log we are seeing many server_bulk_callback error messages at the rate of ~6 per second. Interestingly, they only come from one HA pair of OSS nodes: Sep 24 23:03:14 lfs-oss-0-3 kernel: LustreError: 20694:0:(events.c:361:server_bulk_callback()) event type 4, status -103, desc 81019fce6000 Sep 24 23:03:14 lfs-oss-0-3 kernel: LustreError: 20694:0:(events.c:361:server_bulk_callback()) event type 2, status -103, desc 81019fce6000 Sep 24 23:03:16 lfs-oss-0-2 kernel: LustreError: 27257:0:(events.c:361:server_bulk_callback()) event type 4, status -103, desc 8101b52b8000 Sep 24 23:03:16 lfs-oss-0-2 kernel: LustreError: 27257:0:(events.c:361:server_bulk_callback()) event type 2, status -103, desc 8101b52b8000 Can anyone direct me to documentation to decipher these messages? What does server_bulk_callback do, and does status -103 indicate a severe problem for event types 2 and 4? All Lustre error numbers are from /usr/include/asm/errno.h. In this case, -103 = -ECONNABORTED. My guess would be some kind of networking issue being hit by LNET, because that isn't an error used by the Lustre filesystem itself. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lustre-ldiskfs
On Sep 26, 2008 10:26 +0530, Chirag Raval wrote: When I am installing the lustre-ldiskfs-3.0.4-2.6.9_67.0.7.EL_lustre.1.6.5.1smp.i686.rpm I get the following error. Can someone please help me what can be wrong I am installing it on CentOS 4.5 # rpm -ivh lustre-ldiskfs-3.0.4-2.6.9_67.0.7.EL_lustre.1.6.5.1smp.i686.rpm error: open of HTMLHEADTITLEError/TITLE/HEADBODY failed: No such file or directory error: open of An failed: No such file or directory error: open of error failed: No such file or directory error: open of occurred failed: No such file or directory error: open of while failed: No such file or directory error: open of processing failed: No such file or directory error: open of your failed: No such file or directory error: open of request.p failed: No such file or directory error: open of Reference failed: No such file or directory error: open of /BODY/HTML failed: No such file or directory You downloaded and are trying to install a web page (which itself appears to report that you had an error downloading the RPM). Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lustre-ldiskfs
I ran into this problem my self when sun convoluted download system took over hosting the lustre packages. When I tried to 'wget' the package, i forgot that sun makes you login and thus you download an html error page in place of the rpm. You will need to download to your machine then upload to the cluster, no cmd line download was possible. If anyone knows how to get around this let me know. Brock Palen www.umich.edu/~brockp Center for Advanced Computing [EMAIL PROTECTED] (734)936-1985 On Sep 26, 2008, at 6:39 AM, Andreas Dilger wrote: On Sep 26, 2008 10:26 +0530, Chirag Raval wrote: When I am installing the lustre-ldiskfs-3.0.4-2.6.9_67.0.7.EL_lustre.1.6.5.1smp.i686.rpm I get the following error. Can someone please help me what can be wrong I am installing it on CentOS 4.5 # rpm -ivh lustre-ldiskfs-3.0.4-2.6.9_67.0.7.EL_lustre. 1.6.5.1smp.i686.rpm error: open of HTMLHEADTITLEError/TITLE/HEADBODY failed: No such file or directory error: open of An failed: No such file or directory error: open of error failed: No such file or directory error: open of occurred failed: No such file or directory error: open of while failed: No such file or directory error: open of processing failed: No such file or directory error: open of your failed: No such file or directory error: open of request.p failed: No such file or directory error: open of Reference failed: No such file or directory error: open of /BODY/HTML failed: No such file or directory You downloaded and are trying to install a web page (which itself appears to report that you had an error downloading the RPM). Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] l_getgroups: no such user
We are getting a bunch of: l_getgroups: no such user ## in our log files on the mds. We keep our /etc/passswd and /etc/group in sync with the clusters that mount it. Only one visulization workstation has users who are not in its list. Problem is I don't see any files owned by those users on the filesystem find . -uid # Finds nothing, Does lustre check if a user just cd's to that directory? Or is it for any user that logs in? Is it safe to ignore these messages for non cluster users? Brock Palen www.umich.edu/~brockp Center for Advanced Computing [EMAIL PROTECTED] (734)936-1985 ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] l_getgroups: no such user
On Fri, 2008-09-26 at 13:37 -0400, Brock Palen wrote: Is it safe to ignore these messages for non cluster users? If you don't need supplementary goups, you can just set the upcall to NONE. If you do need supplementary groups then you really do need to unify and universally distribute the passwd/group database to all of the clients and MDSes. b. signature.asc Description: This is a digitally signed message part ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] lustre-1.6.5.1 kernel panic
On Sep 26, 2008 15:14 +0100, Wojciech Turek wrote: We had another kernel panic this time on MDS server. Since we use lustre patched kernel downloaded from the SUN website we would like to ask if anyone else have seen such a problem while moving from 1.6.4.3 to 1.6.5.1 on RHEL4 x86_64 slab: cache size-1620 error: slabs_full accounting error slab: cache size-1620 error: slabs_full accounting error slab: cache size-1620 error: slabs_full accounting error I've never seen these errors before - I didn't even know a size-1620 slab existed. Unable to handle kernel paging request at 303a383a303a RIP: 801623c4{s_show+62} PML4 0 Oops: [1] SMP CPU 3 Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) sg(U) dell_rbu(U) autofs4(U) i2c_nforce2(U) i2c_amd756(U) i2c_isa(U) i2c_amd8111(U) i2c_i801(U) i2c_core(U) qlgc_vnic(U) iw_cxgb3(U) cxgb3(U) mlx4_ib(U) mlx4_core(U) ib_mthca(U) ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) rdma_ucm(U) ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) md5(U) ipv6(U) cpufreq_powersave(U) mptctl(U) dm_mirror(U) dm_round_robin(U) dm_multipath(U) dm_mod(U) sr_mod(U) usb_storage(U) joydev(U) button(U) battery(U) ac(U) uhci_hcd(U) ehci_hcd(U) hw_random(U) ib_ipath(U) ib_umad(U) ib_ucm(U) ib_uverbs(U) ib_cm(U) ib_sa(U) ib_mad(U) ib_core(U) ata_piix(U) libata(U) ext3(U) jbd(U) tg3(U) s2io(U) qla2400(U) qla2xxx(U) scsi_transport_fc(U) nfs(U) nfs_acl(U) lockd(U) sunrpc(U) mptsas(U) mptscsi(U) mptbase(U) megaraid_sas(U) e1000(U) bnx2(U) sd_mod(U) scsi_mod(U) Pid: 15733, comm: collectl Not tainted 2.6.9-67.0.7.EL_lustre.1.6.5.1smp RIP: 0010:[801623c4] 801623c4{s_show+62} RSP: 0018:010117989e68 EFLAGS: 00010006 RAX: 80329f7a RBX: 0100cffa5580 RCX: 0100cffa5501 RDX: 0004 RSI: 303a383a303a RDI: 0100cffa56e8 RBP: 80329f7a R08: fffd R09: R10: R11: R12: R13: 1000 R14: 01004c636500 R15: 0024 FS: 002a9630ee80() GS:8048e880() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 303a383a303a CR3: cfb24000 CR4: 06e0 Process collectl (pid: 15733, threadinfo 010117988000, task 010127176030) Stack: 0009 0100cffa5580 01004c636500 1000 0f0d 80196c1a Call Trace:80196c1a{seq_read+445} 80178c28{vfs_read+207} 80178e84{sys_read+69} 8011022a{system_call+126} Code: 48 8b 06 0f 18 08 48 8d 83 18 01 00 00 48 39 c6 74 2e 8b 93 RIP 801623c4{s_show+62} RSP 010117989e68 CR2: 303a383a303a 0Kernel panic - not syncing: Oops Thanks, Wojciech Wojciech Turek wrote: Hi, I upgraded our test lustre file system to the latest 1.6.5.1 version available from the SUN website. I have one OSS with one OST and one MDS with combined MGS and MDT Both servers are running RHEL4 x86_64 and 2.6.9-67.0.7.EL_lustre.1.6.5.1smp kernel, the interconnect is infiniband and I am using ib modules provided with lustre. When I mount filesystem and then start writing to it OSS crashes with kernel panic, see log below: Lustre: 0:0:(watchdog.c:130:lcw_cb()) Watchdog triggered for pid 17398: it was inactive for 200s Lustre: 0:0:(linux-debug.c:167:libcfs_debug_dumpstack()) showing stack for process 17397 ll_ost_io_92 D 0002 0 17397 1 17398 17396 (L-TLB) 0101156bf538 0046 0101956bf616 801ece0f ff0010776340 01010e14c6c0 00010001 010113f90030 0012b585 Call Trace:ll_ost_io_82 D 01012ab79400 0 17387 1 17388 17386 (L-TLB) 01011c252d88 0046 a000288c 010115b213c0 0246 0100cf851c00 01012bafa940 0002 01010f71f030 0814 Call Trace:a000288c{:scsi_mod:scsi_done+0} 801ece0f{vsnprintf+1406} 8024f658{elv_next_request+238} a0007df8{:scsi_mod:scsi_request_fn+1100} 8030cc1f{__down+147} 80133804{default_wake_function+0} a067b484{:ko2iblnd:kiblnd_init_tx_msg+308} 8030e2f6{io_schedule+38} 80179e24{__wait_on_buffer+125} 80179caa{bh_wake_function+0} 80179caa{bh_wake_function+0} a07cad2b{:ldiskfs:ldiskfs_mb_init_cache+635} 8030e73d{__down_failed+53} a06c6670{:lquota:filter_quota_check+0} a0843acf{:obdfilter:.text.lock.filter_io_26+35}
[Lustre-discuss] How to change default stripe count
I suspect that tunefs.lustre is used to change the stripe count for an existing file system from Lustre default to some other value, but I'm not sure if I do this at the MDT or on each OST device. tunefs.lustre -fsname=[NAME] param lov.stripe_count=4 ?? Mike ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss