Re: [Lustre-discuss] LustreError: server_bulk_callback

2008-09-26 Thread Andreas Dilger
On Sep 24, 2008  17:22 -0600, Nathan Dauchy wrote:
 We have 4 OSS nodes and 2 MDS nodes configured in HA pairs, running
 2.6.18-53.1.14.el5_lustre.1.6.5smp, and using the o2ib network
 transport.  We had multiple failovers recently (possibly due to hardware
 problems, but no root cause yet) and managed to get things back again to
 what I _thought_ was a normal state.
 
 However, in the system log we are seeing many server_bulk_callback
 error messages at the rate of ~6 per second.  Interestingly, they only
 come from one HA pair of OSS nodes:
 
 Sep 24 23:03:14 lfs-oss-0-3 kernel: LustreError:
 20694:0:(events.c:361:server_bulk_callback()) event type 4, status -103,
 desc 81019fce6000
 Sep 24 23:03:14 lfs-oss-0-3 kernel: LustreError:
 20694:0:(events.c:361:server_bulk_callback()) event type 2, status -103,
 desc 81019fce6000
 Sep 24 23:03:16 lfs-oss-0-2 kernel: LustreError:
 27257:0:(events.c:361:server_bulk_callback()) event type 4, status -103,
 desc 8101b52b8000
 Sep 24 23:03:16 lfs-oss-0-2 kernel: LustreError:
 27257:0:(events.c:361:server_bulk_callback()) event type 2, status -103,
 desc 8101b52b8000
 
 Can anyone direct me to documentation to decipher these messages?
 What does server_bulk_callback do, and does status -103 indicate a
 severe problem for event types 2 and 4?

All Lustre error numbers are from /usr/include/asm/errno.h.  In this
case, -103 = -ECONNABORTED.  My guess would be some kind of networking
issue being hit by LNET, because that isn't an error used by the Lustre
filesystem itself.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] lustre-ldiskfs

2008-09-26 Thread Andreas Dilger
On Sep 26, 2008  10:26 +0530, Chirag Raval wrote:
 When I am installing the
 lustre-ldiskfs-3.0.4-2.6.9_67.0.7.EL_lustre.1.6.5.1smp.i686.rpm
 
 I get the following error.
 
  
 
 Can someone please help me what can be wrong I am installing it on CentOS
 4.5 
 
 # rpm -ivh lustre-ldiskfs-3.0.4-2.6.9_67.0.7.EL_lustre.1.6.5.1smp.i686.rpm
 
 error: open of HTMLHEADTITLEError/TITLE/HEADBODY failed: No such
 file or directory
 
 error: open of An failed: No such file or directory
 error: open of error failed: No such file or directory
 error: open of occurred failed: No such file or directory
 error: open of while failed: No such file or directory
 error: open of processing failed: No such file or directory
 error: open of your failed: No such file or directory
 error: open of request.p failed: No such file or directory
 error: open of Reference failed: No such file or directory
 error: open of /BODY/HTML failed: No such file or directory

You downloaded and are trying to install a web page (which itself appears
to report that you had an error downloading the RPM).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] lustre-ldiskfs

2008-09-26 Thread Brock Palen
I ran into this problem my self when sun convoluted download system  
took over hosting the lustre packages.
When I tried to 'wget'  the package, i forgot that sun makes you  
login and thus you download an html error page in place of the rpm.

You will need to download to your machine then upload to the cluster,  
no cmd line download was possible.  If anyone knows how to get around  
this let me know.

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
[EMAIL PROTECTED]
(734)936-1985



On Sep 26, 2008, at 6:39 AM, Andreas Dilger wrote:
 On Sep 26, 2008  10:26 +0530, Chirag Raval wrote:
 When I am installing the
 lustre-ldiskfs-3.0.4-2.6.9_67.0.7.EL_lustre.1.6.5.1smp.i686.rpm

 I get the following error.



 Can someone please help me what can be wrong I am installing it on  
 CentOS
 4.5

 # rpm -ivh lustre-ldiskfs-3.0.4-2.6.9_67.0.7.EL_lustre. 
 1.6.5.1smp.i686.rpm

 error: open of HTMLHEADTITLEError/TITLE/HEADBODY  
 failed: No such
 file or directory

 error: open of An failed: No such file or directory
 error: open of error failed: No such file or directory
 error: open of occurred failed: No such file or directory
 error: open of while failed: No such file or directory
 error: open of processing failed: No such file or directory
 error: open of your failed: No such file or directory
 error: open of request.p failed: No such file or directory
 error: open of Reference failed: No such file or directory
 error: open of /BODY/HTML failed: No such file or directory

 You downloaded and are trying to install a web page (which itself  
 appears
 to report that you had an error downloading the RPM).

 Cheers, Andreas
 --
 Andreas Dilger
 Sr. Staff Engineer, Lustre Group
 Sun Microsystems of Canada, Inc.

 ___
 Lustre-discuss mailing list
 Lustre-discuss@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-discuss



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] l_getgroups: no such user

2008-09-26 Thread Brock Palen
We are getting a bunch of:

l_getgroups: no such user ##

in our log files on the mds.
We keep our /etc/passswd and /etc/group in sync with the clusters  
that mount it.  Only one visulization workstation has users who are  
not in its list.

Problem is I don't see any files owned by those users on the filesystem

find . -uid #

Finds nothing,
Does lustre check if a user just cd's to that directory?  Or is it  
for any user that logs in?
Is it safe to ignore these messages for non cluster users?


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
[EMAIL PROTECTED]
(734)936-1985



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] l_getgroups: no such user

2008-09-26 Thread Brian J. Murrell
On Fri, 2008-09-26 at 13:37 -0400, Brock Palen wrote:
 Is it safe to ignore these messages for non cluster users?

If you don't need supplementary goups, you can just set the upcall to
NONE.  If you do need supplementary groups then you really do need to
unify and universally distribute the passwd/group database to all of the
clients and MDSes.

b.



signature.asc
Description: This is a digitally signed message part
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] lustre-1.6.5.1 kernel panic

2008-09-26 Thread Andreas Dilger
On Sep 26, 2008  15:14 +0100, Wojciech Turek wrote:
 We had another kernel panic this time on MDS server. Since we use lustre 
 patched kernel downloaded from the SUN website we would like to ask if 
 anyone else have seen such a problem while moving from 1.6.4.3 to 
 1.6.5.1 on RHEL4 x86_64
 
 
 slab: cache size-1620 error: slabs_full accounting error
 slab: cache size-1620 error: slabs_full accounting error
 slab: cache size-1620 error: slabs_full accounting error

I've never seen these errors before - I didn't even know a size-1620
slab existed.

 Unable to handle kernel paging request at 303a383a303a RIP:
 801623c4{s_show+62}
 PML4 0
 Oops:  [1] SMP
 CPU 3
 Modules linked in: mds(U) fsfilt_ldiskfs(U) mgs(U) mgc(U) ldiskfs(U) 
 lustre(U) lov(U) mdc(U) lquota(U) osc(U) ko2iblnd(U) ptlrpc(U) 
 obdclass(U) lnet(U) lvfs(U) libcfs(U) sg(U) dell_rbu(U) autofs4(U) 
 i2c_nforce2(U) i2c_amd756(U) i2c_isa(U) i2c_amd8111(U) i2c_i801(U) 
 i2c_core(U) qlgc_vnic(U) iw_cxgb3(U) cxgb3(U) mlx4_ib(U) mlx4_core(U) 
 ib_mthca(U) ipmi_devintf(U) ipmi_si(U) ipmi_msghandler(U) rdma_ucm(U) 
 ib_sdp(U) rdma_cm(U) iw_cm(U) ib_addr(U) ib_ipoib(U) md5(U) ipv6(U) 
 cpufreq_powersave(U) mptctl(U) dm_mirror(U) dm_round_robin(U) 
 dm_multipath(U) dm_mod(U) sr_mod(U) usb_storage(U) joydev(U) button(U) 
 battery(U) ac(U) uhci_hcd(U) ehci_hcd(U) hw_random(U) ib_ipath(U) 
 ib_umad(U) ib_ucm(U) ib_uverbs(U) ib_cm(U) ib_sa(U) ib_mad(U) ib_core(U) 
 ata_piix(U) libata(U) ext3(U) jbd(U) tg3(U) s2io(U) qla2400(U) 
 qla2xxx(U) scsi_transport_fc(U) nfs(U) nfs_acl(U) lockd(U) sunrpc(U) 
 mptsas(U) mptscsi(U) mptbase(U) megaraid_sas(U) e1000(U) bnx2(U) 
 sd_mod(U) scsi_mod(U)
 Pid: 15733, comm: collectl Not tainted 2.6.9-67.0.7.EL_lustre.1.6.5.1smp
 RIP: 0010:[801623c4] 801623c4{s_show+62}
 RSP: 0018:010117989e68  EFLAGS: 00010006
 RAX: 80329f7a RBX: 0100cffa5580 RCX: 0100cffa5501
 RDX: 0004 RSI: 303a383a303a RDI: 0100cffa56e8
 RBP: 80329f7a R08: fffd R09: 
 R10:  R11:  R12: 
 R13: 1000 R14: 01004c636500 R15: 0024
 FS:  002a9630ee80() GS:8048e880() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2: 303a383a303a CR3: cfb24000 CR4: 06e0
 Process collectl (pid: 15733, threadinfo 010117988000, task 
 010127176030)
 Stack:   0009 0100cffa5580
01004c636500  1000 0f0d
 80196c1a
 Call Trace:80196c1a{seq_read+445} 80178c28{vfs_read+207}
80178e84{sys_read+69} 8011022a{system_call+126}
 
 
 Code: 48 8b 06 0f 18 08 48 8d 83 18 01 00 00 48 39 c6 74 2e 8b 93
 RIP 801623c4{s_show+62} RSP 010117989e68
 CR2: 303a383a303a
  0Kernel panic - not syncing: Oops
 
 Thanks,
 
 Wojciech
 
 Wojciech Turek wrote:
  Hi,
 
  I upgraded our test lustre file system to the latest 1.6.5.1 version 
  available from the SUN website.
  I have one OSS with one OST and one MDS with combined MGS and MDT
  Both servers are running RHEL4 x86_64   and  
  2.6.9-67.0.7.EL_lustre.1.6.5.1smp kernel, the interconnect is infiniband 
  and I am using ib modules provided with lustre.
  When I mount filesystem and then start writing to it OSS crashes with 
  kernel panic, see log below:
 
 
  Lustre: 0:0:(watchdog.c:130:lcw_cb()) Watchdog triggered for pid 17398: 
  it was inactive for 200s
  Lustre: 0:0:(linux-debug.c:167:libcfs_debug_dumpstack()) showing stack 
  for process 17397
  ll_ost_io_92  D 0002 0 17397  1 17398 17396 
  (L-TLB)
  0101156bf538 0046 0101956bf616 801ece0f
  ff0010776340 01010e14c6c0 00010001
 010113f90030 0012b585
  Call Trace:ll_ost_io_82  D 01012ab79400 0 17387  1 
  17388 17386 (L-TLB)
  01011c252d88 0046 a000288c 010115b213c0
 0246 0100cf851c00 01012bafa940 0002
 01010f71f030 0814
  Call Trace:a000288c{:scsi_mod:scsi_done+0} 
  801ece0f{vsnprintf+1406} 8024f658{elv_next_request+238}
 a0007df8{:scsi_mod:scsi_request_fn+1100}
 8030cc1f{__down+147}
 80133804{default_wake_function+0} 
  a067b484{:ko2iblnd:kiblnd_init_tx_msg+308}
 8030e2f6{io_schedule+38} 
  80179e24{__wait_on_buffer+125}
 80179caa{bh_wake_function+0} 
  80179caa{bh_wake_function+0}
 a07cad2b{:ldiskfs:ldiskfs_mb_init_cache+635}
 8030e73d{__down_failed+53} 
  a06c6670{:lquota:filter_quota_check+0}
 a0843acf{:obdfilter:.text.lock.filter_io_26+35}
 

[Lustre-discuss] How to change default stripe count

2008-09-26 Thread Mike Feuerstein
I suspect that tunefs.lustre is used to change the stripe count for an
existing file system 

from Lustre default to some other value, but I'm not sure if I do this
at the MDT or on each OST device.

 

tunefs.lustre -fsname=[NAME] param lov.stripe_count=4 ??

 

Mike 

 

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss