Re: [Lustre-discuss] ll_ost_creat_* goes bersek (100% cpu used - OST disabled)

2010-08-14 Thread Adrian Ulrich
> The journal will prevent inconsistencies in the filesystem in case of a crash.
> It cannot prevent corruption of the on-disk data, inconsistencies caused by 
> cache
> enabled on the disks or in a RAID controller, software bugs, memory 
> corruption, bad cables, etc. 

The OSS is part of a 'Snowbird' installation, so the RAID/Disk part should be 
fine.
I hope that we 'just' hit a small software bug :-/


> That is why it is still a good idea for users to run e2fsck periodically on a 
> filesystem.

Ok, we will keep this in mind (e2fsck was surprisingly fast anyway!)


Regards,
 Adrian

-- 
 RFC 1925:
   (11) Every old idea will be proposed again with a different name and
a different presentation, regardless of whether it works.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] ll_ost_creat_* goes bersek (100% cpu used - OST disabled)

2010-08-14 Thread Andreas Dilger
On 2010-08-14, at 2:28, Adrian Ulrich  wrote:
>> - the on-disk structure of the object directory for this OST is corrupted.
>>  Run "e2fsck -fp /dev/{ostdev}" on the unmounted OST filesystem.
> 
> e2fsck fixed it: The OST is now running since 40 minutes without problems:
> 
> But shouldn't the journal of ext3/ldiskfs make running e2fsck unnecessary?

The journal will prevent inconsistencies in the filesystem in case of a crash. 
It cannot prevent corruption of the on-disk data, inconsistencies caused by 
cache enabled on the disks or in a RAID controller, software bugs, memory 
corruption, bad cables, etc. 

That is why it is still a good idea for users to run e2fsck periodically on a 
filesystem. If you are using LVM there is an lvcheck script I wrote that can 
check a filesystem snapshot on a running system, but otherwise you should do it 
whenever the opportunity arises. 

Cheers, Andreas
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] ll_ost_creat_* goes bersek (100% cpu used - OST disabled)

2010-08-14 Thread Adrian Ulrich

> - the on-disk structure of the object directory for this OST is corrupted.
>   Run "e2fsck -fp /dev/{ostdev}" on the unmounted OST filesystem.

e2fsck fixed it: The OST is now running since 40 minutes without problems:

e2fsck 1.41.6.sun1 (30-May-2009)
lustre1-OST0005: recovering journal
lustre1-OST0005 has been mounted 72 times without being checked, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Directory inode 440696867, block 493, offset 0: directory corrupted
Salvage? yes

Directory inode 440696853, block 517, offset 0: directory corrupted
Salvage? yes

Directory inode 440696842, block 560, offset 0: directory corrupted
Salvage? yes

Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Unattached inode 17769156
Connect to /lost+found? yes

Inode 17769156 ref count is 2, should be 1.  Fix? yes

Unattached zero-length inode 17883901.  Clear? yes

Pass 5: Checking group summary information

lustre1-OST0005: * FILE SYSTEM WAS MODIFIED *
lustre1-OST0005: 44279/488382464 files (15.4% non-contiguous), 
280329314/1953524992 blocks



But shouldn't the journal of ext3/ldiskfs make running e2fsck unnecessary?


Have a nice weekend and thanks a lot for the fast reply!

Regards,
 Adrian


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] ll_ost_creat_* goes bersek (100% cpu used - OST disabled)

2010-08-13 Thread Andreas Dilger
On 2010-08-13, at 12:29, Adrian Ulrich wrote:
> Pid: 11833, comm: ll_ost_creat_00 Tainted: G  
> 2.6.18-128.7.1.el5_lustre.1.8.1.1 #1
> :ldiskfs:ldiskfs_find_entry+0x1d4/0x5c0
> [] :ldiskfs:ldiskfs_lookup+0x53/0x290
> [] __lookup_hash+0x10b/0x130
> [] lookup_one_len+0x53/0x61
> [] :obdfilter:filter_fid2dentry+0x42d/0x730
> [] :obdfilter:filter_statfs+0x273/0x350
> [] :obdfilter:filter_parent_lock+0x20/0x220
> [] :obdfilter:filter_precreate+0x843/0x19e0
> [] :obdfilter:filter_create+0x10b9/0x15e0
> [] :ost:ost_handle+0x131d/0x5a70

Two possibilities I can see:
- MDS sent very large create request.  Compare the values from:
  mds> lctl get_param osc.*.prealloc_*
  oss> lctl get_param obdfilter.*.last_id

  and see if they match.  If last_id is growing quickly the thread is busy
  precreating many objects for some reason.  If this OST has a much higher
  prealloc_last_id on the MDS, something is bad in the MDS lov_objids file.

- the on-disk structure of the object directory for this OST is corrupted.
  Run "e2fsck -fp /dev/{ostdev}" on the unmounted OST filesystem.


Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] ll_ost_creat_* goes bersek (100% cpu used - OST disabled)

2010-08-13 Thread Adrian Ulrich
Hi Alexey,

> in general soft-lookup isn't error, that just notice about some operation is 
> need too many time (more then 10s i think).
> attached soft-lookup say - OST is busy with creating objects after MDS<>OST 
> reconnect, 

Yes, i know that a soft-lockup doesn't mean that i hit a bug but having 
ll_ost_creat_* wasting 100% CPU
doesn't seem to be normal.

> i think you have too busy disks or overloaded node.

Disk %busy is < 5% for all attached disks.
The OST is doing almost nothing (there are a few read()'s, that's all)


> if you have slow disks - client can be disconnected before they request is 
> processing, and that request blocked to reconnect from that client.

The recovery of the clients seems to be ok: all clients can write/read data 
from the OST but
there is something wrong between the MDS<->OST0005.

But this might just be a side-effect of the ll_ost_creat_* issue :-/

Regards,
 Adrian


-- 
 RFC 1925:
   (11) Every old idea will be proposed again with a different name and
a different presentation, regardless of whether it works.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] ll_ost_creat_* goes bersek (100% cpu used - OST disabled)

2010-08-13 Thread Alexey Lyashkov
 Adrian,

in general soft-lookup isn't error, that just notice about some operation is 
need too many time (more then 10s i think).
attached soft-lookup say - OST is busy with creating objects after MDS<>OST 
reconnect,  i think you have too busy disks or overloaded node.
to avoid that softlookup  you can reduce maximal number of precreated object 
via procfs (if my memory correct).

why clients has reconnects is different question, but that can be related to 
soft-lookup.
if you have slow disks - client can be disconnected before they request is 
processing, and that request blocked to reconnect from that client.

On Aug 13, 2010, at 21:29, Adrian Ulrich wrote:
> Oh, sorry for mixing this up.
> 
> 'lctl df' doesn't show much new stuff:
> 
> 0001:0400:0:1281721514.362102:0:13008:0:(ldlm_lib.c:541:target_handle_reconnect())
>  lustre1-OST0005:
> 9f880a5e-2331-07a8-8611-d6e3102f466e reconnecting
> 0001:0400:0:1281721514.362107:0:13008:0:(ldlm_lib.c:835:target_handle_connect())
>  lustre1-OST0005:
> refuse reconnection from 
> 9f880a5e-2331-07a8-8611-d6e3102f4...@10.201.48.12@o2ib to 0x8101c93b6000; 
> still busy with 1 active RPCs
> 0001:0002:
> 0001:0400:7:1281721525.880767:0:11822:0:(ldlm_lib.c:541:target_handle_reconnect())
>  lustre1-OST0005:
> lustre1-mdtlov_UUID reconnecting
> 


> 
>> please post soft-lookup report. one of possibility, MDS ask too many objects 
>> to create on that OST or OST have too many reconnects.
> 
> LustreError: 12972:0:(ldlm_lib.c:1863:target_send_reply_msg()) Skipped 71 
> previous similar messages
> BUG: soft lockup - CPU#4 stuck for 59s! [ll_ost_creat_00:11833]
> CPU 4:
> Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) 
> crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ko2iblnd(U)
> ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) raid456(U) xor(U) raid1(U) 
> netconsole(U) lockd(U) sunrpc(U) rdma_ucm(U) qlgc_vnic(U) ib_sdp(U) rdma_cm
> (U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) 
> xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_umad(U) iw_nes(U) iw_cxgb3(U) 
> cxgb3(U) ib_ipath(U) ib_mthca(U) mptctl(U) dm_mirror(U) dm_multipath(U) 
> scsi_dh(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U) button(U) 
> battery(U) 
> asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) 
> mlx4_ib(U) ib_mad(U) ib_core(U) joydev(U) sr_mod(U) cdrom(U) sg(U) 
> tpm_infineon(U) 
> tpm(U) tpm_bios(U) mlx4_core(U) i5000_edac(U) edac_mc(U) i2c_i801(U) 
> pcspkr(U) e1000e(U) i2c_core(U) serio_raw(U) dm_raid45(U) dm_message(U) 
> dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U) usb_storage(U) ahci(U) 
> ata_piix(U) libata(U) mptsas(U) scsi_transport_sas(U) mptfc(U) 
> scsi_transport_fc(U) mptspi(U) mptscsih(U) mptbase(U) scsi_transport_spi(U) 
> shpchp(U) aacraid(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) 
> ohci_hcd(U) ehci_hcd(U)
> Pid: 11833, comm: ll_ost_creat_00 Tainted: G  
> 2.6.18-128.7.1.el5_lustre.1.8.1.1 #1
> RIP: 0010:[]  [] 
> :ldiskfs:ldiskfs_find_entry+0x1d4/0x5c0
> RSP: 0018:8101e715d500  EFLAGS: 0202
> RAX:  RBX: 0008 RCX: 0c91a9f1
> RDX: 8101e8893800 RSI: 8101e715d4e8 RDI: 81010773e838
> RBP: 0002 R08: 81017bd9cff8 R09: 81017bd9c000
> R10: 810216dfb000 R11: 4c6578dc R12: 81017d41e6d0
> R13: 80063b4c R14: 8101e715d5b8 R15: 80014fae
> FS:  2ab1d8b97220() GS:81021fc74bc0() knlGS:
> CS:  0010 DS:  ES:  CR0: 8005003b
> CR2: 09c9c178 CR3: 00201000 CR4: 06e0
> 
> Call Trace:
> [] vsnprintf+0x559/0x59e
> [] cache_alloc_refill+0x106/0x186
> [] :ldiskfs:ldiskfs_lookup+0x53/0x290
> [] __lookup_hash+0x10b/0x130
> [] lookup_one_len+0x53/0x61
> [] :obdfilter:filter_fid2dentry+0x42d/0x730
> [] :obdfilter:filter_statfs+0x273/0x350
> [] __down_trylock+0x44/0x4e
> [] :obdfilter:filter_parent_lock+0x20/0x220
> [] :obdfilter:filter_precreate+0x843/0x19e0
> [] :lnet:lnet_ni_send+0x93/0xd0
> [] dput+0x23/0x10a
> [] :obdfilter:filter_create+0x10b9/0x15e0
> [] :lnet:LNetPut+0x702/0x800
> [] :ptlrpc:ptl_send_buf+0x3f3/0x5b0
> [] :ptlrpc:lustre_msg_add_version+0x34/0x110
> [] :ptlrpc:ptlrpc_send_reply+0x5c8/0x5e0
> [] :ptlrpc:lustre_pack_reply+0x29/0xb0
> [] :ost:ost_handle+0x131d/0x5a70
> [] vsnprintf+0x559/0x59e
> [] :libcfs:libcfs_debug_vmsg2+0x6e8/0x990
> [] __next_cpu+0x19/0x28
> [] find_busiest_group+0x20d/0x621
> [] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
> [] enqueue_task+0x41/0x56
> [] :ptlrpc:ptlrpc_check_req+0x1d/0x110
> [] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1170
> [] lock_timer_base+0x1b/0x3c
> [] __wake_up_common+0x3e/0x68
> [] :ptlrpc:ptlrpc_main+0x1218/0x13e0
> [] default_wake_function+0x0/0xe
> [] child_rip+0xa/0x11
> [] :ptlrpc:ptlrpc_main+0x0/0x13e0
> [] child_rip+0x0/0x11
> 
> 
> Btw: We are running 1.8.1.1 (with rhel kernel 
> 2.6.18-128.7.1.el5_lustre

Re: [Lustre-discuss] ll_ost_creat_* goes bersek (100% cpu used - OST disabled)

2010-08-13 Thread Adrian Ulrich
Hi Alexey,

> Llog_reader is tool to read configuration llog, if you want decode debug log, 
> you should use lctl df $file > $output

Oh, sorry for mixing this up.

'lctl df' doesn't show much new stuff:

0001:0400:0:1281721514.362102:0:13008:0:(ldlm_lib.c:541:target_handle_reconnect())
 lustre1-OST0005:
 9f880a5e-2331-07a8-8611-d6e3102f466e reconnecting
0001:0400:0:1281721514.362107:0:13008:0:(ldlm_lib.c:835:target_handle_connect())
 lustre1-OST0005:
refuse reconnection from 9f880a5e-2331-07a8-8611-d6e3102f4...@10.201.48.12@o2ib 
to 0x8101c93b6000; still busy with 1 active RPCs
0001:0002:
0001:0400:7:1281721525.880767:0:11822:0:(ldlm_lib.c:541:target_handle_reconnect())
 lustre1-OST0005:
lustre1-mdtlov_UUID reconnecting


> please post soft-lookup report. one of possibility, MDS ask too many objects 
> to create on that OST or OST have too many reconnects.

LustreError: 12972:0:(ldlm_lib.c:1863:target_send_reply_msg()) Skipped 71 
previous similar messages
BUG: soft lockup - CPU#4 stuck for 59s! [ll_ost_creat_00:11833]
CPU 4:
Modules linked in: obdfilter(U) fsfilt_ldiskfs(U) ost(U) mgc(U) ldiskfs(U) 
crc16(U) lustre(U) lov(U) mdc(U) lquota(U) osc(U) ksocklnd(U) ko2iblnd(U)
 ptlrpc(U) obdclass(U) lnet(U) lvfs(U) libcfs(U) raid456(U) xor(U) raid1(U) 
netconsole(U) lockd(U) sunrpc(U) rdma_ucm(U) qlgc_vnic(U) ib_sdp(U) rdma_cm
(U) iw_cm(U) ib_addr(U) ib_ipoib(U) ipoib_helper(U) ib_cm(U) ib_sa(U) ipv6(U) 
xfrm_nalgo(U) crypto_api(U) ib_uverbs(U) ib_umad(U) iw_nes(U) iw_cxgb3(U) 
cxgb3(U) ib_ipath(U) ib_mthca(U) mptctl(U) dm_mirror(U) dm_multipath(U) 
scsi_dh(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U) button(U) battery(U) 
asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) mlx4_ib(U) 
ib_mad(U) ib_core(U) joydev(U) sr_mod(U) cdrom(U) sg(U) tpm_infineon(U) 
tpm(U) tpm_bios(U) mlx4_core(U) i5000_edac(U) edac_mc(U) i2c_i801(U) pcspkr(U) 
e1000e(U) i2c_core(U) serio_raw(U) dm_raid45(U) dm_message(U) 
dm_region_hash(U) dm_log(U) dm_mod(U) dm_mem_cache(U) usb_storage(U) ahci(U) 
ata_piix(U) libata(U) mptsas(U) scsi_transport_sas(U) mptfc(U) 
scsi_transport_fc(U) mptspi(U) mptscsih(U) mptbase(U) scsi_transport_spi(U) 
shpchp(U) aacraid(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) 
ohci_hcd(U) ehci_hcd(U)
Pid: 11833, comm: ll_ost_creat_00 Tainted: G  
2.6.18-128.7.1.el5_lustre.1.8.1.1 #1
RIP: 0010:[]  [] 
:ldiskfs:ldiskfs_find_entry+0x1d4/0x5c0
RSP: 0018:8101e715d500  EFLAGS: 0202
RAX:  RBX: 0008 RCX: 0c91a9f1
RDX: 8101e8893800 RSI: 8101e715d4e8 RDI: 81010773e838
RBP: 0002 R08: 81017bd9cff8 R09: 81017bd9c000
R10: 810216dfb000 R11: 4c6578dc R12: 81017d41e6d0
R13: 80063b4c R14: 8101e715d5b8 R15: 80014fae
FS:  2ab1d8b97220() GS:81021fc74bc0() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 09c9c178 CR3: 00201000 CR4: 06e0

Call Trace:
 [] vsnprintf+0x559/0x59e
 [] cache_alloc_refill+0x106/0x186
 [] :ldiskfs:ldiskfs_lookup+0x53/0x290
 [] __lookup_hash+0x10b/0x130
 [] lookup_one_len+0x53/0x61
 [] :obdfilter:filter_fid2dentry+0x42d/0x730
 [] :obdfilter:filter_statfs+0x273/0x350
 [] __down_trylock+0x44/0x4e
 [] :obdfilter:filter_parent_lock+0x20/0x220
 [] :obdfilter:filter_precreate+0x843/0x19e0
 [] :lnet:lnet_ni_send+0x93/0xd0
 [] dput+0x23/0x10a
 [] :obdfilter:filter_create+0x10b9/0x15e0
 [] :lnet:LNetPut+0x702/0x800
 [] :ptlrpc:ptl_send_buf+0x3f3/0x5b0
 [] :ptlrpc:lustre_msg_add_version+0x34/0x110
 [] :ptlrpc:ptlrpc_send_reply+0x5c8/0x5e0
 [] :ptlrpc:lustre_pack_reply+0x29/0xb0
 [] :ost:ost_handle+0x131d/0x5a70
 [] vsnprintf+0x559/0x59e
 [] :libcfs:libcfs_debug_vmsg2+0x6e8/0x990
 [] __next_cpu+0x19/0x28
 [] find_busiest_group+0x20d/0x621
 [] :ptlrpc:lustre_msg_get_conn_cnt+0x35/0xf0
 [] enqueue_task+0x41/0x56
 [] :ptlrpc:ptlrpc_check_req+0x1d/0x110
 [] :ptlrpc:ptlrpc_server_handle_request+0xa97/0x1170
 [] lock_timer_base+0x1b/0x3c
 [] __wake_up_common+0x3e/0x68
 [] :ptlrpc:ptlrpc_main+0x1218/0x13e0
 [] default_wake_function+0x0/0xe
 [] child_rip+0xa/0x11
 [] :ptlrpc:ptlrpc_main+0x0/0x13e0
 [] child_rip+0x0/0x11


Btw: We are running 1.8.1.1 (with rhel kernel 2.6.18-128.7.1.el5_lustre.1.8.1.1)

Regards,
 Adrian
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] ll_ost_creat_* goes bersek (100% cpu used - OST disabled)

2010-08-13 Thread Alexey Lyashkov

On Aug 13, 2010, at 20:49, Adrian Ulrich wrote:

> Hi,
> 
> Since a few hours we have a problem with one of our OSTs:
> 
> One (and only one) ll_ost_create_ process on one of the OSTs
> seems to go crazy and uses 100% CPU.
> 
> Rebooting the OST + MDS didn't help and there isn't much
> going on on the filesystem itself:
> 
> - /proc/fs/lustre/ost/OSS/ost_create/stats is almost 'static'
> - iostat shows almost no usage
> - ib traffic is < 100 kb/s
> 
> 
> The MDS logs this each ~3 minutes:
> Aug 13 19:11:14 mds1 kernel: LustreError: 11-0: an error occurred while 
> communicating with 10.201.62...@o2ib. The ost_connect operation failed with 
> -16
> ..and later:
> Aug 13 19:17:16 mds1 kernel: LustreError: 
> 10253:0:(osc_create.c:390:osc_create()) lustre1-OST0005-osc: oscc recovery 
> failed: -110
> Aug 13 19:17:16 mds1 kernel: LustreError: 
> 10253:0:(lov_obd.c:1129:lov_clear_orphans()) error in orphan recovery on OST 
> idx 5/32: rc = -110
> Aug 13 19:17:16 mds1 kernel: LustreError: 
> 10253:0:(mds_lov.c:1022:__mds_lov_synchronize()) lustre1-OST0005_UUID failed 
> at mds_lov_clear_orphans: -110
> Aug 13 19:17:16 mds1 kernel: LustreError: 
> 10253:0:(mds_lov.c:1031:__mds_lov_synchronize()) lustre1-OST0005_UUID sync 
> failed -110, deactivating
> Aug 13 19:17:54 mds1 kernel: Lustre: 
> 6544:0:(import.c:508:import_select_connection()) lustre1-OST0005-osc: tried 
> all connections, increasing latency to 51s
> 
-110 = -ETIMEOUT, operation don't finished before deadline, or network problem.

> oops! (lustre1-OST0005 is hosted on the OSS with the crazy ll_ost_create 
> process)
ll_ost_create work on destroy old created objects, i think.



> 
> On the affected OSS we get
> Lustre: 11764:0:(ldlm_lib.c:835:target_handle_connect()) lustre1-OST0005: 
> refuse reconnection from lustre1-mdtlov_u...@10.201.62.11@o2ib to 
> 0x8102164d0200; still busy with 2 active RPCs
> 
> 
> $ llog_reader lustre-log.1281718692.11833 shows:
Llog_reader is tool to read configuration llog, if you want decode debug log, 
you should use lctl df $file > $output

> 
> And we get tons of soft-cpu lockups :-/
> 
> Any ideas?
please post soft-lookup report. one of possibility, MDS ask too many objects to 
create on that OST or OST have too many reconnects.

> 
> 
> Regards,
> Adrian
> 
> 
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] ll_ost_creat_* goes bersek (100% cpu used - OST disabled)

2010-08-13 Thread Adrian Ulrich
Hi,

Since a few hours we have a problem with one of our OSTs:

One (and only one) ll_ost_create_ process on one of the OSTs
seems to go crazy and uses 100% CPU.

Rebooting the OST + MDS didn't help and there isn't much
going on on the filesystem itself:

 - /proc/fs/lustre/ost/OSS/ost_create/stats is almost 'static'
 - iostat shows almost no usage
 - ib traffic is < 100 kb/s


The MDS logs this each ~3 minutes:
 Aug 13 19:11:14 mds1 kernel: LustreError: 11-0: an error occurred while 
communicating with 10.201.62...@o2ib. The ost_connect operation failed with -16
..and later:
 Aug 13 19:17:16 mds1 kernel: LustreError: 
10253:0:(osc_create.c:390:osc_create()) lustre1-OST0005-osc: oscc recovery 
failed: -110
 Aug 13 19:17:16 mds1 kernel: LustreError: 
10253:0:(lov_obd.c:1129:lov_clear_orphans()) error in orphan recovery on OST 
idx 5/32: rc = -110
 Aug 13 19:17:16 mds1 kernel: LustreError: 
10253:0:(mds_lov.c:1022:__mds_lov_synchronize()) lustre1-OST0005_UUID failed at 
mds_lov_clear_orphans: -110
 Aug 13 19:17:16 mds1 kernel: LustreError: 
10253:0:(mds_lov.c:1031:__mds_lov_synchronize()) lustre1-OST0005_UUID sync 
failed -110, deactivating
 Aug 13 19:17:54 mds1 kernel: Lustre: 
6544:0:(import.c:508:import_select_connection()) lustre1-OST0005-osc: tried all 
connections, increasing latency to 51s

oops! (lustre1-OST0005 is hosted on the OSS with the crazy ll_ost_create 
process)

On the affected OSS we get
 Lustre: 11764:0:(ldlm_lib.c:835:target_handle_connect()) lustre1-OST0005: 
refuse reconnection from lustre1-mdtlov_u...@10.201.62.11@o2ib to 
0x8102164d0200; still busy with 2 active RPCs


$ llog_reader lustre-log.1281718692.11833 shows:
Bit 0 of 284875 not set
Bit -32510 of 284875 not set
Bit -32510 of 284875 not set
Bit -32511 of 284875 not set
Bit 0 of 284875 not set
Bit -1 of 284875 not set
Bit 0 of 284875 not set
Bit -32510 of 284875 not set
Bit -32510 of 284875 not set
Bit -32510 of 284875 not set
Bit -1 of 284875 not set
Bit 0 of 284875 not set
Segmentation fault <-- *ouch*


And we get tons of soft-cpu lockups :-/

Any ideas?


Regards,
 Adrian


___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss