Re: [Lustre-discuss] client randomly evicted
Robin Humble 写道: > On Thu, May 15, 2008 at 08:23:20AM -0400, Aaron Knister wrote: > >> Ah! That would make a lot of sense. echoing 0 to statahead_count doesn't >> really do anything other than hang my session. Thanks! >> > > I think the hang echo'ing into /proc is another bug, but yeah, deal > with the big ones first :-) > > ah! little issue for that, will fix it soon. Regards! -- Fan Yong > cheers, > robin > > >> -Aaron >> >> On May 15, 2008, at 4:36 AM, Yong Fan wrote: >> >> >>> Robin Humble ??: >>> On Fri, May 02, 2008 at 03:16:31PM -0700, Andreas Dilger wrote: > On Apr 30, 2008 11:40 -0400, Aaron Knister wrote: > > >> Some more information that might be helpful. There is a particular code >> that one of our users runs. Personally after the trouble this code has >> caused us we'd like to hand him a calculator and disable his accounts >> but >> sadly that's not an option. Since the time of the hang, there is what >> seems >> to be one process associated with lustre that is running as the userid >> of >> the problem user- "ll_sa_15530". A trace of this process in its current >> state shows this - >> >> Is this a problem with the lustre readahead code? If so would this fix >> it? >> "echo 0 > /proc/fs/lustre/llite/*/statahead_count " >> >> > Yes, this appears to be a statahead problem. There were fixes added to > 1.6.5 that should resolve the problems seen with statahead. In the > meantime > I'd recommend disabling it as you suggest above. > > we're seeing the same problem. I think the workaround should be: echo 0 > /proc/fs/lustre/llite/*/statahead_max ?? /proc/fs/lustre/llite/*/statahead_count is -r--r--r-- >>> Sure. >>> "/proc/fs/lustre/llite/*/statahead_count" is statistics variable. >>> "/proc/fs/lustre/llite/*/statahead_max " is the switch for >>> enable/disable directory statahead. >>> >>> Regrads! >>> -- >>> Fan Yong >>> cheers, robin ps. sorry I've been too busy this week to look at the llite_lloop stuff. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> ___ >>> Lustre-discuss mailing list >>> Lustre-discuss@lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >> Aaron Knister >> Systems Administrator >> Center for Research on Environment and Water >> >> (301) 595-7000 >> [EMAIL PROTECTED] >> >> >> >> >> >> > ___ > Lustre-discuss mailing list > Lustre-discuss@lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] client randomly evicted
On Thu, May 15, 2008 at 08:23:20AM -0400, Aaron Knister wrote: > Ah! That would make a lot of sense. echoing 0 to statahead_count doesn't > really do anything other than hang my session. Thanks! I think the hang echo'ing into /proc is another bug, but yeah, deal with the big ones first :-) cheers, robin > -Aaron > > On May 15, 2008, at 4:36 AM, Yong Fan wrote: > >> Robin Humble ??: >>> On Fri, May 02, 2008 at 03:16:31PM -0700, Andreas Dilger wrote: >>> On Apr 30, 2008 11:40 -0400, Aaron Knister wrote: > Some more information that might be helpful. There is a particular code > that one of our users runs. Personally after the trouble this code has > caused us we'd like to hand him a calculator and disable his accounts > but > sadly that's not an option. Since the time of the hang, there is what > seems > to be one process associated with lustre that is running as the userid > of > the problem user- "ll_sa_15530". A trace of this process in its current > state shows this - > > Is this a problem with the lustre readahead code? If so would this fix > it? > "echo 0 > /proc/fs/lustre/llite/*/statahead_count " > Yes, this appears to be a statahead problem. There were fixes added to 1.6.5 that should resolve the problems seen with statahead. In the meantime I'd recommend disabling it as you suggest above. >>> >>> we're seeing the same problem. >>> >>> I think the workaround should be: >>> echo 0 > /proc/fs/lustre/llite/*/statahead_max >>> ?? >>> >>> /proc/fs/lustre/llite/*/statahead_count is -r--r--r-- >>> >> Sure. >> "/proc/fs/lustre/llite/*/statahead_count" is statistics variable. >> "/proc/fs/lustre/llite/*/statahead_max " is the switch for >> enable/disable directory statahead. >> >> Regrads! >> -- >> Fan Yong >>> cheers, >>> robin >>> >>> ps. sorry I've been too busy this week to look at the llite_lloop stuff. >>> ___ >>> Lustre-discuss mailing list >>> Lustre-discuss@lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >>> >> >> ___ >> Lustre-discuss mailing list >> Lustre-discuss@lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > Aaron Knister > Systems Administrator > Center for Research on Environment and Water > > (301) 595-7000 > [EMAIL PROTECTED] > > > > > ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] client randomly evicted
Ah! That would make a lot of sense. echoing 0 to statahead_count doesn't really do anything other than hang my session. Thanks! -Aaron On May 15, 2008, at 4:36 AM, Yong Fan wrote: > Robin Humble 写道: >> On Fri, May 02, 2008 at 03:16:31PM -0700, Andreas Dilger wrote: >> >>> On Apr 30, 2008 11:40 -0400, Aaron Knister wrote: >>> Some more information that might be helpful. There is a particular code that one of our users runs. Personally after the trouble this code has caused us we'd like to hand him a calculator and disable his accounts but sadly that's not an option. Since the time of the hang, there is what seems to be one process associated with lustre that is running as the userid of the problem user- "ll_sa_15530". A trace of this process in its current state shows this - Is this a problem with the lustre readahead code? If so would this fix it? "echo 0 > /proc/fs/lustre/llite/*/statahead_count " >>> Yes, this appears to be a statahead problem. There were fixes >>> added to >>> 1.6.5 that should resolve the problems seen with statahead. In >>> the meantime >>> I'd recommend disabling it as you suggest above. >>> >> >> we're seeing the same problem. >> >> I think the workaround should be: >> echo 0 > /proc/fs/lustre/llite/*/statahead_max >> ?? >> >> /proc/fs/lustre/llite/*/statahead_count is -r--r--r-- >> > Sure. > "/proc/fs/lustre/llite/*/statahead_count" is statistics variable. > "/proc/fs/lustre/llite/*/statahead_max " is the switch for > enable/disable directory statahead. > > Regrads! > -- > Fan Yong >> cheers, >> robin >> >> ps. sorry I've been too busy this week to look at the llite_lloop >> stuff. >> ___ >> Lustre-discuss mailing list >> Lustre-discuss@lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > ___ > Lustre-discuss mailing list > Lustre-discuss@lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss Aaron Knister Systems Administrator Center for Research on Environment and Water (301) 595-7000 [EMAIL PROTECTED] ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] client randomly evicted
Robin Humble 写道: > On Fri, May 02, 2008 at 03:16:31PM -0700, Andreas Dilger wrote: > >> On Apr 30, 2008 11:40 -0400, Aaron Knister wrote: >> >>> Some more information that might be helpful. There is a particular code >>> that one of our users runs. Personally after the trouble this code has >>> caused us we'd like to hand him a calculator and disable his accounts but >>> sadly that's not an option. Since the time of the hang, there is what seems >>> to be one process associated with lustre that is running as the userid of >>> the problem user- "ll_sa_15530". A trace of this process in its current >>> state shows this - >>> >>> Is this a problem with the lustre readahead code? If so would this fix it? >>> "echo 0 > /proc/fs/lustre/llite/*/statahead_count " >>> >> Yes, this appears to be a statahead problem. There were fixes added to >> 1.6.5 that should resolve the problems seen with statahead. In the meantime >> I'd recommend disabling it as you suggest above. >> > > we're seeing the same problem. > > I think the workaround should be: > echo 0 > /proc/fs/lustre/llite/*/statahead_max > ?? > > /proc/fs/lustre/llite/*/statahead_count is -r--r--r-- > Sure. "/proc/fs/lustre/llite/*/statahead_count" is statistics variable. "/proc/fs/lustre/llite/*/statahead_max " is the switch for enable/disable directory statahead. Regrads! -- Fan Yong > cheers, > robin > > ps. sorry I've been too busy this week to look at the llite_lloop stuff. > ___ > Lustre-discuss mailing list > Lustre-discuss@lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss > ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] client randomly evicted
On Fri, May 02, 2008 at 03:16:31PM -0700, Andreas Dilger wrote: >On Apr 30, 2008 11:40 -0400, Aaron Knister wrote: >> Some more information that might be helpful. There is a particular code >> that one of our users runs. Personally after the trouble this code has >> caused us we'd like to hand him a calculator and disable his accounts but >> sadly that's not an option. Since the time of the hang, there is what seems >> to be one process associated with lustre that is running as the userid of >> the problem user- "ll_sa_15530". A trace of this process in its current >> state shows this - >> >> Is this a problem with the lustre readahead code? If so would this fix it? >> "echo 0 > /proc/fs/lustre/llite/*/statahead_count " > >Yes, this appears to be a statahead problem. There were fixes added to >1.6.5 that should resolve the problems seen with statahead. In the meantime >I'd recommend disabling it as you suggest above. we're seeing the same problem. I think the workaround should be: echo 0 > /proc/fs/lustre/llite/*/statahead_max ?? /proc/fs/lustre/llite/*/statahead_count is -r--r--r-- cheers, robin ps. sorry I've been too busy this week to look at the llite_lloop stuff. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] client randomly evicted
On Apr 30, 2008 11:40 -0400, Aaron Knister wrote: > Some more information that might be helpful. There is a particular code > that one of our users runs. Personally after the trouble this code has > caused us we'd like to hand him a calculator and disable his accounts but > sadly that's not an option. Since the time of the hang, there is what seems > to be one process associated with lustre that is running as the userid of > the problem user- "ll_sa_15530". A trace of this process in its current > state shows this - > > Is this a problem with the lustre readahead code? If so would this fix it? > "echo 0 > /proc/fs/lustre/llite/*/statahead_count " Yes, this appears to be a statahead problem. There were fixes added to 1.6.5 that should resolve the problems seen with statahead. In the meantime I'd recommend disabling it as you suggest above. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] client randomly evicted
Some more information that might be helpful. There is a particular code that one of our users runs. Personally after the trouble this code has caused us we'd like to hand him a calculator and disable his accounts but sadly that's not an option. Since the time of the hang, there is what seems to be one process associated with lustre that is running as the userid of the problem user- "ll_sa_15530". A trace of this process in its current state shows this - Apr 30 11:29:30 cola10 kernel: ll_sa_15530 S 0 15531 1 17700 18228 (L-TLB) Apr 30 11:29:30 cola10 kernel: 810116c31c10 0046 81013e7747a0 80087d0e Apr 30 11:29:30 cola10 kernel: 0007 81003a76b040 81012f11f0c0 000fcb5175eba398 Apr 30 11:29:30 cola10 kernel: 1407 81003a76b228 0001 0068 Apr 30 11:29:30 cola10 kernel: Call Trace: Apr 30 11:29:30 cola10 kernel: [] enqueue_task +0x41/0x56 Apr 30 11:29:30 cola10 kernel: [] :ptlrpc:ldlm_prep_enqueue_req+0x1b4/0x2e0 Apr 30 11:29:30 cola10 kernel: [] :mdc:mdc_req_avail +0x6c/0xf0 Apr 30 11:29:30 cola10 kernel: [] :mdc:mdc_enter_request+0x145/0x1e0 Apr 30 11:29:30 cola10 kernel: [] default_wake_function+0x0/0xe Apr 30 11:29:30 cola10 kernel: [] :mdc:mdc_intent_lookup_pack+0xd0/0xf0 Apr 30 11:29:30 cola10 kernel: [] :mdc:mdc_intent_getattr_async+0x214/0x420 Apr 30 11:29:30 cola10 kernel: [] :lustre:ll_i2gids +0x5d/0x150 Apr 30 11:29:30 cola10 kernel: [] :lustre:ll_statahead_thread+0xf75/0x1810 Apr 30 11:29:30 cola10 kernel: [] default_wake_function+0x0/0xe Apr 30 11:29:30 cola10 kernel: [] child_rip+0xa/0x11 Apr 30 11:29:30 cola10 kernel: [] :lustre:ll_statahead_thread+0x0/0x1810 Apr 30 11:29:30 cola10 kernel: [] child_rip+0x0/0x11 Is this a problem with the lustre readahead code? If so would this fix it? "echo 0 > /proc/fs/lustre/llite/*/statahead_count " Thank you so much for all your help. -Aaron On Apr 30, 2008, at 11:16 AM, Aaron S. Knister wrote: I have a lustre client that was randomly evicted early this morning. The errors from the dmesg are below. It's running infiniband. There were no infiniband errors that I could tell and all the mds/mgs and oss's said was "haven't heard from client xyz in 2277 seconds. Evicting". The client has halfway come back and now shows this - [EMAIL PROTECTED]:~ $ lfs df -h UUID bytes Used Available Use% Mounted on data-MDT_UUID87.5G 6.4G 81.1G7% /data[MDT:0] data-OST_UUID 5.4T 4.9T439.6G 92% /data[OST:0] data-OST0001_UUID : inactive device data-OST0002_UUID : inactive device data-OST0003_UUID : inactive device data-OST0004_UUID : inactive device data-OST0005_UUID : inactive device data-OST0006_UUID : inactive device data-OST0007_UUID : inactive device data-OST0008_UUID : inactive device data-OST0009_UUID : inactive device filesystem summary: 5.4T 4.9T439.6G 92% /data so it's reconnected to one of 10 osts. I tried to to an lctl -- device {device} reconnect and it said "Error: Operation in progress". I have no idea what went wrong and I'm confident a reboot would fix it but I'd like to avoid it if possible. Thanks in advance. LustreError: 11-0: an error occurred while communicating with [EMAIL PROTECTED] The mds_statfs operation failed with -107 Lustre: data-MDT-mdc-81013037b800: Connection to service data-MDT via nid [EMAIL PROTECTED] was lost; in progress operations using this service will wait for recovery to complete. LustreError: 167-0: This client was evicted by data-MDT; in progress operations using this service will fail. LustreError: 22345:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -5 LustreError: 22396:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x81717113/t0 o41->[EMAIL PROTECTED] @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 22396:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -108 LustreError: 22454:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x81717114/t0 o41->[EMAIL PROTECTED] @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 22454:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -108 LustreError: 22463:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x81717115/t0 o41->[EMAIL PROTECTED] @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 22463:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -108 LustreError: 22734:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x81717138/t0 o41->[EMAIL PROTECTED] @o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 22734:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -108 LustreError: 22736:0:(client.c:519:ptlrpc_import_dela
[Lustre-discuss] client randomly evicted
I have a lustre client that was randomly evicted early this morning. The errors from the dmesg are below. It's running infiniband. There were no infiniband errors that I could tell and all the mds/mgs and oss's said was "haven't heard from client xyz in 2277 seconds. Evicting". The client has halfway come back and now shows this - [EMAIL PROTECTED]:~ $ lfs df -h UUID bytes Used Available Use% Mounted on data-MDT_UUID 87.5G 6.4G 81.1G 7% /data[MDT:0] data-OST_UUID 5.4T 4.9T 439.6G 92% /data[OST:0] data-OST0001_UUID : inactive device data-OST0002_UUID : inactive device data-OST0003_UUID : inactive device data-OST0004_UUID : inactive device data-OST0005_UUID : inactive device data-OST0006_UUID : inactive device data-OST0007_UUID : inactive device data-OST0008_UUID : inactive device data-OST0009_UUID : inactive device filesystem summary: 5.4T 4.9T 439.6G 92% /data so it's reconnected to one of 10 osts. I tried to to an lctl --device {device} reconnect and it said "Error: Operation in progress". I have no idea what went wrong and I'm confident a reboot would fix it but I'd like to avoid it if possible. Thanks in advance. LustreError: 11-0: an error occurred while communicating with [EMAIL PROTECTED] The mds_statfs operation failed with -107 Lustre: data-MDT-mdc-81013037b800: Connection to service data-MDT via nid [EMAIL PROTECTED] was lost; in progress operations using this service will wait for recovery to complete. LustreError: 167-0: This client was evicted by data-MDT; in progress operations using this service will fail. LustreError: 22345:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -5 LustreError: 22396:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x81717113/t0 o41->[EMAIL PROTECTED]@o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 22396:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -108 LustreError: 22454:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x81717114/t0 o41->[EMAIL PROTECTED]@o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 22454:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -108 LustreError: 22463:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x81717115/t0 o41->[EMAIL PROTECTED]@o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 22463:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -108 LustreError: 22734:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x81717138/t0 o41->[EMAIL PROTECTED]@o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 22734:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -108 LustreError: 22736:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x81717139/t0 o41->[EMAIL PROTECTED]@o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 22736:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -108 LustreError: 22912:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x81717140/t0 o41->[EMAIL PROTECTED]@o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 22912:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -108 LustreError: 22971:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x81717143/t0 o41->[EMAIL PROTECTED]@o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 22971:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 2 previous similar messages LustreError: 22971:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -108 LustreError: 22971:0:(llite_lib.c:1508:ll_statfs_internal()) Skipped 2 previous similar messages LustreError: 23781:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x81717144/t0 o41->[EMAIL PROTECTED]@o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 23781:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -108 LustreError: 23796:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x81717156/t0 o41->[EMAIL PROTECTED]@o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 23827:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x81717157/t0 o41->[EMAIL PROTECTED]@o2ib:12 lens 128/272 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 23827:0:(llite_lib.c:1508:ll_statfs_internal()) mdc_statfs fails: rc = -108 LustreError: 23827:0:(llite_lib.c:1508:ll_statfs_internal()) Skipped 1 previous similar message LustreError: 22346:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID [EMAIL PROTECTED] x81717169/t0 o35->[EMAIL PROTECTED]@o2ib:12 lens 296/896 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 22346:0:(file.c:97:ll_close_inode_openhandle()) inode 21601226 mdc close failed: rc = -108 Lustre: data-MDT-mdc-81013037b800: Connection restored to service data-MDT u