[Lustre-discuss] Luster recovery when clients go away

2008-07-31 Thread Brock Palen
One of our OSS's died with a panic last night.  Between when it was  
found (no failover) and restarted two clients had died also.  (nodes  
crashed by user OOM).

Because of this the OST's now are looking for 626  clients to recover  
when only 624 are up.  So the 624 recover in about 15 minutes, but  
the OST's on that OSS hang waiting for the last two that are dead and  
not coming back.  Note the MDS reports only 624 clients.

Is there a a way to tell the OST's to go ahead and evict those two  
clients and finish recovering?  Also time remaining has been 0  
sense it was booted.  How long will the OST's wait before it lets  
operations continue?

Is there any rule to speeding up recovery?  The OSS that crashed sees  
very little cpus/disk/network traffic when recovery is going on so  
any way to speed it up even if it results in a higher load would be  
great to know.

status: RECOVERING
recovery_start: 1217509142
time remaining: 0
connected_clients: 624/626
completed_clients: 624/626
replayed_requests: 0/??
queued_requests: 0
next_transno: 175342162
status: RECOVERING
recovery_start: 1217509144
time remaining: 0
connected_clients: 624/626
completed_clients: 624/626
replayed_requests: 0/??
queued_requests: 0
next_transno: 193097794



Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
[EMAIL PROTECTED]
(734)936-1985



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Luster recovery when clients go away

2008-07-31 Thread Andreas Dilger
On Jul 31, 2008  10:30 -0400, Brock Palen wrote:
 One of our OSS's died with a panic last night.  Between when it was  
 found (no failover) and restarted two clients had died also.  (nodes  
 crashed by user OOM).
 
 Because of this the OST's now are looking for 626  clients to recover  
 when only 624 are up.  So the 624 recover in about 15 minutes, but  
 the OST's on that OSS hang waiting for the last two that are dead and  
 not coming back.  Note the MDS reports only 624 clients.
 
 Is there a a way to tell the OST's to go ahead and evict those two  
 clients and finish recovering?  Also time remaining has been 0  
 sense it was booted.  How long will the OST's wait before it lets  
 operations continue?
 
 Is there any rule to speeding up recovery?  The OSS that crashed sees  
 very little cpus/disk/network traffic when recovery is going on so  
 any way to speed it up even if it results in a higher load would be  
 great to know.
 
 status: RECOVERING
 recovery_start: 1217509142
 time remaining: 0
 connected_clients: 624/626
 completed_clients: 624/626
 replayed_requests: 0/??
 queued_requests: 0
 next_transno: 175342162
 status: RECOVERING
 recovery_start: 1217509144
 time remaining: 0
 connected_clients: 624/626
 completed_clients: 624/626
 replayed_requests: 0/??
 queued_requests: 0
 next_transno: 193097794

The recovery should time out after about 5 minutes (with default 100s
timeouts).  The recovery goes as fast as clients connect and submit RPCs
for replay.  In the case where all clients connect then recovery is
finished as soon as all clients report completion.

Are you saying the system is still stuck in recovery after more than 5
or 10 minutes?

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss