On Jul 31, 2008 10:30 -0400, Brock Palen wrote:
One of our OSS's died with a panic last night. Between when it was
found (no failover) and restarted two clients had died also. (nodes
crashed by user OOM).
Because of this the OST's now are looking for 626 clients to recover
when only 624 are up. So the 624 recover in about 15 minutes, but
the OST's on that OSS hang waiting for the last two that are dead and
not coming back. Note the MDS reports only 624 clients.
Is there a a way to tell the OST's to go ahead and evict those two
clients and finish recovering? Also time remaining has been 0
sense it was booted. How long will the OST's wait before it lets
operations continue?
Is there any rule to speeding up recovery? The OSS that crashed sees
very little cpus/disk/network traffic when recovery is going on so
any way to speed it up even if it results in a higher load would be
great to know.
status: RECOVERING
recovery_start: 1217509142
time remaining: 0
connected_clients: 624/626
completed_clients: 624/626
replayed_requests: 0/??
queued_requests: 0
next_transno: 175342162
status: RECOVERING
recovery_start: 1217509144
time remaining: 0
connected_clients: 624/626
completed_clients: 624/626
replayed_requests: 0/??
queued_requests: 0
next_transno: 193097794
The recovery should time out after about 5 minutes (with default 100s
timeouts). The recovery goes as fast as clients connect and submit RPCs
for replay. In the case where all clients connect then recovery is
finished as soon as all clients report completion.
Are you saying the system is still stuck in recovery after more than 5
or 10 minutes?
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss