On 6/12/11 6:18 PM, "Jim Klimov" <jimkli...@cos.ru> wrote:

>2011-06-12 23:57, Richard Elling wrote:
>>
>> How long should it wait? Before you answer, read through the thread:
>>      http://lists.illumos.org/pipermail/developer/2011-April/001996.html
>> Then add your comments :-)
>>   -- richard
>
>But the point of my previous comment was that, according
>to the original poster, after a while his disk did get
>marked as "faulted" or "offlined". IF this happened
>during the system's initial uptime, but it froze anyway,
>it it a problem.
>
>What I do not know is if he rebooted the box within the
>5 minutes set aside for the timeout, or if some other
>processes gave up during the 5 minutes of no IO and
>effectively hung the system.
>
>If it is somehow the latter - that the inaccessible drive
>did (lead to) hang(ing) the system past any set IO retry
>timeouts - that is a bug, I think.
>

Here's the timeline:

- The Intel X25-M was marked "FAULTED" Monday evening, 6pm. This was not
detected by NexentaStor.
- The storage system performance diminished at 9am the next morning.
Intermittent spikes in system load (of the VMs hosted on the unit).
- By 11am, the Nexenta interface and console were unresponsive and the
virtual machines dependent on the underlying storage stalled completely.
- At 12pm, I gained physical access to the server, but I could not acquire
console access (shell or otherwise). I did see the FMA error output on the
screen indicating the actual device FAULT time.
- I powered the system off, removed the Intel X-25M, and powered back on.
The VMs picked up where they left off and the system stabilized.

The total impact to end-users was 3 hours of either poor performance or
straight downtime. 

-- 
Edmund White
ewwh...@mac.com


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to