Richard Elling wrote:
> [what usually concerns me is that the software people spec'ing device
> drivers don't seem to have much training in control systems, which is
> what is being designed]

Or try to develop safety-critical systems based on "best effort" instead
of first developing a clear and verifiable idea of what is required for
correct functioning.

> 
> The feedback loop is troublesome because there is usually at least one
> queue, perhaps 3 queues between the host and the media.  At each
> queue, iops can be reordered.  

And that's evil... A former colleague did a study of how much
reordering could be done and still preserve correctness as
his master's thesis, and it was notable how easily one could
mess up! 


>               As Sommerfeld points out, we see the
> same sort of thing in IP networks, but two things bother me about that:
> 
>     1. latency for disk seeks, rotates, and cache hits look very different
>        than random IP network latencies.  For example: a TNF trace I
>        recently examined for an IDE disk (no queues which reorder)
>        running a single thread read workload showed the following data:
>            block   size   latency (ms)
>           ----------------------------
>            446464    48   1.18
>           7180944    16  13.82   (long seek?)
>           7181072   112   3.65   (some rotation?)
>           7181184   112   2.16
>           7181296    16   0.53   (track cache?)
>            446512    16   0.57   (track cache?)
> 
>        This same system using a SATA disk might look very
>        different, because there are 2 additional queues at
>        work, and (expect) NCQ. OK, so the easy way around
>        this is to build in a substantial guard band... no
>        problem, but if you get above about a second, then
>        you aren't much different than the B_FAILFAST solution
>        even though...

Fortunately, latencies grow without bound after N*, the saturation point,
so one can distinguish overloads (insanely bad latency & response time)
from normal mismanagement (single orders of magnitude,  base 10 (;-))

> 
>     2. The algorithm *must* be computationally efficient.
>        We are looking down the tunnel at I/O systems that can
>        deliver on the order of 5 Million iops.  We really won't
>        have many (any?) spare cycles to play with.

Ok, I make it two comparisons and a subtract at the decision point,
but a lot of precalculation in user-space, over time.  Very similar
to the IBM mainframe experience with goal-directed management.

> 
>>   The second is for resource management, where one throttles
>> disk-hog projects when one discovers latency growing without
>> bound on disk saturation, and the third is in case of a fault
>> other than the above.
>>   
> 
> 
> Resource management is difficult when you cannot directly attribute
> physical I/O to a process.

Agreed: we may need a way to associate logical I/Os with the
project which authored them. 

> 
>>   For the latter to work well, I'd like to see the resource management
>> and fast/slow mirror adaptation be something one turns on explicitly,
>> because then when FMA discovered that you in fact have a fast/slow
>> mirror or a Dr. Evil program saturating the array, the "fix"
>> could be to notify the sysadmin that they had a problem and
>> suggesting built-in tools to ameliorate it.   
> 
> 
> Agree 100%.
> 
>>  
>> Ian Collins writes:  
>>
>>> One solution (again, to be used with a remote mirror) is the three 
>>> way mirror.  If two devices are local and one remote, data is safe 
>>> once the two local writes return.  I guess the issue then changes 
>>> from "is my data safe" to "how safe is my data".  I would be 
>>> reluctant to deploy a remote mirror device without local redundancy, 
>>> so this probably won't be an uncommon setup.  There would have to be 
>>> an acceptable window of risk when local data isn't replicated.
>>>     
>>
>>
>>   And in this case too, I'd prefer the sysadmin provide the information
>> to ZFS about what she wants, and have the system adapt to it, and
>> report how big the risk window is.
>>
>>   This would effectively change the FMA behavior, you understand, so 
>> as to have it report failures to complete the local writes in time t0 
>> and remote in time t1, much as the resource management or fast/slow 
>> cases would
>> need to be visible to FMA.
>>   
> 
> 
> I think this can be reasonably accomplished within the scope of FMA.
> Perhaps we should pick that up on fm-discuss?
> 
> But I think the bigger problem is that unless you can solve for the general
> case, you *will* get nailed.  I might even argue that we need a way for
> storage devices to notify hosts of their characteristics, which would 
> require
> protocol adoption and would take years to implement.

Fortunately, the critical metric, latency, is easy to measure.  Noisy!
Indeed, very noisy, but easy for specific cases, as noted above. The
general case you describe below is indeed harder. I suspect we
may need to statically annotate certain devices with critical behavior
information... 


> Consider two scenarios:
> 
>     Case 1. Fully redundant storage array with active/active controllers.
>        A failed controller should cause the system to recover on the
>        surviving controller.  I have some lab test data for this sort of 
> thing
>        and some popular arrays can take on the order of a minute to
>        complete the failure detection and reconfiguration.  You don't
>        want to degrade the vdev when this happens, you just want to
>        wait until the array is again ready for use (this works ok today.)
>        I would further argue that no "disk failure prediction" code would
>        be useful for this case.
> 
>     Case 2.  Power on test.  I had a bruise (no scar :-) once from an
>        integrated product we were designing
>           http://docs.sun.com/app/docs/coll/cluster280-3
>        which had a server (or two) and raid array (or two).  If you build
>        such a system from scratch, then it will fail a power-on test.  
> If you
>        power on the rack containing these systems, then the time required
>        for the RAID array to boot was longer than the time required for
>        the server to boot *and* timeout probes of the array.  The result
>        was that the volume manager will declare the disks bad and
>        system administration intervention is required to regain access to
>        the data in the array.  Since this was an integrated product, we
>        solved it by inducing a delay loop in the server boot cycle to
>        slow down the server.  Was it the best possible solution?  No, but
>        it was the only solution which met our other design constraints.
> 
> In both of these cases, the solutions imply multi-minute timeouts are
> required to maintain a stable system.  For 101-level insight to this sort
> of problem see the Sun BluePrint article (an oldie, but goodie):
> http://www.sun.com/blueprints/1101/clstrcomplex.pdf
> 

--dave
-- 
David Collier-Brown            | Always do right. This will gratify
Sun Microsystems, Toronto      | some people and astonish the rest
[EMAIL PROTECTED]                 |                      -- Mark Twain
cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191#
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to