Re: [zfs-discuss] Sidebar to ZFS Availability discussion

2008-09-02 Thread Bill Sommerfeld
On Sun, 2008-08-31 at 12:00 -0700, Richard Elling wrote:
 2. The algorithm *must* be computationally efficient.
We are looking down the tunnel at I/O systems that can
deliver on the order of 5 Million iops.  We really won't
have many (any?) spare cycles to play with.

If you pick the constants carefully (powers of two) you can do the TCP
RTT + variance estimation using only a handful of shifts, adds, and
subtracts.

 In both of these cases, the solutions imply multi-minute timeouts are
 required to maintain a stable system.  

Again, there are different uses for timeouts:
 1) how long should we wait on an ordinary request before deciding to
try plan B and go elsewhere (a la B_FAILFAST)
 2) how long should we wait (while trying all alternatives) before
declaring an overall failure and giving up.

The RTT estimation approach is really only suitable for the former,
where you have some alternatives available (retransmission in the case
of TCP; trying another disk in the case of mirrors, etc.,).  

when you've tried all the alternatives and nobody's responding, there's
no substitute for just retrying for a long time.

- Bill


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sidebar to ZFS Availability discussion

2008-09-02 Thread Bill Sommerfeld
On Sun, 2008-08-31 at 15:03 -0400, Miles Nordin wrote:

 It's sort of like network QoS, but not quite, because: 
 
   (a) you don't know exactly how big the ``pipe'' is, only
   approximately, 

In an ip network, end nodes generally know no more than the pipe size of
the first hop -- and in some cases (such as true CSMA networks like
classical ethernet or wireless) only have an upper bound on the pipe
size.  

beyond that, they can only estimate the characteristics of the rest of
the network by observing its behavior - all they get is end-to-end
latency, and *maybe* a 'congestion observed' mark set by an intermediate
system.

   (c) all the fabrics are lossless, so while there are queues which
   undesireably fill up during congestion, these queues never drop
   ``packets'' but instead exert back-pressure all the way up to
   the top of the stack.

hmm.  I don't think the back pressure makes it all the way up to zfs
(the top of the block storage stack) except as added latency.  

(on the other hand, if it did, zfs could schedule around it both for
reads and writes, avoiding pouring more work on already-congested
paths..)

 I'm surprised we survive as well as we do without disk QoS.  Are the
 storage vendors already doing it somehow?

I bet that (as with networking) in many/most cases overprovisioning the
hardware and running at lower average utilization is often cheaper in
practice than running close to the edge and spending a lot of expensive
expert time monitoring performance and tweaking QoS parameters.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sidebar to ZFS Availability discussion

2008-09-02 Thread Miles Nordin
 bs == Bill Sommerfeld [EMAIL PROTECTED] writes:

bs In an ip network, end nodes generally know no more than the
bs pipe size of the first hop -- and in some cases (such as true
bs CSMA networks like classical ethernet or wireless) only have
bs an upper bound on the pipe size.

yeah, but the most complicated and well-studied queueing disciplines
(like, everything implemented in ALTQ and I think everything
implemented by the two different Cisco queueing frameworks (the CBQ
process-switched one, and the diffserv-like cat6500 ASIC-switched
one)) is (a) hop-by-hop, so the algorithm one discusses only applies
to a single hop, a single transmit queue, never to a whole path, and
(b) assumes a unidirectional link of known fixed size, not a broadcast
link or token ring or anything like that.

For wireless they are not using the fancy algorithms.  They're doing
really primitive things like ``unsolicited grants''---basically just
TDMA channels.

I wouldn't think of ECN as part of QoS exactly, because it separates
so cleanly from your choice of queue discipline.

bs hmm.  I don't think the back pressure makes it all the way up
bs to zfs

I guess I was thinking of the lossless fabrics, which might change
some of the assumptions behind designing a scheduler that went into IP
QoS.  For example, most of the IP QoS systems divide the usual
one-big-queue into many smaller queues.  A ``classifier'' picks some
packets as pink ones and some as blue, and assigns them thusly to
queues, and they always get classified to the end of the queue.  The
``scheduler'' then decides from which queue to take the next packet.
The primitive QoS in Ethernet chips might give you 4 queues that are
either strict-priority or weighted-round-robin.  Link-sharing
schedulers like CBQ or HFSC make a heirarchy of queues where, to the
extent that they're work-conserving, child queues borrow unused
transmission slots from their ancestors.  Or a flat 256 hash-bucket
queues for WFQ, which just tries to separate one job from another.

but no matter which of those you choose, within each of the smaller
queues you get an orthogonal choice of RED or FIFO.  There's no such
thing as RED or FIFO with queues in storage networks because there is
no packet dropping.

This confuses the implementation of the upper queueing discipline
because what happens when one of the small queues fills up?  How can
you push up the stack, ``I will not accept another CDB if I would
classify it as a Pink CDB, because the Pink queue is full.  I will
still accept Blue CDB's though.''  Needing to express this destroys
the modularity of the IP QoS model.  We can only say ``block---no more
CDB's accepted,'' but that defeats the whole purpose of the QoS!  so
how to say no more CDB's of the pink kind?  With normal hop-by-hop
QoS, I don't think we can.

This inexpressability of ``no more pink CDB's'' is the same reason
enterprise Ethernet switches never actually use the gigabit ethernet
``flow control'' mechanism.  Yeah, they negotiate flow control and
obey received flow control signals, but they never _assert_ a flow
control signal, at least not for normal output-queue congestion,
because this would block reception of packets that would get switched
to uncongested output ports, too.  Proper enterprise switches would
assert flow control only for rare pathological cases like backplane
saturation or cheap oversubscribed line cards.  No matter what
overzealous powerpoint monkeys claim, CEE/FCoE is _not_ going to use
``pause frames.''

I guess you're right that some of the ``queues'' in storage are sort
of arbitrarily sized, like the write queue which could take up the
whole buffer cache, so back pressure might not be the right way to
imagine it.


pgpheHnFkXcqX.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sidebar to ZFS Availability discussion

2008-09-01 Thread Robert Milkowski
Hello Miles,

Sunday, August 31, 2008, 8:03:45 PM, you wrote:

 dc == David Collier-Brown [EMAIL PROTECTED] writes:

MN dc one discovers latency growing without bound on disk
MN dc saturation,

MN yeah, ZFS needs the same thing just for scrub.

MN I guess if the disks don't let you tag commands with priorities, then
MN you have to run them at slightly below max throughput in order to QoS
MN them.

MN It's sort of like network QoS, but not quite, because: 

MN   (a) you don't know exactly how big the ``pipe'' is, only
MN   approximately, 

MN   (b) you're not QoS'ing half of a bidirectional link---you get
MN   instant feedback of how long it took to ``send'' each ``packet''
MN   that you don't get with network QoS, and

MN   (c) all the fabrics are lossless, so while there are queues which
MN   undesireably fill up during congestion, these queues never drop
MN   ``packets'' but instead exert back-pressure all the way up to
MN   the top of the stack.

MN I'm surprised we survive as well as we do without disk QoS.  Are the
MN storage vendors already doing it somehow?

I don't know the details and haven't actually tested it but EMC
provides QoS in their Clariion line...

-- 
Best regards,
 Robert Milkowskimailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sidebar to ZFS Availability discussion

2008-09-01 Thread David Collier-Brown


Richard Elling wrote:
 [what usually concerns me is that the software people spec'ing device
 drivers don't seem to have much training in control systems, which is
 what is being designed]

Or try to develop safety-critical systems based on best effort instead
of first developing a clear and verifiable idea of what is required for
correct functioning.

 
 The feedback loop is troublesome because there is usually at least one
 queue, perhaps 3 queues between the host and the media.  At each
 queue, iops can be reordered.  

And that's evil... A former colleague did a study of how much
reordering could be done and still preserve correctness as
his master's thesis, and it was notable how easily one could
mess up! 


   As Sommerfeld points out, we see the
 same sort of thing in IP networks, but two things bother me about that:
 
 1. latency for disk seeks, rotates, and cache hits look very different
than random IP network latencies.  For example: a TNF trace I
recently examined for an IDE disk (no queues which reorder)
running a single thread read workload showed the following data:
block   size   latency (ms)
   
44646448   1.18
   718094416  13.82   (long seek?)
   7181072   112   3.65   (some rotation?)
   7181184   112   2.16
   718129616   0.53   (track cache?)
44651216   0.57   (track cache?)
 
This same system using a SATA disk might look very
different, because there are 2 additional queues at
work, and (expect) NCQ. OK, so the easy way around
this is to build in a substantial guard band... no
problem, but if you get above about a second, then
you aren't much different than the B_FAILFAST solution
even though...

Fortunately, latencies grow without bound after N*, the saturation point,
so one can distinguish overloads (insanely bad latency  response time)
from normal mismanagement (single orders of magnitude,  base 10 (;-))

 
 2. The algorithm *must* be computationally efficient.
We are looking down the tunnel at I/O systems that can
deliver on the order of 5 Million iops.  We really won't
have many (any?) spare cycles to play with.

Ok, I make it two comparisons and a subtract at the decision point,
but a lot of precalculation in user-space, over time.  Very similar
to the IBM mainframe experience with goal-directed management.

 
   The second is for resource management, where one throttles
 disk-hog projects when one discovers latency growing without
 bound on disk saturation, and the third is in case of a fault
 other than the above.
   
 
 
 Resource management is difficult when you cannot directly attribute
 physical I/O to a process.

Agreed: we may need a way to associate logical I/Os with the
project which authored them. 

 
   For the latter to work well, I'd like to see the resource management
 and fast/slow mirror adaptation be something one turns on explicitly,
 because then when FMA discovered that you in fact have a fast/slow
 mirror or a Dr. Evil program saturating the array, the fix
 could be to notify the sysadmin that they had a problem and
 suggesting built-in tools to ameliorate it.   
 
 
 Agree 100%.
 
  
 Ian Collins writes:  

 One solution (again, to be used with a remote mirror) is the three 
 way mirror.  If two devices are local and one remote, data is safe 
 once the two local writes return.  I guess the issue then changes 
 from is my data safe to how safe is my data.  I would be 
 reluctant to deploy a remote mirror device without local redundancy, 
 so this probably won't be an uncommon setup.  There would have to be 
 an acceptable window of risk when local data isn't replicated.
 


   And in this case too, I'd prefer the sysadmin provide the information
 to ZFS about what she wants, and have the system adapt to it, and
 report how big the risk window is.

   This would effectively change the FMA behavior, you understand, so 
 as to have it report failures to complete the local writes in time t0 
 and remote in time t1, much as the resource management or fast/slow 
 cases would
 need to be visible to FMA.
   
 
 
 I think this can be reasonably accomplished within the scope of FMA.
 Perhaps we should pick that up on fm-discuss?
 
 But I think the bigger problem is that unless you can solve for the general
 case, you *will* get nailed.  I might even argue that we need a way for
 storage devices to notify hosts of their characteristics, which would 
 require
 protocol adoption and would take years to implement.

Fortunately, the critical metric, latency, is easy to measure.  Noisy!
Indeed, very noisy, but easy for specific cases, as noted above. The
general case you describe below is indeed harder. I suspect we
may need to statically annotate certain devices with critical behavior
information... 


 Consider two 

Re: [zfs-discuss] Sidebar to ZFS Availability discussion

2008-08-31 Thread Miles Nordin
 dc == David Collier-Brown [EMAIL PROTECTED] writes:

dc one discovers latency growing without bound on disk
dc saturation,

yeah, ZFS needs the same thing just for scrub.

I guess if the disks don't let you tag commands with priorities, then
you have to run them at slightly below max throughput in order to QoS
them.

It's sort of like network QoS, but not quite, because: 

  (a) you don't know exactly how big the ``pipe'' is, only
  approximately, 

  (b) you're not QoS'ing half of a bidirectional link---you get
  instant feedback of how long it took to ``send'' each ``packet''
  that you don't get with network QoS, and

  (c) all the fabrics are lossless, so while there are queues which
  undesireably fill up during congestion, these queues never drop
  ``packets'' but instead exert back-pressure all the way up to
  the top of the stack.

I'm surprised we survive as well as we do without disk QoS.  Are the
storage vendors already doing it somehow?


pgp08hjfQdC5c.pgp
Description: PGP signature
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sidebar to ZFS Availability discussion

2008-08-31 Thread Richard Elling
Miles Nordin wrote:
 dc == David Collier-Brown [EMAIL PROTECTED] writes:
 

 dc one discovers latency growing without bound on disk
 dc saturation,

 yeah, ZFS needs the same thing just for scrub.
   

ZFS already schedules scrubs at a low priority.  However, once the
iops leave ZFS's queue, they can't be rescheduled by ZFS.
 I guess if the disks don't let you tag commands with priorities, then
 you have to run them at slightly below max throughput in order to QoS
 them.

 It's sort of like network QoS, but not quite, because: 

   (a) you don't know exactly how big the ``pipe'' is, only
   approximately, 

   (b) you're not QoS'ing half of a bidirectional link---you get
   instant feedback of how long it took to ``send'' each ``packet''
   that you don't get with network QoS, and

   (c) all the fabrics are lossless, so while there are queues which
   undesireably fill up during congestion, these queues never drop
   ``packets'' but instead exert back-pressure all the way up to
   the top of the stack.

 I'm surprised we survive as well as we do without disk QoS.  Are the
 storage vendors already doing it somehow?
   

Excellent question.  I hope someone will pipe up with an
answer.  In my experience, they get by through overprovisioning.
But I predict that SSDs will render this question moot, at least
for another generation or so.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss