Re: [zfs-discuss] Sidebar to ZFS Availability discussion
On Sun, 2008-08-31 at 12:00 -0700, Richard Elling wrote: 2. The algorithm *must* be computationally efficient. We are looking down the tunnel at I/O systems that can deliver on the order of 5 Million iops. We really won't have many (any?) spare cycles to play with. If you pick the constants carefully (powers of two) you can do the TCP RTT + variance estimation using only a handful of shifts, adds, and subtracts. In both of these cases, the solutions imply multi-minute timeouts are required to maintain a stable system. Again, there are different uses for timeouts: 1) how long should we wait on an ordinary request before deciding to try plan B and go elsewhere (a la B_FAILFAST) 2) how long should we wait (while trying all alternatives) before declaring an overall failure and giving up. The RTT estimation approach is really only suitable for the former, where you have some alternatives available (retransmission in the case of TCP; trying another disk in the case of mirrors, etc.,). when you've tried all the alternatives and nobody's responding, there's no substitute for just retrying for a long time. - Bill ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sidebar to ZFS Availability discussion
On Sun, 2008-08-31 at 15:03 -0400, Miles Nordin wrote: It's sort of like network QoS, but not quite, because: (a) you don't know exactly how big the ``pipe'' is, only approximately, In an ip network, end nodes generally know no more than the pipe size of the first hop -- and in some cases (such as true CSMA networks like classical ethernet or wireless) only have an upper bound on the pipe size. beyond that, they can only estimate the characteristics of the rest of the network by observing its behavior - all they get is end-to-end latency, and *maybe* a 'congestion observed' mark set by an intermediate system. (c) all the fabrics are lossless, so while there are queues which undesireably fill up during congestion, these queues never drop ``packets'' but instead exert back-pressure all the way up to the top of the stack. hmm. I don't think the back pressure makes it all the way up to zfs (the top of the block storage stack) except as added latency. (on the other hand, if it did, zfs could schedule around it both for reads and writes, avoiding pouring more work on already-congested paths..) I'm surprised we survive as well as we do without disk QoS. Are the storage vendors already doing it somehow? I bet that (as with networking) in many/most cases overprovisioning the hardware and running at lower average utilization is often cheaper in practice than running close to the edge and spending a lot of expensive expert time monitoring performance and tweaking QoS parameters. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sidebar to ZFS Availability discussion
bs == Bill Sommerfeld [EMAIL PROTECTED] writes: bs In an ip network, end nodes generally know no more than the bs pipe size of the first hop -- and in some cases (such as true bs CSMA networks like classical ethernet or wireless) only have bs an upper bound on the pipe size. yeah, but the most complicated and well-studied queueing disciplines (like, everything implemented in ALTQ and I think everything implemented by the two different Cisco queueing frameworks (the CBQ process-switched one, and the diffserv-like cat6500 ASIC-switched one)) is (a) hop-by-hop, so the algorithm one discusses only applies to a single hop, a single transmit queue, never to a whole path, and (b) assumes a unidirectional link of known fixed size, not a broadcast link or token ring or anything like that. For wireless they are not using the fancy algorithms. They're doing really primitive things like ``unsolicited grants''---basically just TDMA channels. I wouldn't think of ECN as part of QoS exactly, because it separates so cleanly from your choice of queue discipline. bs hmm. I don't think the back pressure makes it all the way up bs to zfs I guess I was thinking of the lossless fabrics, which might change some of the assumptions behind designing a scheduler that went into IP QoS. For example, most of the IP QoS systems divide the usual one-big-queue into many smaller queues. A ``classifier'' picks some packets as pink ones and some as blue, and assigns them thusly to queues, and they always get classified to the end of the queue. The ``scheduler'' then decides from which queue to take the next packet. The primitive QoS in Ethernet chips might give you 4 queues that are either strict-priority or weighted-round-robin. Link-sharing schedulers like CBQ or HFSC make a heirarchy of queues where, to the extent that they're work-conserving, child queues borrow unused transmission slots from their ancestors. Or a flat 256 hash-bucket queues for WFQ, which just tries to separate one job from another. but no matter which of those you choose, within each of the smaller queues you get an orthogonal choice of RED or FIFO. There's no such thing as RED or FIFO with queues in storage networks because there is no packet dropping. This confuses the implementation of the upper queueing discipline because what happens when one of the small queues fills up? How can you push up the stack, ``I will not accept another CDB if I would classify it as a Pink CDB, because the Pink queue is full. I will still accept Blue CDB's though.'' Needing to express this destroys the modularity of the IP QoS model. We can only say ``block---no more CDB's accepted,'' but that defeats the whole purpose of the QoS! so how to say no more CDB's of the pink kind? With normal hop-by-hop QoS, I don't think we can. This inexpressability of ``no more pink CDB's'' is the same reason enterprise Ethernet switches never actually use the gigabit ethernet ``flow control'' mechanism. Yeah, they negotiate flow control and obey received flow control signals, but they never _assert_ a flow control signal, at least not for normal output-queue congestion, because this would block reception of packets that would get switched to uncongested output ports, too. Proper enterprise switches would assert flow control only for rare pathological cases like backplane saturation or cheap oversubscribed line cards. No matter what overzealous powerpoint monkeys claim, CEE/FCoE is _not_ going to use ``pause frames.'' I guess you're right that some of the ``queues'' in storage are sort of arbitrarily sized, like the write queue which could take up the whole buffer cache, so back pressure might not be the right way to imagine it. pgpheHnFkXcqX.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sidebar to ZFS Availability discussion
Hello Miles, Sunday, August 31, 2008, 8:03:45 PM, you wrote: dc == David Collier-Brown [EMAIL PROTECTED] writes: MN dc one discovers latency growing without bound on disk MN dc saturation, MN yeah, ZFS needs the same thing just for scrub. MN I guess if the disks don't let you tag commands with priorities, then MN you have to run them at slightly below max throughput in order to QoS MN them. MN It's sort of like network QoS, but not quite, because: MN (a) you don't know exactly how big the ``pipe'' is, only MN approximately, MN (b) you're not QoS'ing half of a bidirectional link---you get MN instant feedback of how long it took to ``send'' each ``packet'' MN that you don't get with network QoS, and MN (c) all the fabrics are lossless, so while there are queues which MN undesireably fill up during congestion, these queues never drop MN ``packets'' but instead exert back-pressure all the way up to MN the top of the stack. MN I'm surprised we survive as well as we do without disk QoS. Are the MN storage vendors already doing it somehow? I don't know the details and haven't actually tested it but EMC provides QoS in their Clariion line... -- Best regards, Robert Milkowskimailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sidebar to ZFS Availability discussion
Richard Elling wrote: [what usually concerns me is that the software people spec'ing device drivers don't seem to have much training in control systems, which is what is being designed] Or try to develop safety-critical systems based on best effort instead of first developing a clear and verifiable idea of what is required for correct functioning. The feedback loop is troublesome because there is usually at least one queue, perhaps 3 queues between the host and the media. At each queue, iops can be reordered. And that's evil... A former colleague did a study of how much reordering could be done and still preserve correctness as his master's thesis, and it was notable how easily one could mess up! As Sommerfeld points out, we see the same sort of thing in IP networks, but two things bother me about that: 1. latency for disk seeks, rotates, and cache hits look very different than random IP network latencies. For example: a TNF trace I recently examined for an IDE disk (no queues which reorder) running a single thread read workload showed the following data: block size latency (ms) 44646448 1.18 718094416 13.82 (long seek?) 7181072 112 3.65 (some rotation?) 7181184 112 2.16 718129616 0.53 (track cache?) 44651216 0.57 (track cache?) This same system using a SATA disk might look very different, because there are 2 additional queues at work, and (expect) NCQ. OK, so the easy way around this is to build in a substantial guard band... no problem, but if you get above about a second, then you aren't much different than the B_FAILFAST solution even though... Fortunately, latencies grow without bound after N*, the saturation point, so one can distinguish overloads (insanely bad latency response time) from normal mismanagement (single orders of magnitude, base 10 (;-)) 2. The algorithm *must* be computationally efficient. We are looking down the tunnel at I/O systems that can deliver on the order of 5 Million iops. We really won't have many (any?) spare cycles to play with. Ok, I make it two comparisons and a subtract at the decision point, but a lot of precalculation in user-space, over time. Very similar to the IBM mainframe experience with goal-directed management. The second is for resource management, where one throttles disk-hog projects when one discovers latency growing without bound on disk saturation, and the third is in case of a fault other than the above. Resource management is difficult when you cannot directly attribute physical I/O to a process. Agreed: we may need a way to associate logical I/Os with the project which authored them. For the latter to work well, I'd like to see the resource management and fast/slow mirror adaptation be something one turns on explicitly, because then when FMA discovered that you in fact have a fast/slow mirror or a Dr. Evil program saturating the array, the fix could be to notify the sysadmin that they had a problem and suggesting built-in tools to ameliorate it. Agree 100%. Ian Collins writes: One solution (again, to be used with a remote mirror) is the three way mirror. If two devices are local and one remote, data is safe once the two local writes return. I guess the issue then changes from is my data safe to how safe is my data. I would be reluctant to deploy a remote mirror device without local redundancy, so this probably won't be an uncommon setup. There would have to be an acceptable window of risk when local data isn't replicated. And in this case too, I'd prefer the sysadmin provide the information to ZFS about what she wants, and have the system adapt to it, and report how big the risk window is. This would effectively change the FMA behavior, you understand, so as to have it report failures to complete the local writes in time t0 and remote in time t1, much as the resource management or fast/slow cases would need to be visible to FMA. I think this can be reasonably accomplished within the scope of FMA. Perhaps we should pick that up on fm-discuss? But I think the bigger problem is that unless you can solve for the general case, you *will* get nailed. I might even argue that we need a way for storage devices to notify hosts of their characteristics, which would require protocol adoption and would take years to implement. Fortunately, the critical metric, latency, is easy to measure. Noisy! Indeed, very noisy, but easy for specific cases, as noted above. The general case you describe below is indeed harder. I suspect we may need to statically annotate certain devices with critical behavior information... Consider two
Re: [zfs-discuss] Sidebar to ZFS Availability discussion
dc == David Collier-Brown [EMAIL PROTECTED] writes: dc one discovers latency growing without bound on disk dc saturation, yeah, ZFS needs the same thing just for scrub. I guess if the disks don't let you tag commands with priorities, then you have to run them at slightly below max throughput in order to QoS them. It's sort of like network QoS, but not quite, because: (a) you don't know exactly how big the ``pipe'' is, only approximately, (b) you're not QoS'ing half of a bidirectional link---you get instant feedback of how long it took to ``send'' each ``packet'' that you don't get with network QoS, and (c) all the fabrics are lossless, so while there are queues which undesireably fill up during congestion, these queues never drop ``packets'' but instead exert back-pressure all the way up to the top of the stack. I'm surprised we survive as well as we do without disk QoS. Are the storage vendors already doing it somehow? pgp08hjfQdC5c.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Sidebar to ZFS Availability discussion
Miles Nordin wrote: dc == David Collier-Brown [EMAIL PROTECTED] writes: dc one discovers latency growing without bound on disk dc saturation, yeah, ZFS needs the same thing just for scrub. ZFS already schedules scrubs at a low priority. However, once the iops leave ZFS's queue, they can't be rescheduled by ZFS. I guess if the disks don't let you tag commands with priorities, then you have to run them at slightly below max throughput in order to QoS them. It's sort of like network QoS, but not quite, because: (a) you don't know exactly how big the ``pipe'' is, only approximately, (b) you're not QoS'ing half of a bidirectional link---you get instant feedback of how long it took to ``send'' each ``packet'' that you don't get with network QoS, and (c) all the fabrics are lossless, so while there are queues which undesireably fill up during congestion, these queues never drop ``packets'' but instead exert back-pressure all the way up to the top of the stack. I'm surprised we survive as well as we do without disk QoS. Are the storage vendors already doing it somehow? Excellent question. I hope someone will pipe up with an answer. In my experience, they get by through overprovisioning. But I predict that SSDs will render this question moot, at least for another generation or so. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss