Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Guy Watkins wrote: } -Original Message- } From: [EMAIL PROTECTED] [mailto:linux-raid- } [EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED] } Sent: Thursday, July 12, 2007 1:35 PM } To: [EMAIL PROTECTED] } Cc: Tejun Heo; [EMAIL PROTECTED]; Stefan Bader; Phillip Susi; device-mapper } development; linux-fsdevel@vger.kernel.org; [EMAIL PROTECTED]; } [EMAIL PROTECTED]; Jens Axboe; David Chinner; Andreas Dilger } Subject: Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for } devices, filesystems, and dm/md. } } On Wed, 11 Jul 2007 18:44:21 EDT, Ric Wheeler said: } [EMAIL PROTECTED] wrote: } On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said: } } All of the high end arrays have non-volatile cache (read, on power } loss, it is a } promise that it will get all of your data out to permanent storage). } You don't } need to ask this kind of array to drain the cache. In fact, it might } just ignore } you if you send it that kind of request ;-) } } OK, I'll bite - how does the kernel know whether the other end of that } fiberchannel cable is attached to a DMX-3 or to some no-name product } that } may not have the same assurances? Is there a I'm a high-end array } bit } in the sense data that I'm unaware of? } } } There are ways to query devices (think of hdparm -I in S-ATA/P-ATA } drives, SCSI } has similar queries) to see what kind of device you are talking to. I am } not } sure it is worth the trouble to do any automatic detection/handling of } this. } } In this specific case, it is more a case of when you attach a high end } (or } mid-tier) device to a server, you should configure it without barriers } for its } exported LUNs. } } I don't have a problem with the sysadmin *telling* the system the other } end of } that fiber cable has characteristics X, Y and Z. What worried me was } that it } looked like conflating device reported writeback cache with device } actually } has enough battery/hamster/whatever backup to flush everything on a power } loss. } (My back-of-envelope calculation shows for a worst-case of needing a 1ms } seek } for each 4K block, a 1G cache can take up to 4 1/2 minutes to sync. } That's } a lot of battery..) Most hardware RAID devices I know of use the battery to save the cache while the power is off. When the power is restored it flushes the cache to disk. If the power failure lasts longer than the batteries then the cache data is lost, but the batteries last 24+ hours I beleve. Most mid-range and high end arrays actually use that battery to insure that data is all written out to permanent media when the power is lost. I won't go into how that is done, but it clearly would not be a safe assumption to assume that your power outage is only going to be a certain length of time (and if not, you would lose data). A big EMC array we had had enough battery power to power about 400 disks while the 16 Gig of cache was flushed. I think EMC told me the batteries would last about 20 minutes. I don't recall if the array was usable during the 20 minutes. We never tested a power failure. Guy I worked on the team that designed that big array. At one point, we had an array on loan to a partner who tried to put it in a very small data center. A few weeks later, they brought in an electrician who needed to run more power into the center. It was pretty funny - he tried to find a power button to turn it off and then just walked over and dropped power trying to get the Symm to turn off. When that didn't work, he was really, really confused ;-) ric - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Wed, 11 Jul 2007 18:44:21 EDT, Ric Wheeler said: [EMAIL PROTECTED] wrote: On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said: All of the high end arrays have non-volatile cache (read, on power loss, it is a promise that it will get all of your data out to permanent storage). You don't need to ask this kind of array to drain the cache. In fact, it might just ignore you if you send it that kind of request ;-) OK, I'll bite - how does the kernel know whether the other end of that fiberchannel cable is attached to a DMX-3 or to some no-name product that may not have the same assurances? Is there a I'm a high-end array bit in the sense data that I'm unaware of? There are ways to query devices (think of hdparm -I in S-ATA/P-ATA drives, SCSI has similar queries) to see what kind of device you are talking to. I am not sure it is worth the trouble to do any automatic detection/handling of this. In this specific case, it is more a case of when you attach a high end (or mid-tier) device to a server, you should configure it without barriers for its exported LUNs. I don't have a problem with the sysadmin *telling* the system the other end of that fiber cable has characteristics X, Y and Z. What worried me was that it looked like conflating device reported writeback cache with device actually has enough battery/hamster/whatever backup to flush everything on a power loss. (My back-of-envelope calculation shows for a worst-case of needing a 1ms seek for each 4K block, a 1G cache can take up to 4 1/2 minutes to sync. That's a lot of battery..) pgpk2dx8NDbeH.pgp Description: PGP signature
RE: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
} -Original Message- } From: [EMAIL PROTECTED] [mailto:linux-raid- } [EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED] } Sent: Thursday, July 12, 2007 1:35 PM } To: [EMAIL PROTECTED] } Cc: Tejun Heo; [EMAIL PROTECTED]; Stefan Bader; Phillip Susi; device-mapper } development; linux-fsdevel@vger.kernel.org; [EMAIL PROTECTED]; } [EMAIL PROTECTED]; Jens Axboe; David Chinner; Andreas Dilger } Subject: Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for } devices, filesystems, and dm/md. } } On Wed, 11 Jul 2007 18:44:21 EDT, Ric Wheeler said: } [EMAIL PROTECTED] wrote: } On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said: } } All of the high end arrays have non-volatile cache (read, on power } loss, it is a } promise that it will get all of your data out to permanent storage). } You don't } need to ask this kind of array to drain the cache. In fact, it might } just ignore } you if you send it that kind of request ;-) } } OK, I'll bite - how does the kernel know whether the other end of that } fiberchannel cable is attached to a DMX-3 or to some no-name product } that } may not have the same assurances? Is there a I'm a high-end array } bit } in the sense data that I'm unaware of? } } } There are ways to query devices (think of hdparm -I in S-ATA/P-ATA } drives, SCSI } has similar queries) to see what kind of device you are talking to. I am } not } sure it is worth the trouble to do any automatic detection/handling of } this. } } In this specific case, it is more a case of when you attach a high end } (or } mid-tier) device to a server, you should configure it without barriers } for its } exported LUNs. } } I don't have a problem with the sysadmin *telling* the system the other } end of } that fiber cable has characteristics X, Y and Z. What worried me was } that it } looked like conflating device reported writeback cache with device } actually } has enough battery/hamster/whatever backup to flush everything on a power } loss. } (My back-of-envelope calculation shows for a worst-case of needing a 1ms } seek } for each 4K block, a 1G cache can take up to 4 1/2 minutes to sync. } That's } a lot of battery..) Most hardware RAID devices I know of use the battery to save the cache while the power is off. When the power is restored it flushes the cache to disk. If the power failure lasts longer than the batteries then the cache data is lost, but the batteries last 24+ hours I beleve. A big EMC array we had had enough battery power to power about 400 disks while the 16 Gig of cache was flushed. I think EMC told me the batteries would last about 20 minutes. I don't recall if the array was usable during the 20 minutes. We never tested a power failure. Guy - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
[EMAIL PROTECTED] wrote: On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said: All of the high end arrays have non-volatile cache (read, on power loss, it is a promise that it will get all of your data out to permanent storage). You don't need to ask this kind of array to drain the cache. In fact, it might just ignore you if you send it that kind of request ;-) OK, I'll bite - how does the kernel know whether the other end of that fiberchannel cable is attached to a DMX-3 or to some no-name product that may not have the same assurances? Is there a I'm a high-end array bit in the sense data that I'm unaware of? There are ways to query devices (think of hdparm -I in S-ATA/P-ATA drives, SCSI has similar queries) to see what kind of device you are talking to. I am not sure it is worth the trouble to do any automatic detection/handling of this. In this specific case, it is more a case of when you attach a high end (or mid-tier) device to a server, you should configure it without barriers for its exported LUNs. ric - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Tejun Heo wrote: [ cc'ing Ric Wheeler for storage array thingie. Hi, whole thread is at http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/3344 ] I am actually on the list, just really, really far behind in the thread ;-) Hello, [EMAIL PROTECTED] wrote: but when you consider the self-contained disk arrays it's an entirely different story. you can easily have a few gig of cache and a complete OS pretending to be a single drive as far as you are concerned. and the price of such devices is plummeting (in large part thanks to Linux moving into this space), you can now readily buy a 10TB array for $10k that looks like a single drive. Don't those thingies usually have NV cache or backed by battery such that ORDERED_DRAIN is enough? All of the high end arrays have non-volatile cache (read, on power loss, it is a promise that it will get all of your data out to permanent storage). You don't need to ask this kind of array to drain the cache. In fact, it might just ignore you if you send it that kind of request ;-) The size of the NV cache can run from a few gigabytes up to hundreds of gigabytes, so you really don't want to invoke cache flushes here if you can avoid it. For this class of device, you can get the required in order completion and data integrity semantics as long as we send the IO's to the device in the correct order. The problem is that the interface between the host and a storage device (ATA or SCSI) is not built to communicate that kind of information (grouped flush, relaxed ordering...). I think battery backed ORDERED_DRAIN combined with fine-grained host queue flush would be pretty good. It doesn't require some fancy new interface which isn't gonna be used widely anyway and can achieve most of performance gain if the storage plays it smart. Thanks. I am not really sure that you need this ORDERED_DRAIN for big arrays... ric - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said: All of the high end arrays have non-volatile cache (read, on power loss, it is a promise that it will get all of your data out to permanent storage). You don't need to ask this kind of array to drain the cache. In fact, it might just ignore you if you send it that kind of request ;-) OK, I'll bite - how does the kernel know whether the other end of that fiberchannel cable is attached to a DMX-3 or to some no-name product that may not have the same assurances? Is there a I'm a high-end array bit in the sense data that I'm unaware of? pgp2HWYqpI8yb.pgp Description: PGP signature
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
[EMAIL PROTECTED] wrote: On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said: All of the high end arrays have non-volatile cache (read, on power loss, it is a promise that it will get all of your data out to permanent storage). You don't need to ask this kind of array to drain the cache. In fact, it might just ignore you if you send it that kind of request ;-) OK, I'll bite - how does the kernel know whether the other end of that fiberchannel cable is attached to a DMX-3 or to some no-name product that may not have the same assurances? Is there a I'm a high-end array bit in the sense data that I'm unaware of? Well, the array just has to tell the kernel that it doesn't to write back caching. The kernel automatically selects ORDERED_DRAIN in such case. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Ric Wheeler wrote: Don't those thingies usually have NV cache or backed by battery such that ORDERED_DRAIN is enough? All of the high end arrays have non-volatile cache (read, on power loss, it is a promise that it will get all of your data out to permanent storage). You don't need to ask this kind of array to drain the cache. In fact, it might just ignore you if you send it that kind of request ;-) The size of the NV cache can run from a few gigabytes up to hundreds of gigabytes, so you really don't want to invoke cache flushes here if you can avoid it. For this class of device, you can get the required in order completion and data integrity semantics as long as we send the IO's to the device in the correct order. Thanks for clarification. The problem is that the interface between the host and a storage device (ATA or SCSI) is not built to communicate that kind of information (grouped flush, relaxed ordering...). I think battery backed ORDERED_DRAIN combined with fine-grained host queue flush would be pretty good. It doesn't require some fancy new interface which isn't gonna be used widely anyway and can achieve most of performance gain if the storage plays it smart. I am not really sure that you need this ORDERED_DRAIN for big arrays... ORDERED_DRAIN is to properly order requests from host request queue (elevator/iosched). We can make it finer-grained but we do need to put some ordering restrictions. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
[ cc'ing Ric Wheeler for storage array thingie. Hi, whole thread is at http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/3344 ] Hello, [EMAIL PROTECTED] wrote: but when you consider the self-contained disk arrays it's an entirely different story. you can easily have a few gig of cache and a complete OS pretending to be a single drive as far as you are concerned. and the price of such devices is plummeting (in large part thanks to Linux moving into this space), you can now readily buy a 10TB array for $10k that looks like a single drive. Don't those thingies usually have NV cache or backed by battery such that ORDERED_DRAIN is enough? The problem is that the interface between the host and a storage device (ATA or SCSI) is not built to communicate that kind of information (grouped flush, relaxed ordering...). I think battery backed ORDERED_DRAIN combined with fine-grained host queue flush would be pretty good. It doesn't require some fancy new interface which isn't gonna be used widely anyway and can achieve most of performance gain if the storage plays it smart. Thanks. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Fri, 01 Jun 2007 16:16:01 +0900, Tejun Heo said: Don't those thingies usually have NV cache or backed by battery such that ORDERED_DRAIN is enough? Probably *most* do, but do you really want to bet the user's data on it? The problem is that the interface between the host and a storage device (ATA or SCSI) is not built to communicate that kind of information (grouped flush, relaxed ordering...). I think battery backed ORDERED_DRAIN combined with fine-grained host queue flush would be pretty good. It doesn't require some fancy new interface which isn't gonna be used widely anyway and can achieve most of performance gain if the storage plays it smart. Yes, that would probably be pretty good. But how do you get the storage device to *reliably* tell the truth about what it actually implements? (Consider the number of devices that downright lie about their implementation of cache flushing) pgpUWyordSsJ4.pgp Description: PGP signature
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Jens Axboe wrote: On Thu, May 31 2007, Phillip Susi wrote: Jens Axboe wrote: No Stephan is right, the barrier is both an ordering and integrity constraint. If a driver completes a barrier request before that request and previously submitted requests are on STABLE storage, then it violates that principle. Look at the code and the various ordering options. I am saying that is the wrong thing to do. Barrier should be about ordering only. So long as the order they hit the media is maintained, the order the requests are completed in can change. barrier.txt bears But you can't guarentee ordering without flushing the data out as well. It all depends on the type of cache on the device, of course. If you look at the ordinary sata/ide drive with write back caching, you can't just issue the requests in order and pray that the drive cache will make it to platter. If you don't have write back caching, or if the cache is battery backed and thus guarenteed to never be lost, maintaining order is naturally enough. Do I misread this? If ordered doesn't reach all the way to the platter then there will be failure modes which result in order not preserved. Battery backed cache doesn't prevect failures between the cache and the platter. -- bill davidsen [EMAIL PROTECTED] CTO TMR Associates, Inc Doing interesting things with small computers since 1979 - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Wed, May 30 2007, Phillip Susi wrote: That would be the exactly how I understand Documentation/block/barrier.txt: In other words, I/O barrier requests have the following two properties. 1. Request ordering ... 2. Forced flushing to physical medium So, I/O barriers need to guarantee that requests actually get written to non-volatile medium in order. I think you misinterpret this, and it probably could be worded a bit better. The barrier request is about constraining the order. The forced flushing is one means to implement that constraint. The other alternative mentioned there is to use ordered tags. The key part there is requests actually get written to non-volatile medium _in order_, not before the request completes, which would be synchronous IO. No Stephan is right, the barrier is both an ordering and integrity constraint. If a driver completes a barrier request before that request and previously submitted requests are on STABLE storage, then it violates that principle. Look at the code and the various ordering options. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
2007/5/30, Phillip Susi [EMAIL PROTECTED]: Stefan Bader wrote: Since drive a supports barrier request we don't get -EOPNOTSUPP but the request with block y might get written before block x since the disk are independent. I guess the chances of this are quite low since at some point a barrier request will also hit drive b but for the time being it might be better to indicate -EOPNOTSUPP right from device-mapper. The device mapper needs to ensure that ALL underlying devices get a barrier request when one comes down from above, even if it has to construct zero length barriers to send to most of them. And somehow also make sure all of the barriers have been processed before returning the barrier that came in. Plus it would have to queue all mapping requests until the barrier is done (if strictly acting according to barrier.txt). But I am wondering a bit whether the requirements to barriers are really that tight as described in Tejun's document (barrier request is only started if everything before is safe, the barrier itself isn't returned until it is safe, too, and all requests after the barrier aren't started before the barrier is done). Is it really necessary to defer any further requests until the barrier has been written to save storage? Or would it be sufficient to guarantee that, if a barrier request returns, everything up to (including the barrier) is on safe storage? Stefan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Jens Axboe wrote: No Stephan is right, the barrier is both an ordering and integrity constraint. If a driver completes a barrier request before that request and previously submitted requests are on STABLE storage, then it violates that principle. Look at the code and the various ordering options. I am saying that is the wrong thing to do. Barrier should be about ordering only. So long as the order they hit the media is maintained, the order the requests are completed in can change. barrier.txt bears this out: Requests in ordered sequence are issued in order, but not required to finish in order. Barrier implementation can handle out-of-order completion of ordered sequence. IOW, the requests MUST be processed in order but the hardware/software completion paths are allowed to reorder completion notifications - eg. current SCSI midlayer doesn't preserve completion order during error handling. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Thu, May 31 2007, Phillip Susi wrote: Jens Axboe wrote: No Stephan is right, the barrier is both an ordering and integrity constraint. If a driver completes a barrier request before that request and previously submitted requests are on STABLE storage, then it violates that principle. Look at the code and the various ordering options. I am saying that is the wrong thing to do. Barrier should be about ordering only. So long as the order they hit the media is maintained, the order the requests are completed in can change. barrier.txt bears But you can't guarentee ordering without flushing the data out as well. It all depends on the type of cache on the device, of course. If you look at the ordinary sata/ide drive with write back caching, you can't just issue the requests in order and pray that the drive cache will make it to platter. If you don't have write back caching, or if the cache is battery backed and thus guarenteed to never be lost, maintaining order is naturally enough. Or if the drive can do ordered queued commands, you can relax the flushing (again depending on the cache type, you may need to take different paths). Requests in ordered sequence are issued in order, but not required to finish in order. Barrier implementation can handle out-of-order completion of ordered sequence. IOW, the requests MUST be processed in order but the hardware/software completion paths are allowed to reorder completion notifications - eg. current SCSI midlayer doesn't preserve completion order during error handling. If you carefully re-read that paragraph, then it just tells you that the software implementation can deal with reordered completions. It doesn't relax the rconstraints on ordering and integrity AT ALL. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Stefan Bader wrote: 2007/5/30, Phillip Susi [EMAIL PROTECTED]: Stefan Bader wrote: Since drive a supports barrier request we don't get -EOPNOTSUPP but the request with block y might get written before block x since the disk are independent. I guess the chances of this are quite low since at some point a barrier request will also hit drive b but for the time being it might be better to indicate -EOPNOTSUPP right from device-mapper. The device mapper needs to ensure that ALL underlying devices get a barrier request when one comes down from above, even if it has to construct zero length barriers to send to most of them. And somehow also make sure all of the barriers have been processed before returning the barrier that came in. Plus it would have to queue all mapping requests until the barrier is done (if strictly acting according to barrier.txt). But I am wondering a bit whether the requirements to barriers are really that tight as described in Tejun's document (barrier request is only started if everything before is safe, the barrier itself isn't returned until it is safe, too, and all requests after the barrier aren't started before the barrier is done). Is it really necessary to defer any further requests until the barrier has been written to save storage? Or would it be sufficient to guarantee that, if a barrier request returns, everything up to (including the barrier) is on safe storage? Well, what's described in barrier.txt is the current implemented semantics and what filesystems expect, so we can't change it underneath them but we definitely can introduce new more relaxed variants, but one thing we should bear in mind is that harddisks don't have humongous caches or very smart controller / instruction set. No matter how relaxed interface the block layer provides, in the end, it just has to issue whole-sale FLUSH CACHE on the device to guarantee data ordering on the media. IMHO, we can do better by paying more attention to how we do things in the request queue which can be deeper and more intelligent than the device queue. Thanks. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Fri, 1 Jun 2007, Tejun Heo wrote: but one thing we should bear in mind is that harddisks don't have humongous caches or very smart controller / instruction set. No matter how relaxed interface the block layer provides, in the end, it just has to issue whole-sale FLUSH CACHE on the device to guarantee data ordering on the media. if you are talking about individual drives you may be right for the moment (but 16M cache on drives is a _lot_ larger then people imagined would be there a few years ago) but when you consider the self-contained disk arrays it's an entirely different story. you can easily have a few gig of cache and a complete OS pretending to be a single drive as far as you are concerned. and the price of such devices is plummeting (in large part thanks to Linux moving into this space), you can now readily buy a 10TB array for $10k that looks like a single drive. David Lang - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
in-flight I/O to go to zero? Something like that is needed for some dm targets to support barriers. (We needn't always wait for *all* in-flight I/O.) When faced with -EOPNOTSUP, do all callers fall back to a sync in the places a barrier would have been used, or are there any more sophisticated strategies attempting to optimise code without barriers? If I didn't misunderstand the idea is that no caller will face an -EOPNOTSUPP in future. IOW every layer or driver somehow makes sure the right thing happens. An efficient I/O barrier implementation would not normally involve flushing AFAIK: dm surely wouldn't cause a higher layer to assume stronger semantics than are provided. Seems there are at least two assumptions about what the semantics exactly _are_. Based on Documentation/block/barriers.txt I understand a barrier implies ordering and flushing. But regardless of that, assume the (admittedly constructed) following case: You got a linear target that consists of two disks. One drive (a) supports barriers and the other one (b) doesn't. Device-mapper just maps the requests to the appropriate disk. Now the following sequence happens: 1. block x gets mapped to drive b 2. block y (with barrier) gets mapped to drive a Since drive a supports barrier request we don't get -EOPNOTSUPP but the request with block y might get written before block x since the disk are independent. I guess the chances of this are quite low since at some point a barrier request will also hit drive b but for the time being it might be better to indicate -EOPNOTSUPP right from device-mapper. Stefan - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Wed, May 30, 2007 at 11:12:37AM +0200, Stefan Bader wrote: it might be better to indicate -EOPNOTSUPP right from device-mapper. Indeed we should. For support, on receipt of a barrier, dm core should send a zero-length barrier to all active underlying paths, and delay mapping any further I/O. Alasdair -- [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Stefan Bader wrote: You got a linear target that consists of two disks. One drive (a) supports barriers and the other one (b) doesn't. Device-mapper just maps the requests to the appropriate disk. Now the following sequence happens: 1. block x gets mapped to drive b 2. block y (with barrier) gets mapped to drive a Since drive a supports barrier request we don't get -EOPNOTSUPP but the request with block y might get written before block x since the disk are independent. I guess the chances of this are quite low since at some point a barrier request will also hit drive b but for the time being it might be better to indicate -EOPNOTSUPP right from device-mapper. The device mapper needs to ensure that ALL underlying devices get a barrier request when one comes down from above, even if it has to construct zero length barriers to send to most of them. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Phillip Susi wrote: Hrm... I may have misunderstood the perspective you were talking from. Yes, when the bio is completed it must be on the media, but the filesystem should issue both requests, and then really not care when they complete. That is to say, the filesystem should not wait for block A to finish before issuing block B; it should issue both, and use barriers to make sure they hit the disk in the correct order. Actually now that I think about it, that wasn't correct. The request CAN be completed before the data has hit the medium. The barrier just constrains the ordering of the writes, but they can still sit in the disk write back cache for some time. Stefan Bader wrote: That would be the exactly how I understand Documentation/block/barrier.txt: In other words, I/O barrier requests have the following two properties. 1. Request ordering ... 2. Forced flushing to physical medium So, I/O barriers need to guarantee that requests actually get written to non-volatile medium in order. I think you misinterpret this, and it probably could be worded a bit better. The barrier request is about constraining the order. The forced flushing is one means to implement that constraint. The other alternative mentioned there is to use ordered tags. The key part there is requests actually get written to non-volatile medium _in order_, not before the request completes, which would be synchronous IO. - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Thu, May 31, 2007 at 10:46:04AM +1000, Neil Brown wrote: What if the truth changes (as can happen with md or dm)? You get notified in endio() that the barrier had to be emulated? Alasdair -- [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Thu, May 31, 2007 at 10:46:04AM +1000, Neil Brown wrote: If a filesystem cares, it could 'ask' as suggested above. What would be a good interface for asking? XFS already tests: bd_disk-queue-ordered == QUEUE_ORDERED_NONE Alasdair -- [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Thu, May 31, 2007 at 02:07:39AM +0100, Alasdair G Kergon wrote: On Thu, May 31, 2007 at 10:46:04AM +1000, Neil Brown wrote: If a filesystem cares, it could 'ask' as suggested above. What would be a good interface for asking? XFS already tests: bd_disk-queue-ordered == QUEUE_ORDERED_NONE The side effects of removing that check is what started this whole discussion. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Tue, May 29, 2007 at 11:25:42AM +0200, Stefan Bader wrote: doing a sort of suspend, issuing the barrier request, calling flush to all mapped devices and then wait for in-flight I/O to go to zero? Something like that is needed for some dm targets to support barriers. (We needn't always wait for *all* in-flight I/O.) When faced with -EOPNOTSUP, do all callers fall back to a sync in the places a barrier would have been used, or are there any more sophisticated strategies attempting to optimise code without barriers? I am not a hundred percent sure about that but I think that just passing the barrier flag on to mapped devices might in some (maybe they are rare) cases cause a layer above to think all data is on-disk while this isn't necessarily true (see my previous post). What do you think? An efficient I/O barrier implementation would not normally involve flushing AFAIK: dm surely wouldn't cause a higher layer to assume stronger semantics than are provided. Alasdair -- [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
On Mon, May 28, 2007 at 11:30:32AM +1000, Neil Brown wrote: 1/ A BIO_RW_BARRIER request should never fail with -EOPNOTSUP. The device-mapper position has always been that we require a zero-length BIO_RW_BARRIER (i.e. containing no data to read or write - or emulated, possibly device-specific) before we can provide full barrier support. (Consider multiple active paths - each must see barrier.) Until every device supports barriers -EOPNOTSUP support is required. (Consider reconfiguration of stacks of devices - barrier support is a dynamic block device property that can switch between available and unavailable at any time.) Alasdair -- [EMAIL PROTECTED] - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
(dunny why you explicitly dropped me off the cc/to list when replying to my email, hence I missed it for 3 days) On Fri, May 25 2007, Phillip Susi wrote: Jens Axboe wrote: A barrier write will include a flush, but it may also use the FUA bit to ensure data is on platter. So the only situation where a fallback from a barrier to flush would be valid, is if the device lied and told you it could do FUA but it could not and that is the reason why the barrier write failed. If that is the case, the block layer should stop using FUA and fallback to flush-write-flush. And if it does that, then there's never a valid reason to switch from using barrier writes to blkdev_issue_flush() since both methods would either both work or both fail. IIRC, the FUA bit only forces THAT request to hit the platter before it is completed; it does not flush any previous requests still sitting in the write back queue. Because all io before the barrier must be on the platter as well, setting the FUA bit on the barrier request means you don't have to follow it with a flush, but you still have to precede it with a flush. I'm well aware of how FUA works, hence the barrier FUA implementation does flush and then write-fua. The win compared to flush-write-flush is just a saved command, essentially. It's not block layer breakage, it's a device issue. How isn't it block layer breakage? If the device does not support barriers, isn't it the job of the block layer ( probably the scheduler ) to fall back to flush-write-flush? The problem is flush not working, the block layer can't fix that for you obviously. If it's FUA not working, the block layer should fall back to flush-write-flush, as they are obviously functionally equivalent. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.
Jens Axboe wrote: A barrier write will include a flush, but it may also use the FUA bit to ensure data is on platter. So the only situation where a fallback from a barrier to flush would be valid, is if the device lied and told you it could do FUA but it could not and that is the reason why the barrier write failed. If that is the case, the block layer should stop using FUA and fallback to flush-write-flush. And if it does that, then there's never a valid reason to switch from using barrier writes to blkdev_issue_flush() since both methods would either both work or both fail. IIRC, the FUA bit only forces THAT request to hit the platter before it is completed; it does not flush any previous requests still sitting in the write back queue. Because all io before the barrier must be on the platter as well, setting the FUA bit on the barrier request means you don't have to follow it with a flush, but you still have to precede it with a flush. It's not block layer breakage, it's a device issue. How isn't it block layer breakage? If the device does not support barriers, isn't it the job of the block layer ( probably the scheduler ) to fall back to flush-write-flush? - To unsubscribe from this list: send the line unsubscribe linux-fsdevel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html