Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-13 Thread Ric Wheeler



Guy Watkins wrote:

} -Original Message-
} From: [EMAIL PROTECTED] [mailto:linux-raid-
} [EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED]
} Sent: Thursday, July 12, 2007 1:35 PM
} To: [EMAIL PROTECTED]
} Cc: Tejun Heo; [EMAIL PROTECTED]; Stefan Bader; Phillip Susi; device-mapper
} development; linux-fsdevel@vger.kernel.org; [EMAIL PROTECTED];
} [EMAIL PROTECTED]; Jens Axboe; David Chinner; Andreas Dilger
} Subject: Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for
} devices, filesystems, and dm/md.
} 
} On Wed, 11 Jul 2007 18:44:21 EDT, Ric Wheeler said:

}  [EMAIL PROTECTED] wrote:
}   On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said:
}  
}   All of the high end arrays have non-volatile cache (read, on power
} loss, it is a
}   promise that it will get all of your data out to permanent storage).
} You don't
}   need to ask this kind of array to drain the cache. In fact, it might
} just ignore
}   you if you send it that kind of request ;-)
}  
}   OK, I'll bite - how does the kernel know whether the other end of that
}   fiberchannel cable is attached to a DMX-3 or to some no-name product
} that
}   may not have the same assurances?  Is there a I'm a high-end array
} bit
}   in the sense data that I'm unaware of?
}  
} 
}  There are ways to query devices (think of hdparm -I in S-ATA/P-ATA
} drives, SCSI
}  has similar queries) to see what kind of device you are talking to. I am
} not
}  sure it is worth the trouble to do any automatic detection/handling of
} this.
} 
}  In this specific case, it is more a case of when you attach a high end
} (or
}  mid-tier) device to a server, you should configure it without barriers
} for its
}  exported LUNs.
} 
} I don't have a problem with the sysadmin *telling* the system the other

} end of
} that fiber cable has characteristics X, Y and Z.  What worried me was
} that it
} looked like conflating device reported writeback cache with device
} actually
} has enough battery/hamster/whatever backup to flush everything on a power
} loss.
} (My back-of-envelope calculation shows for a worst-case of needing a 1ms
} seek
} for each 4K block, a 1G cache can take up to 4 1/2 minutes to sync.
} That's
} a lot of battery..)

Most hardware RAID devices I know of use the battery to save the cache while
the power is off.  When the power is restored it flushes the cache to disk.
If the power failure lasts longer than the batteries then the cache data is
lost, but the batteries last 24+ hours I beleve.


Most mid-range and high end arrays actually use that battery to insure that data 
is all written out to permanent media when the power is lost. I won't go into 
how that is done, but it clearly would not be a safe assumption to assume that 
your power outage is only going to be a certain length of time (and if not, you 
would lose data).




A big EMC array we had had enough battery power to power about 400 disks
while the 16 Gig of cache was flushed.  I think EMC told me the batteries
would last about 20 minutes.  I don't recall if the array was usable during
the 20 minutes.  We never tested a power failure.

Guy


I worked on the team that designed that big array.

At one point, we had an array on loan to a partner who tried to put it in a very 
small data center. A few weeks later, they brought in an electrician who needed 
to run more power into the center.  It was pretty funny - he tried to find a 
power button to turn it off and then just walked over and dropped power trying 
to get the Symm to turn off.  When that didn't work, he was really, really 
confused ;-)


ric
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-12 Thread Valdis . Kletnieks
On Wed, 11 Jul 2007 18:44:21 EDT, Ric Wheeler said:
 [EMAIL PROTECTED] wrote:
  On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said:
  
  All of the high end arrays have non-volatile cache (read, on power loss, 
  it is a 
  promise that it will get all of your data out to permanent storage). You 
  don't 
  need to ask this kind of array to drain the cache. In fact, it might just 
  ignore 
  you if you send it that kind of request ;-)
  
  OK, I'll bite - how does the kernel know whether the other end of that
  fiberchannel cable is attached to a DMX-3 or to some no-name product that
  may not have the same assurances?  Is there a I'm a high-end array bit
  in the sense data that I'm unaware of?
  
 
 There are ways to query devices (think of hdparm -I in S-ATA/P-ATA drives, 
 SCSI 
 has similar queries) to see what kind of device you are talking to. I am not
 sure it is worth the trouble to do any automatic detection/handling of this.
 
 In this specific case, it is more a case of when you attach a high end (or 
 mid-tier) device to a server, you should configure it without barriers for its
 exported LUNs.

I don't have a problem with the sysadmin *telling* the system the other end of
that fiber cable has characteristics X, Y and Z.  What worried me was that it
looked like conflating device reported writeback cache with device actually
has enough battery/hamster/whatever backup to flush everything on a power loss.
(My back-of-envelope calculation shows for a worst-case of needing a 1ms seek
for each 4K block, a 1G cache can take up to 4 1/2 minutes to sync.  That's
a lot of battery..)


pgpk2dx8NDbeH.pgp
Description: PGP signature


RE: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-12 Thread Guy Watkins
} -Original Message-
} From: [EMAIL PROTECTED] [mailto:linux-raid-
} [EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED]
} Sent: Thursday, July 12, 2007 1:35 PM
} To: [EMAIL PROTECTED]
} Cc: Tejun Heo; [EMAIL PROTECTED]; Stefan Bader; Phillip Susi; device-mapper
} development; linux-fsdevel@vger.kernel.org; [EMAIL PROTECTED];
} [EMAIL PROTECTED]; Jens Axboe; David Chinner; Andreas Dilger
} Subject: Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for
} devices, filesystems, and dm/md.
} 
} On Wed, 11 Jul 2007 18:44:21 EDT, Ric Wheeler said:
}  [EMAIL PROTECTED] wrote:
}   On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said:
}  
}   All of the high end arrays have non-volatile cache (read, on power
} loss, it is a
}   promise that it will get all of your data out to permanent storage).
} You don't
}   need to ask this kind of array to drain the cache. In fact, it might
} just ignore
}   you if you send it that kind of request ;-)
}  
}   OK, I'll bite - how does the kernel know whether the other end of that
}   fiberchannel cable is attached to a DMX-3 or to some no-name product
} that
}   may not have the same assurances?  Is there a I'm a high-end array
} bit
}   in the sense data that I'm unaware of?
}  
} 
}  There are ways to query devices (think of hdparm -I in S-ATA/P-ATA
} drives, SCSI
}  has similar queries) to see what kind of device you are talking to. I am
} not
}  sure it is worth the trouble to do any automatic detection/handling of
} this.
} 
}  In this specific case, it is more a case of when you attach a high end
} (or
}  mid-tier) device to a server, you should configure it without barriers
} for its
}  exported LUNs.
} 
} I don't have a problem with the sysadmin *telling* the system the other
} end of
} that fiber cable has characteristics X, Y and Z.  What worried me was
} that it
} looked like conflating device reported writeback cache with device
} actually
} has enough battery/hamster/whatever backup to flush everything on a power
} loss.
} (My back-of-envelope calculation shows for a worst-case of needing a 1ms
} seek
} for each 4K block, a 1G cache can take up to 4 1/2 minutes to sync.
} That's
} a lot of battery..)

Most hardware RAID devices I know of use the battery to save the cache while
the power is off.  When the power is restored it flushes the cache to disk.
If the power failure lasts longer than the batteries then the cache data is
lost, but the batteries last 24+ hours I beleve.

A big EMC array we had had enough battery power to power about 400 disks
while the 16 Gig of cache was flushed.  I think EMC told me the batteries
would last about 20 minutes.  I don't recall if the array was usable during
the 20 minutes.  We never tested a power failure.

Guy

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-11 Thread Ric Wheeler


[EMAIL PROTECTED] wrote:

On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said:

All of the high end arrays have non-volatile cache (read, on power loss, it is a 
promise that it will get all of your data out to permanent storage). You don't 
need to ask this kind of array to drain the cache. In fact, it might just ignore 
you if you send it that kind of request ;-)


OK, I'll bite - how does the kernel know whether the other end of that
fiberchannel cable is attached to a DMX-3 or to some no-name product that
may not have the same assurances?  Is there a I'm a high-end array bit
in the sense data that I'm unaware of?



There are ways to query devices (think of hdparm -I in S-ATA/P-ATA drives, SCSI 
has similar queries) to see what kind of device you are talking to. I am not 
sure it is worth the trouble to do any automatic detection/handling of this.


In this specific case, it is more a case of when you attach a high end (or 
mid-tier) device to a server, you should configure it without barriers for its 
exported LUNs.


ric
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-10 Thread Ric Wheeler



Tejun Heo wrote:

[ cc'ing Ric Wheeler for storage array thingie.  Hi, whole thread is at
http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/3344 ]


I am actually on the list, just really, really far behind in the thread ;-)



Hello,

[EMAIL PROTECTED] wrote:

but when you consider the self-contained disk arrays it's an entirely
different story. you can easily have a few gig of cache and a complete
OS pretending to be a single drive as far as you are concerned.

and the price of such devices is plummeting (in large part thanks to
Linux moving into this space), you can now readily buy a 10TB array for
$10k that looks like a single drive.


Don't those thingies usually have NV cache or backed by battery such
that ORDERED_DRAIN is enough?


All of the high end arrays have non-volatile cache (read, on power loss, it is a 
promise that it will get all of your data out to permanent storage). You don't 
need to ask this kind of array to drain the cache. In fact, it might just ignore 
you if you send it that kind of request ;-)


The size of the NV cache can run from a few gigabytes up to hundreds of 
gigabytes, so you really don't want to invoke cache flushes here if you can 
avoid it.


For this class of device, you can get the required in order completion and data 
integrity semantics as long as we send the IO's to the device in the correct order.




The problem is that the interface between the host and a storage device
(ATA or SCSI) is not built to communicate that kind of information
(grouped flush, relaxed ordering...).  I think battery backed
ORDERED_DRAIN combined with fine-grained host queue flush would be
pretty good.  It doesn't require some fancy new interface which isn't
gonna be used widely anyway and can achieve most of performance gain if
the storage plays it smart.

Thanks.



I am not really sure that you need this ORDERED_DRAIN for big arrays...

ric
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-10 Thread Valdis . Kletnieks
On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said:

 All of the high end arrays have non-volatile cache (read, on power loss, it 
 is a 
 promise that it will get all of your data out to permanent storage). You 
 don't 
 need to ask this kind of array to drain the cache. In fact, it might just 
 ignore 
 you if you send it that kind of request ;-)

OK, I'll bite - how does the kernel know whether the other end of that
fiberchannel cable is attached to a DMX-3 or to some no-name product that
may not have the same assurances?  Is there a I'm a high-end array bit
in the sense data that I'm unaware of?



pgp2HWYqpI8yb.pgp
Description: PGP signature


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-10 Thread Tejun Heo
[EMAIL PROTECTED] wrote:
 On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said:
 
 All of the high end arrays have non-volatile cache (read, on power loss, it 
 is a 
 promise that it will get all of your data out to permanent storage). You 
 don't 
 need to ask this kind of array to drain the cache. In fact, it might just 
 ignore 
 you if you send it that kind of request ;-)
 
 OK, I'll bite - how does the kernel know whether the other end of that
 fiberchannel cable is attached to a DMX-3 or to some no-name product that
 may not have the same assurances?  Is there a I'm a high-end array bit
 in the sense data that I'm unaware of?

Well, the array just has to tell the kernel that it doesn't to write
back caching.  The kernel automatically selects ORDERED_DRAIN in such case.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-10 Thread Tejun Heo
Ric Wheeler wrote:
 Don't those thingies usually have NV cache or backed by battery such
 that ORDERED_DRAIN is enough?
 
 All of the high end arrays have non-volatile cache (read, on power loss,
 it is a promise that it will get all of your data out to permanent
 storage). You don't need to ask this kind of array to drain the cache.
 In fact, it might just ignore you if you send it that kind of request ;-)
 
 The size of the NV cache can run from a few gigabytes up to hundreds of
 gigabytes, so you really don't want to invoke cache flushes here if you
 can avoid it.
 
 For this class of device, you can get the required in order completion
 and data integrity semantics as long as we send the IO's to the device
 in the correct order.

Thanks for clarification.

 The problem is that the interface between the host and a storage device
 (ATA or SCSI) is not built to communicate that kind of information
 (grouped flush, relaxed ordering...).  I think battery backed
 ORDERED_DRAIN combined with fine-grained host queue flush would be
 pretty good.  It doesn't require some fancy new interface which isn't
 gonna be used widely anyway and can achieve most of performance gain if
 the storage plays it smart.
 
 I am not really sure that you need this ORDERED_DRAIN for big arrays...

ORDERED_DRAIN is to properly order requests from host request queue
(elevator/iosched).  We can make it finer-grained but we do need to put
some ordering restrictions.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-01 Thread Tejun Heo
[ cc'ing Ric Wheeler for storage array thingie.  Hi, whole thread is at
http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/3344 ]

Hello,

[EMAIL PROTECTED] wrote:
 but when you consider the self-contained disk arrays it's an entirely
 different story. you can easily have a few gig of cache and a complete
 OS pretending to be a single drive as far as you are concerned.
 
 and the price of such devices is plummeting (in large part thanks to
 Linux moving into this space), you can now readily buy a 10TB array for
 $10k that looks like a single drive.

Don't those thingies usually have NV cache or backed by battery such
that ORDERED_DRAIN is enough?

The problem is that the interface between the host and a storage device
(ATA or SCSI) is not built to communicate that kind of information
(grouped flush, relaxed ordering...).  I think battery backed
ORDERED_DRAIN combined with fine-grained host queue flush would be
pretty good.  It doesn't require some fancy new interface which isn't
gonna be used widely anyway and can achieve most of performance gain if
the storage plays it smart.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-01 Thread Valdis . Kletnieks
On Fri, 01 Jun 2007 16:16:01 +0900, Tejun Heo said:
 Don't those thingies usually have NV cache or backed by battery such
 that ORDERED_DRAIN is enough?

Probably *most* do, but do you really want to bet the user's data on it?

 The problem is that the interface between the host and a storage device
 (ATA or SCSI) is not built to communicate that kind of information
 (grouped flush, relaxed ordering...).  I think battery backed
 ORDERED_DRAIN combined with fine-grained host queue flush would be
 pretty good.  It doesn't require some fancy new interface which isn't
 gonna be used widely anyway and can achieve most of performance gain if
 the storage plays it smart.

Yes, that would probably be pretty good.  But how do you get the storage
device to *reliably* tell the truth about what it actually implements? (Consider
the number of devices that downright lie about their implementation of cache
flushing)


pgpUWyordSsJ4.pgp
Description: PGP signature


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-01 Thread Bill Davidsen

Jens Axboe wrote:

On Thu, May 31 2007, Phillip Susi wrote:
  

Jens Axboe wrote:


No Stephan is right, the barrier is both an ordering and integrity
constraint. If a driver completes a barrier request before that request
and previously submitted requests are on STABLE storage, then it
violates that principle. Look at the code and the various ordering
options.
  
I am saying that is the wrong thing to do.  Barrier should be about 
ordering only.  So long as the order they hit the media is maintained, 
the order the requests are completed in can change.  barrier.txt bears 



But you can't guarentee ordering without flushing the data out as well.
It all depends on the type of cache on the device, of course. If you
look at the ordinary sata/ide drive with write back caching, you can't
just issue the requests in order and pray that the drive cache will make
it to platter.

If you don't have write back caching, or if the cache is battery backed
and thus guarenteed to never be lost, maintaining order is naturally
enough.
  


Do I misread this? If ordered doesn't reach all the way to the platter 
then there will be failure modes which result in order not preserved. 
Battery backed cache doesn't prevect failures between the cache and the 
platter.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Jens Axboe
On Wed, May 30 2007, Phillip Susi wrote:
 That would be the exactly how I understand Documentation/block/barrier.txt:
 
 In other words, I/O barrier requests have the following two properties.
 1. Request ordering
 ...
 2. Forced flushing to physical medium
 
 So, I/O barriers need to guarantee that requests actually get written
 to non-volatile medium in order. 
 
 I think you misinterpret this, and it probably could be worded a bit 
 better.  The barrier request is about constraining the order.  The 
 forced flushing is one means to implement that constraint.  The other 
 alternative mentioned there is to use ordered tags.  The key part there 
 is requests actually get written to non-volatile medium _in order_, 
 not before the request completes, which would be synchronous IO.

No Stephan is right, the barrier is both an ordering and integrity
constraint. If a driver completes a barrier request before that request
and previously submitted requests are on STABLE storage, then it
violates that principle. Look at the code and the various ordering
options.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Stefan Bader

2007/5/30, Phillip Susi [EMAIL PROTECTED]:

Stefan Bader wrote:

 Since drive a supports barrier request we don't get -EOPNOTSUPP but
 the request with block y might get written before block x since the
 disk are independent. I guess the chances of this are quite low since
 at some point a barrier request will also hit drive b but for the time
 being it might be better to indicate -EOPNOTSUPP right from
 device-mapper.

The device mapper needs to ensure that ALL underlying devices get a
barrier request when one comes down from above, even if it has to
construct zero length barriers to send to most of them.



And somehow also make sure all of the barriers have been processed
before returning the barrier that came in. Plus it would have to queue
all mapping requests until the barrier is done (if strictly acting
according to barrier.txt).

But I am wondering a bit whether the requirements to barriers are
really that tight as described in Tejun's document (barrier request is
only started if everything before is safe, the barrier itself isn't
returned until it is safe, too, and all requests after the barrier
aren't started before the barrier is done). Is it really necessary to
defer any further requests until the barrier has been written to save
storage? Or would it be sufficient to guarantee that, if a barrier
request returns, everything up to (including the barrier) is on safe
storage?

Stefan
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Phillip Susi

Jens Axboe wrote:

No Stephan is right, the barrier is both an ordering and integrity
constraint. If a driver completes a barrier request before that request
and previously submitted requests are on STABLE storage, then it
violates that principle. Look at the code and the various ordering
options.


I am saying that is the wrong thing to do.  Barrier should be about 
ordering only.  So long as the order they hit the media is maintained, 
the order the requests are completed in can change.  barrier.txt bears 
this out:


Requests in ordered sequence are issued in order, but not required to 
finish in order.  Barrier implementation can handle out-of-order 
completion of ordered sequence.  IOW, the requests MUST be processed in 
order but the hardware/software completion paths are allowed to reorder 
completion notifications - eg. current SCSI midlayer doesn't preserve 
completion order during error handling.



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Jens Axboe
On Thu, May 31 2007, Phillip Susi wrote:
 Jens Axboe wrote:
 No Stephan is right, the barrier is both an ordering and integrity
 constraint. If a driver completes a barrier request before that request
 and previously submitted requests are on STABLE storage, then it
 violates that principle. Look at the code and the various ordering
 options.
 
 I am saying that is the wrong thing to do.  Barrier should be about 
 ordering only.  So long as the order they hit the media is maintained, 
 the order the requests are completed in can change.  barrier.txt bears 

But you can't guarentee ordering without flushing the data out as well.
It all depends on the type of cache on the device, of course. If you
look at the ordinary sata/ide drive with write back caching, you can't
just issue the requests in order and pray that the drive cache will make
it to platter.

If you don't have write back caching, or if the cache is battery backed
and thus guarenteed to never be lost, maintaining order is naturally
enough.

Or if the drive can do ordered queued commands, you can relax the
flushing (again depending on the cache type, you may need to take
different paths).

 Requests in ordered sequence are issued in order, but not required to 
 finish in order.  Barrier implementation can handle out-of-order 
 completion of ordered sequence.  IOW, the requests MUST be processed in 
 order but the hardware/software completion paths are allowed to reorder 
 completion notifications - eg. current SCSI midlayer doesn't preserve 
 completion order during error handling.

If you carefully re-read that paragraph, then it just tells you that the
software implementation can deal with reordered completions. It doesn't
relax the rconstraints on ordering and integrity AT ALL.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Tejun Heo
Stefan Bader wrote:
 2007/5/30, Phillip Susi [EMAIL PROTECTED]:
 Stefan Bader wrote:
 
  Since drive a supports barrier request we don't get -EOPNOTSUPP but
  the request with block y might get written before block x since the
  disk are independent. I guess the chances of this are quite low since
  at some point a barrier request will also hit drive b but for the time
  being it might be better to indicate -EOPNOTSUPP right from
  device-mapper.

 The device mapper needs to ensure that ALL underlying devices get a
 barrier request when one comes down from above, even if it has to
 construct zero length barriers to send to most of them.

 
 And somehow also make sure all of the barriers have been processed
 before returning the barrier that came in. Plus it would have to queue
 all mapping requests until the barrier is done (if strictly acting
 according to barrier.txt).
 
 But I am wondering a bit whether the requirements to barriers are
 really that tight as described in Tejun's document (barrier request is
 only started if everything before is safe, the barrier itself isn't
 returned until it is safe, too, and all requests after the barrier
 aren't started before the barrier is done). Is it really necessary to
 defer any further requests until the barrier has been written to save
 storage? Or would it be sufficient to guarantee that, if a barrier
 request returns, everything up to (including the barrier) is on safe
 storage?

Well, what's described in barrier.txt is the current implemented
semantics and what filesystems expect, so we can't change it underneath
them but we definitely can introduce new more relaxed variants, but one
thing we should bear in mind is that harddisks don't have humongous
caches or very smart controller / instruction set.  No matter how
relaxed interface the block layer provides, in the end, it just has to
issue whole-sale FLUSH CACHE on the device to guarantee data ordering on
the media.

IMHO, we can do better by paying more attention to how we do things in
the request queue which can be deeper and more intelligent than the
device queue.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread david

On Fri, 1 Jun 2007, Tejun Heo wrote:


but one
thing we should bear in mind is that harddisks don't have humongous
caches or very smart controller / instruction set.  No matter how
relaxed interface the block layer provides, in the end, it just has to
issue whole-sale FLUSH CACHE on the device to guarantee data ordering on
the media.


if you are talking about individual drives you may be right for the moment 
(but 16M cache on drives is a _lot_ larger then people imagined would be 
there a few years ago)


but when you consider the self-contained disk arrays it's an entirely 
different story. you can easily have a few gig of cache and a complete OS 
pretending to be a single drive as far as you are concerned.


and the price of such devices is plummeting (in large part thanks to Linux 
moving into this space), you can now readily buy a 10TB array for $10k 
that looks like a single drive.


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread Stefan Bader

 in-flight I/O to go to zero?

Something like that is needed for some dm targets to support barriers.
(We needn't always wait for *all* in-flight I/O.)
When faced with -EOPNOTSUP, do all callers fall back to a sync in
the places a barrier would have been used, or are there any more
sophisticated strategies attempting to optimise code without barriers?


If I didn't misunderstand the idea is that no caller will face an
-EOPNOTSUPP in future. IOW every layer or driver somehow makes sure
the right thing happens.



An efficient I/O barrier implementation would not normally involve
flushing AFAIK: dm surely wouldn't cause a higher layer to assume
stronger semantics than are provided.


Seems there are at least two assumptions about what the semantics
exactly _are_. Based on Documentation/block/barriers.txt I understand
a barrier implies ordering and flushing.
But regardless of that, assume the (admittedly constructed) following case:

You got a linear target that consists of two disks. One drive (a)
supports barriers and the other one (b) doesn't. Device-mapper just
maps the requests to the appropriate disk. Now the following sequence
happens:

1. block x gets mapped to drive b
2. block y (with barrier) gets mapped to drive a

Since drive a supports barrier request we don't get -EOPNOTSUPP but
the request with block y might get written before block x since the
disk are independent. I guess the chances of this are quite low since
at some point a barrier request will also hit drive b but for the time
being it might be better to indicate -EOPNOTSUPP right from
device-mapper.

Stefan
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread Alasdair G Kergon
On Wed, May 30, 2007 at 11:12:37AM +0200, Stefan Bader wrote:
 it might be better to indicate -EOPNOTSUPP right from
 device-mapper.
 
Indeed we should.  For support, on receipt of a barrier, dm core should
send a zero-length barrier to all active underlying paths, and delay
mapping any further I/O.

Alasdair
-- 
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread Phillip Susi

Stefan Bader wrote:

You got a linear target that consists of two disks. One drive (a)
supports barriers and the other one (b) doesn't. Device-mapper just
maps the requests to the appropriate disk. Now the following sequence
happens:

1. block x gets mapped to drive b
2. block y (with barrier) gets mapped to drive a

Since drive a supports barrier request we don't get -EOPNOTSUPP but
the request with block y might get written before block x since the
disk are independent. I guess the chances of this are quite low since
at some point a barrier request will also hit drive b but for the time
being it might be better to indicate -EOPNOTSUPP right from
device-mapper.


The device mapper needs to ensure that ALL underlying devices get a 
barrier request when one comes down from above, even if it has to 
construct zero length barriers to send to most of them.


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread Phillip Susi

Phillip Susi wrote:
Hrm... I may have misunderstood the perspective you were talking from. 
Yes, when the bio is completed it must be on the media, but the 
filesystem should issue both requests, and then really not care when 
they complete.  That is to say, the filesystem should not wait for block 
A to finish before issuing block B; it should issue both, and use 
barriers to make sure they hit the disk in the correct order.


Actually now that I think about it, that wasn't correct.  The request 
CAN be completed before the data has hit the medium.  The barrier just 
constrains the ordering of the writes, but they can still sit in the 
disk write back cache for some time.


Stefan Bader wrote:

That would be the exactly how I understand Documentation/block/barrier.txt:

In other words, I/O barrier requests have the following two properties.
1. Request ordering
...
2. Forced flushing to physical medium

So, I/O barriers need to guarantee that requests actually get written
to non-volatile medium in order. 


I think you misinterpret this, and it probably could be worded a bit 
better.  The barrier request is about constraining the order.  The 
forced flushing is one means to implement that constraint.  The other 
alternative mentioned there is to use ordered tags.  The key part there 
is requests actually get written to non-volatile medium _in order_, 
not before the request completes, which would be synchronous IO.


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread Alasdair G Kergon
On Thu, May 31, 2007 at 10:46:04AM +1000, Neil Brown wrote:
 What if the truth changes (as can happen with md or dm)?

You get notified in endio() that the barrier had to be emulated?
 
Alasdair
-- 
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread Alasdair G Kergon
On Thu, May 31, 2007 at 10:46:04AM +1000, Neil Brown wrote:
 If a filesystem cares, it could 'ask' as suggested above.
 What would be a good interface for asking?

XFS already tests:
  bd_disk-queue-ordered == QUEUE_ORDERED_NONE

Alasdair
-- 
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread David Chinner
On Thu, May 31, 2007 at 02:07:39AM +0100, Alasdair G Kergon wrote:
 On Thu, May 31, 2007 at 10:46:04AM +1000, Neil Brown wrote:
  If a filesystem cares, it could 'ask' as suggested above.
  What would be a good interface for asking?
 
 XFS already tests:
   bd_disk-queue-ordered == QUEUE_ORDERED_NONE

The side effects of removing that check is what started
this whole discussion.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-29 Thread Alasdair G Kergon
On Tue, May 29, 2007 at 11:25:42AM +0200, Stefan Bader wrote:
 doing a sort of suspend, issuing the
 barrier request, calling flush to all mapped devices and then wait for
 in-flight I/O to go to zero? 

Something like that is needed for some dm targets to support barriers.
(We needn't always wait for *all* in-flight I/O.)
When faced with -EOPNOTSUP, do all callers fall back to a sync in
the places a barrier would have been used, or are there any more
sophisticated strategies attempting to optimise code without barriers?

 I am not a hundred percent sure about
 that but I think that just passing the barrier flag on to mapped
 devices might in some (maybe they are rare) cases cause a layer above
 to think all data is on-disk while this isn't necessarily true (see my
 previous post). What do you think?

An efficient I/O barrier implementation would not normally involve
flushing AFAIK: dm surely wouldn't cause a higher layer to assume
stronger semantics than are provided.
 
Alasdair
-- 
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-28 Thread Alasdair G Kergon
On Mon, May 28, 2007 at 11:30:32AM +1000, Neil Brown wrote:
 1/ A BIO_RW_BARRIER request should never fail with -EOPNOTSUP.
 
The device-mapper position has always been that we require

  a zero-length BIO_RW_BARRIER 

(i.e. containing no data to read or write - or emulated, possibly
device-specific)

before we can provide full barrier support.
  (Consider multiple active paths - each must see barrier.)

Until every device supports barriers -EOPNOTSUP support is required.
  (Consider reconfiguration of stacks of devices - barrier support is a
   dynamic block device property that can switch between available and
   unavailable at any time.)

Alasdair
-- 
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-28 Thread Jens Axboe

(dunny why you explicitly dropped me off the cc/to list when replying to
my email, hence I missed it for 3 days)

On Fri, May 25 2007, Phillip Susi wrote:
 Jens Axboe wrote:
 A barrier write will include a flush, but it may also use the FUA bit to
 ensure data is on platter. So the only situation where a fallback from a
 barrier to flush would be valid, is if the device lied and told you it
 could do FUA but it could not and that is the reason why the barrier
 write failed. If that is the case, the block layer should stop using FUA
 and fallback to flush-write-flush. And if it does that, then there's
 never a valid reason to switch from using barrier writes to
 blkdev_issue_flush() since both methods would either both work or both
 fail.
 
 IIRC, the FUA bit only forces THAT request to hit the platter before it 
 is completed; it does not flush any previous requests still sitting in 
 the write back queue.  Because all io before the barrier must be on the 
 platter as well, setting the FUA bit on the barrier request means you 
 don't have to follow it with a flush, but you still have to precede it 
 with a flush.

I'm well aware of how FUA works, hence the barrier FUA implementation
does flush and then write-fua. The win compared to flush-write-flush is
just a saved command, essentially.

 It's not block layer breakage, it's a device issue.
 
 How isn't it block layer breakage?  If the device does not support 
 barriers, isn't it the job of the block layer ( probably the scheduler ) 
 to fall back to flush-write-flush?

The problem is flush not working, the block layer can't fix that for you
obviously. If it's FUA not working, the block layer should fall back to
flush-write-flush, as they are obviously functionally equivalent.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-25 Thread Phillip Susi

Jens Axboe wrote:

A barrier write will include a flush, but it may also use the FUA bit to
ensure data is on platter. So the only situation where a fallback from a
barrier to flush would be valid, is if the device lied and told you it
could do FUA but it could not and that is the reason why the barrier
write failed. If that is the case, the block layer should stop using FUA
and fallback to flush-write-flush. And if it does that, then there's
never a valid reason to switch from using barrier writes to
blkdev_issue_flush() since both methods would either both work or both
fail.


IIRC, the FUA bit only forces THAT request to hit the platter before it 
is completed; it does not flush any previous requests still sitting in 
the write back queue.  Because all io before the barrier must be on the 
platter as well, setting the FUA bit on the barrier request means you 
don't have to follow it with a flush, but you still have to precede it 
with a flush.



It's not block layer breakage, it's a device issue.


How isn't it block layer breakage?  If the device does not support 
barriers, isn't it the job of the block layer ( probably the scheduler ) 
to fall back to flush-write-flush?



-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html