Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-13 Thread Ric Wheeler



Guy Watkins wrote:

} -Original Message-
} From: [EMAIL PROTECTED] [mailto:linux-raid-
} [EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED]
} Sent: Thursday, July 12, 2007 1:35 PM
} To: [EMAIL PROTECTED]
} Cc: Tejun Heo; [EMAIL PROTECTED]; Stefan Bader; Phillip Susi; device-mapper
} development; [EMAIL PROTECTED]; linux-kernel@vger.kernel.org;
} [EMAIL PROTECTED]; Jens Axboe; David Chinner; Andreas Dilger
} Subject: Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for
} devices, filesystems, and dm/md.
} 
} On Wed, 11 Jul 2007 18:44:21 EDT, Ric Wheeler said:

} > [EMAIL PROTECTED] wrote:
} > > On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said:
} > >
} > >> All of the high end arrays have non-volatile cache (read, on power
} loss, it is a
} > >> promise that it will get all of your data out to permanent storage).
} You don't
} > >> need to ask this kind of array to drain the cache. In fact, it might
} just ignore
} > >> you if you send it that kind of request ;-)
} > >
} > > OK, I'll bite - how does the kernel know whether the other end of that
} > > fiberchannel cable is attached to a DMX-3 or to some no-name product
} that
} > > may not have the same assurances?  Is there a "I'm a high-end array"
} bit
} > > in the sense data that I'm unaware of?
} > >
} >
} > There are ways to query devices (think of hdparm -I in S-ATA/P-ATA
} drives, SCSI
} > has similar queries) to see what kind of device you are talking to. I am
} not
} > sure it is worth the trouble to do any automatic detection/handling of
} this.
} >
} > In this specific case, it is more a case of when you attach a high end
} (or
} > mid-tier) device to a server, you should configure it without barriers
} for its
} > exported LUNs.
} 
} I don't have a problem with the sysadmin *telling* the system "the other

} end of
} that fiber cable has characteristics X, Y and Z".  What worried me was
} that it
} looked like conflating "device reported writeback cache" with "device
} actually
} has enough battery/hamster/whatever backup to flush everything on a power
} loss".
} (My back-of-envelope calculation shows for a worst-case of needing a 1ms
} seek
} for each 4K block, a 1G cache can take up to 4 1/2 minutes to sync.
} That's
} a lot of battery..)

Most hardware RAID devices I know of use the battery to save the cache while
the power is off.  When the power is restored it flushes the cache to disk.
If the power failure lasts longer than the batteries then the cache data is
lost, but the batteries last 24+ hours I beleve.


Most mid-range and high end arrays actually use that battery to insure that data 
is all written out to permanent media when the power is lost. I won't go into 
how that is done, but it clearly would not be a safe assumption to assume that 
your power outage is only going to be a certain length of time (and if not, you 
would lose data).




A big EMC array we had had enough battery power to power about 400 disks
while the 16 Gig of cache was flushed.  I think EMC told me the batteries
would last about 20 minutes.  I don't recall if the array was usable during
the 20 minutes.  We never tested a power failure.

Guy


I worked on the team that designed that big array.

At one point, we had an array on loan to a partner who tried to put it in a very 
small data center. A few weeks later, they brought in an electrician who needed 
to run more power into the center.  It was pretty funny - he tried to find a 
power button to turn it off and then just walked over and dropped power trying 
to get the Symm to turn off.  When that didn't work, he was really, really 
confused ;-)


ric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-13 Thread Ric Wheeler



Guy Watkins wrote:

} -Original Message-
} From: [EMAIL PROTECTED] [mailto:linux-raid-
} [EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED]
} Sent: Thursday, July 12, 2007 1:35 PM
} To: [EMAIL PROTECTED]
} Cc: Tejun Heo; [EMAIL PROTECTED]; Stefan Bader; Phillip Susi; device-mapper
} development; [EMAIL PROTECTED]; linux-kernel@vger.kernel.org;
} [EMAIL PROTECTED]; Jens Axboe; David Chinner; Andreas Dilger
} Subject: Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for
} devices, filesystems, and dm/md.
} 
} On Wed, 11 Jul 2007 18:44:21 EDT, Ric Wheeler said:

}  [EMAIL PROTECTED] wrote:
}   On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said:
}  
}   All of the high end arrays have non-volatile cache (read, on power
} loss, it is a
}   promise that it will get all of your data out to permanent storage).
} You don't
}   need to ask this kind of array to drain the cache. In fact, it might
} just ignore
}   you if you send it that kind of request ;-)
}  
}   OK, I'll bite - how does the kernel know whether the other end of that
}   fiberchannel cable is attached to a DMX-3 or to some no-name product
} that
}   may not have the same assurances?  Is there a I'm a high-end array
} bit
}   in the sense data that I'm unaware of?
}  
} 
}  There are ways to query devices (think of hdparm -I in S-ATA/P-ATA
} drives, SCSI
}  has similar queries) to see what kind of device you are talking to. I am
} not
}  sure it is worth the trouble to do any automatic detection/handling of
} this.
} 
}  In this specific case, it is more a case of when you attach a high end
} (or
}  mid-tier) device to a server, you should configure it without barriers
} for its
}  exported LUNs.
} 
} I don't have a problem with the sysadmin *telling* the system the other

} end of
} that fiber cable has characteristics X, Y and Z.  What worried me was
} that it
} looked like conflating device reported writeback cache with device
} actually
} has enough battery/hamster/whatever backup to flush everything on a power
} loss.
} (My back-of-envelope calculation shows for a worst-case of needing a 1ms
} seek
} for each 4K block, a 1G cache can take up to 4 1/2 minutes to sync.
} That's
} a lot of battery..)

Most hardware RAID devices I know of use the battery to save the cache while
the power is off.  When the power is restored it flushes the cache to disk.
If the power failure lasts longer than the batteries then the cache data is
lost, but the batteries last 24+ hours I beleve.


Most mid-range and high end arrays actually use that battery to insure that data 
is all written out to permanent media when the power is lost. I won't go into 
how that is done, but it clearly would not be a safe assumption to assume that 
your power outage is only going to be a certain length of time (and if not, you 
would lose data).




A big EMC array we had had enough battery power to power about 400 disks
while the 16 Gig of cache was flushed.  I think EMC told me the batteries
would last about 20 minutes.  I don't recall if the array was usable during
the 20 minutes.  We never tested a power failure.

Guy


I worked on the team that designed that big array.

At one point, we had an array on loan to a partner who tried to put it in a very 
small data center. A few weeks later, they brought in an electrician who needed 
to run more power into the center.  It was pretty funny - he tried to find a 
power button to turn it off and then just walked over and dropped power trying 
to get the Symm to turn off.  When that didn't work, he was really, really 
confused ;-)


ric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-12 Thread Guy Watkins
} -Original Message-
} From: [EMAIL PROTECTED] [mailto:linux-raid-
} [EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED]
} Sent: Thursday, July 12, 2007 1:35 PM
} To: [EMAIL PROTECTED]
} Cc: Tejun Heo; [EMAIL PROTECTED]; Stefan Bader; Phillip Susi; device-mapper
} development; [EMAIL PROTECTED]; linux-kernel@vger.kernel.org;
} [EMAIL PROTECTED]; Jens Axboe; David Chinner; Andreas Dilger
} Subject: Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for
} devices, filesystems, and dm/md.
} 
} On Wed, 11 Jul 2007 18:44:21 EDT, Ric Wheeler said:
} > [EMAIL PROTECTED] wrote:
} > > On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said:
} > >
} > >> All of the high end arrays have non-volatile cache (read, on power
} loss, it is a
} > >> promise that it will get all of your data out to permanent storage).
} You don't
} > >> need to ask this kind of array to drain the cache. In fact, it might
} just ignore
} > >> you if you send it that kind of request ;-)
} > >
} > > OK, I'll bite - how does the kernel know whether the other end of that
} > > fiberchannel cable is attached to a DMX-3 or to some no-name product
} that
} > > may not have the same assurances?  Is there a "I'm a high-end array"
} bit
} > > in the sense data that I'm unaware of?
} > >
} >
} > There are ways to query devices (think of hdparm -I in S-ATA/P-ATA
} drives, SCSI
} > has similar queries) to see what kind of device you are talking to. I am
} not
} > sure it is worth the trouble to do any automatic detection/handling of
} this.
} >
} > In this specific case, it is more a case of when you attach a high end
} (or
} > mid-tier) device to a server, you should configure it without barriers
} for its
} > exported LUNs.
} 
} I don't have a problem with the sysadmin *telling* the system "the other
} end of
} that fiber cable has characteristics X, Y and Z".  What worried me was
} that it
} looked like conflating "device reported writeback cache" with "device
} actually
} has enough battery/hamster/whatever backup to flush everything on a power
} loss".
} (My back-of-envelope calculation shows for a worst-case of needing a 1ms
} seek
} for each 4K block, a 1G cache can take up to 4 1/2 minutes to sync.
} That's
} a lot of battery..)

Most hardware RAID devices I know of use the battery to save the cache while
the power is off.  When the power is restored it flushes the cache to disk.
If the power failure lasts longer than the batteries then the cache data is
lost, but the batteries last 24+ hours I beleve.

A big EMC array we had had enough battery power to power about 400 disks
while the 16 Gig of cache was flushed.  I think EMC told me the batteries
would last about 20 minutes.  I don't recall if the array was usable during
the 20 minutes.  We never tested a power failure.

Guy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-12 Thread Ric Wheeler



[EMAIL PROTECTED] wrote:

On Wed, 11 Jul 2007 18:44:21 EDT, Ric Wheeler said:

[EMAIL PROTECTED] wrote:

On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said:

All of the high end arrays have non-volatile cache (read, on power loss, it is a 
promise that it will get all of your data out to permanent storage). You don't 
need to ask this kind of array to drain the cache. In fact, it might just ignore 
you if you send it that kind of request ;-)

OK, I'll bite - how does the kernel know whether the other end of that
fiberchannel cable is attached to a DMX-3 or to some no-name product that
may not have the same assurances?  Is there a "I'm a high-end array" bit
in the sense data that I'm unaware of?

There are ways to query devices (think of hdparm -I in S-ATA/P-ATA drives, SCSI 
has similar queries) to see what kind of device you are talking to. I am not

sure it is worth the trouble to do any automatic detection/handling of this.

In this specific case, it is more a case of when you attach a high end (or 
mid-tier) device to a server, you should configure it without barriers for its

exported LUNs.


I don't have a problem with the sysadmin *telling* the system "the other end of
that fiber cable has characteristics X, Y and Z".  What worried me was that it
looked like conflating "device reported writeback cache" with "device actually
has enough battery/hamster/whatever backup to flush everything on a power loss".
(My back-of-envelope calculation shows for a worst-case of needing a 1ms seek
for each 4K block, a 1G cache can take up to 4 1/2 minutes to sync.  That's
a lot of battery..)


I think that we are on the same page here - just let the sys admin mount without 
barriers for big arrays.


1GB of cache, by the way, is really small for some of us ;-)

ric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-12 Thread Valdis . Kletnieks
On Wed, 11 Jul 2007 18:44:21 EDT, Ric Wheeler said:
> [EMAIL PROTECTED] wrote:
> > On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said:
> > 
> >> All of the high end arrays have non-volatile cache (read, on power loss, 
> >> it is a 
> >> promise that it will get all of your data out to permanent storage). You 
> >> don't 
> >> need to ask this kind of array to drain the cache. In fact, it might just 
> >> ignore 
> >> you if you send it that kind of request ;-)
> > 
> > OK, I'll bite - how does the kernel know whether the other end of that
> > fiberchannel cable is attached to a DMX-3 or to some no-name product that
> > may not have the same assurances?  Is there a "I'm a high-end array" bit
> > in the sense data that I'm unaware of?
> > 
> 
> There are ways to query devices (think of hdparm -I in S-ATA/P-ATA drives, 
> SCSI 
> has similar queries) to see what kind of device you are talking to. I am not
> sure it is worth the trouble to do any automatic detection/handling of this.
> 
> In this specific case, it is more a case of when you attach a high end (or 
> mid-tier) device to a server, you should configure it without barriers for its
> exported LUNs.

I don't have a problem with the sysadmin *telling* the system "the other end of
that fiber cable has characteristics X, Y and Z".  What worried me was that it
looked like conflating "device reported writeback cache" with "device actually
has enough battery/hamster/whatever backup to flush everything on a power loss".
(My back-of-envelope calculation shows for a worst-case of needing a 1ms seek
for each 4K block, a 1G cache can take up to 4 1/2 minutes to sync.  That's
a lot of battery..)


pgpMtMKyNYyXF.pgp
Description: PGP signature


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-12 Thread Valdis . Kletnieks
On Wed, 11 Jul 2007 18:44:21 EDT, Ric Wheeler said:
 [EMAIL PROTECTED] wrote:
  On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said:
  
  All of the high end arrays have non-volatile cache (read, on power loss, 
  it is a 
  promise that it will get all of your data out to permanent storage). You 
  don't 
  need to ask this kind of array to drain the cache. In fact, it might just 
  ignore 
  you if you send it that kind of request ;-)
  
  OK, I'll bite - how does the kernel know whether the other end of that
  fiberchannel cable is attached to a DMX-3 or to some no-name product that
  may not have the same assurances?  Is there a I'm a high-end array bit
  in the sense data that I'm unaware of?
  
 
 There are ways to query devices (think of hdparm -I in S-ATA/P-ATA drives, 
 SCSI 
 has similar queries) to see what kind of device you are talking to. I am not
 sure it is worth the trouble to do any automatic detection/handling of this.
 
 In this specific case, it is more a case of when you attach a high end (or 
 mid-tier) device to a server, you should configure it without barriers for its
 exported LUNs.

I don't have a problem with the sysadmin *telling* the system the other end of
that fiber cable has characteristics X, Y and Z.  What worried me was that it
looked like conflating device reported writeback cache with device actually
has enough battery/hamster/whatever backup to flush everything on a power loss.
(My back-of-envelope calculation shows for a worst-case of needing a 1ms seek
for each 4K block, a 1G cache can take up to 4 1/2 minutes to sync.  That's
a lot of battery..)


pgpMtMKyNYyXF.pgp
Description: PGP signature


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-12 Thread Ric Wheeler



[EMAIL PROTECTED] wrote:

On Wed, 11 Jul 2007 18:44:21 EDT, Ric Wheeler said:

[EMAIL PROTECTED] wrote:

On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said:

All of the high end arrays have non-volatile cache (read, on power loss, it is a 
promise that it will get all of your data out to permanent storage). You don't 
need to ask this kind of array to drain the cache. In fact, it might just ignore 
you if you send it that kind of request ;-)

OK, I'll bite - how does the kernel know whether the other end of that
fiberchannel cable is attached to a DMX-3 or to some no-name product that
may not have the same assurances?  Is there a I'm a high-end array bit
in the sense data that I'm unaware of?

There are ways to query devices (think of hdparm -I in S-ATA/P-ATA drives, SCSI 
has similar queries) to see what kind of device you are talking to. I am not

sure it is worth the trouble to do any automatic detection/handling of this.

In this specific case, it is more a case of when you attach a high end (or 
mid-tier) device to a server, you should configure it without barriers for its

exported LUNs.


I don't have a problem with the sysadmin *telling* the system the other end of
that fiber cable has characteristics X, Y and Z.  What worried me was that it
looked like conflating device reported writeback cache with device actually
has enough battery/hamster/whatever backup to flush everything on a power loss.
(My back-of-envelope calculation shows for a worst-case of needing a 1ms seek
for each 4K block, a 1G cache can take up to 4 1/2 minutes to sync.  That's
a lot of battery..)


I think that we are on the same page here - just let the sys admin mount without 
barriers for big arrays.


1GB of cache, by the way, is really small for some of us ;-)

ric

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-12 Thread Guy Watkins
} -Original Message-
} From: [EMAIL PROTECTED] [mailto:linux-raid-
} [EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED]
} Sent: Thursday, July 12, 2007 1:35 PM
} To: [EMAIL PROTECTED]
} Cc: Tejun Heo; [EMAIL PROTECTED]; Stefan Bader; Phillip Susi; device-mapper
} development; [EMAIL PROTECTED]; linux-kernel@vger.kernel.org;
} [EMAIL PROTECTED]; Jens Axboe; David Chinner; Andreas Dilger
} Subject: Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for
} devices, filesystems, and dm/md.
} 
} On Wed, 11 Jul 2007 18:44:21 EDT, Ric Wheeler said:
}  [EMAIL PROTECTED] wrote:
}   On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said:
}  
}   All of the high end arrays have non-volatile cache (read, on power
} loss, it is a
}   promise that it will get all of your data out to permanent storage).
} You don't
}   need to ask this kind of array to drain the cache. In fact, it might
} just ignore
}   you if you send it that kind of request ;-)
}  
}   OK, I'll bite - how does the kernel know whether the other end of that
}   fiberchannel cable is attached to a DMX-3 or to some no-name product
} that
}   may not have the same assurances?  Is there a I'm a high-end array
} bit
}   in the sense data that I'm unaware of?
}  
} 
}  There are ways to query devices (think of hdparm -I in S-ATA/P-ATA
} drives, SCSI
}  has similar queries) to see what kind of device you are talking to. I am
} not
}  sure it is worth the trouble to do any automatic detection/handling of
} this.
} 
}  In this specific case, it is more a case of when you attach a high end
} (or
}  mid-tier) device to a server, you should configure it without barriers
} for its
}  exported LUNs.
} 
} I don't have a problem with the sysadmin *telling* the system the other
} end of
} that fiber cable has characteristics X, Y and Z.  What worried me was
} that it
} looked like conflating device reported writeback cache with device
} actually
} has enough battery/hamster/whatever backup to flush everything on a power
} loss.
} (My back-of-envelope calculation shows for a worst-case of needing a 1ms
} seek
} for each 4K block, a 1G cache can take up to 4 1/2 minutes to sync.
} That's
} a lot of battery..)

Most hardware RAID devices I know of use the battery to save the cache while
the power is off.  When the power is restored it flushes the cache to disk.
If the power failure lasts longer than the batteries then the cache data is
lost, but the batteries last 24+ hours I beleve.

A big EMC array we had had enough battery power to power about 400 disks
while the 16 Gig of cache was flushed.  I think EMC told me the batteries
would last about 20 minutes.  I don't recall if the array was usable during
the 20 minutes.  We never tested a power failure.

Guy

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-11 Thread Ric Wheeler


[EMAIL PROTECTED] wrote:

On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said:

All of the high end arrays have non-volatile cache (read, on power loss, it is a 
promise that it will get all of your data out to permanent storage). You don't 
need to ask this kind of array to drain the cache. In fact, it might just ignore 
you if you send it that kind of request ;-)


OK, I'll bite - how does the kernel know whether the other end of that
fiberchannel cable is attached to a DMX-3 or to some no-name product that
may not have the same assurances?  Is there a "I'm a high-end array" bit
in the sense data that I'm unaware of?



There are ways to query devices (think of hdparm -I in S-ATA/P-ATA drives, SCSI 
has similar queries) to see what kind of device you are talking to. I am not 
sure it is worth the trouble to do any automatic detection/handling of this.


In this specific case, it is more a case of when you attach a high end (or 
mid-tier) device to a server, you should configure it without barriers for its 
exported LUNs.


ric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-11 Thread Ric Wheeler


[EMAIL PROTECTED] wrote:

On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said:

All of the high end arrays have non-volatile cache (read, on power loss, it is a 
promise that it will get all of your data out to permanent storage). You don't 
need to ask this kind of array to drain the cache. In fact, it might just ignore 
you if you send it that kind of request ;-)


OK, I'll bite - how does the kernel know whether the other end of that
fiberchannel cable is attached to a DMX-3 or to some no-name product that
may not have the same assurances?  Is there a I'm a high-end array bit
in the sense data that I'm unaware of?



There are ways to query devices (think of hdparm -I in S-ATA/P-ATA drives, SCSI 
has similar queries) to see what kind of device you are talking to. I am not 
sure it is worth the trouble to do any automatic detection/handling of this.


In this specific case, it is more a case of when you attach a high end (or 
mid-tier) device to a server, you should configure it without barriers for its 
exported LUNs.


ric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-10 Thread Tejun Heo
Ric Wheeler wrote:
>> Don't those thingies usually have NV cache or backed by battery such
>> that ORDERED_DRAIN is enough?
> 
> All of the high end arrays have non-volatile cache (read, on power loss,
> it is a promise that it will get all of your data out to permanent
> storage). You don't need to ask this kind of array to drain the cache.
> In fact, it might just ignore you if you send it that kind of request ;-)
> 
> The size of the NV cache can run from a few gigabytes up to hundreds of
> gigabytes, so you really don't want to invoke cache flushes here if you
> can avoid it.
> 
> For this class of device, you can get the required in order completion
> and data integrity semantics as long as we send the IO's to the device
> in the correct order.

Thanks for clarification.

>> The problem is that the interface between the host and a storage device
>> (ATA or SCSI) is not built to communicate that kind of information
>> (grouped flush, relaxed ordering...).  I think battery backed
>> ORDERED_DRAIN combined with fine-grained host queue flush would be
>> pretty good.  It doesn't require some fancy new interface which isn't
>> gonna be used widely anyway and can achieve most of performance gain if
>> the storage plays it smart.
> 
> I am not really sure that you need this ORDERED_DRAIN for big arrays...

ORDERED_DRAIN is to properly order requests from host request queue
(elevator/iosched).  We can make it finer-grained but we do need to put
some ordering restrictions.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-10 Thread Tejun Heo
[EMAIL PROTECTED] wrote:
> On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said:
> 
>> All of the high end arrays have non-volatile cache (read, on power loss, it 
>> is a 
>> promise that it will get all of your data out to permanent storage). You 
>> don't 
>> need to ask this kind of array to drain the cache. In fact, it might just 
>> ignore 
>> you if you send it that kind of request ;-)
> 
> OK, I'll bite - how does the kernel know whether the other end of that
> fiberchannel cable is attached to a DMX-3 or to some no-name product that
> may not have the same assurances?  Is there a "I'm a high-end array" bit
> in the sense data that I'm unaware of?

Well, the array just has to tell the kernel that it doesn't to write
back caching.  The kernel automatically selects ORDERED_DRAIN in such case.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-10 Thread Valdis . Kletnieks
On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said:

> All of the high end arrays have non-volatile cache (read, on power loss, it 
> is a 
> promise that it will get all of your data out to permanent storage). You 
> don't 
> need to ask this kind of array to drain the cache. In fact, it might just 
> ignore 
> you if you send it that kind of request ;-)

OK, I'll bite - how does the kernel know whether the other end of that
fiberchannel cable is attached to a DMX-3 or to some no-name product that
may not have the same assurances?  Is there a "I'm a high-end array" bit
in the sense data that I'm unaware of?



pgpPnk1U6EkuO.pgp
Description: PGP signature


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-10 Thread Ric Wheeler



Tejun Heo wrote:

[ cc'ing Ric Wheeler for storage array thingie.  Hi, whole thread is at
http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/3344 ]


I am actually on the list, just really, really far behind in the thread ;-)



Hello,

[EMAIL PROTECTED] wrote:

but when you consider the self-contained disk arrays it's an entirely
different story. you can easily have a few gig of cache and a complete
OS pretending to be a single drive as far as you are concerned.

and the price of such devices is plummeting (in large part thanks to
Linux moving into this space), you can now readily buy a 10TB array for
$10k that looks like a single drive.


Don't those thingies usually have NV cache or backed by battery such
that ORDERED_DRAIN is enough?


All of the high end arrays have non-volatile cache (read, on power loss, it is a 
promise that it will get all of your data out to permanent storage). You don't 
need to ask this kind of array to drain the cache. In fact, it might just ignore 
you if you send it that kind of request ;-)


The size of the NV cache can run from a few gigabytes up to hundreds of 
gigabytes, so you really don't want to invoke cache flushes here if you can 
avoid it.


For this class of device, you can get the required in order completion and data 
integrity semantics as long as we send the IO's to the device in the correct order.




The problem is that the interface between the host and a storage device
(ATA or SCSI) is not built to communicate that kind of information
(grouped flush, relaxed ordering...).  I think battery backed
ORDERED_DRAIN combined with fine-grained host queue flush would be
pretty good.  It doesn't require some fancy new interface which isn't
gonna be used widely anyway and can achieve most of performance gain if
the storage plays it smart.

Thanks.



I am not really sure that you need this ORDERED_DRAIN for big arrays...

ric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-10 Thread Ric Wheeler



Tejun Heo wrote:

[ cc'ing Ric Wheeler for storage array thingie.  Hi, whole thread is at
http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/3344 ]


I am actually on the list, just really, really far behind in the thread ;-)



Hello,

[EMAIL PROTECTED] wrote:

but when you consider the self-contained disk arrays it's an entirely
different story. you can easily have a few gig of cache and a complete
OS pretending to be a single drive as far as you are concerned.

and the price of such devices is plummeting (in large part thanks to
Linux moving into this space), you can now readily buy a 10TB array for
$10k that looks like a single drive.


Don't those thingies usually have NV cache or backed by battery such
that ORDERED_DRAIN is enough?


All of the high end arrays have non-volatile cache (read, on power loss, it is a 
promise that it will get all of your data out to permanent storage). You don't 
need to ask this kind of array to drain the cache. In fact, it might just ignore 
you if you send it that kind of request ;-)


The size of the NV cache can run from a few gigabytes up to hundreds of 
gigabytes, so you really don't want to invoke cache flushes here if you can 
avoid it.


For this class of device, you can get the required in order completion and data 
integrity semantics as long as we send the IO's to the device in the correct order.




The problem is that the interface between the host and a storage device
(ATA or SCSI) is not built to communicate that kind of information
(grouped flush, relaxed ordering...).  I think battery backed
ORDERED_DRAIN combined with fine-grained host queue flush would be
pretty good.  It doesn't require some fancy new interface which isn't
gonna be used widely anyway and can achieve most of performance gain if
the storage plays it smart.

Thanks.



I am not really sure that you need this ORDERED_DRAIN for big arrays...

ric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-10 Thread Valdis . Kletnieks
On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said:

 All of the high end arrays have non-volatile cache (read, on power loss, it 
 is a 
 promise that it will get all of your data out to permanent storage). You 
 don't 
 need to ask this kind of array to drain the cache. In fact, it might just 
 ignore 
 you if you send it that kind of request ;-)

OK, I'll bite - how does the kernel know whether the other end of that
fiberchannel cable is attached to a DMX-3 or to some no-name product that
may not have the same assurances?  Is there a I'm a high-end array bit
in the sense data that I'm unaware of?



pgpPnk1U6EkuO.pgp
Description: PGP signature


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-10 Thread Tejun Heo
[EMAIL PROTECTED] wrote:
 On Tue, 10 Jul 2007 14:39:41 EDT, Ric Wheeler said:
 
 All of the high end arrays have non-volatile cache (read, on power loss, it 
 is a 
 promise that it will get all of your data out to permanent storage). You 
 don't 
 need to ask this kind of array to drain the cache. In fact, it might just 
 ignore 
 you if you send it that kind of request ;-)
 
 OK, I'll bite - how does the kernel know whether the other end of that
 fiberchannel cable is attached to a DMX-3 or to some no-name product that
 may not have the same assurances?  Is there a I'm a high-end array bit
 in the sense data that I'm unaware of?

Well, the array just has to tell the kernel that it doesn't to write
back caching.  The kernel automatically selects ORDERED_DRAIN in such case.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-10 Thread Tejun Heo
Ric Wheeler wrote:
 Don't those thingies usually have NV cache or backed by battery such
 that ORDERED_DRAIN is enough?
 
 All of the high end arrays have non-volatile cache (read, on power loss,
 it is a promise that it will get all of your data out to permanent
 storage). You don't need to ask this kind of array to drain the cache.
 In fact, it might just ignore you if you send it that kind of request ;-)
 
 The size of the NV cache can run from a few gigabytes up to hundreds of
 gigabytes, so you really don't want to invoke cache flushes here if you
 can avoid it.
 
 For this class of device, you can get the required in order completion
 and data integrity semantics as long as we send the IO's to the device
 in the correct order.

Thanks for clarification.

 The problem is that the interface between the host and a storage device
 (ATA or SCSI) is not built to communicate that kind of information
 (grouped flush, relaxed ordering...).  I think battery backed
 ORDERED_DRAIN combined with fine-grained host queue flush would be
 pretty good.  It doesn't require some fancy new interface which isn't
 gonna be used widely anyway and can achieve most of performance gain if
 the storage plays it smart.
 
 I am not really sure that you need this ORDERED_DRAIN for big arrays...

ORDERED_DRAIN is to properly order requests from host request queue
(elevator/iosched).  We can make it finer-grained but we do need to put
some ordering restrictions.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-09 Thread Jens Axboe
On Thu, Jul 05 2007, Tejun Heo wrote:
> Hello, Jens.
> 
> Jens Axboe wrote:
> > On Mon, May 28 2007, Neil Brown wrote:
> >> I think the implementation priorities here are:
> >>
> >> 1/ implement a zero-length BIO_RW_BARRIER option.
> >> 2/ Use it (or otherwise) to make all dm and md modules handle
> >>barriers (and loop?).
> >> 3/ Devise and implement appropriate fall-backs with-in the block layer
> >>so that  -EOPNOTSUP is never returned.
> >> 4/ Remove unneeded cruft from filesystems (and elsewhere).
> > 
> > This is the start of 1/ above. It's very lightly tested, it's verified
> > to DTRT here at least and not crash :-)
> > 
> > It gets rid of the ->issue_flush_fn() queue callback, all the driver
> > knowledge resides in ->prepare_flush_fn() anyways. blkdev_issue_flush()
> > then just reuses the empty-bio approach to queue an empty barrier, this
> > should work equally well for stacked and non-stacked devices.
> > 
> > While this patch isn't complete yet, it's clearly the right direction to
> > go.
> 
> Finally took a brief look. :-) I think the sequencing for zero-length
> barrier can be better done by pre-setting QUEUE_ORDSEQ_BAR in
> start_ordered() rather than short circuiting the request after it's
> issued.  What do you think?

Yeah, that might be cleaner and should achieve the same effect. I'll
test!

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-09 Thread Jens Axboe
On Thu, Jul 05 2007, Tejun Heo wrote:
 Hello, Jens.
 
 Jens Axboe wrote:
  On Mon, May 28 2007, Neil Brown wrote:
  I think the implementation priorities here are:
 
  1/ implement a zero-length BIO_RW_BARRIER option.
  2/ Use it (or otherwise) to make all dm and md modules handle
 barriers (and loop?).
  3/ Devise and implement appropriate fall-backs with-in the block layer
 so that  -EOPNOTSUP is never returned.
  4/ Remove unneeded cruft from filesystems (and elsewhere).
  
  This is the start of 1/ above. It's very lightly tested, it's verified
  to DTRT here at least and not crash :-)
  
  It gets rid of the -issue_flush_fn() queue callback, all the driver
  knowledge resides in -prepare_flush_fn() anyways. blkdev_issue_flush()
  then just reuses the empty-bio approach to queue an empty barrier, this
  should work equally well for stacked and non-stacked devices.
  
  While this patch isn't complete yet, it's clearly the right direction to
  go.
 
 Finally took a brief look. :-) I think the sequencing for zero-length
 barrier can be better done by pre-setting QUEUE_ORDSEQ_BAR in
 start_ordered() rather than short circuiting the request after it's
 issued.  What do you think?

Yeah, that might be cleaner and should achieve the same effect. I'll
test!

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-05 Thread Tejun Heo
Hello, Jens.

Jens Axboe wrote:
> On Mon, May 28 2007, Neil Brown wrote:
>> I think the implementation priorities here are:
>>
>> 1/ implement a zero-length BIO_RW_BARRIER option.
>> 2/ Use it (or otherwise) to make all dm and md modules handle
>>barriers (and loop?).
>> 3/ Devise and implement appropriate fall-backs with-in the block layer
>>so that  -EOPNOTSUP is never returned.
>> 4/ Remove unneeded cruft from filesystems (and elsewhere).
> 
> This is the start of 1/ above. It's very lightly tested, it's verified
> to DTRT here at least and not crash :-)
> 
> It gets rid of the ->issue_flush_fn() queue callback, all the driver
> knowledge resides in ->prepare_flush_fn() anyways. blkdev_issue_flush()
> then just reuses the empty-bio approach to queue an empty barrier, this
> should work equally well for stacked and non-stacked devices.
> 
> While this patch isn't complete yet, it's clearly the right direction to
> go.

Finally took a brief look. :-) I think the sequencing for zero-length
barrier can be better done by pre-setting QUEUE_ORDSEQ_BAR in
start_ordered() rather than short circuiting the request after it's
issued.  What do you think?

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-07-05 Thread Tejun Heo
Hello, Jens.

Jens Axboe wrote:
 On Mon, May 28 2007, Neil Brown wrote:
 I think the implementation priorities here are:

 1/ implement a zero-length BIO_RW_BARRIER option.
 2/ Use it (or otherwise) to make all dm and md modules handle
barriers (and loop?).
 3/ Devise and implement appropriate fall-backs with-in the block layer
so that  -EOPNOTSUP is never returned.
 4/ Remove unneeded cruft from filesystems (and elsewhere).
 
 This is the start of 1/ above. It's very lightly tested, it's verified
 to DTRT here at least and not crash :-)
 
 It gets rid of the -issue_flush_fn() queue callback, all the driver
 knowledge resides in -prepare_flush_fn() anyways. blkdev_issue_flush()
 then just reuses the empty-bio approach to queue an empty barrier, this
 should work equally well for stacked and non-stacked devices.
 
 While this patch isn't complete yet, it's clearly the right direction to
 go.

Finally took a brief look. :-) I think the sequencing for zero-length
barrier can be better done by pre-setting QUEUE_ORDSEQ_BAR in
start_ordered() rather than short circuiting the request after it's
issued.  What do you think?

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-04 Thread Tejun Heo
Jens Axboe wrote:
> On Sat, Jun 02 2007, Tejun Heo wrote:
>> Hello,
>>
>> Jens Axboe wrote:
 Would that be very different from issuing barrier and not waiting for
 its completion?  For ATA and SCSI, we'll have to flush write back cache
 anyway, so I don't see how we can get performance advantage by
 implementing separate WRITE_ORDERED.  I think zero-length barrier
 (haven't looked at the code yet, still recovering from jet lag :-) can
 serve as genuine barrier without the extra write tho.
>>> As always, it depends :-)
>>>
>>> If you are doing pure flush barriers, then there's no difference. Unless
>>> you only guarantee ordering wrt previously submitted requests, in which
>>> case you can eliminate the post flush.
>>>
>>> If you are doing ordered tags, then just setting the ordered bit is
>>> enough. That is different from the barrier in that we don't need a flush
>>> of FUA bit set.
>> Hmmm... I'm feeling dense.  Zero-length barrier also requires only one
>> flush to separate requests before and after it (haven't looked at the
>> code yet, will soon).  Can you enlighten me?
> 
> Yeah, that's what the zero-length barrier implementation I posted does.
> Not sure if you have a question beyond that, if so fire away :-)

I thought you were talking about adding BIO_RW_ORDERED instead of
exposing zero length BIO_RW_BARRIER.  Sorry about the confusion.  :-)

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-04 Thread Tejun Heo
Jens Axboe wrote:
 On Sat, Jun 02 2007, Tejun Heo wrote:
 Hello,

 Jens Axboe wrote:
 Would that be very different from issuing barrier and not waiting for
 its completion?  For ATA and SCSI, we'll have to flush write back cache
 anyway, so I don't see how we can get performance advantage by
 implementing separate WRITE_ORDERED.  I think zero-length barrier
 (haven't looked at the code yet, still recovering from jet lag :-) can
 serve as genuine barrier without the extra write tho.
 As always, it depends :-)

 If you are doing pure flush barriers, then there's no difference. Unless
 you only guarantee ordering wrt previously submitted requests, in which
 case you can eliminate the post flush.

 If you are doing ordered tags, then just setting the ordered bit is
 enough. That is different from the barrier in that we don't need a flush
 of FUA bit set.
 Hmmm... I'm feeling dense.  Zero-length barrier also requires only one
 flush to separate requests before and after it (haven't looked at the
 code yet, will soon).  Can you enlighten me?
 
 Yeah, that's what the zero-length barrier implementation I posted does.
 Not sure if you have a question beyond that, if so fire away :-)

I thought you were talking about adding BIO_RW_ORDERED instead of
exposing zero length BIO_RW_BARRIER.  Sorry about the confusion.  :-)

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-02 Thread Guy Watkins
} -Original Message-
} From: [EMAIL PROTECTED] [mailto:linux-raid-
} [EMAIL PROTECTED] On Behalf Of Jens Axboe
} Sent: Saturday, June 02, 2007 10:35 AM
} To: Tejun Heo
} Cc: David Chinner; [EMAIL PROTECTED]; Phillip Susi; Neil Brown; linux-
} [EMAIL PROTECTED]; linux-kernel@vger.kernel.org; dm-
} [EMAIL PROTECTED]; [EMAIL PROTECTED]; Stefan Bader; Andreas Dilger
} Subject: Re: [RFD] BIO_RW_BARRIER - what it means for devices,
} filesystems, and dm/md.
} 
} On Sat, Jun 02 2007, Tejun Heo wrote:
} > Hello,
} >
} > Jens Axboe wrote:
} > >> Would that be very different from issuing barrier and not waiting for
} > >> its completion?  For ATA and SCSI, we'll have to flush write back
} cache
} > >> anyway, so I don't see how we can get performance advantage by
} > >> implementing separate WRITE_ORDERED.  I think zero-length barrier
} > >> (haven't looked at the code yet, still recovering from jet lag :-)
} can
} > >> serve as genuine barrier without the extra write tho.
} > >
} > > As always, it depends :-)
} > >
} > > If you are doing pure flush barriers, then there's no difference.
} Unless
} > > you only guarantee ordering wrt previously submitted requests, in
} which
} > > case you can eliminate the post flush.
} > >
} > > If you are doing ordered tags, then just setting the ordered bit is
} > > enough. That is different from the barrier in that we don't need a
} flush
} > > of FUA bit set.
} >
} > Hmmm... I'm feeling dense.  Zero-length barrier also requires only one
} > flush to separate requests before and after it (haven't looked at the
} > code yet, will soon).  Can you enlighten me?
} 
} Yeah, that's what the zero-length barrier implementation I posted does.
} Not sure if you have a question beyond that, if so fire away :-)
} 
} --
} Jens Axboe

I must admit I have only read some of the barrier related posts, so this
issue may have been covered.  If so, sorry.

What I have read seems to be related to a single disk.  What if a logical
disk is used (md, LVM, ...)?  If a barrier is issued to a logical disk and
that driver issues barriers to all related devices (logical or physical),
all the devices MUST honor the barrier together.  If 1 device crosses the
barrier before another reaches the barrier, corruption should be assumed.
It seems to me each block device that represents more than 2 other devices
must do a flush at a barrier so that all devices will cross the barrier at
the same time.

Guy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-02 Thread Bill Davidsen

Jens Axboe wrote:

On Fri, Jun 01 2007, Bill Davidsen wrote:
  

Jens Axboe wrote:


On Thu, May 31 2007, Bill Davidsen wrote:
 
  

Jens Axboe wrote:
   


On Thu, May 31 2007, David Chinner wrote:

 
  

On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:
  
   


On Thu, May 31 2007, David Chinner wrote:

 
  

IOWs, there are two parts to the problem:

1 - guaranteeing I/O ordering
2 - guaranteeing blocks are on persistent storage.

Right now, a single barrier I/O is used to provide both of these
guarantees. In most cases, all we really need to provide is 1); the
need for 2) is a much rarer condition but still needs to be
provided.

  
   

if I am understanding it correctly, the big win for barriers is that 
you do NOT have to stop and wait until the data is on persistant 
media before you can continue.

 
  

Yes, if we define a barrier to only guarantee 1), then yes this
would be a big win (esp. for XFS). But that requires all filesystems
to handle sync writes differently, and sync_blockdev() needs to
call blkdev_issue_flush() as well

So, what do we do here? Do we define a barrier I/O to only provide
ordering, or do we define it to also provide persistent storage
writeback? Whatever we decide, it needs to be documented
  
   


The block layer already has a notion of the two types of barriers, with
a very small amount of tweaking we could expose that. There's 
absolutely

zero reason we can't easily support both types of barriers.

 
  

That sounds like a good idea - we can leave the existing
WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
behaviour that only guarantees ordering. The filesystem can then
choose which to use where appropriate
  
   


Precisely. The current definition of barriers are what Chris and I came
up with many years ago, when solving the problem for reiserfs
originally. It is by no means the only feasible approach.

I'll add a WRITE_ORDERED command to the #barrier branch, it already
contains the empty-bio barrier support I posted yesterday (well a
slightly modified and cleaned up version).


 
  
Wait. Do filesystems expect (depend on) anything but ordering now? Does 
md? Having users of barriers as they currently behave suddenly getting 
SYNC behavior where they expect ORDERED is likely to have a negative 
effect on performance. Or do I misread what is actually guaranteed by 
WRITE_BARRIER now, and a flush is currently happening in all cases?
   


See the above stuff you quote, it's answered there. It's not a change,
this is how the Linux barrier write has always worked since I first
implemented it. What David and I are talking about is adding a more
relaxed version as well, that just implies ordering.
 
  
I was reading the documentation in block/biodoc.txt, which seems to just 
say ordered:


   1.2.1 I/O Barriers

   There is a way to enforce strict ordering for i/os through barriers.
   All requests before a barrier point must be serviced before the barrier
   request and any other requests arriving after the barrier will not be
   serviced until after the barrier has completed. This is useful for
   higher
   level control on write ordering, e.g flushing a log of committed updates
   to disk before the corresponding updates themselves.

   A flag in the bio structure, BIO_BARRIER is used to identify a
   barrier i/o.
   The generic i/o scheduler would make sure that it places the barrier
   request and
   all other requests coming after it after all the previous requests
   in the
   queue. Barriers may be implemented in different ways depending on the
   driver. A SCSI driver for example could make use of ordered tags to
   preserve the necessary ordering with a lower impact on throughput.
   For IDE
   this might be two sync cache flush: a pre and post flush when
   encountering
   a barrier write.

The "flush" comment is associated with IDE, so it wasn't clear that the 
device cache is always cleared to force the data to the platter.



The above should mention that the ordered tag comment for SCSI assumes
that the drive uses write through caching. If it does, then an ordered
tag is enough. If it doesn't, then you need a bit more than that (a post
flush, after the ordered tag has completed).

  

Thanks, go it.

--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-02 Thread Jens Axboe
On Fri, Jun 01 2007, Bill Davidsen wrote:
> Jens Axboe wrote:
> >On Thu, May 31 2007, Bill Davidsen wrote:
> >  
> >>Jens Axboe wrote:
> >>
> >>>On Thu, May 31 2007, David Chinner wrote:
> >>> 
> >>>  
> On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:
>    
> 
> >On Thu, May 31 2007, David Chinner wrote:
> > 
> >  
> >>IOWs, there are two parts to the problem:
> >>
> >>1 - guaranteeing I/O ordering
> >>2 - guaranteeing blocks are on persistent storage.
> >>
> >>Right now, a single barrier I/O is used to provide both of these
> >>guarantees. In most cases, all we really need to provide is 1); the
> >>need for 2) is a much rarer condition but still needs to be
> >>provided.
> >>
> >>   
> >>
> >>>if I am understanding it correctly, the big win for barriers is that 
> >>>you do NOT have to stop and wait until the data is on persistant 
> >>>media before you can continue.
> >>> 
> >>>  
> >>Yes, if we define a barrier to only guarantee 1), then yes this
> >>would be a big win (esp. for XFS). But that requires all filesystems
> >>to handle sync writes differently, and sync_blockdev() needs to
> >>call blkdev_issue_flush() as well
> >>
> >>So, what do we do here? Do we define a barrier I/O to only provide
> >>ordering, or do we define it to also provide persistent storage
> >>writeback? Whatever we decide, it needs to be documented
> >>   
> >>
> >The block layer already has a notion of the two types of barriers, with
> >a very small amount of tweaking we could expose that. There's 
> >absolutely
> >zero reason we can't easily support both types of barriers.
> > 
> >  
> That sounds like a good idea - we can leave the existing
> WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
> behaviour that only guarantees ordering. The filesystem can then
> choose which to use where appropriate
>    
> 
> >>>Precisely. The current definition of barriers are what Chris and I came
> >>>up with many years ago, when solving the problem for reiserfs
> >>>originally. It is by no means the only feasible approach.
> >>>
> >>>I'll add a WRITE_ORDERED command to the #barrier branch, it already
> >>>contains the empty-bio barrier support I posted yesterday (well a
> >>>slightly modified and cleaned up version).
> >>>
> >>> 
> >>>  
> >>Wait. Do filesystems expect (depend on) anything but ordering now? Does 
> >>md? Having users of barriers as they currently behave suddenly getting 
> >>SYNC behavior where they expect ORDERED is likely to have a negative 
> >>effect on performance. Or do I misread what is actually guaranteed by 
> >>WRITE_BARRIER now, and a flush is currently happening in all cases?
> >>
> >
> >See the above stuff you quote, it's answered there. It's not a change,
> >this is how the Linux barrier write has always worked since I first
> >implemented it. What David and I are talking about is adding a more
> >relaxed version as well, that just implies ordering.
> >  
> 
> I was reading the documentation in block/biodoc.txt, which seems to just 
> say ordered:
> 
>1.2.1 I/O Barriers
> 
>There is a way to enforce strict ordering for i/os through barriers.
>All requests before a barrier point must be serviced before the barrier
>request and any other requests arriving after the barrier will not be
>serviced until after the barrier has completed. This is useful for
>higher
>level control on write ordering, e.g flushing a log of committed updates
>to disk before the corresponding updates themselves.
> 
>A flag in the bio structure, BIO_BARRIER is used to identify a
>barrier i/o.
>The generic i/o scheduler would make sure that it places the barrier
>request and
>all other requests coming after it after all the previous requests
>in the
>queue. Barriers may be implemented in different ways depending on the
>driver. A SCSI driver for example could make use of ordered tags to
>preserve the necessary ordering with a lower impact on throughput.
>For IDE
>this might be two sync cache flush: a pre and post flush when
>encountering
>a barrier write.
> 
> The "flush" comment is associated with IDE, so it wasn't clear that the 
> device cache is always cleared to force the data to the platter.

The above should mention that the ordered tag comment for SCSI assumes
that the drive uses write through caching. If it does, then an ordered
tag is enough. If it doesn't, then you need a bit more than that (a post
flush, after the ordered tag has completed).

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  

Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-02 Thread Jens Axboe
On Sat, Jun 02 2007, Tejun Heo wrote:
> Hello,
> 
> Jens Axboe wrote:
> >> Would that be very different from issuing barrier and not waiting for
> >> its completion?  For ATA and SCSI, we'll have to flush write back cache
> >> anyway, so I don't see how we can get performance advantage by
> >> implementing separate WRITE_ORDERED.  I think zero-length barrier
> >> (haven't looked at the code yet, still recovering from jet lag :-) can
> >> serve as genuine barrier without the extra write tho.
> > 
> > As always, it depends :-)
> > 
> > If you are doing pure flush barriers, then there's no difference. Unless
> > you only guarantee ordering wrt previously submitted requests, in which
> > case you can eliminate the post flush.
> > 
> > If you are doing ordered tags, then just setting the ordered bit is
> > enough. That is different from the barrier in that we don't need a flush
> > of FUA bit set.
> 
> Hmmm... I'm feeling dense.  Zero-length barrier also requires only one
> flush to separate requests before and after it (haven't looked at the
> code yet, will soon).  Can you enlighten me?

Yeah, that's what the zero-length barrier implementation I posted does.
Not sure if you have a question beyond that, if so fire away :-)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-02 Thread Tejun Heo
Hello,

Jens Axboe wrote:
>> Would that be very different from issuing barrier and not waiting for
>> its completion?  For ATA and SCSI, we'll have to flush write back cache
>> anyway, so I don't see how we can get performance advantage by
>> implementing separate WRITE_ORDERED.  I think zero-length barrier
>> (haven't looked at the code yet, still recovering from jet lag :-) can
>> serve as genuine barrier without the extra write tho.
> 
> As always, it depends :-)
> 
> If you are doing pure flush barriers, then there's no difference. Unless
> you only guarantee ordering wrt previously submitted requests, in which
> case you can eliminate the post flush.
> 
> If you are doing ordered tags, then just setting the ordered bit is
> enough. That is different from the barrier in that we don't need a flush
> of FUA bit set.

Hmmm... I'm feeling dense.  Zero-length barrier also requires only one
flush to separate requests before and after it (haven't looked at the
code yet, will soon).  Can you enlighten me?

Thanks.

-- 
tejun

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-02 Thread Tejun Heo
Hello,

Jens Axboe wrote:
 Would that be very different from issuing barrier and not waiting for
 its completion?  For ATA and SCSI, we'll have to flush write back cache
 anyway, so I don't see how we can get performance advantage by
 implementing separate WRITE_ORDERED.  I think zero-length barrier
 (haven't looked at the code yet, still recovering from jet lag :-) can
 serve as genuine barrier without the extra write tho.
 
 As always, it depends :-)
 
 If you are doing pure flush barriers, then there's no difference. Unless
 you only guarantee ordering wrt previously submitted requests, in which
 case you can eliminate the post flush.
 
 If you are doing ordered tags, then just setting the ordered bit is
 enough. That is different from the barrier in that we don't need a flush
 of FUA bit set.

Hmmm... I'm feeling dense.  Zero-length barrier also requires only one
flush to separate requests before and after it (haven't looked at the
code yet, will soon).  Can you enlighten me?

Thanks.

-- 
tejun

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-02 Thread Jens Axboe
On Sat, Jun 02 2007, Tejun Heo wrote:
 Hello,
 
 Jens Axboe wrote:
  Would that be very different from issuing barrier and not waiting for
  its completion?  For ATA and SCSI, we'll have to flush write back cache
  anyway, so I don't see how we can get performance advantage by
  implementing separate WRITE_ORDERED.  I think zero-length barrier
  (haven't looked at the code yet, still recovering from jet lag :-) can
  serve as genuine barrier without the extra write tho.
  
  As always, it depends :-)
  
  If you are doing pure flush barriers, then there's no difference. Unless
  you only guarantee ordering wrt previously submitted requests, in which
  case you can eliminate the post flush.
  
  If you are doing ordered tags, then just setting the ordered bit is
  enough. That is different from the barrier in that we don't need a flush
  of FUA bit set.
 
 Hmmm... I'm feeling dense.  Zero-length barrier also requires only one
 flush to separate requests before and after it (haven't looked at the
 code yet, will soon).  Can you enlighten me?

Yeah, that's what the zero-length barrier implementation I posted does.
Not sure if you have a question beyond that, if so fire away :-)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-02 Thread Jens Axboe
On Fri, Jun 01 2007, Bill Davidsen wrote:
 Jens Axboe wrote:
 On Thu, May 31 2007, Bill Davidsen wrote:
   
 Jens Axboe wrote:
 
 On Thu, May 31 2007, David Chinner wrote:
  
   
 On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:

 
 On Thu, May 31 2007, David Chinner wrote:
  
   
 IOWs, there are two parts to the problem:
 
 1 - guaranteeing I/O ordering
 2 - guaranteeing blocks are on persistent storage.
 
 Right now, a single barrier I/O is used to provide both of these
 guarantees. In most cases, all we really need to provide is 1); the
 need for 2) is a much rarer condition but still needs to be
 provided.
 

 
 if I am understanding it correctly, the big win for barriers is that 
 you do NOT have to stop and wait until the data is on persistant 
 media before you can continue.
  
   
 Yes, if we define a barrier to only guarantee 1), then yes this
 would be a big win (esp. for XFS). But that requires all filesystems
 to handle sync writes differently, and sync_blockdev() needs to
 call blkdev_issue_flush() as well
 
 So, what do we do here? Do we define a barrier I/O to only provide
 ordering, or do we define it to also provide persistent storage
 writeback? Whatever we decide, it needs to be documented

 
 The block layer already has a notion of the two types of barriers, with
 a very small amount of tweaking we could expose that. There's 
 absolutely
 zero reason we can't easily support both types of barriers.
  
   
 That sounds like a good idea - we can leave the existing
 WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
 behaviour that only guarantees ordering. The filesystem can then
 choose which to use where appropriate

 
 Precisely. The current definition of barriers are what Chris and I came
 up with many years ago, when solving the problem for reiserfs
 originally. It is by no means the only feasible approach.
 
 I'll add a WRITE_ORDERED command to the #barrier branch, it already
 contains the empty-bio barrier support I posted yesterday (well a
 slightly modified and cleaned up version).
 
  
   
 Wait. Do filesystems expect (depend on) anything but ordering now? Does 
 md? Having users of barriers as they currently behave suddenly getting 
 SYNC behavior where they expect ORDERED is likely to have a negative 
 effect on performance. Or do I misread what is actually guaranteed by 
 WRITE_BARRIER now, and a flush is currently happening in all cases?
 
 
 See the above stuff you quote, it's answered there. It's not a change,
 this is how the Linux barrier write has always worked since I first
 implemented it. What David and I are talking about is adding a more
 relaxed version as well, that just implies ordering.
   
 
 I was reading the documentation in block/biodoc.txt, which seems to just 
 say ordered:
 
1.2.1 I/O Barriers
 
There is a way to enforce strict ordering for i/os through barriers.
All requests before a barrier point must be serviced before the barrier
request and any other requests arriving after the barrier will not be
serviced until after the barrier has completed. This is useful for
higher
level control on write ordering, e.g flushing a log of committed updates
to disk before the corresponding updates themselves.
 
A flag in the bio structure, BIO_BARRIER is used to identify a
barrier i/o.
The generic i/o scheduler would make sure that it places the barrier
request and
all other requests coming after it after all the previous requests
in the
queue. Barriers may be implemented in different ways depending on the
driver. A SCSI driver for example could make use of ordered tags to
preserve the necessary ordering with a lower impact on throughput.
For IDE
this might be two sync cache flush: a pre and post flush when
encountering
a barrier write.
 
 The flush comment is associated with IDE, so it wasn't clear that the 
 device cache is always cleared to force the data to the platter.

The above should mention that the ordered tag comment for SCSI assumes
that the drive uses write through caching. If it does, then an ordered
tag is enough. If it doesn't, then you need a bit more than that (a post
flush, after the ordered tag has completed).

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-02 Thread Bill Davidsen

Jens Axboe wrote:

On Fri, Jun 01 2007, Bill Davidsen wrote:
  

Jens Axboe wrote:


On Thu, May 31 2007, Bill Davidsen wrote:
 
  

Jens Axboe wrote:
   


On Thu, May 31 2007, David Chinner wrote:

 
  

On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:
  
   


On Thu, May 31 2007, David Chinner wrote:

 
  

IOWs, there are two parts to the problem:

1 - guaranteeing I/O ordering
2 - guaranteeing blocks are on persistent storage.

Right now, a single barrier I/O is used to provide both of these
guarantees. In most cases, all we really need to provide is 1); the
need for 2) is a much rarer condition but still needs to be
provided.

  
   

if I am understanding it correctly, the big win for barriers is that 
you do NOT have to stop and wait until the data is on persistant 
media before you can continue.

 
  

Yes, if we define a barrier to only guarantee 1), then yes this
would be a big win (esp. for XFS). But that requires all filesystems
to handle sync writes differently, and sync_blockdev() needs to
call blkdev_issue_flush() as well

So, what do we do here? Do we define a barrier I/O to only provide
ordering, or do we define it to also provide persistent storage
writeback? Whatever we decide, it needs to be documented
  
   


The block layer already has a notion of the two types of barriers, with
a very small amount of tweaking we could expose that. There's 
absolutely

zero reason we can't easily support both types of barriers.

 
  

That sounds like a good idea - we can leave the existing
WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
behaviour that only guarantees ordering. The filesystem can then
choose which to use where appropriate
  
   


Precisely. The current definition of barriers are what Chris and I came
up with many years ago, when solving the problem for reiserfs
originally. It is by no means the only feasible approach.

I'll add a WRITE_ORDERED command to the #barrier branch, it already
contains the empty-bio barrier support I posted yesterday (well a
slightly modified and cleaned up version).


 
  
Wait. Do filesystems expect (depend on) anything but ordering now? Does 
md? Having users of barriers as they currently behave suddenly getting 
SYNC behavior where they expect ORDERED is likely to have a negative 
effect on performance. Or do I misread what is actually guaranteed by 
WRITE_BARRIER now, and a flush is currently happening in all cases?
   


See the above stuff you quote, it's answered there. It's not a change,
this is how the Linux barrier write has always worked since I first
implemented it. What David and I are talking about is adding a more
relaxed version as well, that just implies ordering.
 
  
I was reading the documentation in block/biodoc.txt, which seems to just 
say ordered:


   1.2.1 I/O Barriers

   There is a way to enforce strict ordering for i/os through barriers.
   All requests before a barrier point must be serviced before the barrier
   request and any other requests arriving after the barrier will not be
   serviced until after the barrier has completed. This is useful for
   higher
   level control on write ordering, e.g flushing a log of committed updates
   to disk before the corresponding updates themselves.

   A flag in the bio structure, BIO_BARRIER is used to identify a
   barrier i/o.
   The generic i/o scheduler would make sure that it places the barrier
   request and
   all other requests coming after it after all the previous requests
   in the
   queue. Barriers may be implemented in different ways depending on the
   driver. A SCSI driver for example could make use of ordered tags to
   preserve the necessary ordering with a lower impact on throughput.
   For IDE
   this might be two sync cache flush: a pre and post flush when
   encountering
   a barrier write.

The flush comment is associated with IDE, so it wasn't clear that the 
device cache is always cleared to force the data to the platter.



The above should mention that the ordered tag comment for SCSI assumes
that the drive uses write through caching. If it does, then an ordered
tag is enough. If it doesn't, then you need a bit more than that (a post
flush, after the ordered tag has completed).

  

Thanks, go it.

--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-02 Thread Guy Watkins
} -Original Message-
} From: [EMAIL PROTECTED] [mailto:linux-raid-
} [EMAIL PROTECTED] On Behalf Of Jens Axboe
} Sent: Saturday, June 02, 2007 10:35 AM
} To: Tejun Heo
} Cc: David Chinner; [EMAIL PROTECTED]; Phillip Susi; Neil Brown; linux-
} [EMAIL PROTECTED]; linux-kernel@vger.kernel.org; dm-
} [EMAIL PROTECTED]; [EMAIL PROTECTED]; Stefan Bader; Andreas Dilger
} Subject: Re: [RFD] BIO_RW_BARRIER - what it means for devices,
} filesystems, and dm/md.
} 
} On Sat, Jun 02 2007, Tejun Heo wrote:
}  Hello,
} 
}  Jens Axboe wrote:
}   Would that be very different from issuing barrier and not waiting for
}   its completion?  For ATA and SCSI, we'll have to flush write back
} cache
}   anyway, so I don't see how we can get performance advantage by
}   implementing separate WRITE_ORDERED.  I think zero-length barrier
}   (haven't looked at the code yet, still recovering from jet lag :-)
} can
}   serve as genuine barrier without the extra write tho.
}  
}   As always, it depends :-)
}  
}   If you are doing pure flush barriers, then there's no difference.
} Unless
}   you only guarantee ordering wrt previously submitted requests, in
} which
}   case you can eliminate the post flush.
}  
}   If you are doing ordered tags, then just setting the ordered bit is
}   enough. That is different from the barrier in that we don't need a
} flush
}   of FUA bit set.
} 
}  Hmmm... I'm feeling dense.  Zero-length barrier also requires only one
}  flush to separate requests before and after it (haven't looked at the
}  code yet, will soon).  Can you enlighten me?
} 
} Yeah, that's what the zero-length barrier implementation I posted does.
} Not sure if you have a question beyond that, if so fire away :-)
} 
} --
} Jens Axboe

I must admit I have only read some of the barrier related posts, so this
issue may have been covered.  If so, sorry.

What I have read seems to be related to a single disk.  What if a logical
disk is used (md, LVM, ...)?  If a barrier is issued to a logical disk and
that driver issues barriers to all related devices (logical or physical),
all the devices MUST honor the barrier together.  If 1 device crosses the
barrier before another reaches the barrier, corruption should be assumed.
It seems to me each block device that represents more than 2 other devices
must do a flush at a barrier so that all devices will cross the barrier at
the same time.

Guy

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-01 Thread Bill Davidsen

Jens Axboe wrote:

On Thu, May 31 2007, Phillip Susi wrote:
  

Jens Axboe wrote:


No Stephan is right, the barrier is both an ordering and integrity
constraint. If a driver completes a barrier request before that request
and previously submitted requests are on STABLE storage, then it
violates that principle. Look at the code and the various ordering
options.
  
I am saying that is the wrong thing to do.  Barrier should be about 
ordering only.  So long as the order they hit the media is maintained, 
the order the requests are completed in can change.  barrier.txt bears 



But you can't guarentee ordering without flushing the data out as well.
It all depends on the type of cache on the device, of course. If you
look at the ordinary sata/ide drive with write back caching, you can't
just issue the requests in order and pray that the drive cache will make
it to platter.

If you don't have write back caching, or if the cache is battery backed
and thus guarenteed to never be lost, maintaining order is naturally
enough.
  


Do I misread this? If ordered doesn't reach all the way to the platter 
then there will be failure modes which result in order not preserved. 
Battery backed cache doesn't prevect failures between the cache and the 
platter.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-01 Thread Bill Davidsen

Neil Brown wrote:

On Friday June 1, [EMAIL PROTECTED] wrote:
  

On Thu, May 31, 2007 at 02:31:21PM -0400, Phillip Susi wrote:


David Chinner wrote:
  

That sounds like a good idea - we can leave the existing
WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
behaviour that only guarantees ordering. The filesystem can then
choose which to use where appropriate

So what if you want a synchronous write, but DON'T care about the order? 
  

submit_bio(WRITE_SYNC, bio);

Already there, already used by XFS, JFS and direct I/O.



Are you sure?

You seem to be saying that WRITE_SYNC causes the write to be safe on
media before the request returns.  That isn't my understanding.
I think (from comments near the definition and a quick grep through
the code) that WRITE_SYNC expedites the delivery of the request
through the elevator, but doesn't do anything special about getting it
onto the media.


My impression is that the sync will return when the i/o has been 
delivered to the device, and will get special treatment by the elevator 
code (I looked quickly, more is needed). I'm sore someone will tell me 
if I misread this. ;-)


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-01 Thread Tejun Heo
[EMAIL PROTECTED] wrote:
> On Fri, 01 Jun 2007 16:16:01 +0900, Tejun Heo said:
>> Don't those thingies usually have NV cache or backed by battery such
>> that ORDERED_DRAIN is enough?
> 
> Probably *most* do, but do you really want to bet the user's data on it?

Thought we were talking about high-end storage stuff.  I don't think
I'll be too uncomfortable.  The reason why we're talking about this at
all is because high-end stuff with fancy NV cache and a hunk of battery
will unnecessarily suffer from the current barrier implementation.

>> The problem is that the interface between the host and a storage device
>> (ATA or SCSI) is not built to communicate that kind of information
>> (grouped flush, relaxed ordering...).  I think battery backed
>> ORDERED_DRAIN combined with fine-grained host queue flush would be
>> pretty good.  It doesn't require some fancy new interface which isn't
>> gonna be used widely anyway and can achieve most of performance gain if
>> the storage plays it smart.
> 
> Yes, that would probably be "pretty good".  But how do you get the storage
> device to *reliably* tell the truth about what it actually implements? 
> (Consider
> the number of devices that downright lie about their implementation of cache
> flushing)

SCSI NV bit or report write through cache?  Again, we're talking about
large arrays and we already trust the write through thing even on cheap
single spindle drives.  sd currently doesn't honor NV bit and it's
causing some troubles on some arrays.  We'll probably have to honor them
at least conditionally.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-01 Thread Valdis . Kletnieks
On Fri, 01 Jun 2007 16:16:01 +0900, Tejun Heo said:
> Don't those thingies usually have NV cache or backed by battery such
> that ORDERED_DRAIN is enough?

Probably *most* do, but do you really want to bet the user's data on it?

> The problem is that the interface between the host and a storage device
> (ATA or SCSI) is not built to communicate that kind of information
> (grouped flush, relaxed ordering...).  I think battery backed
> ORDERED_DRAIN combined with fine-grained host queue flush would be
> pretty good.  It doesn't require some fancy new interface which isn't
> gonna be used widely anyway and can achieve most of performance gain if
> the storage plays it smart.

Yes, that would probably be "pretty good".  But how do you get the storage
device to *reliably* tell the truth about what it actually implements? (Consider
the number of devices that downright lie about their implementation of cache
flushing)


pgpdwBQb4rzZE.pgp
Description: PGP signature


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-01 Thread Bill Davidsen

Jens Axboe wrote:

On Thu, May 31 2007, Bill Davidsen wrote:
  

Jens Axboe wrote:


On Thu, May 31 2007, David Chinner wrote:
 
  

On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:
   


On Thu, May 31 2007, David Chinner wrote:
 
  

IOWs, there are two parts to the problem:

1 - guaranteeing I/O ordering
2 - guaranteeing blocks are on persistent storage.

Right now, a single barrier I/O is used to provide both of these
guarantees. In most cases, all we really need to provide is 1); the
need for 2) is a much rarer condition but still needs to be
provided.

   

if I am understanding it correctly, the big win for barriers is that 
you do NOT have to stop and wait until the data is on persistant media 
before you can continue.
 
  

Yes, if we define a barrier to only guarantee 1), then yes this
would be a big win (esp. for XFS). But that requires all filesystems
to handle sync writes differently, and sync_blockdev() needs to
call blkdev_issue_flush() as well

So, what do we do here? Do we define a barrier I/O to only provide
ordering, or do we define it to also provide persistent storage
writeback? Whatever we decide, it needs to be documented
   


The block layer already has a notion of the two types of barriers, with
a very small amount of tweaking we could expose that. There's absolutely
zero reason we can't easily support both types of barriers.
 
  

That sounds like a good idea - we can leave the existing
WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
behaviour that only guarantees ordering. The filesystem can then
choose which to use where appropriate
   


Precisely. The current definition of barriers are what Chris and I came
up with many years ago, when solving the problem for reiserfs
originally. It is by no means the only feasible approach.

I'll add a WRITE_ORDERED command to the #barrier branch, it already
contains the empty-bio barrier support I posted yesterday (well a
slightly modified and cleaned up version).

 
  
Wait. Do filesystems expect (depend on) anything but ordering now? Does 
md? Having users of barriers as they currently behave suddenly getting 
SYNC behavior where they expect ORDERED is likely to have a negative 
effect on performance. Or do I misread what is actually guaranteed by 
WRITE_BARRIER now, and a flush is currently happening in all cases?



See the above stuff you quote, it's answered there. It's not a change,
this is how the Linux barrier write has always worked since I first
implemented it. What David and I are talking about is adding a more
relaxed version as well, that just implies ordering.
  


I was reading the documentation in block/biodoc.txt, which seems to just 
say ordered:


   1.2.1 I/O Barriers

   There is a way to enforce strict ordering for i/os through barriers.
   All requests before a barrier point must be serviced before the barrier
   request and any other requests arriving after the barrier will not be
   serviced until after the barrier has completed. This is useful for
   higher
   level control on write ordering, e.g flushing a log of committed updates
   to disk before the corresponding updates themselves.

   A flag in the bio structure, BIO_BARRIER is used to identify a
   barrier i/o.
   The generic i/o scheduler would make sure that it places the barrier
   request and
   all other requests coming after it after all the previous requests
   in the
   queue. Barriers may be implemented in different ways depending on the
   driver. A SCSI driver for example could make use of ordered tags to
   preserve the necessary ordering with a lower impact on throughput.
   For IDE
   this might be two sync cache flush: a pre and post flush when
   encountering
   a barrier write.

The "flush" comment is associated with IDE, so it wasn't clear that the 
device cache is always cleared to force the data to the platter.


And will this also be available to user space f/s, since I just proposed 
a project which uses one? :-(



I see several uses for that, so I'd hope so.

  
I think the goal is good, more choice is almost always better choice, I 
just want to be sure there won't be big disk performance regressions.



We can't get more heavy weight than the current barrier, it's about as
conservative as you can get.

  



--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-01 Thread Jens Axboe
On Fri, Jun 01 2007, Tejun Heo wrote:
> Jens Axboe wrote:
> > On Thu, May 31 2007, David Chinner wrote:
> >> On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:
> >>> On Thu, May 31 2007, David Chinner wrote:
>  IOWs, there are two parts to the problem:
> 
>   1 - guaranteeing I/O ordering
>   2 - guaranteeing blocks are on persistent storage.
> 
>  Right now, a single barrier I/O is used to provide both of these
>  guarantees. In most cases, all we really need to provide is 1); the
>  need for 2) is a much rarer condition but still needs to be
>  provided.
> 
> > if I am understanding it correctly, the big win for barriers is that 
> > you 
> > do NOT have to stop and wait until the data is on persistant media 
> > before 
> > you can continue.
>  Yes, if we define a barrier to only guarantee 1), then yes this
>  would be a big win (esp. for XFS). But that requires all filesystems
>  to handle sync writes differently, and sync_blockdev() needs to
>  call blkdev_issue_flush() as well
> 
>  So, what do we do here? Do we define a barrier I/O to only provide
>  ordering, or do we define it to also provide persistent storage
>  writeback? Whatever we decide, it needs to be documented
> >>> The block layer already has a notion of the two types of barriers, with
> >>> a very small amount of tweaking we could expose that. There's absolutely
> >>> zero reason we can't easily support both types of barriers.
> >> That sounds like a good idea - we can leave the existing
> >> WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
> >> behaviour that only guarantees ordering. The filesystem can then
> >> choose which to use where appropriate
> > 
> > Precisely. The current definition of barriers are what Chris and I came
> > up with many years ago, when solving the problem for reiserfs
> > originally. It is by no means the only feasible approach.
> > 
> > I'll add a WRITE_ORDERED command to the #barrier branch, it already
> > contains the empty-bio barrier support I posted yesterday (well a
> > slightly modified and cleaned up version).
> 
> Would that be very different from issuing barrier and not waiting for
> its completion?  For ATA and SCSI, we'll have to flush write back cache
> anyway, so I don't see how we can get performance advantage by
> implementing separate WRITE_ORDERED.  I think zero-length barrier
> (haven't looked at the code yet, still recovering from jet lag :-) can
> serve as genuine barrier without the extra write tho.

As always, it depends :-)

If you are doing pure flush barriers, then there's no difference. Unless
you only guarantee ordering wrt previously submitted requests, in which
case you can eliminate the post flush.

If you are doing ordered tags, then just setting the ordered bit is
enough. That is different from the barrier in that we don't need a flush
of FUA bit set.

In reality maybe the difference isn't all that great, at least we can
start by having WRITE_ORDERED == WRITE_BARRIER.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-01 Thread David Chinner
On Fri, Jun 01, 2007 at 03:59:51PM +1000, Neil Brown wrote:
> On Friday June 1, [EMAIL PROTECTED] wrote:
> > On Thu, May 31, 2007 at 02:31:21PM -0400, Phillip Susi wrote:
> > > David Chinner wrote:
> > > >That sounds like a good idea - we can leave the existing
> > > >WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
> > > >behaviour that only guarantees ordering. The filesystem can then
> > > >choose which to use where appropriate
> > > 
> > > So what if you want a synchronous write, but DON'T care about the order? 
> > 
> > submit_bio(WRITE_SYNC, bio);
> > 
> > Already there, already used by XFS, JFS and direct I/O.
> 
> Are you sure?
> 
> You seem to be saying that WRITE_SYNC causes the write to be safe on
> media before the request returns.

Sorry, I wasn't really all that clear :/

What I'm saying the *interface* for higher layer to tell the block layers
that a sync write is being executed is already there. i.e. we can already
tell the block layer that we are doing a synchronous I/O.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-01 Thread Tejun Heo
[ cc'ing Ric Wheeler for storage array thingie.  Hi, whole thread is at
http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/3344 ]

Hello,

[EMAIL PROTECTED] wrote:
> but when you consider the self-contained disk arrays it's an entirely
> different story. you can easily have a few gig of cache and a complete
> OS pretending to be a single drive as far as you are concerned.
> 
> and the price of such devices is plummeting (in large part thanks to
> Linux moving into this space), you can now readily buy a 10TB array for
> $10k that looks like a single drive.

Don't those thingies usually have NV cache or backed by battery such
that ORDERED_DRAIN is enough?

The problem is that the interface between the host and a storage device
(ATA or SCSI) is not built to communicate that kind of information
(grouped flush, relaxed ordering...).  I think battery backed
ORDERED_DRAIN combined with fine-grained host queue flush would be
pretty good.  It doesn't require some fancy new interface which isn't
gonna be used widely anyway and can achieve most of performance gain if
the storage plays it smart.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-01 Thread Jens Axboe
On Fri, Jun 01 2007, Neil Brown wrote:
> On Friday June 1, [EMAIL PROTECTED] wrote:
> > On Thu, May 31, 2007 at 02:31:21PM -0400, Phillip Susi wrote:
> > > David Chinner wrote:
> > > >That sounds like a good idea - we can leave the existing
> > > >WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
> > > >behaviour that only guarantees ordering. The filesystem can then
> > > >choose which to use where appropriate
> > > 
> > > So what if you want a synchronous write, but DON'T care about the order? 
> > 
> > submit_bio(WRITE_SYNC, bio);
> > 
> > Already there, already used by XFS, JFS and direct I/O.
> 
> Are you sure?
> 
> You seem to be saying that WRITE_SYNC causes the write to be safe on
> media before the request returns.  That isn't my understanding.
> I think (from comments near the definition and a quick grep through
> the code) that WRITE_SYNC expedites the delivery of the request
> through the elevator, but doesn't do anything special about getting it
> onto the media.
> It essentially say "Submit this request now, don't wait for more
> request to bundle with it for better bandwidth utilisation"

That is exactly right. WRITE_SYNC doesn't give any integrity guarentees,
it's just makes sure it goes straight through the io scheduler.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-01 Thread Neil Brown
On Friday June 1, [EMAIL PROTECTED] wrote:
> On Thu, May 31, 2007 at 02:31:21PM -0400, Phillip Susi wrote:
> > David Chinner wrote:
> > >That sounds like a good idea - we can leave the existing
> > >WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
> > >behaviour that only guarantees ordering. The filesystem can then
> > >choose which to use where appropriate
> > 
> > So what if you want a synchronous write, but DON'T care about the order? 
> 
> submit_bio(WRITE_SYNC, bio);
> 
> Already there, already used by XFS, JFS and direct I/O.

Are you sure?

You seem to be saying that WRITE_SYNC causes the write to be safe on
media before the request returns.  That isn't my understanding.
I think (from comments near the definition and a quick grep through
the code) that WRITE_SYNC expedites the delivery of the request
through the elevator, but doesn't do anything special about getting it
onto the media.
It essentially say "Submit this request now, don't wait for more
request to bundle with it for better bandwidth utilisation"

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-01 Thread Neil Brown
On Friday June 1, [EMAIL PROTECTED] wrote:
 On Thu, May 31, 2007 at 02:31:21PM -0400, Phillip Susi wrote:
  David Chinner wrote:
  That sounds like a good idea - we can leave the existing
  WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
  behaviour that only guarantees ordering. The filesystem can then
  choose which to use where appropriate
  
  So what if you want a synchronous write, but DON'T care about the order? 
 
 submit_bio(WRITE_SYNC, bio);
 
 Already there, already used by XFS, JFS and direct I/O.

Are you sure?

You seem to be saying that WRITE_SYNC causes the write to be safe on
media before the request returns.  That isn't my understanding.
I think (from comments near the definition and a quick grep through
the code) that WRITE_SYNC expedites the delivery of the request
through the elevator, but doesn't do anything special about getting it
onto the media.
It essentially say Submit this request now, don't wait for more
request to bundle with it for better bandwidth utilisation

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-01 Thread Jens Axboe
On Fri, Jun 01 2007, Neil Brown wrote:
 On Friday June 1, [EMAIL PROTECTED] wrote:
  On Thu, May 31, 2007 at 02:31:21PM -0400, Phillip Susi wrote:
   David Chinner wrote:
   That sounds like a good idea - we can leave the existing
   WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
   behaviour that only guarantees ordering. The filesystem can then
   choose which to use where appropriate
   
   So what if you want a synchronous write, but DON'T care about the order? 
  
  submit_bio(WRITE_SYNC, bio);
  
  Already there, already used by XFS, JFS and direct I/O.
 
 Are you sure?
 
 You seem to be saying that WRITE_SYNC causes the write to be safe on
 media before the request returns.  That isn't my understanding.
 I think (from comments near the definition and a quick grep through
 the code) that WRITE_SYNC expedites the delivery of the request
 through the elevator, but doesn't do anything special about getting it
 onto the media.
 It essentially say Submit this request now, don't wait for more
 request to bundle with it for better bandwidth utilisation

That is exactly right. WRITE_SYNC doesn't give any integrity guarentees,
it's just makes sure it goes straight through the io scheduler.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-01 Thread Tejun Heo
[ cc'ing Ric Wheeler for storage array thingie.  Hi, whole thread is at
http://thread.gmane.org/gmane.linux.kernel.device-mapper.devel/3344 ]

Hello,

[EMAIL PROTECTED] wrote:
 but when you consider the self-contained disk arrays it's an entirely
 different story. you can easily have a few gig of cache and a complete
 OS pretending to be a single drive as far as you are concerned.
 
 and the price of such devices is plummeting (in large part thanks to
 Linux moving into this space), you can now readily buy a 10TB array for
 $10k that looks like a single drive.

Don't those thingies usually have NV cache or backed by battery such
that ORDERED_DRAIN is enough?

The problem is that the interface between the host and a storage device
(ATA or SCSI) is not built to communicate that kind of information
(grouped flush, relaxed ordering...).  I think battery backed
ORDERED_DRAIN combined with fine-grained host queue flush would be
pretty good.  It doesn't require some fancy new interface which isn't
gonna be used widely anyway and can achieve most of performance gain if
the storage plays it smart.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-01 Thread Jens Axboe
On Fri, Jun 01 2007, Tejun Heo wrote:
 Jens Axboe wrote:
  On Thu, May 31 2007, David Chinner wrote:
  On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:
  On Thu, May 31 2007, David Chinner wrote:
  IOWs, there are two parts to the problem:
 
   1 - guaranteeing I/O ordering
   2 - guaranteeing blocks are on persistent storage.
 
  Right now, a single barrier I/O is used to provide both of these
  guarantees. In most cases, all we really need to provide is 1); the
  need for 2) is a much rarer condition but still needs to be
  provided.
 
  if I am understanding it correctly, the big win for barriers is that 
  you 
  do NOT have to stop and wait until the data is on persistant media 
  before 
  you can continue.
  Yes, if we define a barrier to only guarantee 1), then yes this
  would be a big win (esp. for XFS). But that requires all filesystems
  to handle sync writes differently, and sync_blockdev() needs to
  call blkdev_issue_flush() as well
 
  So, what do we do here? Do we define a barrier I/O to only provide
  ordering, or do we define it to also provide persistent storage
  writeback? Whatever we decide, it needs to be documented
  The block layer already has a notion of the two types of barriers, with
  a very small amount of tweaking we could expose that. There's absolutely
  zero reason we can't easily support both types of barriers.
  That sounds like a good idea - we can leave the existing
  WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
  behaviour that only guarantees ordering. The filesystem can then
  choose which to use where appropriate
  
  Precisely. The current definition of barriers are what Chris and I came
  up with many years ago, when solving the problem for reiserfs
  originally. It is by no means the only feasible approach.
  
  I'll add a WRITE_ORDERED command to the #barrier branch, it already
  contains the empty-bio barrier support I posted yesterday (well a
  slightly modified and cleaned up version).
 
 Would that be very different from issuing barrier and not waiting for
 its completion?  For ATA and SCSI, we'll have to flush write back cache
 anyway, so I don't see how we can get performance advantage by
 implementing separate WRITE_ORDERED.  I think zero-length barrier
 (haven't looked at the code yet, still recovering from jet lag :-) can
 serve as genuine barrier without the extra write tho.

As always, it depends :-)

If you are doing pure flush barriers, then there's no difference. Unless
you only guarantee ordering wrt previously submitted requests, in which
case you can eliminate the post flush.

If you are doing ordered tags, then just setting the ordered bit is
enough. That is different from the barrier in that we don't need a flush
of FUA bit set.

In reality maybe the difference isn't all that great, at least we can
start by having WRITE_ORDERED == WRITE_BARRIER.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-01 Thread David Chinner
On Fri, Jun 01, 2007 at 03:59:51PM +1000, Neil Brown wrote:
 On Friday June 1, [EMAIL PROTECTED] wrote:
  On Thu, May 31, 2007 at 02:31:21PM -0400, Phillip Susi wrote:
   David Chinner wrote:
   That sounds like a good idea - we can leave the existing
   WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
   behaviour that only guarantees ordering. The filesystem can then
   choose which to use where appropriate
   
   So what if you want a synchronous write, but DON'T care about the order? 
  
  submit_bio(WRITE_SYNC, bio);
  
  Already there, already used by XFS, JFS and direct I/O.
 
 Are you sure?
 
 You seem to be saying that WRITE_SYNC causes the write to be safe on
 media before the request returns.

Sorry, I wasn't really all that clear :/

What I'm saying the *interface* for higher layer to tell the block layers
that a sync write is being executed is already there. i.e. we can already
tell the block layer that we are doing a synchronous I/O.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-01 Thread Bill Davidsen

Jens Axboe wrote:

On Thu, May 31 2007, Bill Davidsen wrote:
  

Jens Axboe wrote:


On Thu, May 31 2007, David Chinner wrote:
 
  

On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:
   


On Thu, May 31 2007, David Chinner wrote:
 
  

IOWs, there are two parts to the problem:

1 - guaranteeing I/O ordering
2 - guaranteeing blocks are on persistent storage.

Right now, a single barrier I/O is used to provide both of these
guarantees. In most cases, all we really need to provide is 1); the
need for 2) is a much rarer condition but still needs to be
provided.

   

if I am understanding it correctly, the big win for barriers is that 
you do NOT have to stop and wait until the data is on persistant media 
before you can continue.
 
  

Yes, if we define a barrier to only guarantee 1), then yes this
would be a big win (esp. for XFS). But that requires all filesystems
to handle sync writes differently, and sync_blockdev() needs to
call blkdev_issue_flush() as well

So, what do we do here? Do we define a barrier I/O to only provide
ordering, or do we define it to also provide persistent storage
writeback? Whatever we decide, it needs to be documented
   


The block layer already has a notion of the two types of barriers, with
a very small amount of tweaking we could expose that. There's absolutely
zero reason we can't easily support both types of barriers.
 
  

That sounds like a good idea - we can leave the existing
WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
behaviour that only guarantees ordering. The filesystem can then
choose which to use where appropriate
   


Precisely. The current definition of barriers are what Chris and I came
up with many years ago, when solving the problem for reiserfs
originally. It is by no means the only feasible approach.

I'll add a WRITE_ORDERED command to the #barrier branch, it already
contains the empty-bio barrier support I posted yesterday (well a
slightly modified and cleaned up version).

 
  
Wait. Do filesystems expect (depend on) anything but ordering now? Does 
md? Having users of barriers as they currently behave suddenly getting 
SYNC behavior where they expect ORDERED is likely to have a negative 
effect on performance. Or do I misread what is actually guaranteed by 
WRITE_BARRIER now, and a flush is currently happening in all cases?



See the above stuff you quote, it's answered there. It's not a change,
this is how the Linux barrier write has always worked since I first
implemented it. What David and I are talking about is adding a more
relaxed version as well, that just implies ordering.
  


I was reading the documentation in block/biodoc.txt, which seems to just 
say ordered:


   1.2.1 I/O Barriers

   There is a way to enforce strict ordering for i/os through barriers.
   All requests before a barrier point must be serviced before the barrier
   request and any other requests arriving after the barrier will not be
   serviced until after the barrier has completed. This is useful for
   higher
   level control on write ordering, e.g flushing a log of committed updates
   to disk before the corresponding updates themselves.

   A flag in the bio structure, BIO_BARRIER is used to identify a
   barrier i/o.
   The generic i/o scheduler would make sure that it places the barrier
   request and
   all other requests coming after it after all the previous requests
   in the
   queue. Barriers may be implemented in different ways depending on the
   driver. A SCSI driver for example could make use of ordered tags to
   preserve the necessary ordering with a lower impact on throughput.
   For IDE
   this might be two sync cache flush: a pre and post flush when
   encountering
   a barrier write.

The flush comment is associated with IDE, so it wasn't clear that the 
device cache is always cleared to force the data to the platter.


And will this also be available to user space f/s, since I just proposed 
a project which uses one? :-(



I see several uses for that, so I'd hope so.

  
I think the goal is good, more choice is almost always better choice, I 
just want to be sure there won't be big disk performance regressions.



We can't get more heavy weight than the current barrier, it's about as
conservative as you can get.

  



--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-01 Thread Valdis . Kletnieks
On Fri, 01 Jun 2007 16:16:01 +0900, Tejun Heo said:
 Don't those thingies usually have NV cache or backed by battery such
 that ORDERED_DRAIN is enough?

Probably *most* do, but do you really want to bet the user's data on it?

 The problem is that the interface between the host and a storage device
 (ATA or SCSI) is not built to communicate that kind of information
 (grouped flush, relaxed ordering...).  I think battery backed
 ORDERED_DRAIN combined with fine-grained host queue flush would be
 pretty good.  It doesn't require some fancy new interface which isn't
 gonna be used widely anyway and can achieve most of performance gain if
 the storage plays it smart.

Yes, that would probably be pretty good.  But how do you get the storage
device to *reliably* tell the truth about what it actually implements? (Consider
the number of devices that downright lie about their implementation of cache
flushing)


pgpdwBQb4rzZE.pgp
Description: PGP signature


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-01 Thread Tejun Heo
[EMAIL PROTECTED] wrote:
 On Fri, 01 Jun 2007 16:16:01 +0900, Tejun Heo said:
 Don't those thingies usually have NV cache or backed by battery such
 that ORDERED_DRAIN is enough?
 
 Probably *most* do, but do you really want to bet the user's data on it?

Thought we were talking about high-end storage stuff.  I don't think
I'll be too uncomfortable.  The reason why we're talking about this at
all is because high-end stuff with fancy NV cache and a hunk of battery
will unnecessarily suffer from the current barrier implementation.

 The problem is that the interface between the host and a storage device
 (ATA or SCSI) is not built to communicate that kind of information
 (grouped flush, relaxed ordering...).  I think battery backed
 ORDERED_DRAIN combined with fine-grained host queue flush would be
 pretty good.  It doesn't require some fancy new interface which isn't
 gonna be used widely anyway and can achieve most of performance gain if
 the storage plays it smart.
 
 Yes, that would probably be pretty good.  But how do you get the storage
 device to *reliably* tell the truth about what it actually implements? 
 (Consider
 the number of devices that downright lie about their implementation of cache
 flushing)

SCSI NV bit or report write through cache?  Again, we're talking about
large arrays and we already trust the write through thing even on cheap
single spindle drives.  sd currently doesn't honor NV bit and it's
causing some troubles on some arrays.  We'll probably have to honor them
at least conditionally.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-01 Thread Bill Davidsen

Neil Brown wrote:

On Friday June 1, [EMAIL PROTECTED] wrote:
  

On Thu, May 31, 2007 at 02:31:21PM -0400, Phillip Susi wrote:


David Chinner wrote:
  

That sounds like a good idea - we can leave the existing
WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
behaviour that only guarantees ordering. The filesystem can then
choose which to use where appropriate

So what if you want a synchronous write, but DON'T care about the order? 
  

submit_bio(WRITE_SYNC, bio);

Already there, already used by XFS, JFS and direct I/O.



Are you sure?

You seem to be saying that WRITE_SYNC causes the write to be safe on
media before the request returns.  That isn't my understanding.
I think (from comments near the definition and a quick grep through
the code) that WRITE_SYNC expedites the delivery of the request
through the elevator, but doesn't do anything special about getting it
onto the media.


My impression is that the sync will return when the i/o has been 
delivered to the device, and will get special treatment by the elevator 
code (I looked quickly, more is needed). I'm sore someone will tell me 
if I misread this. ;-)


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-06-01 Thread Bill Davidsen

Jens Axboe wrote:

On Thu, May 31 2007, Phillip Susi wrote:
  

Jens Axboe wrote:


No Stephan is right, the barrier is both an ordering and integrity
constraint. If a driver completes a barrier request before that request
and previously submitted requests are on STABLE storage, then it
violates that principle. Look at the code and the various ordering
options.
  
I am saying that is the wrong thing to do.  Barrier should be about 
ordering only.  So long as the order they hit the media is maintained, 
the order the requests are completed in can change.  barrier.txt bears 



But you can't guarentee ordering without flushing the data out as well.
It all depends on the type of cache on the device, of course. If you
look at the ordinary sata/ide drive with write back caching, you can't
just issue the requests in order and pray that the drive cache will make
it to platter.

If you don't have write back caching, or if the cache is battery backed
and thus guarenteed to never be lost, maintaining order is naturally
enough.
  


Do I misread this? If ordered doesn't reach all the way to the platter 
then there will be failure modes which result in order not preserved. 
Battery backed cache doesn't prevect failures between the cache and the 
platter.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread david

On Fri, 1 Jun 2007, Tejun Heo wrote:


but one
thing we should bear in mind is that harddisks don't have humongous
caches or very smart controller / instruction set.  No matter how
relaxed interface the block layer provides, in the end, it just has to
issue whole-sale FLUSH CACHE on the device to guarantee data ordering on
the media.


if you are talking about individual drives you may be right for the moment 
(but 16M cache on drives is a _lot_ larger then people imagined would be 
there a few years ago)


but when you consider the self-contained disk arrays it's an entirely 
different story. you can easily have a few gig of cache and a complete OS 
pretending to be a single drive as far as you are concerned.


and the price of such devices is plummeting (in large part thanks to Linux 
moving into this space), you can now readily buy a 10TB array for $10k 
that looks like a single drive.


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Tejun Heo
Stefan Bader wrote:
> 2007/5/30, Phillip Susi <[EMAIL PROTECTED]>:
>> Stefan Bader wrote:
>> >
>> > Since drive a supports barrier request we don't get -EOPNOTSUPP but
>> > the request with block y might get written before block x since the
>> > disk are independent. I guess the chances of this are quite low since
>> > at some point a barrier request will also hit drive b but for the time
>> > being it might be better to indicate -EOPNOTSUPP right from
>> > device-mapper.
>>
>> The device mapper needs to ensure that ALL underlying devices get a
>> barrier request when one comes down from above, even if it has to
>> construct zero length barriers to send to most of them.
>>
> 
> And somehow also make sure all of the barriers have been processed
> before returning the barrier that came in. Plus it would have to queue
> all mapping requests until the barrier is done (if strictly acting
> according to barrier.txt).
> 
> But I am wondering a bit whether the requirements to barriers are
> really that tight as described in Tejun's document (barrier request is
> only started if everything before is safe, the barrier itself isn't
> returned until it is safe, too, and all requests after the barrier
> aren't started before the barrier is done). Is it really necessary to
> defer any further requests until the barrier has been written to save
> storage? Or would it be sufficient to guarantee that, if a barrier
> request returns, everything up to (including the barrier) is on safe
> storage?

Well, what's described in barrier.txt is the current implemented
semantics and what filesystems expect, so we can't change it underneath
them but we definitely can introduce new more relaxed variants, but one
thing we should bear in mind is that harddisks don't have humongous
caches or very smart controller / instruction set.  No matter how
relaxed interface the block layer provides, in the end, it just has to
issue whole-sale FLUSH CACHE on the device to guarantee data ordering on
the media.

IMHO, we can do better by paying more attention to how we do things in
the request queue which can be deeper and more intelligent than the
device queue.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Tejun Heo
Jens Axboe wrote:
> On Thu, May 31 2007, David Chinner wrote:
>> On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:
>>> On Thu, May 31 2007, David Chinner wrote:
 IOWs, there are two parts to the problem:

1 - guaranteeing I/O ordering
2 - guaranteeing blocks are on persistent storage.

 Right now, a single barrier I/O is used to provide both of these
 guarantees. In most cases, all we really need to provide is 1); the
 need for 2) is a much rarer condition but still needs to be
 provided.

> if I am understanding it correctly, the big win for barriers is that you 
> do NOT have to stop and wait until the data is on persistant media before 
> you can continue.
 Yes, if we define a barrier to only guarantee 1), then yes this
 would be a big win (esp. for XFS). But that requires all filesystems
 to handle sync writes differently, and sync_blockdev() needs to
 call blkdev_issue_flush() as well

 So, what do we do here? Do we define a barrier I/O to only provide
 ordering, or do we define it to also provide persistent storage
 writeback? Whatever we decide, it needs to be documented
>>> The block layer already has a notion of the two types of barriers, with
>>> a very small amount of tweaking we could expose that. There's absolutely
>>> zero reason we can't easily support both types of barriers.
>> That sounds like a good idea - we can leave the existing
>> WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
>> behaviour that only guarantees ordering. The filesystem can then
>> choose which to use where appropriate
> 
> Precisely. The current definition of barriers are what Chris and I came
> up with many years ago, when solving the problem for reiserfs
> originally. It is by no means the only feasible approach.
> 
> I'll add a WRITE_ORDERED command to the #barrier branch, it already
> contains the empty-bio barrier support I posted yesterday (well a
> slightly modified and cleaned up version).

Would that be very different from issuing barrier and not waiting for
its completion?  For ATA and SCSI, we'll have to flush write back cache
anyway, so I don't see how we can get performance advantage by
implementing separate WRITE_ORDERED.  I think zero-length barrier
(haven't looked at the code yet, still recovering from jet lag :-) can
serve as genuine barrier without the extra write tho.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread David Chinner
On Thu, May 31, 2007 at 02:31:21PM -0400, Phillip Susi wrote:
> David Chinner wrote:
> >That sounds like a good idea - we can leave the existing
> >WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
> >behaviour that only guarantees ordering. The filesystem can then
> >choose which to use where appropriate
> 
> So what if you want a synchronous write, but DON'T care about the order? 

submit_bio(WRITE_SYNC, bio);

Already there, already used by XFS, JFS and direct I/O.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Jens Axboe
On Thu, May 31 2007, [EMAIL PROTECTED] wrote:
> On Thu, 31 May 2007, Jens Axboe wrote:
> 
> >On Thu, May 31 2007, Phillip Susi wrote:
> >>David Chinner wrote:
> >>>That sounds like a good idea - we can leave the existing
> >>>WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
> >>>behaviour that only guarantees ordering. The filesystem can then
> >>>choose which to use where appropriate
> >>
> >>So what if you want a synchronous write, but DON'T care about the order?
> >>  They need to be two completely different flags which you can choose
> >>to combine, or use individually.
> >
> >If you have a use case for that, we can easily support it as well...
> >Depending on the drive capabilities (FUA support or not), it may be
> >nearly as slow as a "real" barrier write.
> 
> true, but a "real" barrier write could have significant side effects on 
> other writes that wouldn't happen with a synchronous wrote (a sync wrote 
> can have other, unrelated writes re-ordered around it, a barrier write 
> can't)

That is true, the sync write also has side effects at the drive side
since it may have a varied cost depending on the workload (eg what
already resides in the cache when it is issued), unless FUA is active.
That is also true for the barrier of course, but only for previously
submitted IO as we don't reorder.

I'm not saying that a SYNC write wont be potentially useful, just that
it's definitely not free even outside of the write itself.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread david

On Thu, 31 May 2007, Jens Axboe wrote:


On Thu, May 31 2007, Phillip Susi wrote:

David Chinner wrote:

That sounds like a good idea - we can leave the existing
WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
behaviour that only guarantees ordering. The filesystem can then
choose which to use where appropriate


So what if you want a synchronous write, but DON'T care about the order?
  They need to be two completely different flags which you can choose
to combine, or use individually.


If you have a use case for that, we can easily support it as well...
Depending on the drive capabilities (FUA support or not), it may be
nearly as slow as a "real" barrier write.


true, but a "real" barrier write could have significant side effects on 
other writes that wouldn't happen with a synchronous wrote (a sync wrote 
can have other, unrelated writes re-ordered around it, a barrier write 
can't)


David Lang
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Jens Axboe
On Thu, May 31 2007, Phillip Susi wrote:
> David Chinner wrote:
> >That sounds like a good idea - we can leave the existing
> >WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
> >behaviour that only guarantees ordering. The filesystem can then
> >choose which to use where appropriate
> 
> So what if you want a synchronous write, but DON'T care about the order? 
>   They need to be two completely different flags which you can choose 
> to combine, or use individually.

If you have a use case for that, we can easily support it as well...
Depending on the drive capabilities (FUA support or not), it may be
nearly as slow as a "real" barrier write.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Jens Axboe
On Thu, May 31 2007, Phillip Susi wrote:
> Jens Axboe wrote:
> >No Stephan is right, the barrier is both an ordering and integrity
> >constraint. If a driver completes a barrier request before that request
> >and previously submitted requests are on STABLE storage, then it
> >violates that principle. Look at the code and the various ordering
> >options.
> 
> I am saying that is the wrong thing to do.  Barrier should be about 
> ordering only.  So long as the order they hit the media is maintained, 
> the order the requests are completed in can change.  barrier.txt bears 

But you can't guarentee ordering without flushing the data out as well.
It all depends on the type of cache on the device, of course. If you
look at the ordinary sata/ide drive with write back caching, you can't
just issue the requests in order and pray that the drive cache will make
it to platter.

If you don't have write back caching, or if the cache is battery backed
and thus guarenteed to never be lost, maintaining order is naturally
enough.

Or if the drive can do ordered queued commands, you can relax the
flushing (again depending on the cache type, you may need to take
different paths).

> "Requests in ordered sequence are issued in order, but not required to 
> finish in order.  Barrier implementation can handle out-of-order 
> completion of ordered sequence.  IOW, the requests MUST be processed in 
> order but the hardware/software completion paths are allowed to reorder 
> completion notifications - eg. current SCSI midlayer doesn't preserve 
> completion order during error handling."

If you carefully re-read that paragraph, then it just tells you that the
software implementation can deal with reordered completions. It doesn't
relax the rconstraints on ordering and integrity AT ALL.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Phillip Susi

Jens Axboe wrote:

No Stephan is right, the barrier is both an ordering and integrity
constraint. If a driver completes a barrier request before that request
and previously submitted requests are on STABLE storage, then it
violates that principle. Look at the code and the various ordering
options.


I am saying that is the wrong thing to do.  Barrier should be about 
ordering only.  So long as the order they hit the media is maintained, 
the order the requests are completed in can change.  barrier.txt bears 
this out:


"Requests in ordered sequence are issued in order, but not required to 
finish in order.  Barrier implementation can handle out-of-order 
completion of ordered sequence.  IOW, the requests MUST be processed in 
order but the hardware/software completion paths are allowed to reorder 
completion notifications - eg. current SCSI midlayer doesn't preserve 
completion order during error handling."



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Phillip Susi

David Chinner wrote:

That sounds like a good idea - we can leave the existing
WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
behaviour that only guarantees ordering. The filesystem can then
choose which to use where appropriate


So what if you want a synchronous write, but DON'T care about the order? 
  They need to be two completely different flags which you can choose 
to combine, or use individually.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Phillip Susi

David Chinner wrote:
you are understanding barriers to be the same as syncronous writes. (and 
therefor the data is on persistant media before the call returns)


No, I'm describing the high level behaviour that is expected by
a filesystem. The reasons for this are below


You say no, but then you go on to contradict yourself below.


Ok, that's my understanding of how *device based barriers* can work,
but there's more to it than that. As far as the filesystem is
concerned the barrier write needs to *behave* exactly like a sync
write because of the guarantees the filesystem has to provide
userspace. Specifically - sync, sync writes and fsync.


There, you just ascribed the synchronous property to barrier requests. 
This is false.  Barriers are about ordering, synchronous writes are 
another thing entirely.  The filesystem is supposed to use barriers to 
maintain ordering for journal data.  If you are trying to handle a 
synchronous write request, that's another flag.



This is the big problem, right? If we use barriers for commit
writes, the filesystem can return to userspace after a sync write or
fsync() and an *ordered barrier device implementation* may not have
written the blocks to persistent media. If we then pull the plug on
the box, we've just lost data that sync or fsync said was
successfully on disk. That's BAD.


That's why for synchronous writes, you set the flag to mark the request 
as synchronous, which has nothing at all to do with barriers.  You are 
trying to use barriers to solve two different problems.  Use one flag to 
indicate ordering, and another to indicate synchronisity.



Right now a barrier write on the last block of the fsync/sync write
is sufficient to prevent that because of the FUA on the barrier
block write. A purely ordered barrier implementation does not
provide this guarantee.


This is a side effect of the implementation of the barrier, not part of 
the semantics of barriers, so you shouldn't rely on this behavior.  You 
don't have to use FUA to handle the barrier request, and if you don't, 
then the request can be completed while the data is still in the write 
cache.  You just have to make sure to flush it before any subsequent 
requests.



IOWs, there are two parts to the problem:

1 - guaranteeing I/O ordering
2 - guaranteeing blocks are on persistent storage.

Right now, a single barrier I/O is used to provide both of these
guarantees. In most cases, all we really need to provide is 1); the
need for 2) is a much rarer condition but still needs to be
provided.


Yep... two problems... two flags.


Yes, if we define a barrier to only guarantee 1), then yes this
would be a big win (esp. for XFS). But that requires all filesystems
to handle sync writes differently, and sync_blockdev() needs to
call blkdev_issue_flush() as well

So, what do we do here? Do we define a barrier I/O to only provide
ordering, or do we define it to also provide persistent storage
writeback? Whatever we decide, it needs to be documented


We do the former or we end up in the same boat as O_DIRECT; where you 
have one flag that means several things, and no way to specify you only 
need some of those and not the others.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Jens Axboe
On Thu, May 31 2007, Bill Davidsen wrote:
> Jens Axboe wrote:
> >On Thu, May 31 2007, David Chinner wrote:
> >  
> >>On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:
> >>
> >>>On Thu, May 31 2007, David Chinner wrote:
> >>>  
> IOWs, there are two parts to the problem:
> 
>   1 - guaranteeing I/O ordering
>   2 - guaranteeing blocks are on persistent storage.
> 
> Right now, a single barrier I/O is used to provide both of these
> guarantees. In most cases, all we really need to provide is 1); the
> need for 2) is a much rarer condition but still needs to be
> provided.
> 
> 
> >if I am understanding it correctly, the big win for barriers is that 
> >you do NOT have to stop and wait until the data is on persistant media 
> >before you can continue.
> >  
> Yes, if we define a barrier to only guarantee 1), then yes this
> would be a big win (esp. for XFS). But that requires all filesystems
> to handle sync writes differently, and sync_blockdev() needs to
> call blkdev_issue_flush() as well
> 
> So, what do we do here? Do we define a barrier I/O to only provide
> ordering, or do we define it to also provide persistent storage
> writeback? Whatever we decide, it needs to be documented
> 
> >>>The block layer already has a notion of the two types of barriers, with
> >>>a very small amount of tweaking we could expose that. There's absolutely
> >>>zero reason we can't easily support both types of barriers.
> >>>  
> >>That sounds like a good idea - we can leave the existing
> >>WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
> >>behaviour that only guarantees ordering. The filesystem can then
> >>choose which to use where appropriate
> >>
> >
> >Precisely. The current definition of barriers are what Chris and I came
> >up with many years ago, when solving the problem for reiserfs
> >originally. It is by no means the only feasible approach.
> >
> >I'll add a WRITE_ORDERED command to the #barrier branch, it already
> >contains the empty-bio barrier support I posted yesterday (well a
> >slightly modified and cleaned up version).
> >
> >  
> Wait. Do filesystems expect (depend on) anything but ordering now? Does 
> md? Having users of barriers as they currently behave suddenly getting 
> SYNC behavior where they expect ORDERED is likely to have a negative 
> effect on performance. Or do I misread what is actually guaranteed by 
> WRITE_BARRIER now, and a flush is currently happening in all cases?

See the above stuff you quote, it's answered there. It's not a change,
this is how the Linux barrier write has always worked since I first
implemented it. What David and I are talking about is adding a more
relaxed version as well, that just implies ordering.

> And will this also be available to user space f/s, since I just proposed 
> a project which uses one? :-(

I see several uses for that, so I'd hope so.

> I think the goal is good, more choice is almost always better choice, I 
> just want to be sure there won't be big disk performance regressions.

We can't get more heavy weight than the current barrier, it's about as
conservative as you can get.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Bill Davidsen

Jens Axboe wrote:

On Thu, May 31 2007, David Chinner wrote:
  

On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:


On Thu, May 31 2007, David Chinner wrote:
  

IOWs, there are two parts to the problem:

1 - guaranteeing I/O ordering
2 - guaranteeing blocks are on persistent storage.

Right now, a single barrier I/O is used to provide both of these
guarantees. In most cases, all we really need to provide is 1); the
need for 2) is a much rarer condition but still needs to be
provided.


if I am understanding it correctly, the big win for barriers is that you 
do NOT have to stop and wait until the data is on persistant media before 
you can continue.
  

Yes, if we define a barrier to only guarantee 1), then yes this
would be a big win (esp. for XFS). But that requires all filesystems
to handle sync writes differently, and sync_blockdev() needs to
call blkdev_issue_flush() as well

So, what do we do here? Do we define a barrier I/O to only provide
ordering, or do we define it to also provide persistent storage
writeback? Whatever we decide, it needs to be documented


The block layer already has a notion of the two types of barriers, with
a very small amount of tweaking we could expose that. There's absolutely
zero reason we can't easily support both types of barriers.
  

That sounds like a good idea - we can leave the existing
WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
behaviour that only guarantees ordering. The filesystem can then
choose which to use where appropriate



Precisely. The current definition of barriers are what Chris and I came
up with many years ago, when solving the problem for reiserfs
originally. It is by no means the only feasible approach.

I'll add a WRITE_ORDERED command to the #barrier branch, it already
contains the empty-bio barrier support I posted yesterday (well a
slightly modified and cleaned up version).

  
Wait. Do filesystems expect (depend on) anything but ordering now? Does 
md? Having users of barriers as they currently behave suddenly getting 
SYNC behavior where they expect ORDERED is likely to have a negative 
effect on performance. Or do I misread what is actually guaranteed by 
WRITE_BARRIER now, and a flush is currently happening in all cases?


And will this also be available to user space f/s, since I just proposed 
a project which uses one? :-(
I think the goal is good, more choice is almost always better choice, I 
just want to be sure there won't be big disk performance regressions.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Bill Davidsen

Neil Brown wrote:

On Monday May 28, [EMAIL PROTECTED] wrote:
  

There are two things I'm not sure you covered.

First, disks which don't support flush but do have a "cache dirty" 
status bit you can poll at times like shutdown. If there are no drivers 
which support these, it can be ignored.



There are really devices like that?  So to implement a flush, you have
to stop sending writes and wait and poll - maybe poll every
millisecond?
  


Yes, there really are (or were). But I don't think that there are 
drivers, so it's not an issue.

That wouldn't be very good for performance  maybe you just
wouldn't bother with barriers on that sort of device?
  


That is why there are no drivers...

Which reminds me:  What is the best way to turn off barriers?
Several filesystems have "-o nobarriers" or "-o barriers=0",
or the inverse.
  


If they can function usefully without, the admin gets to make that choice.

md/raid currently uses barriers to write metadata, and there is no
way to turn that off.  I'm beginning to wonder if that is best.
  


I don't see how you can have reliable operation without it, particularly 
WRT bitmap.

Maybe barrier support should be a function of the device.  i.e. the
filesystem or whatever always sends barrier requests where it thinks
it is appropriate, and the block device tries to honour them to the
best of its ability, but if you run
   blockdev --enforce-barriers=no /dev/sda
then you lose some reliability guarantees, but gain some throughput (a
bit like the 'async' export option for nfsd).

  
Since this is device dependent, it really should be in the device 
driver, and requests should have status of success, failure, or feature 
unavailability.




Second, NAS (including nbd?). Is there enough information to handle this  "really 
right?"



NAS means lots of things, including NFS and CIFS where this doesn't
apply.
  


Well, we're really talking about network attached devices rather than 
network filesystems. I guess people do lump them together.



For 'nbd', it is entirely up to the protocol.  If the protocol allows
a barrier flag to be sent to the server, then barriers should just
work.  If it doesn't, then either the server disables write-back
caching, or flushes every request, or you lose all barrier
guarantees. 
  


Pretty much agrees with what I said above, it's at a level closer to the 
device, and status should come back from the physical i/o request.

For 'iscsi', I guess it works just the same as SCSI...
  


Hopefully.

--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Stefan Bader

2007/5/30, Phillip Susi <[EMAIL PROTECTED]>:

Stefan Bader wrote:
>
> Since drive a supports barrier request we don't get -EOPNOTSUPP but
> the request with block y might get written before block x since the
> disk are independent. I guess the chances of this are quite low since
> at some point a barrier request will also hit drive b but for the time
> being it might be better to indicate -EOPNOTSUPP right from
> device-mapper.

The device mapper needs to ensure that ALL underlying devices get a
barrier request when one comes down from above, even if it has to
construct zero length barriers to send to most of them.



And somehow also make sure all of the barriers have been processed
before returning the barrier that came in. Plus it would have to queue
all mapping requests until the barrier is done (if strictly acting
according to barrier.txt).

But I am wondering a bit whether the requirements to barriers are
really that tight as described in Tejun's document (barrier request is
only started if everything before is safe, the barrier itself isn't
returned until it is safe, too, and all requests after the barrier
aren't started before the barrier is done). Is it really necessary to
defer any further requests until the barrier has been written to save
storage? Or would it be sufficient to guarantee that, if a barrier
request returns, everything up to (including the barrier) is on safe
storage?

Stefan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Jens Axboe
On Thu, May 31 2007, David Chinner wrote:
> On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:
> > On Thu, May 31 2007, David Chinner wrote:
> > > IOWs, there are two parts to the problem:
> > > 
> > >   1 - guaranteeing I/O ordering
> > >   2 - guaranteeing blocks are on persistent storage.
> > > 
> > > Right now, a single barrier I/O is used to provide both of these
> > > guarantees. In most cases, all we really need to provide is 1); the
> > > need for 2) is a much rarer condition but still needs to be
> > > provided.
> > > 
> > > > if I am understanding it correctly, the big win for barriers is that 
> > > > you 
> > > > do NOT have to stop and wait until the data is on persistant media 
> > > > before 
> > > > you can continue.
> > > 
> > > Yes, if we define a barrier to only guarantee 1), then yes this
> > > would be a big win (esp. for XFS). But that requires all filesystems
> > > to handle sync writes differently, and sync_blockdev() needs to
> > > call blkdev_issue_flush() as well
> > > 
> > > So, what do we do here? Do we define a barrier I/O to only provide
> > > ordering, or do we define it to also provide persistent storage
> > > writeback? Whatever we decide, it needs to be documented
> > 
> > The block layer already has a notion of the two types of barriers, with
> > a very small amount of tweaking we could expose that. There's absolutely
> > zero reason we can't easily support both types of barriers.
> 
> That sounds like a good idea - we can leave the existing
> WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
> behaviour that only guarantees ordering. The filesystem can then
> choose which to use where appropriate

Precisely. The current definition of barriers are what Chris and I came
up with many years ago, when solving the problem for reiserfs
originally. It is by no means the only feasible approach.

I'll add a WRITE_ORDERED command to the #barrier branch, it already
contains the empty-bio barrier support I posted yesterday (well a
slightly modified and cleaned up version).

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread David Chinner
On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:
> On Thu, May 31 2007, David Chinner wrote:
> > IOWs, there are two parts to the problem:
> > 
> > 1 - guaranteeing I/O ordering
> > 2 - guaranteeing blocks are on persistent storage.
> > 
> > Right now, a single barrier I/O is used to provide both of these
> > guarantees. In most cases, all we really need to provide is 1); the
> > need for 2) is a much rarer condition but still needs to be
> > provided.
> > 
> > > if I am understanding it correctly, the big win for barriers is that you 
> > > do NOT have to stop and wait until the data is on persistant media before 
> > > you can continue.
> > 
> > Yes, if we define a barrier to only guarantee 1), then yes this
> > would be a big win (esp. for XFS). But that requires all filesystems
> > to handle sync writes differently, and sync_blockdev() needs to
> > call blkdev_issue_flush() as well
> > 
> > So, what do we do here? Do we define a barrier I/O to only provide
> > ordering, or do we define it to also provide persistent storage
> > writeback? Whatever we decide, it needs to be documented
> 
> The block layer already has a notion of the two types of barriers, with
> a very small amount of tweaking we could expose that. There's absolutely
> zero reason we can't easily support both types of barriers.

That sounds like a good idea - we can leave the existing
WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
behaviour that only guarantees ordering. The filesystem can then
choose which to use where appropriate

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Jens Axboe
On Thu, May 31 2007, David Chinner wrote:
> IOWs, there are two parts to the problem:
> 
>   1 - guaranteeing I/O ordering
>   2 - guaranteeing blocks are on persistent storage.
> 
> Right now, a single barrier I/O is used to provide both of these
> guarantees. In most cases, all we really need to provide is 1); the
> need for 2) is a much rarer condition but still needs to be
> provided.
> 
> > if I am understanding it correctly, the big win for barriers is that you 
> > do NOT have to stop and wait until the data is on persistant media before 
> > you can continue.
> 
> Yes, if we define a barrier to only guarantee 1), then yes this
> would be a big win (esp. for XFS). But that requires all filesystems
> to handle sync writes differently, and sync_blockdev() needs to
> call blkdev_issue_flush() as well
> 
> So, what do we do here? Do we define a barrier I/O to only provide
> ordering, or do we define it to also provide persistent storage
> writeback? Whatever we decide, it needs to be documented

The block layer already has a notion of the two types of barriers, with
a very small amount of tweaking we could expose that. There's absolutely
zero reason we can't easily support both types of barriers.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Jens Axboe
On Wed, May 30 2007, Phillip Susi wrote:
> >That would be the exactly how I understand Documentation/block/barrier.txt:
> >
> >"In other words, I/O barrier requests have the following two properties.
> >1. Request ordering
> >...
> >2. Forced flushing to physical medium"
> >
> >"So, I/O barriers need to guarantee that requests actually get written
> >to non-volatile medium in order." 
> 
> I think you misinterpret this, and it probably could be worded a bit 
> better.  The barrier request is about constraining the order.  The 
> forced flushing is one means to implement that constraint.  The other 
> alternative mentioned there is to use ordered tags.  The key part there 
> is "requests actually get written to non-volatile medium _in order_", 
> not "before the request completes", which would be synchronous IO.

No Stephan is right, the barrier is both an ordering and integrity
constraint. If a driver completes a barrier request before that request
and previously submitted requests are on STABLE storage, then it
violates that principle. Look at the code and the various ordering
options.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Jens Axboe
On Wed, May 30 2007, Phillip Susi wrote:
 That would be the exactly how I understand Documentation/block/barrier.txt:
 
 In other words, I/O barrier requests have the following two properties.
 1. Request ordering
 ...
 2. Forced flushing to physical medium
 
 So, I/O barriers need to guarantee that requests actually get written
 to non-volatile medium in order. 
 
 I think you misinterpret this, and it probably could be worded a bit 
 better.  The barrier request is about constraining the order.  The 
 forced flushing is one means to implement that constraint.  The other 
 alternative mentioned there is to use ordered tags.  The key part there 
 is requests actually get written to non-volatile medium _in order_, 
 not before the request completes, which would be synchronous IO.

No Stephan is right, the barrier is both an ordering and integrity
constraint. If a driver completes a barrier request before that request
and previously submitted requests are on STABLE storage, then it
violates that principle. Look at the code and the various ordering
options.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Jens Axboe
On Thu, May 31 2007, David Chinner wrote:
 IOWs, there are two parts to the problem:
 
   1 - guaranteeing I/O ordering
   2 - guaranteeing blocks are on persistent storage.
 
 Right now, a single barrier I/O is used to provide both of these
 guarantees. In most cases, all we really need to provide is 1); the
 need for 2) is a much rarer condition but still needs to be
 provided.
 
  if I am understanding it correctly, the big win for barriers is that you 
  do NOT have to stop and wait until the data is on persistant media before 
  you can continue.
 
 Yes, if we define a barrier to only guarantee 1), then yes this
 would be a big win (esp. for XFS). But that requires all filesystems
 to handle sync writes differently, and sync_blockdev() needs to
 call blkdev_issue_flush() as well
 
 So, what do we do here? Do we define a barrier I/O to only provide
 ordering, or do we define it to also provide persistent storage
 writeback? Whatever we decide, it needs to be documented

The block layer already has a notion of the two types of barriers, with
a very small amount of tweaking we could expose that. There's absolutely
zero reason we can't easily support both types of barriers.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread David Chinner
On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:
 On Thu, May 31 2007, David Chinner wrote:
  IOWs, there are two parts to the problem:
  
  1 - guaranteeing I/O ordering
  2 - guaranteeing blocks are on persistent storage.
  
  Right now, a single barrier I/O is used to provide both of these
  guarantees. In most cases, all we really need to provide is 1); the
  need for 2) is a much rarer condition but still needs to be
  provided.
  
   if I am understanding it correctly, the big win for barriers is that you 
   do NOT have to stop and wait until the data is on persistant media before 
   you can continue.
  
  Yes, if we define a barrier to only guarantee 1), then yes this
  would be a big win (esp. for XFS). But that requires all filesystems
  to handle sync writes differently, and sync_blockdev() needs to
  call blkdev_issue_flush() as well
  
  So, what do we do here? Do we define a barrier I/O to only provide
  ordering, or do we define it to also provide persistent storage
  writeback? Whatever we decide, it needs to be documented
 
 The block layer already has a notion of the two types of barriers, with
 a very small amount of tweaking we could expose that. There's absolutely
 zero reason we can't easily support both types of barriers.

That sounds like a good idea - we can leave the existing
WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
behaviour that only guarantees ordering. The filesystem can then
choose which to use where appropriate

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Jens Axboe
On Thu, May 31 2007, David Chinner wrote:
 On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:
  On Thu, May 31 2007, David Chinner wrote:
   IOWs, there are two parts to the problem:
   
 1 - guaranteeing I/O ordering
 2 - guaranteeing blocks are on persistent storage.
   
   Right now, a single barrier I/O is used to provide both of these
   guarantees. In most cases, all we really need to provide is 1); the
   need for 2) is a much rarer condition but still needs to be
   provided.
   
if I am understanding it correctly, the big win for barriers is that 
you 
do NOT have to stop and wait until the data is on persistant media 
before 
you can continue.
   
   Yes, if we define a barrier to only guarantee 1), then yes this
   would be a big win (esp. for XFS). But that requires all filesystems
   to handle sync writes differently, and sync_blockdev() needs to
   call blkdev_issue_flush() as well
   
   So, what do we do here? Do we define a barrier I/O to only provide
   ordering, or do we define it to also provide persistent storage
   writeback? Whatever we decide, it needs to be documented
  
  The block layer already has a notion of the two types of barriers, with
  a very small amount of tweaking we could expose that. There's absolutely
  zero reason we can't easily support both types of barriers.
 
 That sounds like a good idea - we can leave the existing
 WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
 behaviour that only guarantees ordering. The filesystem can then
 choose which to use where appropriate

Precisely. The current definition of barriers are what Chris and I came
up with many years ago, when solving the problem for reiserfs
originally. It is by no means the only feasible approach.

I'll add a WRITE_ORDERED command to the #barrier branch, it already
contains the empty-bio barrier support I posted yesterday (well a
slightly modified and cleaned up version).

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Stefan Bader

2007/5/30, Phillip Susi [EMAIL PROTECTED]:

Stefan Bader wrote:

 Since drive a supports barrier request we don't get -EOPNOTSUPP but
 the request with block y might get written before block x since the
 disk are independent. I guess the chances of this are quite low since
 at some point a barrier request will also hit drive b but for the time
 being it might be better to indicate -EOPNOTSUPP right from
 device-mapper.

The device mapper needs to ensure that ALL underlying devices get a
barrier request when one comes down from above, even if it has to
construct zero length barriers to send to most of them.



And somehow also make sure all of the barriers have been processed
before returning the barrier that came in. Plus it would have to queue
all mapping requests until the barrier is done (if strictly acting
according to barrier.txt).

But I am wondering a bit whether the requirements to barriers are
really that tight as described in Tejun's document (barrier request is
only started if everything before is safe, the barrier itself isn't
returned until it is safe, too, and all requests after the barrier
aren't started before the barrier is done). Is it really necessary to
defer any further requests until the barrier has been written to save
storage? Or would it be sufficient to guarantee that, if a barrier
request returns, everything up to (including the barrier) is on safe
storage?

Stefan
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Bill Davidsen

Neil Brown wrote:

On Monday May 28, [EMAIL PROTECTED] wrote:
  

There are two things I'm not sure you covered.

First, disks which don't support flush but do have a cache dirty 
status bit you can poll at times like shutdown. If there are no drivers 
which support these, it can be ignored.



There are really devices like that?  So to implement a flush, you have
to stop sending writes and wait and poll - maybe poll every
millisecond?
  


Yes, there really are (or were). But I don't think that there are 
drivers, so it's not an issue.

That wouldn't be very good for performance  maybe you just
wouldn't bother with barriers on that sort of device?
  


That is why there are no drivers...

Which reminds me:  What is the best way to turn off barriers?
Several filesystems have -o nobarriers or -o barriers=0,
or the inverse.
  


If they can function usefully without, the admin gets to make that choice.

md/raid currently uses barriers to write metadata, and there is no
way to turn that off.  I'm beginning to wonder if that is best.
  


I don't see how you can have reliable operation without it, particularly 
WRT bitmap.

Maybe barrier support should be a function of the device.  i.e. the
filesystem or whatever always sends barrier requests where it thinks
it is appropriate, and the block device tries to honour them to the
best of its ability, but if you run
   blockdev --enforce-barriers=no /dev/sda
then you lose some reliability guarantees, but gain some throughput (a
bit like the 'async' export option for nfsd).

  
Since this is device dependent, it really should be in the device 
driver, and requests should have status of success, failure, or feature 
unavailability.




Second, NAS (including nbd?). Is there enough information to handle this  really 
right?



NAS means lots of things, including NFS and CIFS where this doesn't
apply.
  


Well, we're really talking about network attached devices rather than 
network filesystems. I guess people do lump them together.



For 'nbd', it is entirely up to the protocol.  If the protocol allows
a barrier flag to be sent to the server, then barriers should just
work.  If it doesn't, then either the server disables write-back
caching, or flushes every request, or you lose all barrier
guarantees. 
  


Pretty much agrees with what I said above, it's at a level closer to the 
device, and status should come back from the physical i/o request.

For 'iscsi', I guess it works just the same as SCSI...
  


Hopefully.

--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Bill Davidsen

Jens Axboe wrote:

On Thu, May 31 2007, David Chinner wrote:
  

On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:


On Thu, May 31 2007, David Chinner wrote:
  

IOWs, there are two parts to the problem:

1 - guaranteeing I/O ordering
2 - guaranteeing blocks are on persistent storage.

Right now, a single barrier I/O is used to provide both of these
guarantees. In most cases, all we really need to provide is 1); the
need for 2) is a much rarer condition but still needs to be
provided.


if I am understanding it correctly, the big win for barriers is that you 
do NOT have to stop and wait until the data is on persistant media before 
you can continue.
  

Yes, if we define a barrier to only guarantee 1), then yes this
would be a big win (esp. for XFS). But that requires all filesystems
to handle sync writes differently, and sync_blockdev() needs to
call blkdev_issue_flush() as well

So, what do we do here? Do we define a barrier I/O to only provide
ordering, or do we define it to also provide persistent storage
writeback? Whatever we decide, it needs to be documented


The block layer already has a notion of the two types of barriers, with
a very small amount of tweaking we could expose that. There's absolutely
zero reason we can't easily support both types of barriers.
  

That sounds like a good idea - we can leave the existing
WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
behaviour that only guarantees ordering. The filesystem can then
choose which to use where appropriate



Precisely. The current definition of barriers are what Chris and I came
up with many years ago, when solving the problem for reiserfs
originally. It is by no means the only feasible approach.

I'll add a WRITE_ORDERED command to the #barrier branch, it already
contains the empty-bio barrier support I posted yesterday (well a
slightly modified and cleaned up version).

  
Wait. Do filesystems expect (depend on) anything but ordering now? Does 
md? Having users of barriers as they currently behave suddenly getting 
SYNC behavior where they expect ORDERED is likely to have a negative 
effect on performance. Or do I misread what is actually guaranteed by 
WRITE_BARRIER now, and a flush is currently happening in all cases?


And will this also be available to user space f/s, since I just proposed 
a project which uses one? :-(
I think the goal is good, more choice is almost always better choice, I 
just want to be sure there won't be big disk performance regressions.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Jens Axboe
On Thu, May 31 2007, Bill Davidsen wrote:
 Jens Axboe wrote:
 On Thu, May 31 2007, David Chinner wrote:
   
 On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:
 
 On Thu, May 31 2007, David Chinner wrote:
   
 IOWs, there are two parts to the problem:
 
   1 - guaranteeing I/O ordering
   2 - guaranteeing blocks are on persistent storage.
 
 Right now, a single barrier I/O is used to provide both of these
 guarantees. In most cases, all we really need to provide is 1); the
 need for 2) is a much rarer condition but still needs to be
 provided.
 
 
 if I am understanding it correctly, the big win for barriers is that 
 you do NOT have to stop and wait until the data is on persistant media 
 before you can continue.
   
 Yes, if we define a barrier to only guarantee 1), then yes this
 would be a big win (esp. for XFS). But that requires all filesystems
 to handle sync writes differently, and sync_blockdev() needs to
 call blkdev_issue_flush() as well
 
 So, what do we do here? Do we define a barrier I/O to only provide
 ordering, or do we define it to also provide persistent storage
 writeback? Whatever we decide, it needs to be documented
 
 The block layer already has a notion of the two types of barriers, with
 a very small amount of tweaking we could expose that. There's absolutely
 zero reason we can't easily support both types of barriers.
   
 That sounds like a good idea - we can leave the existing
 WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
 behaviour that only guarantees ordering. The filesystem can then
 choose which to use where appropriate
 
 
 Precisely. The current definition of barriers are what Chris and I came
 up with many years ago, when solving the problem for reiserfs
 originally. It is by no means the only feasible approach.
 
 I'll add a WRITE_ORDERED command to the #barrier branch, it already
 contains the empty-bio barrier support I posted yesterday (well a
 slightly modified and cleaned up version).
 
   
 Wait. Do filesystems expect (depend on) anything but ordering now? Does 
 md? Having users of barriers as they currently behave suddenly getting 
 SYNC behavior where they expect ORDERED is likely to have a negative 
 effect on performance. Or do I misread what is actually guaranteed by 
 WRITE_BARRIER now, and a flush is currently happening in all cases?

See the above stuff you quote, it's answered there. It's not a change,
this is how the Linux barrier write has always worked since I first
implemented it. What David and I are talking about is adding a more
relaxed version as well, that just implies ordering.

 And will this also be available to user space f/s, since I just proposed 
 a project which uses one? :-(

I see several uses for that, so I'd hope so.

 I think the goal is good, more choice is almost always better choice, I 
 just want to be sure there won't be big disk performance regressions.

We can't get more heavy weight than the current barrier, it's about as
conservative as you can get.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Phillip Susi

David Chinner wrote:
you are understanding barriers to be the same as syncronous writes. (and 
therefor the data is on persistant media before the call returns)


No, I'm describing the high level behaviour that is expected by
a filesystem. The reasons for this are below


You say no, but then you go on to contradict yourself below.


Ok, that's my understanding of how *device based barriers* can work,
but there's more to it than that. As far as the filesystem is
concerned the barrier write needs to *behave* exactly like a sync
write because of the guarantees the filesystem has to provide
userspace. Specifically - sync, sync writes and fsync.


There, you just ascribed the synchronous property to barrier requests. 
This is false.  Barriers are about ordering, synchronous writes are 
another thing entirely.  The filesystem is supposed to use barriers to 
maintain ordering for journal data.  If you are trying to handle a 
synchronous write request, that's another flag.



This is the big problem, right? If we use barriers for commit
writes, the filesystem can return to userspace after a sync write or
fsync() and an *ordered barrier device implementation* may not have
written the blocks to persistent media. If we then pull the plug on
the box, we've just lost data that sync or fsync said was
successfully on disk. That's BAD.


That's why for synchronous writes, you set the flag to mark the request 
as synchronous, which has nothing at all to do with barriers.  You are 
trying to use barriers to solve two different problems.  Use one flag to 
indicate ordering, and another to indicate synchronisity.



Right now a barrier write on the last block of the fsync/sync write
is sufficient to prevent that because of the FUA on the barrier
block write. A purely ordered barrier implementation does not
provide this guarantee.


This is a side effect of the implementation of the barrier, not part of 
the semantics of barriers, so you shouldn't rely on this behavior.  You 
don't have to use FUA to handle the barrier request, and if you don't, 
then the request can be completed while the data is still in the write 
cache.  You just have to make sure to flush it before any subsequent 
requests.



IOWs, there are two parts to the problem:

1 - guaranteeing I/O ordering
2 - guaranteeing blocks are on persistent storage.

Right now, a single barrier I/O is used to provide both of these
guarantees. In most cases, all we really need to provide is 1); the
need for 2) is a much rarer condition but still needs to be
provided.


Yep... two problems... two flags.


Yes, if we define a barrier to only guarantee 1), then yes this
would be a big win (esp. for XFS). But that requires all filesystems
to handle sync writes differently, and sync_blockdev() needs to
call blkdev_issue_flush() as well

So, what do we do here? Do we define a barrier I/O to only provide
ordering, or do we define it to also provide persistent storage
writeback? Whatever we decide, it needs to be documented


We do the former or we end up in the same boat as O_DIRECT; where you 
have one flag that means several things, and no way to specify you only 
need some of those and not the others.



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Phillip Susi

David Chinner wrote:

That sounds like a good idea - we can leave the existing
WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
behaviour that only guarantees ordering. The filesystem can then
choose which to use where appropriate


So what if you want a synchronous write, but DON'T care about the order? 
  They need to be two completely different flags which you can choose 
to combine, or use individually.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Phillip Susi

Jens Axboe wrote:

No Stephan is right, the barrier is both an ordering and integrity
constraint. If a driver completes a barrier request before that request
and previously submitted requests are on STABLE storage, then it
violates that principle. Look at the code and the various ordering
options.


I am saying that is the wrong thing to do.  Barrier should be about 
ordering only.  So long as the order they hit the media is maintained, 
the order the requests are completed in can change.  barrier.txt bears 
this out:


Requests in ordered sequence are issued in order, but not required to 
finish in order.  Barrier implementation can handle out-of-order 
completion of ordered sequence.  IOW, the requests MUST be processed in 
order but the hardware/software completion paths are allowed to reorder 
completion notifications - eg. current SCSI midlayer doesn't preserve 
completion order during error handling.



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Jens Axboe
On Thu, May 31 2007, Phillip Susi wrote:
 Jens Axboe wrote:
 No Stephan is right, the barrier is both an ordering and integrity
 constraint. If a driver completes a barrier request before that request
 and previously submitted requests are on STABLE storage, then it
 violates that principle. Look at the code and the various ordering
 options.
 
 I am saying that is the wrong thing to do.  Barrier should be about 
 ordering only.  So long as the order they hit the media is maintained, 
 the order the requests are completed in can change.  barrier.txt bears 

But you can't guarentee ordering without flushing the data out as well.
It all depends on the type of cache on the device, of course. If you
look at the ordinary sata/ide drive with write back caching, you can't
just issue the requests in order and pray that the drive cache will make
it to platter.

If you don't have write back caching, or if the cache is battery backed
and thus guarenteed to never be lost, maintaining order is naturally
enough.

Or if the drive can do ordered queued commands, you can relax the
flushing (again depending on the cache type, you may need to take
different paths).

 Requests in ordered sequence are issued in order, but not required to 
 finish in order.  Barrier implementation can handle out-of-order 
 completion of ordered sequence.  IOW, the requests MUST be processed in 
 order but the hardware/software completion paths are allowed to reorder 
 completion notifications - eg. current SCSI midlayer doesn't preserve 
 completion order during error handling.

If you carefully re-read that paragraph, then it just tells you that the
software implementation can deal with reordered completions. It doesn't
relax the rconstraints on ordering and integrity AT ALL.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Jens Axboe
On Thu, May 31 2007, Phillip Susi wrote:
 David Chinner wrote:
 That sounds like a good idea - we can leave the existing
 WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
 behaviour that only guarantees ordering. The filesystem can then
 choose which to use where appropriate
 
 So what if you want a synchronous write, but DON'T care about the order? 
   They need to be two completely different flags which you can choose 
 to combine, or use individually.

If you have a use case for that, we can easily support it as well...
Depending on the drive capabilities (FUA support or not), it may be
nearly as slow as a real barrier write.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread david

On Thu, 31 May 2007, Jens Axboe wrote:


On Thu, May 31 2007, Phillip Susi wrote:

David Chinner wrote:

That sounds like a good idea - we can leave the existing
WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
behaviour that only guarantees ordering. The filesystem can then
choose which to use where appropriate


So what if you want a synchronous write, but DON'T care about the order?
  They need to be two completely different flags which you can choose
to combine, or use individually.


If you have a use case for that, we can easily support it as well...
Depending on the drive capabilities (FUA support or not), it may be
nearly as slow as a real barrier write.


true, but a real barrier write could have significant side effects on 
other writes that wouldn't happen with a synchronous wrote (a sync wrote 
can have other, unrelated writes re-ordered around it, a barrier write 
can't)


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Jens Axboe
On Thu, May 31 2007, [EMAIL PROTECTED] wrote:
 On Thu, 31 May 2007, Jens Axboe wrote:
 
 On Thu, May 31 2007, Phillip Susi wrote:
 David Chinner wrote:
 That sounds like a good idea - we can leave the existing
 WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
 behaviour that only guarantees ordering. The filesystem can then
 choose which to use where appropriate
 
 So what if you want a synchronous write, but DON'T care about the order?
   They need to be two completely different flags which you can choose
 to combine, or use individually.
 
 If you have a use case for that, we can easily support it as well...
 Depending on the drive capabilities (FUA support or not), it may be
 nearly as slow as a real barrier write.
 
 true, but a real barrier write could have significant side effects on 
 other writes that wouldn't happen with a synchronous wrote (a sync wrote 
 can have other, unrelated writes re-ordered around it, a barrier write 
 can't)

That is true, the sync write also has side effects at the drive side
since it may have a varied cost depending on the workload (eg what
already resides in the cache when it is issued), unless FUA is active.
That is also true for the barrier of course, but only for previously
submitted IO as we don't reorder.

I'm not saying that a SYNC write wont be potentially useful, just that
it's definitely not free even outside of the write itself.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread David Chinner
On Thu, May 31, 2007 at 02:31:21PM -0400, Phillip Susi wrote:
 David Chinner wrote:
 That sounds like a good idea - we can leave the existing
 WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
 behaviour that only guarantees ordering. The filesystem can then
 choose which to use where appropriate
 
 So what if you want a synchronous write, but DON'T care about the order? 

submit_bio(WRITE_SYNC, bio);

Already there, already used by XFS, JFS and direct I/O.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Tejun Heo
Jens Axboe wrote:
 On Thu, May 31 2007, David Chinner wrote:
 On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:
 On Thu, May 31 2007, David Chinner wrote:
 IOWs, there are two parts to the problem:

1 - guaranteeing I/O ordering
2 - guaranteeing blocks are on persistent storage.

 Right now, a single barrier I/O is used to provide both of these
 guarantees. In most cases, all we really need to provide is 1); the
 need for 2) is a much rarer condition but still needs to be
 provided.

 if I am understanding it correctly, the big win for barriers is that you 
 do NOT have to stop and wait until the data is on persistant media before 
 you can continue.
 Yes, if we define a barrier to only guarantee 1), then yes this
 would be a big win (esp. for XFS). But that requires all filesystems
 to handle sync writes differently, and sync_blockdev() needs to
 call blkdev_issue_flush() as well

 So, what do we do here? Do we define a barrier I/O to only provide
 ordering, or do we define it to also provide persistent storage
 writeback? Whatever we decide, it needs to be documented
 The block layer already has a notion of the two types of barriers, with
 a very small amount of tweaking we could expose that. There's absolutely
 zero reason we can't easily support both types of barriers.
 That sounds like a good idea - we can leave the existing
 WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
 behaviour that only guarantees ordering. The filesystem can then
 choose which to use where appropriate
 
 Precisely. The current definition of barriers are what Chris and I came
 up with many years ago, when solving the problem for reiserfs
 originally. It is by no means the only feasible approach.
 
 I'll add a WRITE_ORDERED command to the #barrier branch, it already
 contains the empty-bio barrier support I posted yesterday (well a
 slightly modified and cleaned up version).

Would that be very different from issuing barrier and not waiting for
its completion?  For ATA and SCSI, we'll have to flush write back cache
anyway, so I don't see how we can get performance advantage by
implementing separate WRITE_ORDERED.  I think zero-length barrier
(haven't looked at the code yet, still recovering from jet lag :-) can
serve as genuine barrier without the extra write tho.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread Tejun Heo
Stefan Bader wrote:
 2007/5/30, Phillip Susi [EMAIL PROTECTED]:
 Stefan Bader wrote:
 
  Since drive a supports barrier request we don't get -EOPNOTSUPP but
  the request with block y might get written before block x since the
  disk are independent. I guess the chances of this are quite low since
  at some point a barrier request will also hit drive b but for the time
  being it might be better to indicate -EOPNOTSUPP right from
  device-mapper.

 The device mapper needs to ensure that ALL underlying devices get a
 barrier request when one comes down from above, even if it has to
 construct zero length barriers to send to most of them.

 
 And somehow also make sure all of the barriers have been processed
 before returning the barrier that came in. Plus it would have to queue
 all mapping requests until the barrier is done (if strictly acting
 according to barrier.txt).
 
 But I am wondering a bit whether the requirements to barriers are
 really that tight as described in Tejun's document (barrier request is
 only started if everything before is safe, the barrier itself isn't
 returned until it is safe, too, and all requests after the barrier
 aren't started before the barrier is done). Is it really necessary to
 defer any further requests until the barrier has been written to save
 storage? Or would it be sufficient to guarantee that, if a barrier
 request returns, everything up to (including the barrier) is on safe
 storage?

Well, what's described in barrier.txt is the current implemented
semantics and what filesystems expect, so we can't change it underneath
them but we definitely can introduce new more relaxed variants, but one
thing we should bear in mind is that harddisks don't have humongous
caches or very smart controller / instruction set.  No matter how
relaxed interface the block layer provides, in the end, it just has to
issue whole-sale FLUSH CACHE on the device to guarantee data ordering on
the media.

IMHO, we can do better by paying more attention to how we do things in
the request queue which can be deeper and more intelligent than the
device queue.

Thanks.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread david

On Fri, 1 Jun 2007, Tejun Heo wrote:


but one
thing we should bear in mind is that harddisks don't have humongous
caches or very smart controller / instruction set.  No matter how
relaxed interface the block layer provides, in the end, it just has to
issue whole-sale FLUSH CACHE on the device to guarantee data ordering on
the media.


if you are talking about individual drives you may be right for the moment 
(but 16M cache on drives is a _lot_ larger then people imagined would be 
there a few years ago)


but when you consider the self-contained disk arrays it's an entirely 
different story. you can easily have a few gig of cache and a complete OS 
pretending to be a single drive as far as you are concerned.


and the price of such devices is plummeting (in large part thanks to Linux 
moving into this space), you can now readily buy a 10TB array for $10k 
that looks like a single drive.


David Lang
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread Neil Brown
On Monday May 28, [EMAIL PROTECTED] wrote:
> Neil Brown writes:
>  > 
> 
> [...]
> 
>  > Thus the general sequence might be:
>  > 
>  >   a/ issue all "preceding writes".
>  >   b/ issue the commit write with BIO_RW_BARRIER
>  >   c/ wait for the commit to complete.
>  >  If it was successful - done.
>  >  If it failed other than with EOPNOTSUPP, abort
>  >  else continue
>  >   d/ wait for all 'preceding writes' to complete
>  >   e/ call blkdev_issue_flush
>  >   f/ issue commit write without BIO_RW_BARRIER
>  >   g/ wait for commit write to complete
>  >if it failed, abort
>  >   h/ call blkdev_issue
>  >   DONE
>  > 
>  > steps b and c can be left out if it is known that the device does not
>  > support barriers.  The only way to discover this to try and see if it
>  > fails.
>  > 
>  > I don't think any filesystem follows all these steps.
> 
> It seems that steps b/ -- h/ are quite generic, and can be implemented
> once in a generic code (with some synchronization mechanism like
> wait-queue at d/).

Yes and no.
It depends on what you mean by "preceding write".

If you implement this in the filesystem, the filesystem can wait only
for those writes where it has an ordering dependency.   If you
implement it in common code, then you have to wait for all writes
that were previously issued.

e.g.
  If you have two different filesystems on two different partitions on
  the one device, why should writes in one filesystem wait for a
  barrier issued in the other filesystem.
  If you have a single filesystem with one thread doing lot of
  over-writes (no metadata changes) and the another doing lots of
  metadata changes (requiring journalling and barriers) why should the
  data write be held up by the metadata updates?

So I'm not actually convinced that doing this is common code is the
best approach.  But it is the easiest.  The common code should provide
the barrier and flushing primitives, and the filesystem gets to use
them however it likes.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread David Chinner
On Thu, May 31, 2007 at 02:07:39AM +0100, Alasdair G Kergon wrote:
> On Thu, May 31, 2007 at 10:46:04AM +1000, Neil Brown wrote:
> > If a filesystem cares, it could 'ask' as suggested above.
> > What would be a good interface for asking?
> 
> XFS already tests:
>   bd_disk->queue->ordered == QUEUE_ORDERED_NONE

The side effects of removing that check is what started
this whole discussion.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread Alasdair G Kergon
On Thu, May 31, 2007 at 10:46:04AM +1000, Neil Brown wrote:
> If a filesystem cares, it could 'ask' as suggested above.
> What would be a good interface for asking?

XFS already tests:
  bd_disk->queue->ordered == QUEUE_ORDERED_NONE

Alasdair
-- 
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread Alasdair G Kergon
On Thu, May 31, 2007 at 10:46:04AM +1000, Neil Brown wrote:
> What if the truth changes (as can happen with md or dm)?

You get notified in endio() that the barrier had to be emulated?
 
Alasdair
-- 
[EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread Neil Brown
On Monday May 28, [EMAIL PROTECTED] wrote:
> On Mon, May 28, 2007 at 12:57:53PM +1000, Neil Brown wrote:
> > What exactly do you want to know, and why do you care?
> 
> If someone explicitly mounts "-o barrier" and the underlying device
> cannot do it, then we want to issue a warning or reject the
> mount.

I guess that makes sense.
But apparently you cannot tell what a device supports until you write
to it.
So maybe you need to write some metadata with as a barrier, then ask
the device what it's barrier status is.  The options might be:
  YES - barriers are fully handled
  NO  - best effort, but due to missing device features, it might not
work
  DISABLED - admin has requested that barriers be ignored.

??
 
> 
> > The idea is that every "struct block_device" supports barriers.  If the
> > underlying hardware doesn't support them directly, then they get
> > simulated by draining the queue and issuing a flush.
> 
> Ok. But you also seem to be implying that there will be devices that
> cannot support barriers.

It seems there will always be hardware that doesn't meet specs.  If a
device doesn't support SYNCHRONIZE_CACHE or FUA, then implementing
barriers all the way to the media would be hard..

> 
> Even if all devices do eventually support barriers, it may take some
> time before we reach that goal.  Why not start by making it easy to
> determine what the capabilities of each device are. This can then be
> removed once we reach the holy grail

I'd rather not add something that we plan to remove.  We currently
have -EOPNOTSUP.  I don't think there is much point having more than
that.

I would really like to get to the stage where -EOPNOTSUP is never
returned.  If a filesystem cares, it could 'ask' as suggested above.
What would be a good interface for asking?
What if the truth changes (as can happen with md or dm)?

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread Neil Brown
On Monday May 28, [EMAIL PROTECTED] wrote:
> There are two things I'm not sure you covered.
> 
> First, disks which don't support flush but do have a "cache dirty" 
> status bit you can poll at times like shutdown. If there are no drivers 
> which support these, it can be ignored.

There are really devices like that?  So to implement a flush, you have
to stop sending writes and wait and poll - maybe poll every
millisecond?
That wouldn't be very good for performance  maybe you just
wouldn't bother with barriers on that sort of device?

Which reminds me:  What is the best way to turn off barriers?
Several filesystems have "-o nobarriers" or "-o barriers=0",
or the inverse.
md/raid currently uses barriers to write metadata, and there is no
way to turn that off.  I'm beginning to wonder if that is best.

Maybe barrier support should be a function of the device.  i.e. the
filesystem or whatever always sends barrier requests where it thinks
it is appropriate, and the block device tries to honour them to the
best of its ability, but if you run
   blockdev --enforce-barriers=no /dev/sda
then you lose some reliability guarantees, but gain some throughput (a
bit like the 'async' export option for nfsd).

> 
> Second, NAS (including nbd?). Is there enough information to handle this 
> "really rigt?"

NAS means lots of things, including NFS and CIFS where this doesn't
apply.
For 'nbd', it is entirely up to the protocol.  If the protocol allows
a barrier flag to be sent to the server, then barriers should just
work.  If it doesn't, then either the server disables write-back
caching, or flushes every request, or you lose all barrier
guarantees. 
For 'iscsi', I guess it works just the same as SCSI...

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread Neil Brown
On Tuesday May 29, [EMAIL PROTECTED] wrote:
> Neil Brown wrote:
> >  md/dm modules could keep count of requests as has been suggested
> >  (though that would be a fairly big change for raid0 as it currently
> >  doesn't know when a request completes - bi_endio goes directly to the
> >  filesystem). 
> 
> Are you sure?  I believe that dm handles bi_endio because it waits for 
> all in progress bio to complete before switching tables.

I was taking about md/raid0, not dm-stripe.
md/raid0 (and md/linear) currently never know that a request has
completed.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread David Chinner
On Wed, May 30, 2007 at 09:52:49AM -0700, [EMAIL PROTECTED] wrote:
> On Wed, 30 May 2007, David Chinner wrote:
> >with the barrier is on stable storage when I/o completion is
> >signalled.  The existing barrier implementation (where it works)
> >provide these requirements. We need barriers to retain these
> >semantics, otherwise we'll still have to do special stuff in
> >the filesystems to get the semantics that we need.
> 
> one of us is misunderstanding barriers here.

No, I thinkwe are both on the same level here - it's what
barriers are used for that is not clear understood, I think.

> you are understanding barriers to be the same as syncronous writes. (and 
> therefor the data is on persistant media before the call returns)

No, I'm describing the high level behaviour that is expected by
a filesystem. The reasons for this are below

> I am understanding barriers to only indicate ordering requirements. things 
> before the barrier can be reordered freely, things after the barrier can 
> be reordered freely, but things cannot be reordered across the barrier.

Ok, that's my understanding of how *device based barriers* can work,
but there's more to it than that. As far as the filesystem is
concerned the barrier write needs to *behave* exactly like a sync
write because of the guarantees the filesystem has to provide
userspace. Specifically - sync, sync writes and fsync.

This is the big problem, right? If we use barriers for commit
writes, the filesystem can return to userspace after a sync write or
fsync() and an *ordered barrier device implementation* may not have
written the blocks to persistent media. If we then pull the plug on
the box, we've just lost data that sync or fsync said was
successfully on disk. That's BAD.

Right now a barrier write on the last block of the fsync/sync write
is sufficient to prevent that because of the FUA on the barrier
block write. A purely ordered barrier implementation does not
provide this guarantee.

This is the crux of my argument - from a filesystem perspective,
there is a *major* difference between a barrier implemented to just
guaranteeing ordering and a barrier implemented with a flush+FUA or
flush+write+flush.

IOWs, there are two parts to the problem:

1 - guaranteeing I/O ordering
2 - guaranteeing blocks are on persistent storage.

Right now, a single barrier I/O is used to provide both of these
guarantees. In most cases, all we really need to provide is 1); the
need for 2) is a much rarer condition but still needs to be
provided.

> if I am understanding it correctly, the big win for barriers is that you 
> do NOT have to stop and wait until the data is on persistant media before 
> you can continue.

Yes, if we define a barrier to only guarantee 1), then yes this
would be a big win (esp. for XFS). But that requires all filesystems
to handle sync writes differently, and sync_blockdev() needs to
call blkdev_issue_flush() as well

So, what do we do here? Do we define a barrier I/O to only provide
ordering, or do we define it to also provide persistent storage
writeback? Whatever we decide, it needs to be documented

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >