Re: How to flush the disk write cache from userspace

2007-01-20 Thread Jens Axboe
On Thu, Jan 18 2007, Robert Hancock wrote:
> Ricardo Correia wrote:
> >On Tuesday 16 January 2007 00:38, you wrote:
> >>As always with these things, the devil is in the details. It requires
> >>the device to support a ->prepare_flush() queue hook, and not all
> >>devices do that. It will work for IDE/SATA/SCSI, though. In some devices
> >>you don't want/need to do a real disk flush, it depends on the write
> >>cache settings, battery backing, etc.
> >
> >Is there any chance that someone could implement this (I don't have the 
> >skills, unfortunately)? Maybe add a new ioctl() to block devices, so that 
> >it doesn't break any existing code?
> 
> I think we really should have support for doing cache flushes 
> automatically on fsync, etc. User space code should not have to worry 
> about this problem, it's pretty silly that for example MySQL has to 
> advise people to use hdparm -W 0 to disable the write cache on their IDE 
> drives in order to get proper data integrity guarantees - and disabling 
> the cache on IDE without command queueing really slaughters the 
> performance, unnecessarily in this case.

Completely agree. If you have barriers enabled in your filesystem, then
it should Just Work when you do fsync(). At least that is the case for
reiserfs and XFS, I'm not completely sure that ext3 also handles it
correctly.

For direct block device access, fsync() does need to provide a commit to
stable storage as well though.

> There may be some cases where the controller provides a battery-backed 
> cache and thus we don't want to actually force the controller to flush 
> everything out to the drive on fsync, so we may need to be able to 
> disable this, but these controllers may ignore flushes anyway. I know 
> IBM ServeRAID appears to fail requests for write cache info and so the 
> kernel assumes drive cache: write through and doesn't do any flushes.

That would be the preferable approach, just have the hardware that
doesn't need a flush ignore the FLUSH_CACHE. That would also need to
ignore the FUA bit on writes then. I'm not sure what the spec has to say
on this, basically the requirement is just that data is on stable
storage (eg survives power failure and so on), then that would be fine.
And I would hope it is, it'd be hard to specify anything else.

> >I believe it's a very useful (and relatively simple) feature that 
> >increases data integrity and reliability for applications that need this 
> >functionality.
> >
> >I think it must be considered that most people have disk write caches 
> >enabled and are using IDE, SATA or SCSI disks.
> >
> >I also think there's no point in disabling disks' write caches, since it 
> >slows writes and decreases disks' lifetime, and because there's a better 
> >solution.
> 
> Yes, ideally doing all writes to the drive with write cache enabled and 
> then flushing them out afterwards would be much more efficient (at least 
>  when no command queueing is involved) since the drive can choose what 
> order to complete the writes in.

That only works if you just care about the stream of writes going to
stable storage and don't care about ordering. But the above is
essentially how the barriers work on write back cache + non queued
devices. When the barrier write is received, we commit the previous
writes first with a flush and then write the barrier (followed by
another flush, or possibly not if we have FUA).

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: How to flush the disk write cache from userspace

2007-01-20 Thread Jens Axboe
On Thu, Jan 18 2007, Robert Hancock wrote:
 Ricardo Correia wrote:
 On Tuesday 16 January 2007 00:38, you wrote:
 As always with these things, the devil is in the details. It requires
 the device to support a -prepare_flush() queue hook, and not all
 devices do that. It will work for IDE/SATA/SCSI, though. In some devices
 you don't want/need to do a real disk flush, it depends on the write
 cache settings, battery backing, etc.
 
 Is there any chance that someone could implement this (I don't have the 
 skills, unfortunately)? Maybe add a new ioctl() to block devices, so that 
 it doesn't break any existing code?
 
 I think we really should have support for doing cache flushes 
 automatically on fsync, etc. User space code should not have to worry 
 about this problem, it's pretty silly that for example MySQL has to 
 advise people to use hdparm -W 0 to disable the write cache on their IDE 
 drives in order to get proper data integrity guarantees - and disabling 
 the cache on IDE without command queueing really slaughters the 
 performance, unnecessarily in this case.

Completely agree. If you have barriers enabled in your filesystem, then
it should Just Work when you do fsync(). At least that is the case for
reiserfs and XFS, I'm not completely sure that ext3 also handles it
correctly.

For direct block device access, fsync() does need to provide a commit to
stable storage as well though.

 There may be some cases where the controller provides a battery-backed 
 cache and thus we don't want to actually force the controller to flush 
 everything out to the drive on fsync, so we may need to be able to 
 disable this, but these controllers may ignore flushes anyway. I know 
 IBM ServeRAID appears to fail requests for write cache info and so the 
 kernel assumes drive cache: write through and doesn't do any flushes.

That would be the preferable approach, just have the hardware that
doesn't need a flush ignore the FLUSH_CACHE. That would also need to
ignore the FUA bit on writes then. I'm not sure what the spec has to say
on this, basically the requirement is just that data is on stable
storage (eg survives power failure and so on), then that would be fine.
And I would hope it is, it'd be hard to specify anything else.

 I believe it's a very useful (and relatively simple) feature that 
 increases data integrity and reliability for applications that need this 
 functionality.
 
 I think it must be considered that most people have disk write caches 
 enabled and are using IDE, SATA or SCSI disks.
 
 I also think there's no point in disabling disks' write caches, since it 
 slows writes and decreases disks' lifetime, and because there's a better 
 solution.
 
 Yes, ideally doing all writes to the drive with write cache enabled and 
 then flushing them out afterwards would be much more efficient (at least 
  when no command queueing is involved) since the drive can choose what 
 order to complete the writes in.

That only works if you just care about the stream of writes going to
stable storage and don't care about ordering. But the above is
essentially how the barriers work on write back cache + non queued
devices. When the barrier write is received, we commit the previous
writes first with a flush and then write the barrier (followed by
another flush, or possibly not if we have FUA).

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: How to flush the disk write cache from userspace

2007-01-18 Thread Robert Hancock

Ricardo Correia wrote:

On Tuesday 16 January 2007 00:38, you wrote:

As always with these things, the devil is in the details. It requires
the device to support a ->prepare_flush() queue hook, and not all
devices do that. It will work for IDE/SATA/SCSI, though. In some devices
you don't want/need to do a real disk flush, it depends on the write
cache settings, battery backing, etc.


Is there any chance that someone could implement this (I don't have the 
skills, unfortunately)? Maybe add a new ioctl() to block devices, so that it 
doesn't break any existing code?


I think we really should have support for doing cache flushes 
automatically on fsync, etc. User space code should not have to worry 
about this problem, it's pretty silly that for example MySQL has to 
advise people to use hdparm -W 0 to disable the write cache on their IDE 
drives in order to get proper data integrity guarantees - and disabling 
the cache on IDE without command queueing really slaughters the 
performance, unnecessarily in this case.


There may be some cases where the controller provides a battery-backed 
cache and thus we don't want to actually force the controller to flush 
everything out to the drive on fsync, so we may need to be able to 
disable this, but these controllers may ignore flushes anyway. I know 
IBM ServeRAID appears to fail requests for write cache info and so the 
kernel assumes drive cache: write through and doesn't do any flushes.




I believe it's a very useful (and relatively simple) feature that increases 
data integrity and reliability for applications that need this functionality.


I think it must be considered that most people have disk write caches enabled 
and are using IDE, SATA or SCSI disks.


I also think there's no point in disabling disks' write caches, since it slows 
writes and decreases disks' lifetime, and because there's a better solution.


Yes, ideally doing all writes to the drive with write cache enabled and 
then flushing them out afterwards would be much more efficient (at least 
 when no command queueing is involved) since the drive can choose what 
order to complete the writes in.




Personally, I'm not really interested in specific filesystem behaviour, since 
my application uses block devices directly (it's a filesystem itself). 
Although I think all filesystems should guarantee data integrity in the face 
of fsync() or metadata modifications, even if it costs a little performance.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: How to flush the disk write cache from userspace

2007-01-18 Thread Robert Hancock

Ricardo Correia wrote:

On Tuesday 16 January 2007 00:38, you wrote:

As always with these things, the devil is in the details. It requires
the device to support a -prepare_flush() queue hook, and not all
devices do that. It will work for IDE/SATA/SCSI, though. In some devices
you don't want/need to do a real disk flush, it depends on the write
cache settings, battery backing, etc.


Is there any chance that someone could implement this (I don't have the 
skills, unfortunately)? Maybe add a new ioctl() to block devices, so that it 
doesn't break any existing code?


I think we really should have support for doing cache flushes 
automatically on fsync, etc. User space code should not have to worry 
about this problem, it's pretty silly that for example MySQL has to 
advise people to use hdparm -W 0 to disable the write cache on their IDE 
drives in order to get proper data integrity guarantees - and disabling 
the cache on IDE without command queueing really slaughters the 
performance, unnecessarily in this case.


There may be some cases where the controller provides a battery-backed 
cache and thus we don't want to actually force the controller to flush 
everything out to the drive on fsync, so we may need to be able to 
disable this, but these controllers may ignore flushes anyway. I know 
IBM ServeRAID appears to fail requests for write cache info and so the 
kernel assumes drive cache: write through and doesn't do any flushes.




I believe it's a very useful (and relatively simple) feature that increases 
data integrity and reliability for applications that need this functionality.


I think it must be considered that most people have disk write caches enabled 
and are using IDE, SATA or SCSI disks.


I also think there's no point in disabling disks' write caches, since it slows 
writes and decreases disks' lifetime, and because there's a better solution.


Yes, ideally doing all writes to the drive with write cache enabled and 
then flushing them out afterwards would be much more efficient (at least 
 when no command queueing is involved) since the drive can choose what 
order to complete the writes in.




Personally, I'm not really interested in specific filesystem behaviour, since 
my application uses block devices directly (it's a filesystem itself). 
Although I think all filesystems should guarantee data integrity in the face 
of fsync() or metadata modifications, even if it costs a little performance.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: How to flush the disk write cache from userspace

2007-01-17 Thread Ricardo Correia
On Tuesday 16 January 2007 00:38, you wrote:
> As always with these things, the devil is in the details. It requires
> the device to support a ->prepare_flush() queue hook, and not all
> devices do that. It will work for IDE/SATA/SCSI, though. In some devices
> you don't want/need to do a real disk flush, it depends on the write
> cache settings, battery backing, etc.

Is there any chance that someone could implement this (I don't have the 
skills, unfortunately)? Maybe add a new ioctl() to block devices, so that it 
doesn't break any existing code?

I believe it's a very useful (and relatively simple) feature that increases 
data integrity and reliability for applications that need this functionality.

I think it must be considered that most people have disk write caches enabled 
and are using IDE, SATA or SCSI disks.

I also think there's no point in disabling disks' write caches, since it slows 
writes and decreases disks' lifetime, and because there's a better solution.

Personally, I'm not really interested in specific filesystem behaviour, since 
my application uses block devices directly (it's a filesystem itself). 
Although I think all filesystems should guarantee data integrity in the face 
of fsync() or metadata modifications, even if it costs a little performance.

Thank you.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: How to flush the disk write cache from userspace

2007-01-17 Thread Ricardo Correia
On Tuesday 16 January 2007 00:38, you wrote:
 As always with these things, the devil is in the details. It requires
 the device to support a -prepare_flush() queue hook, and not all
 devices do that. It will work for IDE/SATA/SCSI, though. In some devices
 you don't want/need to do a real disk flush, it depends on the write
 cache settings, battery backing, etc.

Is there any chance that someone could implement this (I don't have the 
skills, unfortunately)? Maybe add a new ioctl() to block devices, so that it 
doesn't break any existing code?

I believe it's a very useful (and relatively simple) feature that increases 
data integrity and reliability for applications that need this functionality.

I think it must be considered that most people have disk write caches enabled 
and are using IDE, SATA or SCSI disks.

I also think there's no point in disabling disks' write caches, since it slows 
writes and decreases disks' lifetime, and because there's a better solution.

Personally, I'm not really interested in specific filesystem behaviour, since 
my application uses block devices directly (it's a filesystem itself). 
Although I think all filesystems should guarantee data integrity in the face 
of fsync() or metadata modifications, even if it costs a little performance.

Thank you.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: How to flush the disk write cache from userspace

2007-01-15 Thread Jens Axboe
On Sun, Jan 14 2007, Ricardo Correia wrote:
> Hi, (please CC: to my email address, I'm not subscribed)
> 
> Quick question: how can I flush the disk write cache from userspace?
> 
> Long question:
> 
> I'm porting the Solaris ZFS filesystem to the FUSE/Linux filesystem
> framework.  This is a copy-on-write, transactional filesystem and so
> it needs to ensure correct ordering of writes when transactions are
> written to disk.
> 
> At the moment, when transactions end, I'm using a fsync() on the block
> device followed by a ioctl(BLKFLSBUF).
> 
> This is because, according to the fsync manpage, even after fsync()
> returns, data might still be in the disk write cache, so fsync by
> itself doesn't guarantee data safety on power failure.

Depends. Only if the file system does the right thing here, iirc only
reiserfs with barriers enabled issue a real disk flush for fsync. So you
can't rely on it in general.

> I was looking for something like the Solaris
> ioctl(DKIOCFLUSHWRITECACHE), which does exactly what I need.
> 
> The most similar thing I could find was ioctl(BLKFLSBUF), however a
> search for BLKFLSBUF on the Linux 2.6.15 source doesn't seem to return
> anything related to IDE or SCSI disks.
> 
> Can I trust ioctl(BLKFLSBUF) to flush disks' write caches (for disks
> that follow the specs)?

BLKFLSBUF doesn't flush the disk cache either, it just flushes
every dirty page in the block device address space. It would not be very
hard to do, basically we have most of the support code in place for this
for IO barriers. Basically it would be something like:

blockdev_cache_flush(bdev)
{
request_queue_t *q = bdev_get_queue(bdev);
struct request *rq = blk_get_request(q, WRITE, GFP_WHATEVER);
int ret;

ret = blk_execute_rq(q, bdev->bd_disk, rq, 0);
blk_put_request(rq);
return ret;
}

Somewhat simplified of course, but it should get the point across.
Putting that in fs/buffer.c:sync_blockdev() would make BLKFLSBUF work.

As always with these things, the devil is in the details. It requires
the device to support a ->prepare_flush() queue hook, and not all
devices do that. It will work for IDE/SATA/SCSI, though. In some devices
you don't want/need to do a real disk flush, it depends on the write
cache settings, battery backing, etc.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: How to flush the disk write cache from userspace

2007-01-15 Thread Jens Axboe
On Sun, Jan 14 2007, Ricardo Correia wrote:
 Hi, (please CC: to my email address, I'm not subscribed)
 
 Quick question: how can I flush the disk write cache from userspace?
 
 Long question:
 
 I'm porting the Solaris ZFS filesystem to the FUSE/Linux filesystem
 framework.  This is a copy-on-write, transactional filesystem and so
 it needs to ensure correct ordering of writes when transactions are
 written to disk.
 
 At the moment, when transactions end, I'm using a fsync() on the block
 device followed by a ioctl(BLKFLSBUF).
 
 This is because, according to the fsync manpage, even after fsync()
 returns, data might still be in the disk write cache, so fsync by
 itself doesn't guarantee data safety on power failure.

Depends. Only if the file system does the right thing here, iirc only
reiserfs with barriers enabled issue a real disk flush for fsync. So you
can't rely on it in general.

 I was looking for something like the Solaris
 ioctl(DKIOCFLUSHWRITECACHE), which does exactly what I need.
 
 The most similar thing I could find was ioctl(BLKFLSBUF), however a
 search for BLKFLSBUF on the Linux 2.6.15 source doesn't seem to return
 anything related to IDE or SCSI disks.
 
 Can I trust ioctl(BLKFLSBUF) to flush disks' write caches (for disks
 that follow the specs)?

BLKFLSBUF doesn't flush the disk cache either, it just flushes
every dirty page in the block device address space. It would not be very
hard to do, basically we have most of the support code in place for this
for IO barriers. Basically it would be something like:

blockdev_cache_flush(bdev)
{
request_queue_t *q = bdev_get_queue(bdev);
struct request *rq = blk_get_request(q, WRITE, GFP_WHATEVER);
int ret;

ret = blk_execute_rq(q, bdev-bd_disk, rq, 0);
blk_put_request(rq);
return ret;
}

Somewhat simplified of course, but it should get the point across.
Putting that in fs/buffer.c:sync_blockdev() would make BLKFLSBUF work.

As always with these things, the devil is in the details. It requires
the device to support a -prepare_flush() queue hook, and not all
devices do that. It will work for IDE/SATA/SCSI, though. In some devices
you don't want/need to do a real disk flush, it depends on the write
cache settings, battery backing, etc.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


How to flush the disk write cache from userspace

2007-01-13 Thread Ricardo Correia
Hi, (please CC: to my email address, I'm not subscribed)

Quick question: how can I flush the disk write cache from userspace?

Long question:

I'm porting the Solaris ZFS filesystem to the FUSE/Linux filesystem framework.
This is a copy-on-write, transactional filesystem and so it needs to ensure 
correct ordering of writes when transactions are written to disk.

At the moment, when transactions end, I'm using a fsync() on the block device 
followed by a ioctl(BLKFLSBUF).

This is because, according to the fsync manpage, even after fsync() returns, 
data might still be in the disk write cache, so fsync by itself doesn't 
guarantee data safety on power failure.

I was looking for something like the Solaris ioctl(DKIOCFLUSHWRITECACHE), 
which does exactly what I need.

The most similar thing I could find was ioctl(BLKFLSBUF), however a search for 
BLKFLSBUF on the Linux 2.6.15 source doesn't seem to return anything related 
to IDE or SCSI disks.

Can I trust ioctl(BLKFLSBUF) to flush disks' write caches (for disks that 
follow the specs)?

What about block devices of disk partitions, LVM logical volumes and the EMVS 
volumes, do they propagate flush commands to the respective disks?

What about loop devices?

Thanks in advance.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


How to flush the disk write cache from userspace

2007-01-13 Thread Ricardo Correia
Hi, (please CC: to my email address, I'm not subscribed)

Quick question: how can I flush the disk write cache from userspace?

Long question:

I'm porting the Solaris ZFS filesystem to the FUSE/Linux filesystem framework.
This is a copy-on-write, transactional filesystem and so it needs to ensure 
correct ordering of writes when transactions are written to disk.

At the moment, when transactions end, I'm using a fsync() on the block device 
followed by a ioctl(BLKFLSBUF).

This is because, according to the fsync manpage, even after fsync() returns, 
data might still be in the disk write cache, so fsync by itself doesn't 
guarantee data safety on power failure.

I was looking for something like the Solaris ioctl(DKIOCFLUSHWRITECACHE), 
which does exactly what I need.

The most similar thing I could find was ioctl(BLKFLSBUF), however a search for 
BLKFLSBUF on the Linux 2.6.15 source doesn't seem to return anything related 
to IDE or SCSI disks.

Can I trust ioctl(BLKFLSBUF) to flush disks' write caches (for disks that 
follow the specs)?

What about block devices of disk partitions, LVM logical volumes and the EMVS 
volumes, do they propagate flush commands to the respective disks?

What about loop devices?

Thanks in advance.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/