Re: assemble vs create an array.......

2007-12-06 Thread David Chinner
On Thu, Dec 06, 2007 at 07:39:28PM +0300, Michael Tokarev wrote:
 What to do is to give repairfs a try for each permutation,
 but again without letting it to actually fix anything.
 Just run it in read-only mode and see which combination
 of drives gives less errors, or no fatal errors (there
 may be several similar combinations, with the same order
 of drives but with different drive missing).

Ugggh. 

 It's sad that xfs refuses mount when structure needs
 cleaning - the best way here is to actually mount it
 and see how it looks like, instead of trying repair
 tools. 

It self protection - if you try to write to a corrupted filesystem,
you'll only make the corruption worse. Mounting involves log
recovery, which writes to the filesystem

 Is there some option to force-mount it still
 (in readonly mode, knowing it may OOPs kernel etc)?

Sure you can: mount -o ro,norecovery dev mtpt

But it you hit corruption it will still shut down on you. If
the machine oopses then that is a bug.

 thread prompted me to think.  If I can't force-mount it
 (or browse it using other ways) as I can almost always
 do with (somewhat?) broken ext[23] just to examine things,
 maybe I'm trying it before it's mature enough? ;)

Hehe ;)

For maximum uber-XFS-guru points, learn to browse your filesystem
with xfs_db. :P

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3ware 9650 tips

2007-07-16 Thread David Chinner
On Mon, Jul 16, 2007 at 12:41:15PM +1000, David Chinner wrote:
 On Fri, Jul 13, 2007 at 03:36:46PM -0400, Justin Piszcz wrote:
  On Fri, 13 Jul 2007, Jon Collette wrote:
  
  Wouldn't Raid 6 be slower than Raid 5 because of the extra fault tolerance?
http://www.enterprisenetworksandservers.com/monthly/art.php?1754 - 20% 
  drop according to this article
  
  His 500GB WD drives are 7200RPM compared to the Raptors 10K.  So his 
  numbers will be slower. 
  Justin what file system do you have running on the Raptors?  I think thats 
  an interesting point made by Joshua.
  
  I use XFS:
 
 When it comes to bandwidth, there is good reason for that.
 
  Trying to stick with a supported config as much as possible, I need to 
  run ext3.  As per usual, though, initial ext3 numbers are less than 
  impressive. Using bonnie++ to get a baseline, I get (after doing 
  'blockdev --setra 65536' on the device):
  Write: 136MB/s
  Read:  384MB/s
  
  Proving it's not the hardware, with XFS the numbers look like:
  Write: 333MB/s
  Read:  465MB/s
  
 
 Those are pretty typical numbers. In my experience, ext3 is limited to about
 250MB/s buffered write speed. It's not disk limited, it's design limited. e.g.
 on a disk subsystem where XFS was getting 4-5GB/s buffered write, ext3 was 
 doing
 250MB/s.
 
 http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf
 
 If you've got any sort of serious disk array, ext3 is not the filesystem
 to use

To show what the difference is, I used blktrace and Chris Mason's
seekwatcher script on a simple, single threaded dd command on
a 12 disk dm RAID0 stripe:

# dd if=/dev/zero of=/mnt/scratch/fred bs=1024k count=10k; sync

http://oss.sgi.com/~dgc/writes/ext3_write.png
http://oss.sgi.com/~dgc/writes/xfs_write.png

You can see from the ext3 graph that it comes to a screeching halt
every 5s (probably when pdflush runs) and at all other times the
seek rate is 10,000 seeks/s. That's pretty bad for a brand new,
empty filesystem and the only way it is sustained is the fact that
the disks have their write caches turned on. ext4 will probably show
better results, but I haven't got any of the tools installed to be
able to test it

The XFS pattern shows consistently an order of magnitude less seeks
and consistent throughput above 600MB/s. To put the number of seeks
in context, XFS is doing 512k I/Os at about 1200-1300 per second. The
number of seeks? A bit above 10^3 per second or roughly 1 seek per
I/O which is pretty much optimal.

Cheers,

Dave.

-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3ware 9650 tips

2007-07-16 Thread David Chinner
On Mon, Jul 16, 2007 at 10:50:34AM -0500, Eric Sandeen wrote:
 David Chinner wrote:
  On Mon, Jul 16, 2007 at 12:41:15PM +1000, David Chinner wrote:
  On Fri, Jul 13, 2007 at 03:36:46PM -0400, Justin Piszcz wrote:
 ...
  If you've got any sort of serious disk array, ext3 is not the filesystem
  to use
  
  To show what the difference is, I used blktrace and Chris Mason's
  seekwatcher script on a simple, single threaded dd command on
  a 12 disk dm RAID0 stripe:
  
  # dd if=/dev/zero of=/mnt/scratch/fred bs=1024k count=10k; sync
  
  http://oss.sgi.com/~dgc/writes/ext3_write.png
  http://oss.sgi.com/~dgc/writes/xfs_write.png
 
 Were those all with default mkfs  mount options?  ext3 in writeback
 mode might be an interesting comparison too.

Defaults. i.e.

# mkfs.ext3 /dev/mapper/dm0

# mkfs.xfs /dev/mapper/dm0

The mkfs.xfs picked up sunit/swidth correctly from the dm volume.

Last time I checked, writeback made little difference to ext3 throughput;
maybe 5-10% at most. I'll run it again later today...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 3ware 9650 tips

2007-07-15 Thread David Chinner
On Fri, Jul 13, 2007 at 03:36:46PM -0400, Justin Piszcz wrote:
 On Fri, 13 Jul 2007, Jon Collette wrote:
 
 Wouldn't Raid 6 be slower than Raid 5 because of the extra fault tolerance?
   http://www.enterprisenetworksandservers.com/monthly/art.php?1754 - 20% 
 drop according to this article
 
 His 500GB WD drives are 7200RPM compared to the Raptors 10K.  So his 
 numbers will be slower. 
 Justin what file system do you have running on the Raptors?  I think thats 
 an interesting point made by Joshua.
 
 I use XFS:

When it comes to bandwidth, there is good reason for that.

 Trying to stick with a supported config as much as possible, I need to 
 run ext3.  As per usual, though, initial ext3 numbers are less than 
 impressive. Using bonnie++ to get a baseline, I get (after doing 
 'blockdev --setra 65536' on the device):
 Write: 136MB/s
 Read:  384MB/s
 
 Proving it's not the hardware, with XFS the numbers look like:
 Write: 333MB/s
 Read:  465MB/s
 

Those are pretty typical numbers. In my experience, ext3 is limited to about
250MB/s buffered write speed. It's not disk limited, it's design limited. e.g.
on a disk subsystem where XFS was getting 4-5GB/s buffered write, ext3 was doing
250MB/s.

http://oss.sgi.com/projects/xfs/papers/ols2006/ols-2006-paper.pdf

If you've got any sort of serious disk array, ext3 is not the filesystem
to use

 How many folks are using these?  Any tuning tips?

Make sure you tell XFS the correct sunit/swidth. For hardware
raid5/6, sunit = per-disk chunksize, swidth = number of *data* disks in
array.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fastest Chunk Size w/XFS For MD Software RAID = 1024k

2007-06-28 Thread David Chinner
On Thu, Jun 28, 2007 at 04:27:15AM -0400, Justin Piszcz wrote:
 
 
 On Thu, 28 Jun 2007, Peter Rabbitson wrote:
 
 Justin Piszcz wrote:
 mdadm --create \
   --verbose /dev/md3 \
   --level=5 \
   --raid-devices=10 \
   --chunk=1024 \
   --force \
   --run
   /dev/sd[cdefghijkl]1
 
 Justin.
 
 Interesting, I came up with the same results (1M chunk being superior) 
 with a completely different raid set with XFS on top:
 
 mdadm--create \
  --level=10 \
  --chunk=1024 \
  --raid-devices=4 \
  --layout=f3 \
  ...
 
 Could it be attributed to XFS itself?

More likely it's related to the I/O size being sent to the disks. The larger
the chunk size, the larger the I/o hitting each disk. I think the maximum I/O
size is 512k ATM on x86(_64), so a chunk of 1MB will guarantee that there are
maximally sized I/Os being sent to the disk

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-lvm] 2.6.22-rc4 XFS fails after hibernate/resume

2007-06-28 Thread David Chinner
On Wed, Jun 27, 2007 at 08:49:24PM +, Pavel Machek wrote:
 Hi!
 
  FWIW, I'm on record stating that sync is not sufficient to quiesce an XFS
  filesystem for a suspend/resume to work safely and have argued that the only
 
 Hmm, so XFS writes to disk even when its threads are frozen?

They issue async I/O before they sleep and expects
processing to be done on I/O completion via workqueues.

  safe thing to do is freeze the filesystem before suspend and thaw it after
  resume. This is why I originally asked you to test that with the other 
  problem
 
 Could you add that to the XFS threads if it is really required? They
 do know that they are being frozen for suspend.

We don't suspend the threads on a filesystem freeze - they continue
run. A filesystem freeze guarantees the filesystem clean and that
the in memory state matches what is on disk. It is not possible for
the filesytem to issue I/O or have outstanding I/O when it is in the
frozen state, so the state of the threads and/or workqueues does not
matter because they will be idle.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-pm] Re: [linux-lvm] 2.6.22-rc4 XFS fails after hibernate/resume

2007-06-28 Thread David Chinner
On Fri, Jun 29, 2007 at 12:16:44AM +0200, Rafael J. Wysocki wrote:
 There are two solutions possible, IMO.  One would be to make these workqueues
 freezable, which is possible, but hacky and Oleg didn't like that very much.
 The second would be to freeze XFS from within the hibernation code path,
 using freeze_bdev().

The second is much more likely to work reliably. If freezing the
filesystem leaves something in an inconsistent state, then it's
something I can reproduce and debug without needing to
suspend/resume.

FWIW, don't forget you need to thaw the filesystem on resume.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fastest Chunk Size w/XFS For MD Software RAID = 1024k

2007-06-27 Thread David Chinner
On Wed, Jun 27, 2007 at 07:20:42PM -0400, Justin Piszcz wrote:
 For drives with 16MB of cache (in this case, raptors).

That's four (4) drives, right?

If so, how do you get a block read rate of 578MB/s from
4 drives? That's 145MB/s per drive

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-lvm] 2.6.22-rc4 XFS fails after hibernate/resume

2007-06-18 Thread David Chinner
On Mon, Jun 18, 2007 at 08:49:34AM +0100, David Greaves wrote:
 David Greaves wrote:
 OK, that gave me an idea.
 
 Freeze the filesystem
 md5sum the lvm
 hibernate
 resume
 md5sum the lvm
 snip
 So the lvm and below looks OK...
 
 I'll see how it behaves now the filesystem has been frozen/thawed over 
 the hibernate...
 
 
 And it appears to behave well. (A few hours compile/clean cycling kernel 
 builds on that filesystem were OK).
 
 
 Historically I've done:
 sync
 echo platform  /sys/power/disk
 echo disk  /sys/power/state
 # resume
 
 and had filesystem corruption (only on this machine, my other hibernating 
 xfs machines don't have this problem)
 
 So doing:
 xfs_freeze -f /scratch
 sync
 echo platform  /sys/power/disk
 echo disk  /sys/power/state
 # resume
 xfs_freeze -u /scratch

 Works (for now - more usage testing tonight)

Verrry interesting.

What you were seeing was an XFS shutdown occurring because the free space
btree was corrupted. IOWs, the process of suspend/resume has resulted
in either bad data being written to disk, the correct data not being
written to disk or the cached block being corrupted in memory.

If you run xfs_check on the filesystem after it has shut down after a resume,
can you tell us if it reports on-disk corruption? Note: do not run xfs_repair
to check this - it does not check the free space btrees; instead it simply
rebuilds them from scratch. If xfs_check reports an error, then run xfs_repair
to fix it up.

FWIW, I'm on record stating that sync is not sufficient to quiesce an XFS
filesystem for a suspend/resume to work safely and have argued that the only
safe thing to do is freeze the filesystem before suspend and thaw it after
resume. This is why I originally asked you to test that with the other problem
that you reported. Up until this point in time, there's been no evidence to
prove either side of the argument..

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: limits on raid

2007-06-17 Thread David Chinner
On Sat, Jun 16, 2007 at 07:59:29AM +1000, Neil Brown wrote:
 Combining these thoughts, it would make a lot of sense for the
 filesystem to be able to say to the block device That blocks looks
 wrong - can you find me another copy to try?.  That is an example of
 the sort of closer integration between filesystem and RAID that would
 make sense.

I think that this would only be useful on devices that store
discrete copies of the blocks on different devices i.e. mirrors. If
it's an XOR based RAID, you don't have another copy you can
retreive

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread David Chinner
On Thu, May 31, 2007 at 08:26:45AM +0200, Jens Axboe wrote:
 On Thu, May 31 2007, David Chinner wrote:
  IOWs, there are two parts to the problem:
  
  1 - guaranteeing I/O ordering
  2 - guaranteeing blocks are on persistent storage.
  
  Right now, a single barrier I/O is used to provide both of these
  guarantees. In most cases, all we really need to provide is 1); the
  need for 2) is a much rarer condition but still needs to be
  provided.
  
   if I am understanding it correctly, the big win for barriers is that you 
   do NOT have to stop and wait until the data is on persistant media before 
   you can continue.
  
  Yes, if we define a barrier to only guarantee 1), then yes this
  would be a big win (esp. for XFS). But that requires all filesystems
  to handle sync writes differently, and sync_blockdev() needs to
  call blkdev_issue_flush() as well
  
  So, what do we do here? Do we define a barrier I/O to only provide
  ordering, or do we define it to also provide persistent storage
  writeback? Whatever we decide, it needs to be documented
 
 The block layer already has a notion of the two types of barriers, with
 a very small amount of tweaking we could expose that. There's absolutely
 zero reason we can't easily support both types of barriers.

That sounds like a good idea - we can leave the existing
WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
behaviour that only guarantees ordering. The filesystem can then
choose which to use where appropriate

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-31 Thread David Chinner
On Thu, May 31, 2007 at 02:31:21PM -0400, Phillip Susi wrote:
 David Chinner wrote:
 That sounds like a good idea - we can leave the existing
 WRITE_BARRIER behaviour unchanged and introduce a new WRITE_ORDERED
 behaviour that only guarantees ordering. The filesystem can then
 choose which to use where appropriate
 
 So what if you want a synchronous write, but DON'T care about the order? 

submit_bio(WRITE_SYNC, bio);

Already there, already used by XFS, JFS and direct I/O.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread David Chinner
On Tue, May 29, 2007 at 05:01:24PM -0700, [EMAIL PROTECTED] wrote:
 On Wed, 30 May 2007, David Chinner wrote:
 
 On Tue, May 29, 2007 at 04:03:43PM -0400, Phillip Susi wrote:
 David Chinner wrote:
 The use of barriers in XFS assumes the commit write to be on stable
 storage before it returns.  One of the ordering guarantees that we
 need is that the transaction (commit write) is on disk before the
 metadata block containing the change in the transaction is written
 to disk and the current barrier behaviour gives us that.
 
 Barrier != synchronous write,
 
 Of course. FYI, XFS only issues barriers on *async* writes.
 
 But barrier semantics - as far as they've been described by everyone
 but you indicate that the barrier write is guaranteed to be on stable
 storage when it returns.
 
 this doesn't match what I have seen
 
 wtih barriers it's perfectly legal to have the following sequence of 
 events
 
 1. app writes block 10 to OS
 2. app writes block 4 to OS
 3. app writes barrier to OS
 4. app writes block 5 to OS
 5. app writes block 20 to OS

hm - applications can't issue barriers to the filesystem.
However, if you consider the barrier to be an fsync() for example,
then it's still the filesystem that is issuing the barrier and
there's a block that needs to be written that is associated with
that barrier (either an inode or a transaction commit) that needs to
be on stable storage before the filesystem returns to userspace.

 6. OS writes block 4 to disk drive
 7. OS writes block 10 to disk drive
 8. OS writes barrier to disk drive
 9. OS writes block 5 to disk drive
 10. OS writes block 20 to disk drive

Replace OS with filesystem, and combine 7+8 together - we don't have
zero-length barriers and hence they are *always* associated with a
write to a certain block on disk. i.e.:

1. FS writes block 4 to disk drive
2. FS writes block 10 to disk drive
3. FS writes *barrier* block X to disk drive
4. FS writes block 5 to disk drive
5. FS writes block 20 to disk drive

The order that these are expected by the filesystem to hit stable
storage are:

1. block 4 and 10 on stable storage in any order
2. barrier block X on stable storage
3. block 5 and 20 on stable storage in any order

The point I'm trying to make is that in XFS,  block 5 and 20 cannot
be allowed to hit the disk before the barrier block because they
have strict order dependency on block X being stable before them,
just like block X has strict order dependency that block 4 and 10
must be stable before we start the barrier block write.

 11. disk drive writes block 10 to platter
 12. disk drive writes block 4 to platter
 13. disk drive writes block 20 to platter
 14. disk drive writes block 5 to platter

 if the disk drive doesn't support barriers then step #8 becomes 'issue 
 flush' and steps 11 and 12 take place before step #9, 13, 14

No, you need a flush on either side of the block X write to maintain
the same semantics as barrier writes currently have.

We have filesystems that require barriers to prevent reordering of
writes in both directions and to ensure that the block associated
with the barrier is on stable storage when I/o completion is
signalled.  The existing barrier implementation (where it works)
provide these requirements. We need barriers to retain these
semantics, otherwise we'll still have to do special stuff in
the filesystems to get the semantics that we need.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread David Chinner
On Wed, May 30, 2007 at 09:52:49AM -0700, [EMAIL PROTECTED] wrote:
 On Wed, 30 May 2007, David Chinner wrote:
 with the barrier is on stable storage when I/o completion is
 signalled.  The existing barrier implementation (where it works)
 provide these requirements. We need barriers to retain these
 semantics, otherwise we'll still have to do special stuff in
 the filesystems to get the semantics that we need.
 
 one of us is misunderstanding barriers here.

No, I thinkwe are both on the same level here - it's what
barriers are used for that is not clear understood, I think.

 you are understanding barriers to be the same as syncronous writes. (and 
 therefor the data is on persistant media before the call returns)

No, I'm describing the high level behaviour that is expected by
a filesystem. The reasons for this are below

 I am understanding barriers to only indicate ordering requirements. things 
 before the barrier can be reordered freely, things after the barrier can 
 be reordered freely, but things cannot be reordered across the barrier.

Ok, that's my understanding of how *device based barriers* can work,
but there's more to it than that. As far as the filesystem is
concerned the barrier write needs to *behave* exactly like a sync
write because of the guarantees the filesystem has to provide
userspace. Specifically - sync, sync writes and fsync.

This is the big problem, right? If we use barriers for commit
writes, the filesystem can return to userspace after a sync write or
fsync() and an *ordered barrier device implementation* may not have
written the blocks to persistent media. If we then pull the plug on
the box, we've just lost data that sync or fsync said was
successfully on disk. That's BAD.

Right now a barrier write on the last block of the fsync/sync write
is sufficient to prevent that because of the FUA on the barrier
block write. A purely ordered barrier implementation does not
provide this guarantee.

This is the crux of my argument - from a filesystem perspective,
there is a *major* difference between a barrier implemented to just
guaranteeing ordering and a barrier implemented with a flush+FUA or
flush+write+flush.

IOWs, there are two parts to the problem:

1 - guaranteeing I/O ordering
2 - guaranteeing blocks are on persistent storage.

Right now, a single barrier I/O is used to provide both of these
guarantees. In most cases, all we really need to provide is 1); the
need for 2) is a much rarer condition but still needs to be
provided.

 if I am understanding it correctly, the big win for barriers is that you 
 do NOT have to stop and wait until the data is on persistant media before 
 you can continue.

Yes, if we define a barrier to only guarantee 1), then yes this
would be a big win (esp. for XFS). But that requires all filesystems
to handle sync writes differently, and sync_blockdev() needs to
call blkdev_issue_flush() as well

So, what do we do here? Do we define a barrier I/O to only provide
ordering, or do we define it to also provide persistent storage
writeback? Whatever we decide, it needs to be documented

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [dm-devel] Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-30 Thread David Chinner
On Thu, May 31, 2007 at 02:07:39AM +0100, Alasdair G Kergon wrote:
 On Thu, May 31, 2007 at 10:46:04AM +1000, Neil Brown wrote:
  If a filesystem cares, it could 'ask' as suggested above.
  What would be a good interface for asking?
 
 XFS already tests:
   bd_disk-queue-ordered == QUEUE_ORDERED_NONE

The side effects of removing that check is what started
this whole discussion.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5: I lost a XFS file system due to a minor IDE cable problem

2007-05-28 Thread David Chinner
On Mon, May 28, 2007 at 01:17:31PM +0200, Pallai Roland wrote:
 On Monday 28 May 2007 04:17:18 David Chinner wrote:
  H. A quick look at the linux code makes me thikn that background
  writeback on linux has never been able to cause a shutdown in this case.
  However, the same error on Irix will definitely cause a shutdown,
  though
  I hope Linux will follow Irix, that's a consistent standpoint.

I raised a bug for this yesterday when writing that reply. It won't
get forgotten now

  David, have you a plan to implement your reporting raid5 block layer
  idea?  No one else has caring about this silent data loss on temporary
  (cable, power) failed raid5 arrays as I see, I really hope you do at least!

Yeah, I'd love to get something like this happening, but given it's about
half way down my list of stuff to do when I have some spare time I'd
say it will be about 2015 before I get to it.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5: I lost a XFS file system due to a minor IDE cable problem

2007-05-28 Thread David Chinner
On Mon, May 28, 2007 at 05:30:52PM +0200, Pallai Roland wrote:
 
 On Monday 28 May 2007 14:53:55 Pallai Roland wrote:
  On Friday 25 May 2007 02:05:47 David Chinner wrote:
   -o ro,norecovery will allow you to mount the filesystem and get any
   uncorrupted data off it.
  
   You still may get shutdowns if you trip across corrupted metadata in
   the filesystem, though.
 
  This filesystem is completely dead.
  [...]
 
  I tried to make a md patch to stop writes if a raid5 array got 2+ failed 
 drives, but I found it's already done, oops. :) handle_stripe5() ignores 
 writes in this case quietly, I tried and works.

Hmmm - it clears the uptodate bit on the bio, which is supposed to
make the bio return EIO. That looks to be doing the right thing...

  There's an another layer I used on this box between md and xfs: loop-aes. I 

Oh, that's a kind of important thing to forget to mention

 used it since years and rock stable, but now it's my first suspect, cause I 
 found a bug in it today:
  I assembled my array from n-1 disks, and I failed a second disk for a test 
 and I found /dev/loop1 still provides *random* data where /dev/md1 serves 
 nothing, it's definitely a loop-aes bug:

.

  It's not an explanation to my screwed up file system, but for me it's enough 
 to drop loop-aes. Eh.

If you can get random data back instead of an error from the block device,
then I'm not surprised your filesystem is toast. If it's one sector in a
larger block that is corrupted, then the only thing that will protect you from
this sort of corruption causing problems is metadata checksums (yet another
thin on my list of stuff to do).

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5: I lost a XFS file system due to a minor IDE cable problem

2007-05-28 Thread David Chinner
On Mon, May 28, 2007 at 05:45:27PM -0500, Alberto Alonso wrote:
 On Fri, 2007-05-25 at 18:36 +1000, David Chinner wrote:
  On Fri, May 25, 2007 at 12:43:51AM -0500, Alberto Alonso wrote:
   I think his point was that going into a read only mode causes a
   less catastrophic situation (ie. a web server can still serve
   pages).
  
  Sure - but once you've detected one corruption or had metadata
  I/O errors, can you trust the rest of the filesystem?
  
   I think that is a valid point, rather than shutting down
   the file system completely, an automatic switch to where the least
   disruption of service can occur is always desired.
  
  I consider the possibility of serving out bad data (i.e after
  a remount to readonly) to be the worst possible disruption of
  service that can happen ;)
 
 I guess it does depend on the nature of the failure. A write failure
 on block 2000 does not imply corruption of the other 2TB of data.

The rest might not be corrupted, but if block 2000 is a index of
some sort (i.e. metadata), you could reference any of that 2TB
incorrectly and get the wrong data, write to the wrong spot on disk,
etc.

   I personally have found the XFS file system to be great for
   my needs (except issues with NFS interaction, where the bug report
   never got answered), but that doesn't mean it can not be improved.
  
  Got a pointer?
 
 I can't seem to find it. I'm pretty sure I used bugzilla to report
 it. I did find the kernel dump file though, so here it is:
 
 Oct  3 15:34:07 localhost kernel: xfs_iget_core: ambiguous vns:
 vp/0xd1e69c80, invp/0xc989e380

Oh, I haven't seen any of those problems for quite some time.

 = /proc/kmsg started.
 Oct  3 15:51:23 localhost kernel:
 Inspecting /boot/System.map-2.6.8-2-686-smp

Oh, well, yes, kernels that old did have that problem. It got fixed
some time around 2.6.12 or 2.6.13 IIRC

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5: I lost a XFS file system due to a minor IDE cable problem

2007-05-27 Thread David Chinner
On Fri, May 25, 2007 at 04:35:36PM +0200, Pallai Roland wrote:
 
 On Friday 25 May 2007 06:55:00 David Chinner wrote:
  Oh, did you look at your logs and find that XFS had spammed them
  about writes that were failing?
 
 The first message after the incident:
 
 May 24 01:53:50 hq kernel: Filesystem loop1: XFS internal error 
 xfs_btree_check_sblock at line 336 of file fs/xfs/xfs_btree.c.  Caller 
 0xf8ac14f8
 May 24 01:53:50 hq kernel: f8adae69 xfs_btree_check_sblock+0x4f/0xc2 [xfs]  
 f8ac14f8 xfs_alloc_lookup+0x34e/0x47b [xfs]
 May 24 01:53:50 HF kernel: f8ac14f8 xfs_alloc_lookup+0x34e/0x47b [xfs]  
 f8b1a9c7 kmem_zone_zalloc+0x1b/0x43 [xfs]
 May 24 01:53:50 hq kernel: f8abe645 xfs_alloc_ag_vextent+0x24d/0x1110 [xfs] 
  f8ac0647 xfs_alloc_vextent+0x3bd/0x53b [xfs]
 May 24 01:53:50 hq kernel: f8ad2f7e xfs_bmapi+0x1ac4/0x23cd [xfs]  
 f8acab97 xfs_bmap_search_multi_extents+0x8e/0xd8 [xfs]
 May 24 01:53:50 hq kernel: f8b1 xlog_dealloc_log+0x49/0xea [xfs]  
 f8afdaee xfs_iomap_write_allocate+0x2d9/0x58b [xfs]
 May 24 01:53:50 hq kernel: f8afc3ae xfs_iomap+0x60e/0x82d [xfs]  c0113bc8 
 __wake_up_common+0x39/0x59
 May 24 01:53:50 hq kernel: f8b1ae11 xfs_map_blocks+0x39/0x6c [xfs]  
 f8b1bd7b xfs_page_state_convert+0x644/0xf9c [xfs]
 May 24 01:53:50 hq kernel: c036f384 schedule+0x5d1/0xf4d  f8b1c780 
 xfs_vm_writepage+0x0/0xe0 [xfs]
 May 24 01:53:50 hq kernel: f8b1c7d7 xfs_vm_writepage+0x57/0xe0 [xfs]  
 c01830e8 mpage_writepages+0x1fb/0x3bb
 May 24 01:53:50 hq kernel: c0183020 mpage_writepages+0x133/0x3bb  
 f8b1c780 xfs_vm_writepage+0x0/0xe0 [xfs]
 May 24 01:53:50 hq kernel: c0147bb3 do_writepages+0x35/0x3b  c018135c 
 __writeback_single_inode+0x88/0x387
 May 24 01:53:50 hq kernel: c01819b7 sync_sb_inodes+0x1b4/0x2a8  c0181c63 
 writeback_inodes+0x63/0xdc
 May 24 01:53:50 hq kernel: c0147943 background_writeout+0x66/0x9f  
 c01482b3 pdflush+0x0/0x1ad
 May 24 01:53:50 hq kernel: c01483a2 pdflush+0xef/0x1ad  c01478dd 
 background_writeout+0x0/0x9f
 May 24 01:53:50 hq kernel: c012d10b kthread+0xc2/0xc6  c012d049 
 kthread+0x0/0xc6
 May 24 01:53:50 hq kernel: c0100dd5 kernel_thread_helper+0x5/0xb
 
 .and I've spammed such messages. This internal error isn't a good reason to 
 shut down
 the file system?

Actaully, that error does shut the filesystem down in most cases. When you
see that output, the function is returning -EFSCORRUPTED. You've got a corrupted
freespace btree.

The reason why you get spammed is that this is happening during background
writeback, and there is no one to return the -EFSCORRUPTED error to. The
background writeback path doesn't specifically detect shut down filesystems or
trigger shutdowns on errors because that happens in different layers so you
just end up with failed data writes. These errors will occur on the next
foreground data or metadata allocation and that will shut the filesystem down
at that point.

I'm not sure that we should be ignoring EFSCORRUPTED errors here; maybe in
this case we should be shutting down the filesystem.  That would certainly cut
down on the spamming and would not appear to change anything other
behaviour

 I think if there's a sign of corrupted file system, the first thing we should 
 do
 is to stop writes (or the entire FS) and let the admin to examine the 
 situation.

Yes, that's *exactly* what a shutdown does. In this case, your writes are
being stopped - hence the error messages - but the filesystem has not yet
been shutdown.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5: I lost a XFS file system due to a minor IDE cable problem

2007-05-27 Thread David Chinner
On Mon, May 28, 2007 at 03:50:17AM +0200, Pallai Roland wrote:
 On Monday 28 May 2007 02:30:11 David Chinner wrote:
  On Fri, May 25, 2007 at 04:35:36PM +0200, Pallai Roland wrote:
   .and I've spammed such messages. This internal error isn't a good
   reason to shut down the file system?
 
  Actaully, that error does shut the filesystem down in most cases. When you
  see that output, the function is returning -EFSCORRUPTED. You've got a
  corrupted freespace btree.
 
  The reason why you get spammed is that this is happening during background
  writeback, and there is no one to return the -EFSCORRUPTED error to. The
  background writeback path doesn't specifically detect shut down filesystems
  or trigger shutdowns on errors because that happens in different layers so
  you just end up with failed data writes. These errors will occur on the
  next foreground data or metadata allocation and that will shut the
  filesystem down at that point.
 
  I'm not sure that we should be ignoring EFSCORRUPTED errors here; maybe in
  this case we should be shutting down the filesystem.  That would certainly
  cut down on the spamming and would not appear to change anything other
  behaviour
  If I remember correctly, my file system wasn't shutted down at all, it 
 was writeable for whole night, the yafc slowly written files to it. Maybe 
 all write operations had failed, but yafc doesn't warn.

So you never created new files or directories, unlinked files or
directories, did synchronous writes, etc? Just had slowly growing files?

  Spamming is just annoying when we need to find out what went wrong (My 
 kernel.log is 300Mb), but for data security it's important to react to 
 EFSCORRUPTED error in any case, I think so. Please consider this.

The filesystem has responded correctly to the corruption in terms of
data security (i.e. failed the data write and warned noisily about
it), but it probably hasn't done everything it should

H. A quick look at the linux code makes me thikn that background
writeback on linux has never been able to cause a shutdown in this
case. However, the same error on Irix will definitely cause a
shutdown, though

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-27 Thread David Chinner
On Mon, May 28, 2007 at 11:30:32AM +1000, Neil Brown wrote:
 
 Thanks everyone for your input.  There was some very valuable
 observations in the various emails.
 I will try to pull most of it together and bring out what seem to be
 the important points.
 
 
 1/ A BIO_RW_BARRIER request should never fail with -EOPNOTSUP.

Sounds good to me, but how do we test to see if the underlying
device supports barriers? Do we just assume that they do and
only change behaviour if -o nobarrier is specified in the mount
options?

 2/ Maybe barriers provide stronger semantics than are required.
 
  All write requests are synchronised around a barrier write.  This is
  often more than is required and apparently can cause a measurable
  slowdown.
 
  Also the FUA for the actual commit write might not be needed.  It is
  important for consistency that the preceding writes are in safe
  storage before the commit write, but it is not so important that the
  commit write is immediately safe on storage.  That isn't needed until
  a 'sync' or 'fsync' or similar.

The use of barriers in XFS assumes the commit write to be on stable
storage before it returns.  One of the ordering guarantees that we
need is that the transaction (commit write) is on disk before the
metadata block containing the change in the transaction is written
to disk and the current barrier behaviour gives us that.

  One possible alternative is:
- writes can overtake barriers, but barrier cannot overtake writes.

No, that breaks the above usage of a barrier

- flush before the barrier, not after.
 
  This is considerably weaker, and hence cheaper. But I think it is
  enough for all filesystems (providing it is still an option to call
  blkdev_issue_flush on 'fsync').

No, not enough for XFS.

  Another alternative would be to tag each bio was being in a
  particular barrier-group.  Then bio's in different groups could
  overtake each other in either direction, but a BARRIER request must
  be totally ordered w.r.t. other requests in the barrier group.
  This would require an extra bio field, and would give the filesystem
  more appearance of control.  I'm not yet sure how much it would
  really help...

And that assumes the filesystem is tracking exact dependencies
between I/Os.  Such a mechanism would probably require filesystems
to be redesigned to use this, but I can see how it would be useful
for doing things like ensuring ordering between just an inode and
it's data writes.  What would the overhead of having to support
several hundred thousand different barrier groups be (i.e. one per
dirty inode in a system)?

 I think the implementation priorities here are:

Depending on the answer to my first question:

0/ implement a specific test for filesystems to run at mount time
   to determine if barriers are supported or not.

 1/ implement a zero-length BIO_RW_BARRIER option.
 2/ Use it (or otherwise) to make all dm and md modules handle
barriers (and loop?).
 3/ Devise and implement appropriate fall-backs with-in the block layer
so that  -EOPNOTSUP is never returned.
 4/ Remove unneeded cruft from filesystems (and elsewhere).

Sounds like a good start. ;)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid5: I lost a XFS file system due to a minor IDE cable problem

2007-05-25 Thread David Chinner
On Fri, May 25, 2007 at 12:43:51AM -0500, Alberto Alonso wrote:
The difference between ext3 and XFS is that ext3 will remount to
   read-only on the first write error but the XFS won't, XFS only fails
   only the current operation, IMHO. The method of ext3 isn't perfect, but
   in practice, it's working well.
  
  XFS will shutdown the filesystem if metadata corruption will occur
  due to a failed write. We don't immediately fail the filesystem on
  data write errors because on large systems you can get *transient*
  I/O errors (e.g. FC path failover) and so retrying failed data
  writes is useful for preventing unnecessary shutdowns of the
  filesystem.
  
  Different design criteria, different solutions...
 
 I think his point was that going into a read only mode causes a
 less catastrophic situation (ie. a web server can still serve
 pages).

Sure - but once you've detected one corruption or had metadata
I/O errors, can you trust the rest of the filesystem?

 I think that is a valid point, rather than shutting down
 the file system completely, an automatic switch to where the least
 disruption of service can occur is always desired.

I consider the possibility of serving out bad data (i.e after
a remount to readonly) to be the worst possible disruption of
service that can happen ;)

 Maybe the automatic failure mode could be something that is 
 configurable via the mount options.

If only it were that simple. Have you looked to see how many
hooks there are in XFS to shutdown without causing further
damage?

% grep FORCED_SHUTDOWN fs/xfs/*.[ch] fs/xfs/*/*.[ch] | wc -l
116

Changing the way we handle shutdowns would take a lot of time,
effort and testing. When can I expect a patch? ;)

 I personally have found the XFS file system to be great for
 my needs (except issues with NFS interaction, where the bug report
 never got answered), but that doesn't mean it can not be improved.

Got a pointer?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFD] BIO_RW_BARRIER - what it means for devices, filesystems, and dm/md.

2007-05-25 Thread David Chinner
On Fri, May 25, 2007 at 05:58:25PM +1000, Neil Brown wrote:
 We can think of there being three types of devices:
  
 1/ SAFE.  With a SAFE device, there is no write-behind cache, or if
   there is it is non-volatile.  Once a write completes it is 
   completely safe.  Such a device does not require barriers
   or -issue_flush_fn, and can respond to them either by a
 no-op or with -EOPNOTSUPP (the former is preferred).
 
 2/ FLUSHABLE.
   A FLUSHABLE device may have a volatile write-behind cache.
   This cache can be flushed with a call to blkdev_issue_flush.
 It may not support barrier requests.

So returns -EOPNOTSUPP to any barrier request?

 3/ BARRIER.
 A BARRIER device supports both blkdev_issue_flush and
   BIO_RW_BARRIER.  Either may be used to synchronise any
 write-behind cache to non-volatile storage (media).
 
 Handling of SAFE and FLUSHABLE devices is essentially the same and can
 work on a BARRIER device.  The BARRIER device has the option of more
 efficient handling.
 
 How does a filesystem use this?
 ===

 
 The filesystem will want to ensure that all preceding writes are safe
 before writing the barrier block.  There are two ways to achieve this.

Three, actually.

 1/  Issue all 'preceding writes', wait for them to complete (bi_endio
called), call blkdev_issue_flush, issue the commit write, wait
for it to complete, call blkdev_issue_flush a second time.
(This is needed for FLUSHABLE)

*nod*

 2/ Set the BIO_RW_BARRIER bit in the write request for the commit
 block.
(This is more efficient on BARRIER).

*nod*

3/ Use a SAFE device.

 The second, while much easier, can fail.

So we do a test I/O to see if the device supports them before
enabling that mode.  But, as we've recently discovered, this is not
sufficient to detect *correctly functioning* barrier support.

 So a filesystem should be
 prepared to deal with that failure by falling back to the first
 option.

I don't buy that argument.

 Thus the general sequence might be:
 
   a/ issue all preceding writes.
   b/ issue the commit write with BIO_RW_BARRIER

At this point, the filesystem has done everything it needs to ensure
that the block layer has been informed of the I/O ordering
requirements. Why should the filesystem now have to detect block
layer breakage, and then use a different block layer API to issue
the same I/O under the same constraints?

   c/ wait for the commit to complete.
  If it was successful - done.
  If it failed other than with EOPNOTSUPP, abort
  else continue
   d/ wait for all 'preceding writes' to complete
   e/ call blkdev_issue_flush
   f/ issue commit write without BIO_RW_BARRIER
   g/ wait for commit write to complete
if it failed, abort
   h/ call blkdev_issue
_flush?

   DONE
 
 steps b and c can be left out if it is known that the device does not
 support barriers.  The only way to discover this to try and see if it
 fails.

That's a very linear, single-threaded way of looking at it... ;)

 I don't think any filesystem follows all these steps.
 
  ext3 has the right structure, but it doesn't include steps e and h.
  reiserfs is similar.  It does have a call to blkdev_issue_flush, but 
   that is only on the fsync path, so it isn't really protecting
   general journal commits.
  XFS - I'm less sure.  I think it does 'a' then 'd', then 'b' or 'f'
depending on a whether it thinks the device handles barriers,
and finally 'g'.

That's right, except for the g (or c) bit - commit writes are
async and nothing waits for them - the io completion wakes anything
waiting on it's completion

(yes, all XFS barrier I/Os are issued async which is why having to
handle an -EOPNOTSUPP error is a real pain. The fix I currently
have is to reissue the I/O from the completion handler with is
ugly, ugly, ugly.)

 So for devices that support BIO_RW_BARRIER, and for devices that don't
 need any flush, they work OK, but for device that need flushing, but
 don't support BIO_RW_BARRIER, none of them work.  This should be easy
 to fix.

Right - XFS as it stands was designed to work on SAFE devices, and
we've modified it to work on BARRIER devices. We don't support
FLUSHABLE devices at all.

But if the filesystem supports BARRIER devices, I don't see any
reason why a filesystem needs to be modified to support FLUSHABLE
devices - the key point being that by the time the filesystem
has issued the commit write it has already waited for all it's
dependent I/O, and so all the block device needs to do is
issue flushes either side of the commit write

 HOW DO MD or DM USE THIS
 
 
 1/ striping devices.
  This includes md/raid0 md/linear dm-linear dm-stripe and probably
  others. 
 
These devices can easily support blkdev_issue_flush by simply
calling blkdev_issue_flush on all component 

Re: raid5: I lost a XFS file system due to a minor IDE cable problem

2007-05-24 Thread David Chinner
On Thu, May 24, 2007 at 07:20:35AM -0400, Justin Piszcz wrote:
 Including XFS mailing list on this one.

Thanks Justin.

 On Thu, 24 May 2007, Pallai Roland wrote:
 
 
 Hi,
 
 I wondering why the md raid5 does accept writes after 2 disks failed. I've 
 an
 array built from 7 drives, filesystem is XFS. Yesterday, an IDE cable 
 failed
 (my friend kicked it off from the box on the floor:) and 2 disks have been
 kicked but my download (yafc) not stopped, it tried and could write the 
 file
 system for whole night!
 Now I changed the cable, tried to reassembly the array (mdadm -f --run),
 event counter increased from 4908158 up to 4929612 on the failed disks, 
 but I
 cannot mount the file system and the 'xfs_repair -n' shows lot of errors
 there. This is expainable by the partially successed writes. Ext3 and JFS
 has error= mount option to switch filesystem read-only on any error, but
 XFS hasn't: why?

-o ro,norecovery will allow you to mount the filesystem and get any
uncorrupted data off it.

You still may get shutdowns if you trip across corrupted metadata in
the filesystem, though.

 It's a good question too, but I think the md layer could
 save dumb filesystems like XFS if denies writes after 2 disks are failed, 
 and
 I cannot see a good reason why it's not behave this way.

How is *any* filesystem supposed to know that the underlying block
device has gone bad if it is not returning errors?

I did mention this exact scenario in the filesystems workshop back
in february - we'd *really* like to know if a RAID block device has gone
into degraded mode (i.e. lost a disk) so we can throttle new writes
until the rebuil dhas been completed. Stopping writes completely on a
fatal error (like 2 lost disks in RAID5, and 3 lost disks in RAID6)
would also be possible if only we could get the information out
of the block layer.

 Do you have better idea how can I avoid such filesystem corruptions in the
 future? No, I don't want to use ext3 on this box. :)

Well, the problem is a bug in MD - it should have detected
drives going away and stopped access to the device until it was
repaired. You would have had the same problem with ext3, or JFS,
or reiser or any other filesystem, too.

 my mount error:
 XFS: Log inconsistent (didn't find previous header)
 XFS: failed to find log head
 XFS: log mount/recovery failed: error 5
 XFS: log mount failed

You MD device is still hosed - error 5 = EIO; the md device is
reporting errors back the filesystem now. You need to fix that
before trying to recover any data...

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: XFS and write barrier

2006-07-18 Thread David Chinner
On Tue, Jul 18, 2006 at 06:58:56PM +1000, Neil Brown wrote:
 On Tuesday July 18, [EMAIL PROTECTED] wrote:
  On Mon, Jul 17, 2006 at 01:32:38AM +0800, Federico Sevilla III wrote:
   On Sat, Jul 15, 2006 at 12:48:56PM +0200, Martin Steigerwald wrote:
I am currently gathering information to write an article about journal
filesystems with emphasis on write barrier functionality, how it
works, why journalling filesystems need write barrier and the current
implementation of write barrier support for different filesystems.
 
 Journalling filesystems need write barrier isn't really accurate.
 They can make good use of write barrier if it is supported, and where
 it isn't supported, they should use blkdev_issue_flush in combination
 with regular submit/wait.

blkdev_issue_flush() causes a write cache flush - just like a
barrier typically causes a write cache flush up to the I/O with the
barrier in it.  Both of these mechanisms provide the same thing - an
I/O barrier that enforces ordering of I/Os to disk.

Given that filesystems already indicate to the block layer when they
want a barrier, wouldn't it be better to get the block layer to issue
this cache flush if the underlying device doesn't support barriers
and it receives a barrier request?

FWIW, Only XFS and Reiser3 use this function, and only then when
issuing a fsync when barriers are disabled to make sure a common
test (fsync then power cycle) doesn't result in data loss...

  Noone here seems to know, maybe Neil | the other folks on linux-raid
  can help us out with details on status of MD and write barriers?
 
 In 2.6.17, md/raid1 will detect if the underlying devices support
 barriers and if they all do, it will accept barrier requests from the
 filesystem and pass those requests down to all devices.
 
 Other raid levels will reject all barrier requests.

Any particular reason for not supporting barriers on the other types
of RAID?

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html