Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.

2006-05-14 Thread Neil Brown
On Saturday May 13, [EMAIL PROTECTED] wrote:
 Paul Clements [EMAIL PROTECTED] wrote:
 
  Andrew Morton wrote:
  
   The loss of pagecache coherency seems sad.  I assume there's never a
   requirement for userspace to read this file.
  
  Actually, there is. mdadm reads the bitmap file, so that would be 
  broken. Also, it's just useful for a user to be able to read the bitmap 
  (od -x, or similar) to figure out approximately how much more he's got 
  to resync to get an array in-sync. Other than reading the bitmap file, I 
  don't know of any way to determine that.
 
 Read it with O_DIRECT :(

Which is exactly what the next release of mdadm does.
As the patch comment said:

: With this approach the pagecache may contain data which is inconsistent with 
: what is on disk.  To alleviate the problems this can cause, md invalidates
: the pagecache when releasing the file.  If the file is to be examined
: while the array is active (a non-critical but occasionally useful function),
: O_DIRECT io must be used.  And new version of mdadm will have support for 
this.

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.

2006-05-14 Thread Andrew Morton
Neil Brown [EMAIL PROTECTED] wrote:

 On Saturday May 13, [EMAIL PROTECTED] wrote:
  Paul Clements [EMAIL PROTECTED] wrote:
  
   Andrew Morton wrote:
   
The loss of pagecache coherency seems sad.  I assume there's never a
requirement for userspace to read this file.
   
   Actually, there is. mdadm reads the bitmap file, so that would be 
   broken. Also, it's just useful for a user to be able to read the bitmap 
   (od -x, or similar) to figure out approximately how much more he's got 
   to resync to get an array in-sync. Other than reading the bitmap file, I 
   don't know of any way to determine that.
  
  Read it with O_DIRECT :(
 
 Which is exactly what the next release of mdadm does.
 As the patch comment said:
 
 : With this approach the pagecache may contain data which is inconsistent 
 with 
 : what is on disk.  To alleviate the problems this can cause, md invalidates
 : the pagecache when releasing the file.  If the file is to be examined
 : while the array is active (a non-critical but occasionally useful function),
 : O_DIRECT io must be used.  And new version of mdadm will have support for 
 this.

Which doesn't help `od -x' and is going to cause older mdadm userspace to
mysteriously and subtly fail.  Or does the user-kernel interface have
versioning which will prevent this?


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: softraid and multiple distros

2006-05-14 Thread Mark Hahn
 What do I need to do when I want to install a different distro on the machine 
 with a raid5 array?
 Which files do I need? /etc/mdadm.conf? /etc/raittab? both?

MD doesn't need any files to function, since it can auto-assemble
arrays based on their superblocks (for partition-type 0xfd).

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: softraid and multiple distros

2006-05-14 Thread Dexter Filmore
Am Sonntag, 14. Mai 2006 16:50 schrieben Sie:
  What do I need to do when I want to install a different distro on the
  machine with a raid5 array?
  Which files do I need? /etc/mdadm.conf? /etc/raittab? both?

 MD doesn't need any files to function, since it can auto-assemble
 arrays based on their superblocks (for partition-type 0xfd).

I see. Now an issue arises someone else here mentioned: 
My first attempt was to use the entire disks, then I was hinted that this 
approach wasn't too hot so I made partitions. 
Now the devices have all two superblocks, the one left from the first try 
which are now kinda orphaned and those now active. 
Can I trust mdadm to handle this properly on its own?

Dex


-- 
-BEGIN GEEK CODE BLOCK-
Version: 3.12
GCS d--(+)@ s-:+ a- C+++() UL+ P+++ L+++ E-- W++ N o? K-
w--(---) !O M+ V- PS++(+) PE(-) Y++ PGP t++(---)@ 5 X+(++) R+(++) tv--(+)@ 
b++(+++) DI+++ D G++ e* h++ r%* y?
--END GEEK CODE BLOCK--

http://www.stop1984.com
http://www.againsttcpa.com
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: softraid and multiple distros

2006-05-14 Thread Mark Hahn
 Now the devices have all two superblocks, the one left from the first try 
 which are now kinda orphaned and those now active. 
 Can I trust mdadm to handle this properly on its own?

I'm not sure what properly means.  you should not leave around 0xfd 
partitions with bogus superblocks, since MD will certainly try to 
assemble them.  I don't know offhand how it decides which components 
to make into a single array (UUID?), but why screw around?
for orphan partitions, either change the partition type 
or zero the superblock or both...

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with large devices 2TB

2006-05-14 Thread H. Peter Anvin
Followup to:  [EMAIL PROTECTED]
By author:Jim Klimov [EMAIL PROTECTED]
In newsgroup: linux.dev.raid
 
   Since the new parted worked ok (older one didn't), we were happy
   until we tried a reboot. During the device initialization and after
   it the system only recognises the 6 or 7 partitions which start
   before the 2000Gb limit:
 

For a DOS partition table, there is no such thing as a partition
starting beyond 2 TB.  You need to use a GPT or other more
sophisticated partition table.

-hpa
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: softraid and multiple distros

2006-05-14 Thread Neil Brown
On Sunday May 14, [EMAIL PROTECTED] wrote:
 Am Sonntag, 14. Mai 2006 16:50 schrieben Sie:
   What do I need to do when I want to install a different distro on the
   machine with a raid5 array?
   Which files do I need? /etc/mdadm.conf? /etc/raittab? both?
 
  MD doesn't need any files to function, since it can auto-assemble
  arrays based on their superblocks (for partition-type 0xfd).
 
 I see. Now an issue arises someone else here mentioned: 
 My first attempt was to use the entire disks, then I was hinted that this 
 approach wasn't too hot so I made partitions. 

I always use entire disks if I want the entire disks raided (sounds
obvious, doesn't it...)  I only use partitions when I want to vary the
raid layout for different parts of the disk (e.g. mirrored root, mirrored
swap, raid6 for the rest).   But that certainly doesn't mean it is
wrong to use partitions for the whole disk.


 Now the devices have all two superblocks, the one left from the first try 
 which are now kinda orphaned and those now active. 
 Can I trust mdadm to handle this properly on its own?

You can tell mdadm where to look.  If you want to be sure that it
won't look at entire drives, only partitions, then a line like
   DEVICES /dev/[hs]d*[0-1]
in /etc/mdadm.conf might be what you want. 
However as you should be listing the uuids in /etc/mdadm.conf, any
superblock with an unknown uuid will easily be ignored.

If you are relying nf 0xfd autodetect to assemble your arrays, then
obviously the entire-disk superblock will be ignored (because they
wont be in the right place in any partition).

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 008 of 8] md/bitmap: Change md/bitmap file handling to use bmap to file blocks.

2006-05-14 Thread Neil Brown

(replying to bits of several emails)

On Friday May 12, [EMAIL PROTECTED] wrote:
 Neil Brown [EMAIL PROTECTED] wrote:

  However some IO requests cannot complete until the filesystem I/O
  completes, so we need to be sure that the filesystem I/O won't block
  waiting for memory, or fail with -ENOMEM.
 
 That sounds like a complex deadlock.  Suppose the bitmap writeout requres
 some writeback to happen before it can get enough memory to proceed.
 

Exactly. Bitmap writeout must not block on fs-writeback.  It can block
on device writeout (e.g. queue congestion or mempool exhaustion) but
it must complete without waiting in the fs layer or above, and without
the possibility of any error other -EIO.  Otherwise we can get
deadlocked writing to the raid array. bh_submit (or bio_submit) is
certain to be safe in this respect.  I'm not so confident about
anything at the fs level.

   Read it with O_DIRECT :(
  
  Which is exactly what the next release of mdadm does.
  As the patch comment said:
  
  : With this approach the pagecache may contain data which is inconsistent 
  with 
  : what is on disk.  To alleviate the problems this can cause, md invalidates
  : the pagecache when releasing the file.  If the file is to be examined
  : while the array is active (a non-critical but occasionally useful 
  function),
  : O_DIRECT io must be used.  And new version of mdadm will have support for 
  this.
 
 Which doesn't help `od -x' and is going to cause older mdadm userspace to
 mysteriously and subtly fail.  Or does the user-kernel interface have
 versioning which will prevent this?
 

As I said: 'non-critical'.  Nothing important breaks if reading the
file gets old data.  Reading the file while the array is active is
purely a curiosity thing.  There is information in /proc/mdstat which
gives a fairly coarse view of the same data.  It could lead to some
confusion, but if a compliant mdadm comes out before this gets into a
mainline kernel, I doubt there will be any significant issue.

Read/writing the bitmap needs to work reliably when the array is not
active, but suitable sync/invalidate calls in the kernel should make
that work perfectly.

I know this is technically a regression in user-space interface, and
you don't like such regression with good reason Maybe I could call
invalidate_inode_pages every few seconds or whenever the atime
changes, just to be on the safe side :-)

   I have a patch which did that,
  but decided that the possibility of kmalloc failure at awkward times
  would make that not suitable.
 
 submit_bh() can and will allocate memory, although most decent device
 drivers should be OK.
 

submit_bh (like all decent device drivers) uses a mempool for memory
allocation so we can be sure that the delay in getting memory is
bounded by the delay for a few IO requests to complete, and we can be
sure the allocation won't fail.  This is perfectly fine.

  
  I don't think a_ops really provides an interface that I can use, partly
  because, as I said in a previous email, it isn't really a public
  interface to a filesystem.
 
 It's publicer than bmap+submit_bh!
 

I don't know how you can say that.

bmap is so public that it is exported to userspace through an IOCTL
and is used by lilo (admitted only for reading, not writing).  More
significantly it is used by swapfile which is a completely independent
subsystem from the filesystem.  Contrast this with a_ops.  The primary
users of a_ops are routines like generic_file_{read,write} and
friends.  These are tools - library routines - that are used by
filesystems to implement their 'file_operations' which are the real
public interface.  As far as these uses go, it is not a public
interface.  Where a filesystem doesn't use some library routines, it
does not need to implement the matching functionality in the a_op
interface.

The other main user is the 'VM' which might try to flush out or
invalidate pages.  However the VM isn't using this interface to
interact with files, but only to interact with pages, and it doesn't
much care what is done with the pages providing they get clean, or get
released, or whatever.

The way I perceive Linux design/development, active usage is far more
significant than documented design.  If some feature of an interface
isn't being actively used - by in-kernel code - then you cannot be
sure that feature will be uniformly implemented, or that it won't
change subtly next week.

So when I went looking for the best way to get md/bitmap to write to a
file, I didn't just look at the interface specs (which are pretty
poorly documented anyway), I looked at existing code.
I can find 3 different parts of the kernel that write to a file.
They are
   swap-file
   loop
   nfsd

nfsd uses vfs_read/vfs_write  which have too many failure/delay modes
  for me to safely use.
loop uses prepare_write/commit_write (if available) or f_op-write
  (not vfs_write - I wonder why) which is not much better than what
  nfsd uses.  And as far as I can tell