Re: [PATCH md 0 of 4] Introduction

2005-03-09 Thread Mike Tran
Hi Neil,

On Tue, 2005-03-08 at 21:17, Neil Brown wrote:
 On Monday March 7, [EMAIL PROTECTED] wrote:
  NeilBrown [EMAIL PROTECTED] wrote:
  
   The first two are trivial and should apply equally to 2.6.11
   
The second two fix bugs that were introduced by the recent 
bitmap-based-intent-logging patches and so are not relevant
to 2.6.11 yet. 
  
  The changelog for the Fix typo in super_1_sync patch doesn't actually say
  what the patch does.  What are the user-visible consequences of not fixing
  this?
 
 ---
 This fixes possible inconsistencies that might arise in a version-1 
 superblock when devices fail and are removed.
 
 Usage of version-1 superblocks is not yet widespread and no actual
 problems have been reported.
 

EVMS 2.5.1 (http://evms.sf.net) has provided support for creation of MD
arrays using version-1 superblock.  Some of EVMS users actually tried to
use this new functionality.  You probably remember I posted a problem
and a patch to fix version-1 superblock update code.

We will continue to test and will report any problems.

--
Regards,
Mike T.


-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH md 0 of 4] Introduction

2005-03-09 Thread Peter T. Breuer
Neil Brown [EMAIL PROTECTED] wrote:
 On Tuesday March 8, [EMAIL PROTECTED] wrote:
  Have you remodelled the md/raid1 make_request() fn?
 
 Somewhat.  Write requests are queued, and raid1d submits them when
 it is happy that all bitmap updates have been done.

OK - so a slight modification of the kernel generic_make_request (I
haven't looked).  Mind you, I think that Paul said that just before
clearing bitmap entries, incoming requests were checked to see if a
bitmap entry should be marked again..

Perhaps both things happen. Bitmap pages in memory are updated as
clean after pending writes have finished and then marked as dirty as
necessary, and then flushed and when the flush finishes new accumulated
requests are started.

One can

 There is no '1/100th' second or anything like that.

I was trying in a way to give a definite image to what happens, rather
than speak abstractly. I'm sure that the ordinary kernel mechanism for
plugging and unplugging is used, as much as it is possible. If yu
unplug when the request struct reservoir is exhausted, then it will be
at 1K requests. If they are each 4KB, that will be every 4MB. At say
64MB/s, that will be every 1/16 s. And unplugging may happen more
frequently because of other kernel magic mumble mumble ...

 When a write request arrives, the queue is 'plugged', requests are
 queued, and bits in the in-memory bitmap are set.

OK.

 When the queue is unplugged (by the filesystem or timeout) the bitmap
 changes (if any) are flushed to disk, then the queued requests are
 submitted. 

That accumulates bitmap markings into the minimum number of extra
transactions.  It does impose extra latency, however.

I'm intrigued by exactly how you exert the memory pressure required to
force just the dirty bitmap pages out. I'll have to look it up.

 Bits on disk are cleaned lazily.

OK - so the disk bitmap state is always pessimistic. That's fine. Very
good.

 Note that for many applications, the bitmap does not need to be huge.
 4K is enough for 1 bit per 2-3 megabytes on many large drives.
 Having to sync 3 meg when just one block might be out-of-sync may seem
 like a waste, but it is heaps better than syncing 100Gig!!

Yes - I used 1 bit per 1K, falling back to 1 bit per 2MB under memory
pressure.

  And if so, do you also aggregate them? And what steps are taken to
  preserve write ordering constraints (do some overlying file systems
  still require these)?
 
 filesystems have never had any write ordering constraints, except that
 IO must not be processed before it is requested, nor after it has been
 acknowledged.  md continue to obey these restraints.

Out of curiousity, is aggregation done on the queued requests? Or are
they all kept at 4KB? (or whatever - 1KB).

Thanks!

Peter

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH md 0 of 4] Introduction

2005-03-08 Thread Peter T. Breuer
NeilBrown [EMAIL PROTECTED] wrote:
 The second two fix bugs that were introduced by the recent 
 bitmap-based-intent-logging patches and so are not relevant

Neil - can you describe for me (us all?) what is meant by
intent­logging here.

Well, I can guess - I suppose the driver marks the bitmap before a write
(or group of writes) and unmarks it when they have completed
successfully.  Is that it?

If so, how does it manage to mark what it is _going_ to do (without
psychic powers) on the disk bitmap?  Unmarking is easy - that needs a
queue of things due to be unmarked in the bitmap, and a point in time at
which they are all unmarked at once on disk.

Then resync would only deal with the marked blocks.

Peter

-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH md 0 of 4] Introduction

2005-03-08 Thread Paul Clements
Peter T. Breuer wrote:
Neil - can you describe for me (us all?) what is meant by
intent­logging here.
Since I wrote a lot of the code, I guess I'll try...
Well, I can guess - I suppose the driver marks the bitmap before a write
(or group of writes) and unmarks it when they have completed
successfully.  Is that it?
Yes. It marks the bitmap before writing (actually queues up the bitmap 
and normal writes in bunches for the sake of performance). The code is 
actually (loosely) based on your original bitmap (fr1) code.

If so, how does it manage to mark what it is _going_ to do (without
psychic powers) on the disk bitmap?  
That's actually fairly easy. The pages for the bitmap are locked in 
memory, so you just dirty the bits you want (which doesn't actually 
incur any I/O) and then when you're about to perform the normal writes, 
you flush the dirty bitmap pages to disk.

Once the writes are complete, a thread (we have the raid1d thread doing 
this) comes back along and flushes the (now clean) bitmap pages back to 
disk. If the pages get dirty again in the meantime (because of more 
I/O), we just leave them dirty and don't touch the disk.

Then resync would only deal with the marked blocks.
Right. It clears the bitmap once things are back in sync.
--
Paul
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH md 0 of 4] Introduction

2005-03-08 Thread Peter T. Breuer
Paul Clements [EMAIL PROTECTED] wrote:
 Peter T. Breuer wrote:
  Neil - can you describe for me (us all?) what is meant by
  intent-logging here.
 
 Since I wrote a lot of the code, I guess I'll try...

Hi, Paul. Thanks.

  Well, I can guess - I suppose the driver marks the bitmap before a write
  (or group of writes) and unmarks it when they have completed
  successfully.  Is that it?
 
 Yes. It marks the bitmap before writing (actually queues up the bitmap 
 and normal writes in bunches for the sake of performance). The code is 
 actually (loosely) based on your original bitmap (fr1) code.

Yeah, I can see the traces.  I'm a little tired right now, but some
aspects of this idea vaguely worry me.  I'll see if I manage to
articulate those worries here despite my state. And you can dispell
them :).

Let me first of all guess at the intervals involved. I assume you will
write the marked parts of the bitmap to disk every 1/100th of a second or
so?  (I'd probably opt for 1/10th of a second or even every second just
to make sure it's not noticable on bandwidth and to heck with the
safety until we learn better what the tradeoffs are).  Or perhaps once
every hundred trasactions in busy times.

Now, there are races here.  You must mark the bitmap in memory before
every write, and unmark it after every complete write.  That is an
ordering constraint.  There is a race, however, to record the bitmap
state to disk.  Without any rendezvous or handshake or other
synchronization, one would simply be snapshotting the in-memory bitmap
to disk every so often, and the  on-disk bitmap would not always
accurately reflect the current state of completed transactions to the
mirror. The question is whether it shows an overly-pessimistic picture,
an overly-optimistic picture, or neither one nor the other.

I would naively imagine straight off that it cannot in general be
(appropriately) pessimistic because it does not know what writes will
occur in the next 1/100th second in order to be able to mark those on
the disk bitmap before they happen.  In the next section of your answer,
however, you say this is what happens, and therefore I deduce that

   a) 1/100th second's worth of writes to the mirror are first queued
   b) the in-memory bitmap is marked for these (if it exists as separate)
   c) the dirty parts of that bitmap are written to disk(s)
   d) the queued writes are carried out on the mirror
   e) the in-memory bitmap is unmarked for these
   f) the newly cleaned parts of that bitmap are written to disk. 

You may even have some sort of direct mapping between the on-disk
bitmap and the memory image, which could be quite effective, but
may run into problems with the address range available (bitmap must be
less than 2GB, no?), unless it maps only the necessary parts of the
bitmap at a time.  Well, if the kernel can manage that mapping window on
its own, it would be useful and probably what you have done.

But I digress. My immediate problem is that writes must be queued
first. I thought md traditionally did not queue requests, but instead
used its own make_request substitute to dispatch incoming requests as
they arrived.

Have you remodelled the md/raid1 make_request() fn?

And if so, do you also aggregate them? And what steps are taken to
preserve write ordering constraints (do some overlying file systems
still require these)?

  If so, how does it manage to mark what it is _going_ to do (without
  psychic powers) on the disk bitmap?  
 
 That's actually fairly easy. The pages for the bitmap are locked in 
 memory,

That limits the size to about 2GB - oh, but perhaps you are doing as I
did and release bitmap pages when they are not dirty. Yes, you must.

  so you just dirty the bits you want (which doesn't actually 
 incur any I/O) and then when you're about to perform the normal writes, 
 you flush the dirty bitmap pages to disk.

Hmm. I don't know how one can select pages to flush, but clearly one
can!  You maintain a list of dirtied pages, clearly. This list cannot be
larger than the list of outstanding requests. If you use the generic
kernel mechanisms, that will be 1000 or so, max.

 Once the writes are complete, a thread (we have the raid1d thread doing 
 this) comes back along and flushes the (now clean) bitmap pages back to 
 disk.

OK ..  there is a potential race here too, however, ...

 If the pages get dirty again in the meantime (because of more 
 I/O), we just leave them dirty and don't touch the disk.

Hmm. This appears to me to be an optimization. OK.

  Then resync would only deal with the marked blocks.
 
 Right. It clears the bitmap once things are back in sync.

Well, OK. Thinking it through as I write I see fewer problems. Thank
you for the explanation, and well done.

I have been meaning to merge the patches and see what comes out. I
presume you left out the mechanisms I included to allow a mirror
component to aggressively notify the array when it feels sick, and when
it feels better again. That 

Re: [PATCH md 0 of 4] Introduction

2005-03-08 Thread Neil Brown
On Monday March 7, [EMAIL PROTECTED] wrote:
 NeilBrown [EMAIL PROTECTED] wrote:
 
  The first two are trivial and should apply equally to 2.6.11
  
   The second two fix bugs that were introduced by the recent 
   bitmap-based-intent-logging patches and so are not relevant
   to 2.6.11 yet. 
 
 The changelog for the Fix typo in super_1_sync patch doesn't actually say
 what the patch does.  What are the user-visible consequences of not fixing
 this?

---
This fixes possible inconsistencies that might arise in a version-1 
superblock when devices fail and are removed.

Usage of version-1 superblocks is not yet widespread and no actual
problems have been reported.

 
 
 Is the bitmap stuff now ready for Linus?

I agree with Paul - not yet.
I'd also like to get a bit more functionality in before it goes to
Linus, as the functionality may necessitate in interface change (I'm
not sure).
Specifically, I want the bitmap to be able to live near the superblock
rather than having to be in a file on a different filesystem.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH md 0 of 4] Introduction

2005-03-08 Thread Neil Brown
On Tuesday March 8, [EMAIL PROTECTED] wrote:
 
 But I digress. My immediate problem is that writes must be queued
 first. I thought md traditionally did not queue requests, but instead
 used its own make_request substitute to dispatch incoming requests as
 they arrived.
 
 Have you remodelled the md/raid1 make_request() fn?

Somewhat.  Write requests are queued, and raid1d submits them when
it is happy that all bitmap updates have been done.

There is no '1/100th' second or anything like that.
When a write request arrives, the queue is 'plugged', requests are
queued, and bits in the in-memory bitmap are set.
When the queue is unplugged (by the filesystem or timeout) the bitmap
changes (if any) are flushed to disk, then the queued requests are
submitted. 

Bits on disk are cleaned lazily.


Note that for many applications, the bitmap does not need to be huge.
4K is enough for 1 bit per 2-3 megabytes on many large drives.
Having to sync 3 meg when just one block might be out-of-sync may seem
like a waste, but it is heaps better than syncing 100Gig!!

If a resync without bitmap logging takes 1 hour, I suspect a resync
with a 4K bitmap would have a good chance of finishing in under 1
minute (Depending on locality of references).  That is good enough for
me.

Of course, if one mirror is on the other side of the country, and a
normal sync requires 5 days over ADSL, then you would have a strong
case for a finer grained bitmap.

 
 And if so, do you also aggregate them? And what steps are taken to
 preserve write ordering constraints (do some overlying file systems
 still require these)?

filesystems have never had any write ordering constraints, except that
IO must not be processed before it is requested, nor after it has been
acknowledged.  md continue to obey these restraints.

NeilBrown
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH md 0 of 4] Introduction

2005-03-07 Thread Andrew Morton
NeilBrown [EMAIL PROTECTED] wrote:

 The first two are trivial and should apply equally to 2.6.11
 
  The second two fix bugs that were introduced by the recent 
  bitmap-based-intent-logging patches and so are not relevant
  to 2.6.11 yet. 

The changelog for the Fix typo in super_1_sync patch doesn't actually say
what the patch does.  What are the user-visible consequences of not fixing
this?


Is the bitmap stuff now ready for Linus?
-
To unsubscribe from this list: send the line unsubscribe linux-raid in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html