Re: [PATCH md 0 of 4] Introduction
Hi Neil, On Tue, 2005-03-08 at 21:17, Neil Brown wrote: On Monday March 7, [EMAIL PROTECTED] wrote: NeilBrown [EMAIL PROTECTED] wrote: The first two are trivial and should apply equally to 2.6.11 The second two fix bugs that were introduced by the recent bitmap-based-intent-logging patches and so are not relevant to 2.6.11 yet. The changelog for the Fix typo in super_1_sync patch doesn't actually say what the patch does. What are the user-visible consequences of not fixing this? --- This fixes possible inconsistencies that might arise in a version-1 superblock when devices fail and are removed. Usage of version-1 superblocks is not yet widespread and no actual problems have been reported. EVMS 2.5.1 (http://evms.sf.net) has provided support for creation of MD arrays using version-1 superblock. Some of EVMS users actually tried to use this new functionality. You probably remember I posted a problem and a patch to fix version-1 superblock update code. We will continue to test and will report any problems. -- Regards, Mike T. - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH md 0 of 4] Introduction
Neil Brown [EMAIL PROTECTED] wrote: On Tuesday March 8, [EMAIL PROTECTED] wrote: Have you remodelled the md/raid1 make_request() fn? Somewhat. Write requests are queued, and raid1d submits them when it is happy that all bitmap updates have been done. OK - so a slight modification of the kernel generic_make_request (I haven't looked). Mind you, I think that Paul said that just before clearing bitmap entries, incoming requests were checked to see if a bitmap entry should be marked again.. Perhaps both things happen. Bitmap pages in memory are updated as clean after pending writes have finished and then marked as dirty as necessary, and then flushed and when the flush finishes new accumulated requests are started. One can There is no '1/100th' second or anything like that. I was trying in a way to give a definite image to what happens, rather than speak abstractly. I'm sure that the ordinary kernel mechanism for plugging and unplugging is used, as much as it is possible. If yu unplug when the request struct reservoir is exhausted, then it will be at 1K requests. If they are each 4KB, that will be every 4MB. At say 64MB/s, that will be every 1/16 s. And unplugging may happen more frequently because of other kernel magic mumble mumble ... When a write request arrives, the queue is 'plugged', requests are queued, and bits in the in-memory bitmap are set. OK. When the queue is unplugged (by the filesystem or timeout) the bitmap changes (if any) are flushed to disk, then the queued requests are submitted. That accumulates bitmap markings into the minimum number of extra transactions. It does impose extra latency, however. I'm intrigued by exactly how you exert the memory pressure required to force just the dirty bitmap pages out. I'll have to look it up. Bits on disk are cleaned lazily. OK - so the disk bitmap state is always pessimistic. That's fine. Very good. Note that for many applications, the bitmap does not need to be huge. 4K is enough for 1 bit per 2-3 megabytes on many large drives. Having to sync 3 meg when just one block might be out-of-sync may seem like a waste, but it is heaps better than syncing 100Gig!! Yes - I used 1 bit per 1K, falling back to 1 bit per 2MB under memory pressure. And if so, do you also aggregate them? And what steps are taken to preserve write ordering constraints (do some overlying file systems still require these)? filesystems have never had any write ordering constraints, except that IO must not be processed before it is requested, nor after it has been acknowledged. md continue to obey these restraints. Out of curiousity, is aggregation done on the queued requests? Or are they all kept at 4KB? (or whatever - 1KB). Thanks! Peter - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH md 0 of 4] Introduction
NeilBrown [EMAIL PROTECTED] wrote: The second two fix bugs that were introduced by the recent bitmap-based-intent-logging patches and so are not relevant Neil - can you describe for me (us all?) what is meant by intentÂlogging here. Well, I can guess - I suppose the driver marks the bitmap before a write (or group of writes) and unmarks it when they have completed successfully. Is that it? If so, how does it manage to mark what it is _going_ to do (without psychic powers) on the disk bitmap? Unmarking is easy - that needs a queue of things due to be unmarked in the bitmap, and a point in time at which they are all unmarked at once on disk. Then resync would only deal with the marked blocks. Peter - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH md 0 of 4] Introduction
Peter T. Breuer wrote: Neil - can you describe for me (us all?) what is meant by intentÂlogging here. Since I wrote a lot of the code, I guess I'll try... Well, I can guess - I suppose the driver marks the bitmap before a write (or group of writes) and unmarks it when they have completed successfully. Is that it? Yes. It marks the bitmap before writing (actually queues up the bitmap and normal writes in bunches for the sake of performance). The code is actually (loosely) based on your original bitmap (fr1) code. If so, how does it manage to mark what it is _going_ to do (without psychic powers) on the disk bitmap? That's actually fairly easy. The pages for the bitmap are locked in memory, so you just dirty the bits you want (which doesn't actually incur any I/O) and then when you're about to perform the normal writes, you flush the dirty bitmap pages to disk. Once the writes are complete, a thread (we have the raid1d thread doing this) comes back along and flushes the (now clean) bitmap pages back to disk. If the pages get dirty again in the meantime (because of more I/O), we just leave them dirty and don't touch the disk. Then resync would only deal with the marked blocks. Right. It clears the bitmap once things are back in sync. -- Paul - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH md 0 of 4] Introduction
Paul Clements [EMAIL PROTECTED] wrote: Peter T. Breuer wrote: Neil - can you describe for me (us all?) what is meant by intent-logging here. Since I wrote a lot of the code, I guess I'll try... Hi, Paul. Thanks. Well, I can guess - I suppose the driver marks the bitmap before a write (or group of writes) and unmarks it when they have completed successfully. Is that it? Yes. It marks the bitmap before writing (actually queues up the bitmap and normal writes in bunches for the sake of performance). The code is actually (loosely) based on your original bitmap (fr1) code. Yeah, I can see the traces. I'm a little tired right now, but some aspects of this idea vaguely worry me. I'll see if I manage to articulate those worries here despite my state. And you can dispell them :). Let me first of all guess at the intervals involved. I assume you will write the marked parts of the bitmap to disk every 1/100th of a second or so? (I'd probably opt for 1/10th of a second or even every second just to make sure it's not noticable on bandwidth and to heck with the safety until we learn better what the tradeoffs are). Or perhaps once every hundred trasactions in busy times. Now, there are races here. You must mark the bitmap in memory before every write, and unmark it after every complete write. That is an ordering constraint. There is a race, however, to record the bitmap state to disk. Without any rendezvous or handshake or other synchronization, one would simply be snapshotting the in-memory bitmap to disk every so often, and the on-disk bitmap would not always accurately reflect the current state of completed transactions to the mirror. The question is whether it shows an overly-pessimistic picture, an overly-optimistic picture, or neither one nor the other. I would naively imagine straight off that it cannot in general be (appropriately) pessimistic because it does not know what writes will occur in the next 1/100th second in order to be able to mark those on the disk bitmap before they happen. In the next section of your answer, however, you say this is what happens, and therefore I deduce that a) 1/100th second's worth of writes to the mirror are first queued b) the in-memory bitmap is marked for these (if it exists as separate) c) the dirty parts of that bitmap are written to disk(s) d) the queued writes are carried out on the mirror e) the in-memory bitmap is unmarked for these f) the newly cleaned parts of that bitmap are written to disk. You may even have some sort of direct mapping between the on-disk bitmap and the memory image, which could be quite effective, but may run into problems with the address range available (bitmap must be less than 2GB, no?), unless it maps only the necessary parts of the bitmap at a time. Well, if the kernel can manage that mapping window on its own, it would be useful and probably what you have done. But I digress. My immediate problem is that writes must be queued first. I thought md traditionally did not queue requests, but instead used its own make_request substitute to dispatch incoming requests as they arrived. Have you remodelled the md/raid1 make_request() fn? And if so, do you also aggregate them? And what steps are taken to preserve write ordering constraints (do some overlying file systems still require these)? If so, how does it manage to mark what it is _going_ to do (without psychic powers) on the disk bitmap? That's actually fairly easy. The pages for the bitmap are locked in memory, That limits the size to about 2GB - oh, but perhaps you are doing as I did and release bitmap pages when they are not dirty. Yes, you must. so you just dirty the bits you want (which doesn't actually incur any I/O) and then when you're about to perform the normal writes, you flush the dirty bitmap pages to disk. Hmm. I don't know how one can select pages to flush, but clearly one can! You maintain a list of dirtied pages, clearly. This list cannot be larger than the list of outstanding requests. If you use the generic kernel mechanisms, that will be 1000 or so, max. Once the writes are complete, a thread (we have the raid1d thread doing this) comes back along and flushes the (now clean) bitmap pages back to disk. OK .. there is a potential race here too, however, ... If the pages get dirty again in the meantime (because of more I/O), we just leave them dirty and don't touch the disk. Hmm. This appears to me to be an optimization. OK. Then resync would only deal with the marked blocks. Right. It clears the bitmap once things are back in sync. Well, OK. Thinking it through as I write I see fewer problems. Thank you for the explanation, and well done. I have been meaning to merge the patches and see what comes out. I presume you left out the mechanisms I included to allow a mirror component to aggressively notify the array when it feels sick, and when it feels better again. That
Re: [PATCH md 0 of 4] Introduction
On Monday March 7, [EMAIL PROTECTED] wrote: NeilBrown [EMAIL PROTECTED] wrote: The first two are trivial and should apply equally to 2.6.11 The second two fix bugs that were introduced by the recent bitmap-based-intent-logging patches and so are not relevant to 2.6.11 yet. The changelog for the Fix typo in super_1_sync patch doesn't actually say what the patch does. What are the user-visible consequences of not fixing this? --- This fixes possible inconsistencies that might arise in a version-1 superblock when devices fail and are removed. Usage of version-1 superblocks is not yet widespread and no actual problems have been reported. Is the bitmap stuff now ready for Linus? I agree with Paul - not yet. I'd also like to get a bit more functionality in before it goes to Linus, as the functionality may necessitate in interface change (I'm not sure). Specifically, I want the bitmap to be able to live near the superblock rather than having to be in a file on a different filesystem. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH md 0 of 4] Introduction
On Tuesday March 8, [EMAIL PROTECTED] wrote: But I digress. My immediate problem is that writes must be queued first. I thought md traditionally did not queue requests, but instead used its own make_request substitute to dispatch incoming requests as they arrived. Have you remodelled the md/raid1 make_request() fn? Somewhat. Write requests are queued, and raid1d submits them when it is happy that all bitmap updates have been done. There is no '1/100th' second or anything like that. When a write request arrives, the queue is 'plugged', requests are queued, and bits in the in-memory bitmap are set. When the queue is unplugged (by the filesystem or timeout) the bitmap changes (if any) are flushed to disk, then the queued requests are submitted. Bits on disk are cleaned lazily. Note that for many applications, the bitmap does not need to be huge. 4K is enough for 1 bit per 2-3 megabytes on many large drives. Having to sync 3 meg when just one block might be out-of-sync may seem like a waste, but it is heaps better than syncing 100Gig!! If a resync without bitmap logging takes 1 hour, I suspect a resync with a 4K bitmap would have a good chance of finishing in under 1 minute (Depending on locality of references). That is good enough for me. Of course, if one mirror is on the other side of the country, and a normal sync requires 5 days over ADSL, then you would have a strong case for a finer grained bitmap. And if so, do you also aggregate them? And what steps are taken to preserve write ordering constraints (do some overlying file systems still require these)? filesystems have never had any write ordering constraints, except that IO must not be processed before it is requested, nor after it has been acknowledged. md continue to obey these restraints. NeilBrown - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH md 0 of 4] Introduction
NeilBrown [EMAIL PROTECTED] wrote: The first two are trivial and should apply equally to 2.6.11 The second two fix bugs that were introduced by the recent bitmap-based-intent-logging patches and so are not relevant to 2.6.11 yet. The changelog for the Fix typo in super_1_sync patch doesn't actually say what the patch does. What are the user-visible consequences of not fixing this? Is the bitmap stuff now ready for Linus? - To unsubscribe from this list: send the line unsubscribe linux-raid in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html