Re: [patch] rewrite rd

2008-01-14 Thread Jens Axboe
On Mon, Jan 14 2008, Matthew Wilcox wrote:
> On Tue, Dec 04, 2007 at 05:26:28AM +0100, Nick Piggin wrote:
> > +static void copy_to_brd(struct brd_device *brd, const void *src,
> > +   sector_t sector, size_t n)
> > +{
> > +   struct page *page;
> > +   void *dst;
> > +   unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
> > +   size_t copy;
> > +
> > +   copy = min((unsigned long)n, PAGE_SIZE - offset);
> > +   page = brd_lookup_page(brd, sector);
> > +   BUG_ON(!page);
> > +
> > +   dst = kmap_atomic(page, KM_USER1);
> > +   memcpy(dst + offset, src, copy);
> > +   kunmap_atomic(dst, KM_USER1);
> 
> You're using kmap_atomic, but I see no reason you can't be preempted.
> Don't you need to at least disable preemption while you have stuff
> atomically kmapped?

kmap_atomic() disables preemption through pagefault_disable().

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] rewrite rd

2008-01-14 Thread Matthew Wilcox
On Tue, Dec 04, 2007 at 05:26:28AM +0100, Nick Piggin wrote:
> +static void copy_to_brd(struct brd_device *brd, const void *src,
> + sector_t sector, size_t n)
> +{
> + struct page *page;
> + void *dst;
> + unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
> + size_t copy;
> +
> + copy = min((unsigned long)n, PAGE_SIZE - offset);
> + page = brd_lookup_page(brd, sector);
> + BUG_ON(!page);
> +
> + dst = kmap_atomic(page, KM_USER1);
> + memcpy(dst + offset, src, copy);
> + kunmap_atomic(dst, KM_USER1);

You're using kmap_atomic, but I see no reason you can't be preempted.
Don't you need to at least disable preemption while you have stuff
atomically kmapped?

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] rewrite rd

2007-12-04 Thread Rob Landley
On Tuesday 04 December 2007 03:29:40 Nick Piggin wrote:
> On Tue, Dec 04, 2007 at 01:55:17AM -0600, Rob Landley wrote:
> > On Monday 03 December 2007 22:26:28 Nick Piggin wrote:
> > > There is one slight downside -- direct block device access and
> > > filesystem metadata access goes through an extra copy and gets stored
> > > in RAM twice. However, this downside is only slight, because the real
> > > buffercache of the device is now reclaimable (because we're not playing
> > > crazy games with it), so under memory intensive situations, footprint
> > > should effectively be the same -- maybe even a slight advantage to the
> > > new driver because it can also reclaim buffer heads.
> >
> > For the embedded world, initramfs has pretty much rendered initrd
> > obsolete, and that was the biggest user of the ramdisk code I know of. 
> > Beyond that, loopback mounts give you more flexible transient block
> > devices than ramdisks do.  (In fact, ramdisks are such an amazing pain to
> > use/size/free that if I really needed something like that I'd just make a
> > loopback mount in a ramfs instance.)
>
> They are, although we could easily hook up a couple more ioctls for them
> now (if anybody is interested).

>From a usability perspective, an ioctl is no substitute for being able to 
resize a device with "dd".  (Or for that matter, make it sparse.)

> The rd driver can potentially be a _lot_ more scalable than the loop
> driver. It's completely lockless in the fastpath and doesn't even dirty any
> cachelines except for the actual destination memory. OK, this is only
> really important for testing purposes (eg. testing scalability of a
> filesystem etc.) But it is one reason why I (personally) want brd...

I wouldn't say not important: The "database in RAM" people will probably love 
you, at least if they insist on using Oracle.  (Filesystem schmilesystem, 
gimme a raw block device and let me implement my own filesystem from 
userspace...)

But I'm not personally likely to care. :)

> > Embedded users who still want a block interface for memory are generally
> > trying to use a cramfs or squashfs image out of ROM or flash, although
> > there are flash-specific filesystems for this and I dunno if they're
> > actually mounting /dev/mem at an offset or something (md?  losetup -o? 
> > Beats me, I haven't tried that myself yet...)
>
> OK, it would be interesting to hear from anyone using rd for things like
> cramfs.

I'd be interested in that too.  I've heard it proposed a lot but not actually 
seen anyone bother to implement it yet...

Rob
-- 
"One of my most productive days was throwing away 1000 lines of code."
  - Ken Thompson.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] rewrite rd

2007-12-04 Thread Nick Piggin
On Tue, Dec 04, 2007 at 10:54:51AM +0100, Christian Borntraeger wrote:
> Am Dienstag, 4. Dezember 2007 schrieb Nick Piggin:
> [...]
> > There is one slight downside -- direct block device access and filesystem
> > metadata access goes through an extra copy and gets stored in RAM twice.
> > However, this downside is only slight, because the real buffercache of the
> > device is now reclaimable (because we're not playing crazy games with it),
> > so under memory intensive situations, footprint should effectively be the
> > same -- maybe even a slight advantage to the new driver because it can also
> > reclaim buffer heads.
> 
> This is just an idea, I dont know if it is worth the trouble, but have you
> though about implementing direct_access for brd? That would allow 
> execute-in-place (xip) on brd eliminating the extra copy.

Actually that's a pretty good idea. It would allow xip to be tested
without special hardware as well...

I'll see what the patch looks like. Thanks

-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] rewrite rd

2007-12-04 Thread Christian Borntraeger
Am Dienstag, 4. Dezember 2007 schrieb Nick Piggin:
[...]
> There is one slight downside -- direct block device access and filesystem
> metadata access goes through an extra copy and gets stored in RAM twice.
> However, this downside is only slight, because the real buffercache of the
> device is now reclaimable (because we're not playing crazy games with it),
> so under memory intensive situations, footprint should effectively be the
> same -- maybe even a slight advantage to the new driver because it can also
> reclaim buffer heads.

This is just an idea, I dont know if it is worth the trouble, but have you
though about implementing direct_access for brd? That would allow 
execute-in-place (xip) on brd eliminating the extra copy.

Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] rewrite rd

2007-12-04 Thread Nick Piggin
On Tue, Dec 04, 2007 at 01:55:17AM -0600, Rob Landley wrote:
> On Monday 03 December 2007 22:26:28 Nick Piggin wrote:
> > There is one slight downside -- direct block device access and filesystem
> > metadata access goes through an extra copy and gets stored in RAM twice.
> > However, this downside is only slight, because the real buffercache of the
> > device is now reclaimable (because we're not playing crazy games with it),
> > so under memory intensive situations, footprint should effectively be the
> > same -- maybe even a slight advantage to the new driver because it can also
> > reclaim buffer heads.
> 
> For the embedded world, initramfs has pretty much rendered initrd obsolete, 
> and that was the biggest user of the ramdisk code I know of.  Beyond that, 
> loopback mounts give you more flexible transient block devices than ramdisks 
> do.  (In fact, ramdisks are such an amazing pain to use/size/free that if I 
> really needed something like that I'd just make a loopback mount in a ramfs 
> instance.)

They are, although we could easily hook up a couple more ioctls for them
now (if anybody is interested).

The rd driver can potentially be a _lot_ more scalable than the loop driver.
It's completely lockless in the fastpath and doesn't even dirty any
cachelines except for the actual destination memory. OK, this is only
really important for testing purposes (eg. testing scalability of a
filesystem etc.) But it is one reason why I (personally) want brd...
 

> Embedded users who still want a block interface for memory are generally 
> trying to use a cramfs or squashfs image out of ROM or flash, although there 
> are flash-specific filesystems for this and I dunno if they're actually 
> mounting /dev/mem at an offset or something (md?  losetup -o?  Beats me, I 
> haven't tried that myself yet...)

OK, it would be interesting to hear from anyone using rd for things like
cramfs. I don't think we can get rid of the code entirely, but it sounds
like the few downsides of the new code won't be big problems.

Thanks for the feedback.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] rewrite rd

2007-12-03 Thread Rob Landley
On Monday 03 December 2007 22:26:28 Nick Piggin wrote:
> There is one slight downside -- direct block device access and filesystem
> metadata access goes through an extra copy and gets stored in RAM twice.
> However, this downside is only slight, because the real buffercache of the
> device is now reclaimable (because we're not playing crazy games with it),
> so under memory intensive situations, footprint should effectively be the
> same -- maybe even a slight advantage to the new driver because it can also
> reclaim buffer heads.

For the embedded world, initramfs has pretty much rendered initrd obsolete, 
and that was the biggest user of the ramdisk code I know of.  Beyond that, 
loopback mounts give you more flexible transient block devices than ramdisks 
do.  (In fact, ramdisks are such an amazing pain to use/size/free that if I 
really needed something like that I'd just make a loopback mount in a ramfs 
instance.)

Embedded users who still want a block interface for memory are generally 
trying to use a cramfs or squashfs image out of ROM or flash, although there 
are flash-specific filesystems for this and I dunno if they're actually 
mounting /dev/mem at an offset or something (md?  losetup -o?  Beats me, I 
haven't tried that myself yet...)

Rob
-- 
"One of my most productive days was throwing away 1000 lines of code."
  - Ken Thompson.
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] rewrite rd

2007-12-03 Thread Nick Piggin
On Tue, Dec 04, 2007 at 08:01:31AM +0100, Nick Piggin wrote:
> Thanks for the review, I'll post an incremental patch in a sec.
 

Index: linux-2.6/drivers/block/brd.c
===
--- linux-2.6.orig/drivers/block/brd.c
+++ linux-2.6/drivers/block/brd.c
@@ -56,7 +56,7 @@ struct brd_device {
  */
 static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector)
 {
-   unsigned long idx;
+   pgoff_t idx;
struct page *page;
 
/*
@@ -87,13 +87,17 @@ static struct page *brd_lookup_page(stru
  */
 static struct page *brd_insert_page(struct brd_device *brd, sector_t sector)
 {
-   unsigned long idx;
+   pgoff_t idx;
struct page *page;
 
page = brd_lookup_page(brd, sector);
if (page)
return page;
 
+   /*
+* Must use NOIO because we don't want to recurse back into the
+* block or filesystem layers from page reclaim.
+*/
page = alloc_page(GFP_NOIO | __GFP_HIGHMEM | __GFP_ZERO);
if (!page)
return NULL;
@@ -148,6 +152,11 @@ static void brd_free_pages(struct brd_de
 
pos++;
 
+   /*
+* This assumes radix_tree_gang_lookup always returns as
+* many pages as possible. If the radix-tree code changes,
+* so will this have to.
+*/
} while (nr_pages == FREE_BATCH);
 }
 
@@ -159,7 +168,7 @@ static int copy_to_brd_setup(struct brd_
unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
size_t copy;
 
-   copy = min((unsigned long)n, PAGE_SIZE - offset);
+   copy = min_t(size_t, n, PAGE_SIZE - offset);
if (!brd_insert_page(brd, sector))
return -ENOMEM;
if (copy < n) {
@@ -181,7 +190,7 @@ static void copy_to_brd(struct brd_devic
unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
size_t copy;
 
-   copy = min((unsigned long)n, PAGE_SIZE - offset);
+   copy = min_t(size_t, n, PAGE_SIZE - offset);
page = brd_lookup_page(brd, sector);
BUG_ON(!page);
 
@@ -213,7 +222,7 @@ static void copy_from_brd(void *dst, str
unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
size_t copy;
 
-   copy = min((unsigned long)n, PAGE_SIZE - offset);
+   copy = min_t(size_t, n, PAGE_SIZE - offset);
page = brd_lookup_page(brd, sector);
if (page) {
src = kmap_atomic(page, KM_USER1);
@@ -318,6 +327,9 @@ static int brd_ioctl(struct inode *inode
/*
 * Invalidate the cache first, so it isn't written
 * back to the device.
+*
+* Another thread might instantiate more buffercache here,
+* but there is not much we can do to close that race.
 */
invalidate_bh_lrus();
truncate_inode_pages(bdev->bd_inode->i_mapping, 0);
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch] rewrite rd

2007-12-03 Thread Nick Piggin
On Mon, Dec 03, 2007 at 10:29:03PM -0800, Andrew Morton wrote:
> On Tue, 4 Dec 2007 05:26:28 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:
> > 
> > There is one slight downside -- direct block device access and filesystem
> > metadata access goes through an extra copy and gets stored in RAM twice.
> > However, this downside is only slight, because the real buffercache of the
> > device is now reclaimable (because we're not playing crazy games with it),
> > so under memory intensive situations, footprint should effectively be the
> > same -- maybe even a slight advantage to the new driver because it can also
> > reclaim buffer heads.
> 
> what about mmap(/dev/ram0)?

Same thing I guess -- it will use more memory in the common case, but should
actually be able to reclaim slightly more memory under pressure, so the
tiny systems guys shouldn't have too much trouble.

And we're avoiding the whole class of aliasing problems where mmap on
the old rd driver means that you're creating new mappings to your backing
store pages.

 
> > Text is larger, but data and bss are smaller, making total size smaller.
> > 
> > A few other nice things about it:
> > - Similar structure and layout to the new loop device handlinag.
> > - Dynamic ramdisk creation.
> > - Runtime flexible buffer head size (because it is no longer part of the
> >   ramdisk code).
> > - Boot / load time flexible ramdisk size, which could easily be extended
> >   to a per-ramdisk runtime changeable size (eg. with an ioctl).
> 
> This ramdisk driver can use highmem whereas the existing one can't (yes?). 
> That's a pretty major difference.  Plus look at the revoltingness in rd.c's
> mapping_set_gfp_mask()s.

Ah yep, there is the highmem advantage too.


> > +#define SECTOR_SHIFT   9
> 
> That's our third definition of SECTOR_SHIFT.

Or 7th, depend on how you count ;) I always thought redefining it is a
prerequsite to getting anything merged into the block layer, so I'm too
scared to put it in include/linux/blkdev.h ;)


> > +/*
> > + * Look up and return a brd's page for a given sector.
> > + */
> > +static struct page *brd_lookup_page(struct brd_device *brd, sector_t 
> > sector)
> > +{
> > +   unsigned long idx;
> 
> Could use pgoff_t here if you think that's clearer.

I guess it is. Although on one hand the radix-tree uses unsigned long, but
on the other hand, page->index is pgoff.

> > +{
> > +   unsigned long idx;
> > +   struct page *page;
> > +
> > +   page = brd_lookup_page(brd, sector);
> > +   if (page)
> > +   return page;
> > +
> > +   page = alloc_page(GFP_NOIO | __GFP_HIGHMEM | __GFP_ZERO);
> 
> Why GFP_NOIO?

I guess it has similar issues to rd.c -- we can't risk recursing
into the block layer here. However unlike rd.c, we _could_ easily
add some mode or ioctl to allocate the backing store upfront, with
full reclaim flags...

 
> Have you thought about __GFP_MOVABLE treatment here?  Seems pretty
> desirable as this sucker can be large.

AFAIK that only applies to pagecache but I haven't been paying much
attention to that stuff lately. It wouldn't be hard to move these
pages around, if we had a hook from the vm for it.
 

> > +static void brd_free_pages(struct brd_device *brd)
> > +{
> > +   unsigned long pos = 0;
> > +   struct page *pages[FREE_BATCH];
> > +   int nr_pages;
> > +
> > +   do {
> > +   int i;
> > +
> > +   nr_pages = radix_tree_gang_lookup(&brd->brd_pages,
> > +   (void **)pages, pos, FREE_BATCH);
> > +
> > +   for (i = 0; i < nr_pages; i++) {
> > +   void *ret;
> > +
> > +   BUG_ON(pages[i]->index < pos);
> > +   pos = pages[i]->index;
> > +   ret = radix_tree_delete(&brd->brd_pages, pos);
> > +   BUG_ON(!ret || ret != pages[i]);
> > +   __free_page(pages[i]);
> > +   }
> > +
> > +   pos++;
> > +
> > +   } while (nr_pages == FREE_BATCH);
> > +}
> 
> I have vague memories that radix_tree_gang_lookup()'s "naive"
> implementation may return fewer items than you asked for even when there
> are more items remaining - when it hits certain boundaries.

Good memory, but it's the low level leaf traversal that bales out at
boundaries. The higher level code then retries, so we should be OK
here.

> > +/*
> > + * copy_to_brd_setup must be called before copy_to_brd. It may sleep.
> > + */
> > +static int copy_to_brd_setup(struct brd_device *brd, sector_t sector, 
> > size_t n)
> > +{
> > +   unsigned int offset = (sector & (PAGE_SECTORS-1)) << SECTOR_SHIFT;
> > +   size_t copy;
> > +
> > +   copy = min((unsigned long)n, PAGE_SIZE - offset);
> 
> use min_t.  Or, better, sort out the types.

I guess the API is using size_t, so that would be the more approprite type
to convert to. And I'll use min_t too.


> > +static void copy_to_brd(struct brd_device *brd, const void *src,
> > +   sector_t sector, size_t n)
> > +{
> > +   struct page *page;
> > +  

Re: [patch] rewrite rd

2007-12-03 Thread Andrew Morton
On Tue, 4 Dec 2007 05:26:28 +0100 Nick Piggin <[EMAIL PROTECTED]> wrote:

> Hi,
> 
> This is my proposal for a (hopefully) backwards compatible rd driver.
> The motivation for me is not pressing, except that I have this code
> sitting here that is either going to rot or get merged. I'm happy to
> make myself maintainer of this code, but if any real block device
> driver writer would like to, that would be fine by me ;)
> 
> Comments?
> 
> --
> 
> This is a rewrite of the ramdisk block device driver.
> 
> The old one is really difficult because it effectively implements a block
> device which serves data out of its own buffer cache. It relies on the dirty
> bit being set, to pin its backing store in cache, however there are non
> trivial paths which can clear the dirty bit (eg. try_to_free_buffers()),
> which had recently lead to data corruption. And in general it is completely
> wrong for a block device driver to do this.
> 
> The new one is more like a regular block device driver. It has no idea
> about vm/vfs stuff. It's backing store is similar to the buffer cache
> (a simple radix-tree of pages), but it doesn't know anything about page
> cache (the pages in the radix tree are not pagecache pages).
> 
> There is one slight downside -- direct block device access and filesystem
> metadata access goes through an extra copy and gets stored in RAM twice.
> However, this downside is only slight, because the real buffercache of the
> device is now reclaimable (because we're not playing crazy games with it),
> so under memory intensive situations, footprint should effectively be the
> same -- maybe even a slight advantage to the new driver because it can also
> reclaim buffer heads.

what about mmap(/dev/ram0)?

> The fact that it now goes through all the regular vm/fs paths makes it
> much more useful for testing, too.
> 
>textdata bss dec hex filename
>2837 849 3844070 fe6 drivers/block/rd.o
>3528 371  123911 f47 drivers/block/brd.o
> 
> Text is larger, but data and bss are smaller, making total size smaller.
> 
> A few other nice things about it:
> - Similar structure and layout to the new loop device handlinag.
> - Dynamic ramdisk creation.
> - Runtime flexible buffer head size (because it is no longer part of the
>   ramdisk code).
> - Boot / load time flexible ramdisk size, which could easily be extended
>   to a per-ramdisk runtime changeable size (eg. with an ioctl).

This ramdisk driver can use highmem whereas the existing one can't (yes?). 
That's a pretty major difference.  Plus look at the revoltingness in rd.c's
mapping_set_gfp_mask()s.

> +++ linux-2.6/drivers/block/brd.c
> @@ -0,0 +1,536 @@
> +/*
> + * Ram backed block device driver.
> + *
> + * Copyright (C) 2007 Nick Piggin
> + * Copyright (C) 2007 Novell Inc.
> + *
> + * Parts derived from drivers/block/rd.c, and drivers/block/loop.c, copyright
> + * of their respective owners.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include  /* invalidate_bh_lrus() */
> +
> +#include 
> +
> +#define SECTOR_SHIFT 9

That's our third definition of SECTOR_SHIFT.

> +#define PAGE_SECTORS_SHIFT   (PAGE_SHIFT - SECTOR_SHIFT)
> +#define PAGE_SECTORS (1 << PAGE_SECTORS_SHIFT)
> +
> +/*
> + * Each block ramdisk device has a radix_tree brd_pages of pages that stores
> + * the pages containing the block device's contents. A brd page's ->index is
> + * its offset in PAGE_SIZE units. This is similar to, but in no way connected
> + * with, the kernel's pagecache or buffer cache (which sit above our block
> + * device).
> + */
> +struct brd_device {
> + int brd_number;
> + int brd_refcnt;
> + loff_t  brd_offset;
> + loff_t  brd_sizelimit;
> + unsignedbrd_blocksize;
> +
> + struct request_queue*brd_queue;
> + struct gendisk  *brd_disk;
> + struct list_headbrd_list;
> +
> + /*
> +  * Backing store of pages and lock to protect it. This is the contents
> +  * of the block device.
> +  */
> + spinlock_t  brd_lock;
> + struct radix_tree_root  brd_pages;
> +};
> +
> +/*
> + * Look up and return a brd's page for a given sector.
> + */
> +static struct page *brd_lookup_page(struct brd_device *brd, sector_t sector)
> +{
> + unsigned long idx;

Could use pgoff_t here if you think that's clearer.

> + struct page *page;
> +
> + /*
> +  * The page lifetime is protected by the fact that we have opened the
> +  * device node -- brd pages will never be deleted under us, so we
> +  * don't need any further locking or refcounting.
> +  *
> +  * This is strictly true for the radix-tree nodes as well (ie. we
> +  * don't actually need the rcu_read_lock()), however that is not a
> +  * documented feature of the radix-tree API so it is better to be
> +  

[patch] rewrite rd

2007-12-03 Thread Nick Piggin
Hi,

This is my proposal for a (hopefully) backwards compatible rd driver.
The motivation for me is not pressing, except that I have this code
sitting here that is either going to rot or get merged. I'm happy to
make myself maintainer of this code, but if any real block device
driver writer would like to, that would be fine by me ;)

Comments?

--

This is a rewrite of the ramdisk block device driver.

The old one is really difficult because it effectively implements a block
device which serves data out of its own buffer cache. It relies on the dirty
bit being set, to pin its backing store in cache, however there are non
trivial paths which can clear the dirty bit (eg. try_to_free_buffers()),
which had recently lead to data corruption. And in general it is completely
wrong for a block device driver to do this.

The new one is more like a regular block device driver. It has no idea
about vm/vfs stuff. It's backing store is similar to the buffer cache
(a simple radix-tree of pages), but it doesn't know anything about page
cache (the pages in the radix tree are not pagecache pages).

There is one slight downside -- direct block device access and filesystem
metadata access goes through an extra copy and gets stored in RAM twice.
However, this downside is only slight, because the real buffercache of the
device is now reclaimable (because we're not playing crazy games with it),
so under memory intensive situations, footprint should effectively be the
same -- maybe even a slight advantage to the new driver because it can also
reclaim buffer heads.

The fact that it now goes through all the regular vm/fs paths makes it
much more useful for testing, too.

   textdata bss dec hex filename
   2837 849 3844070 fe6 drivers/block/rd.o
   3528 371  123911 f47 drivers/block/brd.o

Text is larger, but data and bss are smaller, making total size smaller.

A few other nice things about it:
- Similar structure and layout to the new loop device handlinag.
- Dynamic ramdisk creation.
- Runtime flexible buffer head size (because it is no longer part of the
  ramdisk code).
- Boot / load time flexible ramdisk size, which could easily be extended
  to a per-ramdisk runtime changeable size (eg. with an ioctl).

Signed-off-by: Nick Piggin <[EMAIL PROTECTED]>
---
 MAINTAINERS|5 
 drivers/block/Kconfig  |   12 -
 drivers/block/Makefile |2 
 drivers/block/brd.c|  500 +
 drivers/block/rd.c |  536 -
 fs/buffer.c|1 
 6 files changed, 508 insertions(+), 548 deletions(-)

Index: linux-2.6/drivers/block/Kconfig
===
--- linux-2.6.orig/drivers/block/Kconfig
+++ linux-2.6/drivers/block/Kconfig
@@ -311,7 +311,7 @@ config BLK_DEV_UB
  If unsure, say N.
 
 config BLK_DEV_RAM
-   tristate "RAM disk support"
+   tristate "RAM block device support"
---help---
  Saying Y here will allow you to use a portion of your RAM memory as
  a block device, so that you can make file systems on it, read and
@@ -346,16 +346,6 @@ config BLK_DEV_RAM_SIZE
  The default value is 4096 kilobytes. Only change this if you know
  what you are doing.
 
-config BLK_DEV_RAM_BLOCKSIZE
-   int "Default RAM disk block size (bytes)"
-   depends on BLK_DEV_RAM
-   default "1024"
-   help
- The default value is 1024 bytes.  PAGE_SIZE is a much more
- efficient choice however.  The default is kept to ensure initrd
- setups function - apparently needed by the rd_load_image routine
- that supposes the filesystem in the image uses a 1024 blocksize.
-
 config CDROM_PKTCDVD
tristate "Packet writing on CD/DVD media"
depends on !UML
Index: linux-2.6/drivers/block/Makefile
===
--- linux-2.6.orig/drivers/block/Makefile
+++ linux-2.6/drivers/block/Makefile
@@ -11,7 +11,7 @@ obj-$(CONFIG_AMIGA_FLOPPY)+= amiflop.o
 obj-$(CONFIG_PS3_DISK) += ps3disk.o
 obj-$(CONFIG_ATARI_FLOPPY) += ataflop.o
 obj-$(CONFIG_AMIGA_Z2RAM)  += z2ram.o
-obj-$(CONFIG_BLK_DEV_RAM)  += rd.o
+obj-$(CONFIG_BLK_DEV_RAM)  += brd.o
 obj-$(CONFIG_BLK_DEV_LOOP) += loop.o
 obj-$(CONFIG_BLK_DEV_PS2)  += ps2esdi.o
 obj-$(CONFIG_BLK_DEV_XD)   += xd.o
Index: linux-2.6/drivers/block/brd.c
===
--- /dev/null
+++ linux-2.6/drivers/block/brd.c
@@ -0,0 +1,536 @@
+/*
+ * Ram backed block device driver.
+ *
+ * Copyright (C) 2007 Nick Piggin
+ * Copyright (C) 2007 Novell Inc.
+ *
+ * Parts derived from drivers/block/rd.c, and drivers/block/loop.c, copyright
+ * of their respective owners.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include  /* invalida