Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences

2016-02-02 Thread Jared Hulbert
On Tue, Feb 2, 2016 at 4:34 PM, Matthew Wilcox  wrote:
> On Tue, Feb 02, 2016 at 01:46:06PM -0800, Jared Hulbert wrote:
>> On Tue, Feb 2, 2016 at 8:51 AM, Dan Williams  
>> wrote:
>> >> The filesystem I'm concerned with is AXFS
>> >> (https://www.kernel.org/doc/ols/2008/ols2008v1-pages-211-218.pdf).
>> >> Which I've been planning on trying to merge again due to a recent
>> >> resurgence of interest.  The device model for AXFS is... weird.  It
>> >> can use one or two devices at a time of any mix of NOR MTD, NAND MTD,
>> >> block, and unmanaged physical memory.  It's a terribly useful model
>> >> for embedded.  Anyway AXFS is readonly so hacking in a read only
>> >> dax_fault_nodev() and dax_file_read() would work fine, looks easy
>> >> enough.  But... it would be cool if similar small embedded focused RW
>> >> filesystems were enabled.
>> >
>> > Are those also out of tree?
>>
>> Of course.  Merging embedded filesystems is little merging regular
>> filesystems except 98% of you reviewers don't want it merged.
>
> You should at least be able to get it into staging these days.  I mean,
> look at some of the junk that's in staging ... and I don't think AXFS was
> nearly as bad.

Thanks? ;)

>> IMO you're making DAX more complex by overly coupling to the bdev and
>> I think it could bite you later.  I submit this rework of the radix
>> tree and confusion about where to get the real bdev as evidence.  I'm
>> guessing that it won't be the last time.  It's unnecessary to couple
>> it like this, and in fact is not how the vfs has been layered in the
>> past.
>
> Huh?  The rework to use the radix tree for PFNs was done with one eye
> firmly on your usage case.  Just because I had to thread the get_block
> interface through it for the moment doesn't mean that I didn't have
> the "how do we get rid of get_block entirely" question on my mind.

Oh yeah.  I think we're on the same page.  But I'm not sure Dan is.  I
get the need to phase this in too.

> Using get_block seemed like the right idea three years ago.  I didn't
> know just how fundamentally ext4 and XFS disagree on how it should be
> used.

Sure.  I can see that.

>> To look at the the downside consider dax_fault().  Its called on a
>> fault to a user memory map, uses the filesystems get_block() to lookup
>> a sector so you can ask a block device to convert it to an address on
>> a DIMM.  Come on, that's awkward.  Everything around dax_fault() is
>> dripping with memory semantic interfaces, the dax_fault() call are
>> fundamentally about memory, the pmem calls are memory, the hardware is
>> memory, and yet it directly calls bdev_direct_access().  It's out of
>> place.
>
> What was out of place was the old 'get_xip_mem' in address_space
> operations.  Returning a kernel virtual address and a PFN from a
> filesystem operation?  That looks awful.

Yes.  Yes it does!  But at least my big hack was just one line. ;)
Nobody really even seemed to notice at the time.

>  All the other operations deal
> in struct pages, file offsets and occasionally sectors.  Of course, we
> don't have a struct page, so a pfn makes sense, but the kernel virtual
> address being returned was a gargantuan layering problem.

Well yes, but it was an expedient hack.

>> The legacy vfs/mm code didn't have this layering problem either.  Even
>> filemap_fault() that dax_fault() is modeled after doesn't call any
>> bdev methods directly, when it needs something it asks the filesystem
>> with a ->readpage().  The precedence is that you ask the filesystem
>> for what you need.  Look at the get_bdev() thing you've concluded you
>> need.  It _almost_ makes my point.  I just happen to be of the opinion
>> that you don't actually want or need the bdev, you want the pfn/kaddr
>> so you can flush or map or memcpy().
>
> You want the pfn.  The device driver doesn't have enough information to
> give you a (coherent with userspace) kaddr.  That's what (some future
> arch-specific implementation of) dax_map_pfn() is for.  That's why it
> takes 'index' as a parameter, so you can calculate where it'll be mapped
> in userspace, and determine an appropriate kernel virtual address to
> use for it.

Oh I think I'm just beginning to catch your vision for
dax_map_pfn().  I still don't get why we can't just do semi-arch
specific flushing instead of the alignment thing.  But that just might
be epic ignorance on my part.  Either way flush or magic alignments
dax_(un)map_pfn() would handle it, right?


Re: [PATCH] dax: allow DAX to look up an inode's block device

2016-02-02 Thread Jared Hulbert
On Tue, Feb 2, 2016 at 3:41 PM, Dan Williams  wrote:
> On Tue, Feb 2, 2016 at 3:36 PM, Jared Hulbert  wrote:
>> On Tue, Feb 2, 2016 at 3:19 PM, Al Viro  wrote:
>>>
>>> On Tue, Feb 02, 2016 at 04:11:42PM -0700, Ross Zwisler wrote:
>>>
>>> > However, for raw block devices and for XFS with a real-time device, the
>>> > value in inode->i_sb->s_bdev is not correct.  With the code as it is
>>> > currently written, an fsync or msync to a DAX enabled raw block device 
>>> > will
>>> > cause a NULL pointer dereference kernel BUG.  For this to work correctly 
>>> > we
>>> > need to ask the block device or filesystem what struct block_device is
>>> > appropriate for our inode.
>>> >
>>> > To that end, add a get_bdev(struct inode *) entry point to struct
>>> > super_operations.  If this function pointer is non-NULL, this notifies DAX
>>> > that it needs to use it to look up the correct block_device.  If
>>> > i_sb->get_bdev() is NULL DAX will default to inode->i_sb->s_bdev.
>>>
>>> Umm...  It assumes that bdev will stay pinned for as long as inode is
>>> referenced, presumably?  If so, that needs to be documented (and verified
>>> for existing fs instances).  In principle, multi-disk fs might want to
>>> support things like "silently move the inodes backed by that disk to other
>>> ones"...
>>
>> Dan, This is exactly the kind of thing I'm taking about WRT the
>> weirder device models and directly calling bdev_direct_access().
>> Filesystems don't have the monogamous relationship with a device that
>> is implicitly assumed in DAX, you have to ask the filesystem what the
>> relationship is and is migrating to, and allow the filesystem to
>> update DAX when the relationship is changing.
>
> That's precisely what ->get_bdev() does.  When the answer
> inode->i_sb->s_bdev lookup is invalid, use ->get_bdev().
>
>> As we start to see many
>> DIMM's and 10s TiB pmem systems this is going be an even bigger deal
>> as load balancing, wear leveling, and fault tolerance concerned are
>> inevitably driven by the filesystem.
>
> No, there are no plans on the horizon for an fs to manage these media
> specific concerns for persistent memory.

So the filesystem is now directly in charge of mapping user pages to
physical memory.  The filesystem is effectively bypassing NUMA and
zones and all that stuff that tries to balance memory bus and QPI
traffic etc.  You don't think the filesystem will therefore be in
charge of memory bus hotspots?

Alright.  We can just agree to disagree on that point.


Re: [PATCH] dax: allow DAX to look up an inode's block device

2016-02-02 Thread Jared Hulbert
On Tue, Feb 2, 2016 at 3:19 PM, Al Viro  wrote:
>
> On Tue, Feb 02, 2016 at 04:11:42PM -0700, Ross Zwisler wrote:
>
> > However, for raw block devices and for XFS with a real-time device, the
> > value in inode->i_sb->s_bdev is not correct.  With the code as it is
> > currently written, an fsync or msync to a DAX enabled raw block device will
> > cause a NULL pointer dereference kernel BUG.  For this to work correctly we
> > need to ask the block device or filesystem what struct block_device is
> > appropriate for our inode.
> >
> > To that end, add a get_bdev(struct inode *) entry point to struct
> > super_operations.  If this function pointer is non-NULL, this notifies DAX
> > that it needs to use it to look up the correct block_device.  If
> > i_sb->get_bdev() is NULL DAX will default to inode->i_sb->s_bdev.
>
> Umm...  It assumes that bdev will stay pinned for as long as inode is
> referenced, presumably?  If so, that needs to be documented (and verified
> for existing fs instances).  In principle, multi-disk fs might want to
> support things like "silently move the inodes backed by that disk to other
> ones"...

Dan, This is exactly the kind of thing I'm taking about WRT the
weirder device models and directly calling bdev_direct_access().
Filesystems don't have the monogamous relationship with a device that
is implicitly assumed in DAX, you have to ask the filesystem what the
relationship is and is migrating to, and allow the filesystem to
update DAX when the relationship is changing.  As we start to see many
DIMM's and 10s TiB pmem systems this is going be an even bigger deal
as load balancing, wear leveling, and fault tolerance concerned are
inevitably driven by the filesystem.


Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences

2016-02-02 Thread Jared Hulbert
On Tue, Feb 2, 2016 at 8:51 AM, Dan Williams  wrote:
> On Tue, Feb 2, 2016 at 12:05 AM, Jared Hulbert  wrote:
> [..]
>> Well... as CONFIG_BLOCK was not required with filemap_xip.c for a
>> decade.  This CONFIG_BLOCK dependency is a result of an incremental
>> feature from a certain point of view ;)
>>
>> The obvious 'driver' is physical RAM without a particular driver.
>> Remember please I'm talking about embedded.  RAM measured in MiB and
>> funky one off hardware etc.  In the embedded world there are lots of
>> ways that persistent memory has been supported in device specific ways
>> without the new fancypants NFIT and Intel instructions,so frankly
>> they don't fit in the PMEM stuff.  Maybe they could be supported in
>> PMEM but not without effort to bring embedded players to the table.
>
> Not sure what you're trying to say here.  An ACPI NFIT only feeds the
> generic libnvdimm device model.  You don't need NFIT to get pmem.

Right... I'm just not seeing how the libnvdimm device model fits, is
relevant, or useful to a persistent SRAM in embedded.  Therefore I
don't see some of the user will have a driver.

>> The other drivers are the MTD drivers, probably as read-only for now.
>> But the paradigm there isn't so different from what PMEM looks like
>> with asymmetric read/write capabilities.
>>
>> The filesystem I'm concerned with is AXFS
>> (https://www.kernel.org/doc/ols/2008/ols2008v1-pages-211-218.pdf).
>> Which I've been planning on trying to merge again due to a recent
>> resurgence of interest.  The device model for AXFS is... weird.  It
>> can use one or two devices at a time of any mix of NOR MTD, NAND MTD,
>> block, and unmanaged physical memory.  It's a terribly useful model
>> for embedded.  Anyway AXFS is readonly so hacking in a read only
>> dax_fault_nodev() and dax_file_read() would work fine, looks easy
>> enough.  But... it would be cool if similar small embedded focused RW
>> filesystems were enabled.
>
> Are those also out of tree?

Of course.  Merging embedded filesystems is little merging regular
filesystems except 98% of you reviewers don't want it merged.

>> I don't expect you to taint DAX with design requirements for this
>> stuff that it wasn't built for, nobody ends up happy in that case.
>> However, if enabling the filesystem to manage the bdev_direct_access()
>> interactions solves some of the "alternate device" problems you are
>> discussing here, then there is a chance we can accommodate both.
>> Sometimes that works.
>>
>> So... Forget CONFIG_BLOCK=n entirely I didn't want that to be the
>> focus anyway.  Does it help to support the weirder XFS and btrfs
>> device models to enable the filesystem to handle the
>> bdev_direct_access() stuff?
>
> It's not clear that it does.  We just clarified with xfs and ext4 that
> we can really on get_blocks().  That solves the immediate concern with
> multi-device filesystems.

IMO you're making DAX more complex by overly coupling to the bdev and
I think it could bite you later.  I submit this rework of the radix
tree and confusion about where to get the real bdev as evidence.  I'm
guessing that it won't be the last time.  It's unnecessary to couple
it like this, and in fact is not how the vfs has been layered in the
past.

The trouble with vfs work has been that it straddles the line between
mm and block, unfortunately that line is dark chasm with ill defined
boundaries.  DAX is even more exciting because it's trying to duct
tape the filesystem even closer to the mm system, one could argue it's
actually in some respects enabling the filesystem to bypass the mm
code.  On top of that DAX is designed to enable block based
filesystems to use RAM like devices.

Bolting the block device interface on to NVDIMM is a brilliant hack
and the right design choice, but it's still a hack.  The upside is it
enables the reuse of all this glorious legacy filesystem code which
does a pretty amazing job of handling what the pmem device
applications need considering they were designed to manage data on
platters of slow spinning rust.  How would DAX look like developed
with a filesystem purpose built for pmem?

To look at the the downside consider dax_fault().  Its called on a
fault to a user memory map, uses the filesystems get_block() to lookup
a sector so you can ask a block device to convert it to an address on
a DIMM.  Come on, that's awkward.  Everything around dax_fault() is
dripping with memory semantic interfaces, the dax_fault() call are
fundamentally about memory, the pmem calls are memory, the hardware is
memory, and yet it directly calls bdev_direct_access().  It's out of
place.

The legacy vfs/mm code didn't have this layering problem either.  Even
filemap_fault() that dax_fault() i

Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences

2016-02-02 Thread Jared Hulbert
On Mon, Feb 1, 2016 at 10:46 PM, Dan Williams  wrote:
> On Mon, Feb 1, 2016 at 10:06 PM, Jared Hulbert  wrote:
>> On Mon, Feb 1, 2016 at 1:47 PM, Dave Chinner  wrote:
>>> On Mon, Feb 01, 2016 at 03:51:47PM +0100, Jan Kara wrote:
>>>> On Sat 30-01-16 00:28:33, Matthew Wilcox wrote:
>>>> > On Fri, Jan 29, 2016 at 11:28:15AM -0700, Ross Zwisler wrote:
>>>> > > I guess I need to go off and understand if we can have DAX mappings on 
>>>> > > such a
>>>> > > device.  If we can, we may have a problem - we can get the 
>>>> > > block_device from
>>>> > > get_block() in I/O path and the various fault paths, but we don't have 
>>>> > > access
>>>> > > to get_block() when flushing via dax_writeback_mapping_range().  We 
>>>> > > avoid
>>>> > > needing it the normal case by storing the sector results from 
>>>> > > get_block() in
>>>> > > the radix tree.
>>>> >
>>>> > I think we're doing it wrong by storing the sector in the radix tree; 
>>>> > we'd
>>>> > really need to store both the sector and the bdev which is too much data.
>>>> >
>>>> > If we store the PFN of the underlying page instead, we don't have this
>>>> > problem.  Instead, we have a different problem; of the device going
>>>> > away under us.  I'm trying to find the code which tears down PTEs when
>>>> > the device goes away, and I'm not seeing it.  What do we do about user
>>>> > mappings of the device?
>>>>
>>>> So I don't have a strong opinion whether storing PFN or sector is better.
>>>> Maybe PFN is somewhat more generic but OTOH turning DAX off for special
>>>> cases like inodes on XFS RT devices would be IMHO fine.
>>>
>>> We need to support alternate devices.
>>
>> Embedded devices trying to use NOR Flash to free up RAM was
>> historically one of the more prevalent real world uses of the old
>> filemap_xip.c code although the users never made it to mainline.  So I
>> spent some time last week trying to figure out how to make a subset of
>> DAX not depend on CONFIG_BLOCK.  It was a very frustrating and
>> unfruitful experience.  I discarded my main conclusion as impractical,
>> but now that I see the difficultly DAX faces in dealing with
>> "alternate devices" especially some of the crazy stuff btrfs can do, I
>> wonder if it's not so crazy after all.
>>
>> Lets stop calling bdev_direct_access() directly from DAX.  Let the
>> filesystems do it.
>>
>> Sure we could enable generic_dax_direct_access() helper for the
>> filesystems that only support single devices to make it easy.  But XFS
>> and btrfs for example, have to do the work of figuring out what bdev
>> is required and then calling bdev_direct_access().
>>
>> My reasoning is that the filesystem knows how to map inodes and
>> offsets to devices and sectors, no matter how complex that is.  It
>> would even enable a filesystem to intelligently use a mix of
>> direct_access and regular block devices down the road.  Of course it
>> would also make the block-less solution doable.
>>
>> Good idea?  Stupid idea?
>
> The CONFIG_BLOCK=y case isn't going anywhere, so if anything it seems
> the CONFIG_BLOCK=n is an incremental feature in its own right.  What
> driver and what filesystem are looking to enable this XIP support in?

Well... as CONFIG_BLOCK was not required with filemap_xip.c for a
decade.  This CONFIG_BLOCK dependency is a result of an incremental
feature from a certain point of view ;)

The obvious 'driver' is physical RAM without a particular driver.
Remember please I'm talking about embedded.  RAM measured in MiB and
funky one off hardware etc.  In the embedded world there are lots of
ways that persistent memory has been supported in device specific ways
without the new fancypants NFIT and Intel instructions, so frankly
they don't fit in the PMEM stuff.  Maybe they could be supported in
PMEM but not without effort to bring embedded players to the table.

The other drivers are the MTD drivers, probably as read-only for now.
But the paradigm there isn't so different from what PMEM looks like
with asymmetric read/write capabilities.

The filesystem I'm concerned with is AXFS
(https://www.kernel.org/doc/ols/2008/ols2008v1-pages-211-218.pdf).
Which I've been planning on trying to merge again due to a recent
resurgence of interest.  The device model for AXFS is... weird.  It
can use one or two devices at a time of any mix of NOR MTD, NAND MTD,
block, and unmanaged 

Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences

2016-02-02 Thread Jared Hulbert
On Tue, Feb 2, 2016 at 8:51 AM, Dan Williams <dan.j.willi...@intel.com> wrote:
> On Tue, Feb 2, 2016 at 12:05 AM, Jared Hulbert <jare...@gmail.com> wrote:
> [..]
>> Well... as CONFIG_BLOCK was not required with filemap_xip.c for a
>> decade.  This CONFIG_BLOCK dependency is a result of an incremental
>> feature from a certain point of view ;)
>>
>> The obvious 'driver' is physical RAM without a particular driver.
>> Remember please I'm talking about embedded.  RAM measured in MiB and
>> funky one off hardware etc.  In the embedded world there are lots of
>> ways that persistent memory has been supported in device specific ways
>> without the new fancypants NFIT and Intel instructions,so frankly
>> they don't fit in the PMEM stuff.  Maybe they could be supported in
>> PMEM but not without effort to bring embedded players to the table.
>
> Not sure what you're trying to say here.  An ACPI NFIT only feeds the
> generic libnvdimm device model.  You don't need NFIT to get pmem.

Right... I'm just not seeing how the libnvdimm device model fits, is
relevant, or useful to a persistent SRAM in embedded.  Therefore I
don't see some of the user will have a driver.

>> The other drivers are the MTD drivers, probably as read-only for now.
>> But the paradigm there isn't so different from what PMEM looks like
>> with asymmetric read/write capabilities.
>>
>> The filesystem I'm concerned with is AXFS
>> (https://www.kernel.org/doc/ols/2008/ols2008v1-pages-211-218.pdf).
>> Which I've been planning on trying to merge again due to a recent
>> resurgence of interest.  The device model for AXFS is... weird.  It
>> can use one or two devices at a time of any mix of NOR MTD, NAND MTD,
>> block, and unmanaged physical memory.  It's a terribly useful model
>> for embedded.  Anyway AXFS is readonly so hacking in a read only
>> dax_fault_nodev() and dax_file_read() would work fine, looks easy
>> enough.  But... it would be cool if similar small embedded focused RW
>> filesystems were enabled.
>
> Are those also out of tree?

Of course.  Merging embedded filesystems is little merging regular
filesystems except 98% of you reviewers don't want it merged.

>> I don't expect you to taint DAX with design requirements for this
>> stuff that it wasn't built for, nobody ends up happy in that case.
>> However, if enabling the filesystem to manage the bdev_direct_access()
>> interactions solves some of the "alternate device" problems you are
>> discussing here, then there is a chance we can accommodate both.
>> Sometimes that works.
>>
>> So... Forget CONFIG_BLOCK=n entirely I didn't want that to be the
>> focus anyway.  Does it help to support the weirder XFS and btrfs
>> device models to enable the filesystem to handle the
>> bdev_direct_access() stuff?
>
> It's not clear that it does.  We just clarified with xfs and ext4 that
> we can really on get_blocks().  That solves the immediate concern with
> multi-device filesystems.

IMO you're making DAX more complex by overly coupling to the bdev and
I think it could bite you later.  I submit this rework of the radix
tree and confusion about where to get the real bdev as evidence.  I'm
guessing that it won't be the last time.  It's unnecessary to couple
it like this, and in fact is not how the vfs has been layered in the
past.

The trouble with vfs work has been that it straddles the line between
mm and block, unfortunately that line is dark chasm with ill defined
boundaries.  DAX is even more exciting because it's trying to duct
tape the filesystem even closer to the mm system, one could argue it's
actually in some respects enabling the filesystem to bypass the mm
code.  On top of that DAX is designed to enable block based
filesystems to use RAM like devices.

Bolting the block device interface on to NVDIMM is a brilliant hack
and the right design choice, but it's still a hack.  The upside is it
enables the reuse of all this glorious legacy filesystem code which
does a pretty amazing job of handling what the pmem device
applications need considering they were designed to manage data on
platters of slow spinning rust.  How would DAX look like developed
with a filesystem purpose built for pmem?

To look at the the downside consider dax_fault().  Its called on a
fault to a user memory map, uses the filesystems get_block() to lookup
a sector so you can ask a block device to convert it to an address on
a DIMM.  Come on, that's awkward.  Everything around dax_fault() is
dripping with memory semantic interfaces, the dax_fault() call are
fundamentally about memory, the pmem calls are memory, the hardware is
memory, and yet it directly calls bdev_direct_access().  It's out of
place.

The legacy vfs/mm code didn't have this layering prob

Re: [PATCH] dax: allow DAX to look up an inode's block device

2016-02-02 Thread Jared Hulbert
On Tue, Feb 2, 2016 at 3:19 PM, Al Viro  wrote:
>
> On Tue, Feb 02, 2016 at 04:11:42PM -0700, Ross Zwisler wrote:
>
> > However, for raw block devices and for XFS with a real-time device, the
> > value in inode->i_sb->s_bdev is not correct.  With the code as it is
> > currently written, an fsync or msync to a DAX enabled raw block device will
> > cause a NULL pointer dereference kernel BUG.  For this to work correctly we
> > need to ask the block device or filesystem what struct block_device is
> > appropriate for our inode.
> >
> > To that end, add a get_bdev(struct inode *) entry point to struct
> > super_operations.  If this function pointer is non-NULL, this notifies DAX
> > that it needs to use it to look up the correct block_device.  If
> > i_sb->get_bdev() is NULL DAX will default to inode->i_sb->s_bdev.
>
> Umm...  It assumes that bdev will stay pinned for as long as inode is
> referenced, presumably?  If so, that needs to be documented (and verified
> for existing fs instances).  In principle, multi-disk fs might want to
> support things like "silently move the inodes backed by that disk to other
> ones"...

Dan, This is exactly the kind of thing I'm taking about WRT the
weirder device models and directly calling bdev_direct_access().
Filesystems don't have the monogamous relationship with a device that
is implicitly assumed in DAX, you have to ask the filesystem what the
relationship is and is migrating to, and allow the filesystem to
update DAX when the relationship is changing.  As we start to see many
DIMM's and 10s TiB pmem systems this is going be an even bigger deal
as load balancing, wear leveling, and fault tolerance concerned are
inevitably driven by the filesystem.


Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences

2016-02-02 Thread Jared Hulbert
On Tue, Feb 2, 2016 at 4:34 PM, Matthew Wilcox <wi...@linux.intel.com> wrote:
> On Tue, Feb 02, 2016 at 01:46:06PM -0800, Jared Hulbert wrote:
>> On Tue, Feb 2, 2016 at 8:51 AM, Dan Williams <dan.j.willi...@intel.com> 
>> wrote:
>> >> The filesystem I'm concerned with is AXFS
>> >> (https://www.kernel.org/doc/ols/2008/ols2008v1-pages-211-218.pdf).
>> >> Which I've been planning on trying to merge again due to a recent
>> >> resurgence of interest.  The device model for AXFS is... weird.  It
>> >> can use one or two devices at a time of any mix of NOR MTD, NAND MTD,
>> >> block, and unmanaged physical memory.  It's a terribly useful model
>> >> for embedded.  Anyway AXFS is readonly so hacking in a read only
>> >> dax_fault_nodev() and dax_file_read() would work fine, looks easy
>> >> enough.  But... it would be cool if similar small embedded focused RW
>> >> filesystems were enabled.
>> >
>> > Are those also out of tree?
>>
>> Of course.  Merging embedded filesystems is little merging regular
>> filesystems except 98% of you reviewers don't want it merged.
>
> You should at least be able to get it into staging these days.  I mean,
> look at some of the junk that's in staging ... and I don't think AXFS was
> nearly as bad.

Thanks? ;)

>> IMO you're making DAX more complex by overly coupling to the bdev and
>> I think it could bite you later.  I submit this rework of the radix
>> tree and confusion about where to get the real bdev as evidence.  I'm
>> guessing that it won't be the last time.  It's unnecessary to couple
>> it like this, and in fact is not how the vfs has been layered in the
>> past.
>
> Huh?  The rework to use the radix tree for PFNs was done with one eye
> firmly on your usage case.  Just because I had to thread the get_block
> interface through it for the moment doesn't mean that I didn't have
> the "how do we get rid of get_block entirely" question on my mind.

Oh yeah.  I think we're on the same page.  But I'm not sure Dan is.  I
get the need to phase this in too.

> Using get_block seemed like the right idea three years ago.  I didn't
> know just how fundamentally ext4 and XFS disagree on how it should be
> used.

Sure.  I can see that.

>> To look at the the downside consider dax_fault().  Its called on a
>> fault to a user memory map, uses the filesystems get_block() to lookup
>> a sector so you can ask a block device to convert it to an address on
>> a DIMM.  Come on, that's awkward.  Everything around dax_fault() is
>> dripping with memory semantic interfaces, the dax_fault() call are
>> fundamentally about memory, the pmem calls are memory, the hardware is
>> memory, and yet it directly calls bdev_direct_access().  It's out of
>> place.
>
> What was out of place was the old 'get_xip_mem' in address_space
> operations.  Returning a kernel virtual address and a PFN from a
> filesystem operation?  That looks awful.

Yes.  Yes it does!  But at least my big hack was just one line. ;)
Nobody really even seemed to notice at the time.

>  All the other operations deal
> in struct pages, file offsets and occasionally sectors.  Of course, we
> don't have a struct page, so a pfn makes sense, but the kernel virtual
> address being returned was a gargantuan layering problem.

Well yes, but it was an expedient hack.

>> The legacy vfs/mm code didn't have this layering problem either.  Even
>> filemap_fault() that dax_fault() is modeled after doesn't call any
>> bdev methods directly, when it needs something it asks the filesystem
>> with a ->readpage().  The precedence is that you ask the filesystem
>> for what you need.  Look at the get_bdev() thing you've concluded you
>> need.  It _almost_ makes my point.  I just happen to be of the opinion
>> that you don't actually want or need the bdev, you want the pfn/kaddr
>> so you can flush or map or memcpy().
>
> You want the pfn.  The device driver doesn't have enough information to
> give you a (coherent with userspace) kaddr.  That's what (some future
> arch-specific implementation of) dax_map_pfn() is for.  That's why it
> takes 'index' as a parameter, so you can calculate where it'll be mapped
> in userspace, and determine an appropriate kernel virtual address to
> use for it.

Oh I think I'm just beginning to catch your vision for
dax_map_pfn().  I still don't get why we can't just do semi-arch
specific flushing instead of the alignment thing.  But that just might
be epic ignorance on my part.  Either way flush or magic alignments
dax_(un)map_pfn() would handle it, right?


Re: [PATCH] dax: allow DAX to look up an inode's block device

2016-02-02 Thread Jared Hulbert
On Tue, Feb 2, 2016 at 3:41 PM, Dan Williams <dan.j.willi...@intel.com> wrote:
> On Tue, Feb 2, 2016 at 3:36 PM, Jared Hulbert <jare...@gmail.com> wrote:
>> On Tue, Feb 2, 2016 at 3:19 PM, Al Viro <v...@zeniv.linux.org.uk> wrote:
>>>
>>> On Tue, Feb 02, 2016 at 04:11:42PM -0700, Ross Zwisler wrote:
>>>
>>> > However, for raw block devices and for XFS with a real-time device, the
>>> > value in inode->i_sb->s_bdev is not correct.  With the code as it is
>>> > currently written, an fsync or msync to a DAX enabled raw block device 
>>> > will
>>> > cause a NULL pointer dereference kernel BUG.  For this to work correctly 
>>> > we
>>> > need to ask the block device or filesystem what struct block_device is
>>> > appropriate for our inode.
>>> >
>>> > To that end, add a get_bdev(struct inode *) entry point to struct
>>> > super_operations.  If this function pointer is non-NULL, this notifies DAX
>>> > that it needs to use it to look up the correct block_device.  If
>>> > i_sb->get_bdev() is NULL DAX will default to inode->i_sb->s_bdev.
>>>
>>> Umm...  It assumes that bdev will stay pinned for as long as inode is
>>> referenced, presumably?  If so, that needs to be documented (and verified
>>> for existing fs instances).  In principle, multi-disk fs might want to
>>> support things like "silently move the inodes backed by that disk to other
>>> ones"...
>>
>> Dan, This is exactly the kind of thing I'm taking about WRT the
>> weirder device models and directly calling bdev_direct_access().
>> Filesystems don't have the monogamous relationship with a device that
>> is implicitly assumed in DAX, you have to ask the filesystem what the
>> relationship is and is migrating to, and allow the filesystem to
>> update DAX when the relationship is changing.
>
> That's precisely what ->get_bdev() does.  When the answer
> inode->i_sb->s_bdev lookup is invalid, use ->get_bdev().
>
>> As we start to see many
>> DIMM's and 10s TiB pmem systems this is going be an even bigger deal
>> as load balancing, wear leveling, and fault tolerance concerned are
>> inevitably driven by the filesystem.
>
> No, there are no plans on the horizon for an fs to manage these media
> specific concerns for persistent memory.

So the filesystem is now directly in charge of mapping user pages to
physical memory.  The filesystem is effectively bypassing NUMA and
zones and all that stuff that tries to balance memory bus and QPI
traffic etc.  You don't think the filesystem will therefore be in
charge of memory bus hotspots?

Alright.  We can just agree to disagree on that point.


Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences

2016-02-02 Thread Jared Hulbert
On Mon, Feb 1, 2016 at 10:46 PM, Dan Williams <dan.j.willi...@intel.com> wrote:
> On Mon, Feb 1, 2016 at 10:06 PM, Jared Hulbert <jare...@gmail.com> wrote:
>> On Mon, Feb 1, 2016 at 1:47 PM, Dave Chinner <da...@fromorbit.com> wrote:
>>> On Mon, Feb 01, 2016 at 03:51:47PM +0100, Jan Kara wrote:
>>>> On Sat 30-01-16 00:28:33, Matthew Wilcox wrote:
>>>> > On Fri, Jan 29, 2016 at 11:28:15AM -0700, Ross Zwisler wrote:
>>>> > > I guess I need to go off and understand if we can have DAX mappings on 
>>>> > > such a
>>>> > > device.  If we can, we may have a problem - we can get the 
>>>> > > block_device from
>>>> > > get_block() in I/O path and the various fault paths, but we don't have 
>>>> > > access
>>>> > > to get_block() when flushing via dax_writeback_mapping_range().  We 
>>>> > > avoid
>>>> > > needing it the normal case by storing the sector results from 
>>>> > > get_block() in
>>>> > > the radix tree.
>>>> >
>>>> > I think we're doing it wrong by storing the sector in the radix tree; 
>>>> > we'd
>>>> > really need to store both the sector and the bdev which is too much data.
>>>> >
>>>> > If we store the PFN of the underlying page instead, we don't have this
>>>> > problem.  Instead, we have a different problem; of the device going
>>>> > away under us.  I'm trying to find the code which tears down PTEs when
>>>> > the device goes away, and I'm not seeing it.  What do we do about user
>>>> > mappings of the device?
>>>>
>>>> So I don't have a strong opinion whether storing PFN or sector is better.
>>>> Maybe PFN is somewhat more generic but OTOH turning DAX off for special
>>>> cases like inodes on XFS RT devices would be IMHO fine.
>>>
>>> We need to support alternate devices.
>>
>> Embedded devices trying to use NOR Flash to free up RAM was
>> historically one of the more prevalent real world uses of the old
>> filemap_xip.c code although the users never made it to mainline.  So I
>> spent some time last week trying to figure out how to make a subset of
>> DAX not depend on CONFIG_BLOCK.  It was a very frustrating and
>> unfruitful experience.  I discarded my main conclusion as impractical,
>> but now that I see the difficultly DAX faces in dealing with
>> "alternate devices" especially some of the crazy stuff btrfs can do, I
>> wonder if it's not so crazy after all.
>>
>> Lets stop calling bdev_direct_access() directly from DAX.  Let the
>> filesystems do it.
>>
>> Sure we could enable generic_dax_direct_access() helper for the
>> filesystems that only support single devices to make it easy.  But XFS
>> and btrfs for example, have to do the work of figuring out what bdev
>> is required and then calling bdev_direct_access().
>>
>> My reasoning is that the filesystem knows how to map inodes and
>> offsets to devices and sectors, no matter how complex that is.  It
>> would even enable a filesystem to intelligently use a mix of
>> direct_access and regular block devices down the road.  Of course it
>> would also make the block-less solution doable.
>>
>> Good idea?  Stupid idea?
>
> The CONFIG_BLOCK=y case isn't going anywhere, so if anything it seems
> the CONFIG_BLOCK=n is an incremental feature in its own right.  What
> driver and what filesystem are looking to enable this XIP support in?

Well... as CONFIG_BLOCK was not required with filemap_xip.c for a
decade.  This CONFIG_BLOCK dependency is a result of an incremental
feature from a certain point of view ;)

The obvious 'driver' is physical RAM without a particular driver.
Remember please I'm talking about embedded.  RAM measured in MiB and
funky one off hardware etc.  In the embedded world there are lots of
ways that persistent memory has been supported in device specific ways
without the new fancypants NFIT and Intel instructions, so frankly
they don't fit in the PMEM stuff.  Maybe they could be supported in
PMEM but not without effort to bring embedded players to the table.

The other drivers are the MTD drivers, probably as read-only for now.
But the paradigm there isn't so different from what PMEM looks like
with asymmetric read/write capabilities.

The filesystem I'm concerned with is AXFS
(https://www.kernel.org/doc/ols/2008/ols2008v1-pages-211-218.pdf).
Which I've been planning on trying to merge again due to a recent
resurgence of interest.  The device model for AXFS is... weird.  I

Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences

2016-02-01 Thread Jared Hulbert
On Mon, Feb 1, 2016 at 1:47 PM, Dave Chinner  wrote:
> On Mon, Feb 01, 2016 at 03:51:47PM +0100, Jan Kara wrote:
>> On Sat 30-01-16 00:28:33, Matthew Wilcox wrote:
>> > On Fri, Jan 29, 2016 at 11:28:15AM -0700, Ross Zwisler wrote:
>> > > I guess I need to go off and understand if we can have DAX mappings on 
>> > > such a
>> > > device.  If we can, we may have a problem - we can get the block_device 
>> > > from
>> > > get_block() in I/O path and the various fault paths, but we don't have 
>> > > access
>> > > to get_block() when flushing via dax_writeback_mapping_range().  We avoid
>> > > needing it the normal case by storing the sector results from 
>> > > get_block() in
>> > > the radix tree.
>> >
>> > I think we're doing it wrong by storing the sector in the radix tree; we'd
>> > really need to store both the sector and the bdev which is too much data.
>> >
>> > If we store the PFN of the underlying page instead, we don't have this
>> > problem.  Instead, we have a different problem; of the device going
>> > away under us.  I'm trying to find the code which tears down PTEs when
>> > the device goes away, and I'm not seeing it.  What do we do about user
>> > mappings of the device?
>>
>> So I don't have a strong opinion whether storing PFN or sector is better.
>> Maybe PFN is somewhat more generic but OTOH turning DAX off for special
>> cases like inodes on XFS RT devices would be IMHO fine.
>
> We need to support alternate devices.

Embedded devices trying to use NOR Flash to free up RAM was
historically one of the more prevalent real world uses of the old
filemap_xip.c code although the users never made it to mainline.  So I
spent some time last week trying to figure out how to make a subset of
DAX not depend on CONFIG_BLOCK.  It was a very frustrating and
unfruitful experience.  I discarded my main conclusion as impractical,
but now that I see the difficultly DAX faces in dealing with
"alternate devices" especially some of the crazy stuff btrfs can do, I
wonder if it's not so crazy after all.

Lets stop calling bdev_direct_access() directly from DAX.  Let the
filesystems do it.

Sure we could enable generic_dax_direct_access() helper for the
filesystems that only support single devices to make it easy.  But XFS
and btrfs for example, have to do the work of figuring out what bdev
is required and then calling bdev_direct_access().

My reasoning is that the filesystem knows how to map inodes and
offsets to devices and sectors, no matter how complex that is.  It
would even enable a filesystem to intelligently use a mix of
direct_access and regular block devices down the road.  Of course it
would also make the block-less solution doable.

Good idea?  Stupid idea?


Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences

2016-02-01 Thread Jared Hulbert
On Mon, Feb 1, 2016 at 1:47 PM, Dave Chinner  wrote:
> On Mon, Feb 01, 2016 at 03:51:47PM +0100, Jan Kara wrote:
>> On Sat 30-01-16 00:28:33, Matthew Wilcox wrote:
>> > On Fri, Jan 29, 2016 at 11:28:15AM -0700, Ross Zwisler wrote:
>> > > I guess I need to go off and understand if we can have DAX mappings on 
>> > > such a
>> > > device.  If we can, we may have a problem - we can get the block_device 
>> > > from
>> > > get_block() in I/O path and the various fault paths, but we don't have 
>> > > access
>> > > to get_block() when flushing via dax_writeback_mapping_range().  We avoid
>> > > needing it the normal case by storing the sector results from 
>> > > get_block() in
>> > > the radix tree.
>> >
>> > I think we're doing it wrong by storing the sector in the radix tree; we'd
>> > really need to store both the sector and the bdev which is too much data.
>> >
>> > If we store the PFN of the underlying page instead, we don't have this
>> > problem.  Instead, we have a different problem; of the device going
>> > away under us.  I'm trying to find the code which tears down PTEs when
>> > the device goes away, and I'm not seeing it.  What do we do about user
>> > mappings of the device?
>>
>> So I don't have a strong opinion whether storing PFN or sector is better.
>> Maybe PFN is somewhat more generic but OTOH turning DAX off for special
>> cases like inodes on XFS RT devices would be IMHO fine.
>
> We need to support alternate devices.

Embedded devices trying to use NOR Flash to free up RAM was
historically one of the more prevalent real world uses of the old
filemap_xip.c code although the users never made it to mainline.  So I
spent some time last week trying to figure out how to make a subset of
DAX not depend on CONFIG_BLOCK.  It was a very frustrating and
unfruitful experience.  I discarded my main conclusion as impractical,
but now that I see the difficultly DAX faces in dealing with
"alternate devices" especially some of the crazy stuff btrfs can do, I
wonder if it's not so crazy after all.

Lets stop calling bdev_direct_access() directly from DAX.  Let the
filesystems do it.

Sure we could enable generic_dax_direct_access() helper for the
filesystems that only support single devices to make it easy.  But XFS
and btrfs for example, have to do the work of figuring out what bdev
is required and then calling bdev_direct_access().

My reasoning is that the filesystem knows how to map inodes and
offsets to devices and sectors, no matter how complex that is.  It
would even enable a filesystem to intelligently use a mix of
direct_access and regular block devices down the road.  Of course it
would also make the block-less solution doable.

Good idea?  Stupid idea?


Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences

2016-01-29 Thread Jared Hulbert
On Fri, Jan 29, 2016 at 10:01 PM, Dan Williams  wrote:
> On Fri, Jan 29, 2016 at 9:28 PM, Matthew Wilcox  wrote:
>> On Fri, Jan 29, 2016 at 11:28:15AM -0700, Ross Zwisler wrote:
>>> I guess I need to go off and understand if we can have DAX mappings on such 
>>> a
>>> device.  If we can, we may have a problem - we can get the block_device from
>>> get_block() in I/O path and the various fault paths, but we don't have 
>>> access
>>> to get_block() when flushing via dax_writeback_mapping_range().  We avoid
>>> needing it the normal case by storing the sector results from get_block() in
>>> the radix tree.
>>
>> I think we're doing it wrong by storing the sector in the radix tree; we'd
>> really need to store both the sector and the bdev which is too much data.
>>
>> If we store the PFN of the underlying page instead, we don't have this
>> problem.  Instead, we have a different problem; of the device going
>> away under us.  I'm trying to find the code which tears down PTEs when
>> the device goes away, and I'm not seeing it.  What do we do about user
>> mappings of the device?
>>
>
> I deferred the dax tear down code until next cycle as Al rightly
> pointed out some needed re-works:
>
> https://lists.01.org/pipermail/linux-nvdimm/2016-January/003995.html

If you store sectors in the radix and the device gets removed you
still have to unmap user mappings of PFNs.

So why is the device remove harder with the PFN vs bdev+sector radix
entry?  Either way you need a list of PFNs and their corresponding
PTE's, right?

And are we just talking graceful removal?  Any plans for device failures?


Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences

2016-01-29 Thread Jared Hulbert
On Fri, Jan 29, 2016 at 10:01 PM, Dan Williams  wrote:
> On Fri, Jan 29, 2016 at 9:28 PM, Matthew Wilcox  wrote:
>> On Fri, Jan 29, 2016 at 11:28:15AM -0700, Ross Zwisler wrote:
>>> I guess I need to go off and understand if we can have DAX mappings on such 
>>> a
>>> device.  If we can, we may have a problem - we can get the block_device from
>>> get_block() in I/O path and the various fault paths, but we don't have 
>>> access
>>> to get_block() when flushing via dax_writeback_mapping_range().  We avoid
>>> needing it the normal case by storing the sector results from get_block() in
>>> the radix tree.
>>
>> I think we're doing it wrong by storing the sector in the radix tree; we'd
>> really need to store both the sector and the bdev which is too much data.
>>
>> If we store the PFN of the underlying page instead, we don't have this
>> problem.  Instead, we have a different problem; of the device going
>> away under us.  I'm trying to find the code which tears down PTEs when
>> the device goes away, and I'm not seeing it.  What do we do about user
>> mappings of the device?
>>
>
> I deferred the dax tear down code until next cycle as Al rightly
> pointed out some needed re-works:
>
> https://lists.01.org/pipermail/linux-nvdimm/2016-January/003995.html

If you store sectors in the radix and the device gets removed you
still have to unmap user mappings of PFNs.

So why is the device remove harder with the PFN vs bdev+sector radix
entry?  Either way you need a list of PFNs and their corresponding
PTE's, right?

And are we just talking graceful removal?  Any plans for device failures?


Re: [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation

2016-01-27 Thread Jared Hulbert
On Mon, Jan 25, 2016 at 1:18 PM, Jared Hulbert  wrote:
> On Mon, Jan 25, 2016 at 8:52 AM, Matthew Wilcox  wrote:
>> On Sun, Jan 24, 2016 at 01:03:49AM -0800, Jared Hulbert wrote:
>>> I our defense we didn't know we were sinning at the time.
>>
>> Fair enough.  Cache flushing is Hard.
>>
>>> Can you walk me through the cache flushing hole?  How is it okay on
>>> X86 but not VIVT archs?  I'm missing something obvious here.
>>>
>>> I thought earlier that vm_insert_mixed() handled the necessary
>>> flushing.  Is that even the part you are worried about?
>>
>> No, that part should be fine.  My concern is about write() calls to files
>> which are also mmaped.  See Documentation/cachetlb.txt around line 229,
>> starting with "There exists another whole class of cpu cache issues" ...
>
> oh wow.  So aren't all the copy_to/from_user() variants specifically
> supposed to handle such cases?
>
>>> What flushing functions would you call if you did have a cache page.
>>
>> Well, that's the problem; they don't currently exist.
>>
>>> There are all kinds of cache flushing functions that work without a
>>> struct page. If nothing else the specialized ASM instructions that do
>>> the various flushes don't use struct page as a parameter.  This isn't
>>> the first I've run into the lack of a sane cache API.  Grep for
>>> inval_cache in the mtd drivers, should have been much easier.  Isn't
>>> the proper solution to fix update_mmu_cache() or build out a pageless
>>> cache flushing API?
>>>
>>> I don't get the explicit mapping solution.  What are you mapping
>>> where?  What addresses would be SHMLBA?  Phys, kernel, userspace?
>>
>> The problem comes in dax_io() where the kernel stores to an alias of the
>> user address (or reads from an alias of the user address).  Theoretically,
>> we should flush user addresses before we read from the kernel's alias,
>> and flush the kernel's alias after we store to it.
>
> Reasoning this out loud here.  Please correct.
>
> For the dax read case:
> - kernel virt is mapped to pfn
> - data is memcpy'd from kernel virt
>
> For the dax write case:
> - kernel virt is mapped to pfn
> - data is memcpy'd to kernel virt
> - user virt map to pfn attempts to read
>
> Is that right?  I see the x86 does a nocache copy_to/from operation,
> I'm not familiar with the semantics of that call and it would take me
> a while to understand the assembly but I assume it's doing some magic
> opcodes that forces the writes down to physical memory with each
> load/store.  Does the the caching model of the x86 arch update the
> cache entries tied to the physical memory on update?
>
> For architectures that don't do auto coherency magic...
>
> For reads:
> - User dcaches need flushing before kernel virtual mapping to ensure
> kernel reads latest data.  If the user has unflushed data in the
> dcache it would not be reflected in the read copy.
> This failure mode only is a problem if the filesystem is RW.
>
> For writes:
> - Unlike the read case we don't need up to date data for the user's
> mapping of a pfn.  However, the user will need to caches invalidated
> to get fresh data, so we should make sure to writeback any affected
> lines in the user caches so they don't get lost if we do an
> invalidate.  I suppose uncommitted data might corrupt the new data
> written from the kernel mapping if the cachelines get flushed later.
> - After the data is memcpy'ed to the kernel virt map the cache, and
> possibly the write buffers, should be flushed.  Without this flush the
> data might not ever get to the user mapped versions.
> - Assuming the user maps were all flushed at the outset they should be
> reloaded with fresh data on access.
>
> Do I get it more or less?

I assume the silence means I don't get it.

Moving along...

The need to flush kernel aliases and user alias without a struct page
was articulated and cited as the reason why the DAX doesn't work with
ARM, MIPS, and SPARC.

One of the following routines should work for kernel flushing, right?
--  flush_cache_vmap(unsigned long start, unsigned long end)
--  flush_kernel_vmap_range(void *vaddr, int size)
--  invalidate_kernel_vmap_range(void *vaddr, int size)

For user aliases I'm less confident with here, but at first glance I
don't see why these wouldn't work?
-- flush_cache_page(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn)
-- flush_cache_range(struct vm_area_struct *vma, unsigned long start,
unsigned long end)

Help?!  I missing something here.

>> But if we create a new address for the kernel to use which lands on the
>> same cache line 

Re: [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation

2016-01-27 Thread Jared Hulbert
On Mon, Jan 25, 2016 at 1:18 PM, Jared Hulbert <jare...@gmail.com> wrote:
> On Mon, Jan 25, 2016 at 8:52 AM, Matthew Wilcox <wi...@linux.intel.com> wrote:
>> On Sun, Jan 24, 2016 at 01:03:49AM -0800, Jared Hulbert wrote:
>>> I our defense we didn't know we were sinning at the time.
>>
>> Fair enough.  Cache flushing is Hard.
>>
>>> Can you walk me through the cache flushing hole?  How is it okay on
>>> X86 but not VIVT archs?  I'm missing something obvious here.
>>>
>>> I thought earlier that vm_insert_mixed() handled the necessary
>>> flushing.  Is that even the part you are worried about?
>>
>> No, that part should be fine.  My concern is about write() calls to files
>> which are also mmaped.  See Documentation/cachetlb.txt around line 229,
>> starting with "There exists another whole class of cpu cache issues" ...
>
> oh wow.  So aren't all the copy_to/from_user() variants specifically
> supposed to handle such cases?
>
>>> What flushing functions would you call if you did have a cache page.
>>
>> Well, that's the problem; they don't currently exist.
>>
>>> There are all kinds of cache flushing functions that work without a
>>> struct page. If nothing else the specialized ASM instructions that do
>>> the various flushes don't use struct page as a parameter.  This isn't
>>> the first I've run into the lack of a sane cache API.  Grep for
>>> inval_cache in the mtd drivers, should have been much easier.  Isn't
>>> the proper solution to fix update_mmu_cache() or build out a pageless
>>> cache flushing API?
>>>
>>> I don't get the explicit mapping solution.  What are you mapping
>>> where?  What addresses would be SHMLBA?  Phys, kernel, userspace?
>>
>> The problem comes in dax_io() where the kernel stores to an alias of the
>> user address (or reads from an alias of the user address).  Theoretically,
>> we should flush user addresses before we read from the kernel's alias,
>> and flush the kernel's alias after we store to it.
>
> Reasoning this out loud here.  Please correct.
>
> For the dax read case:
> - kernel virt is mapped to pfn
> - data is memcpy'd from kernel virt
>
> For the dax write case:
> - kernel virt is mapped to pfn
> - data is memcpy'd to kernel virt
> - user virt map to pfn attempts to read
>
> Is that right?  I see the x86 does a nocache copy_to/from operation,
> I'm not familiar with the semantics of that call and it would take me
> a while to understand the assembly but I assume it's doing some magic
> opcodes that forces the writes down to physical memory with each
> load/store.  Does the the caching model of the x86 arch update the
> cache entries tied to the physical memory on update?
>
> For architectures that don't do auto coherency magic...
>
> For reads:
> - User dcaches need flushing before kernel virtual mapping to ensure
> kernel reads latest data.  If the user has unflushed data in the
> dcache it would not be reflected in the read copy.
> This failure mode only is a problem if the filesystem is RW.
>
> For writes:
> - Unlike the read case we don't need up to date data for the user's
> mapping of a pfn.  However, the user will need to caches invalidated
> to get fresh data, so we should make sure to writeback any affected
> lines in the user caches so they don't get lost if we do an
> invalidate.  I suppose uncommitted data might corrupt the new data
> written from the kernel mapping if the cachelines get flushed later.
> - After the data is memcpy'ed to the kernel virt map the cache, and
> possibly the write buffers, should be flushed.  Without this flush the
> data might not ever get to the user mapped versions.
> - Assuming the user maps were all flushed at the outset they should be
> reloaded with fresh data on access.
>
> Do I get it more or less?

I assume the silence means I don't get it.

Moving along...

The need to flush kernel aliases and user alias without a struct page
was articulated and cited as the reason why the DAX doesn't work with
ARM, MIPS, and SPARC.

One of the following routines should work for kernel flushing, right?
--  flush_cache_vmap(unsigned long start, unsigned long end)
--  flush_kernel_vmap_range(void *vaddr, int size)
--  invalidate_kernel_vmap_range(void *vaddr, int size)

For user aliases I'm less confident with here, but at first glance I
don't see why these wouldn't work?
-- flush_cache_page(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn)
-- flush_cache_range(struct vm_area_struct *vma, unsigned long start,
unsigned long end)

Help?!  I missing something here.

>> But if we create a new address for the kernel 

Re: [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation

2016-01-25 Thread Jared Hulbert
On Mon, Jan 25, 2016 at 8:52 AM, Matthew Wilcox  wrote:
> On Sun, Jan 24, 2016 at 01:03:49AM -0800, Jared Hulbert wrote:
>> I our defense we didn't know we were sinning at the time.
>
> Fair enough.  Cache flushing is Hard.
>
>> Can you walk me through the cache flushing hole?  How is it okay on
>> X86 but not VIVT archs?  I'm missing something obvious here.
>>
>> I thought earlier that vm_insert_mixed() handled the necessary
>> flushing.  Is that even the part you are worried about?
>
> No, that part should be fine.  My concern is about write() calls to files
> which are also mmaped.  See Documentation/cachetlb.txt around line 229,
> starting with "There exists another whole class of cpu cache issues" ...

oh wow.  So aren't all the copy_to/from_user() variants specifically
supposed to handle such cases?

>> What flushing functions would you call if you did have a cache page.
>
> Well, that's the problem; they don't currently exist.
>
>> There are all kinds of cache flushing functions that work without a
>> struct page. If nothing else the specialized ASM instructions that do
>> the various flushes don't use struct page as a parameter.  This isn't
>> the first I've run into the lack of a sane cache API.  Grep for
>> inval_cache in the mtd drivers, should have been much easier.  Isn't
>> the proper solution to fix update_mmu_cache() or build out a pageless
>> cache flushing API?
>>
>> I don't get the explicit mapping solution.  What are you mapping
>> where?  What addresses would be SHMLBA?  Phys, kernel, userspace?
>
> The problem comes in dax_io() where the kernel stores to an alias of the
> user address (or reads from an alias of the user address).  Theoretically,
> we should flush user addresses before we read from the kernel's alias,
> and flush the kernel's alias after we store to it.

Reasoning this out loud here.  Please correct.

For the dax read case:
- kernel virt is mapped to pfn
- data is memcpy'd from kernel virt

For the dax write case:
- kernel virt is mapped to pfn
- data is memcpy'd to kernel virt
- user virt map to pfn attempts to read

Is that right?  I see the x86 does a nocache copy_to/from operation,
I'm not familiar with the semantics of that call and it would take me
a while to understand the assembly but I assume it's doing some magic
opcodes that forces the writes down to physical memory with each
load/store.  Does the the caching model of the x86 arch update the
cache entries tied to the physical memory on update?

For architectures that don't do auto coherency magic...

For reads:
- User dcaches need flushing before kernel virtual mapping to ensure
kernel reads latest data.  If the user has unflushed data in the
dcache it would not be reflected in the read copy.
This failure mode only is a problem if the filesystem is RW.

For writes:
- Unlike the read case we don't need up to date data for the user's
mapping of a pfn.  However, the user will need to caches invalidated
to get fresh data, so we should make sure to writeback any affected
lines in the user caches so they don't get lost if we do an
invalidate.  I suppose uncommitted data might corrupt the new data
written from the kernel mapping if the cachelines get flushed later.
- After the data is memcpy'ed to the kernel virt map the cache, and
possibly the write buffers, should be flushed.  Without this flush the
data might not ever get to the user mapped versions.
- Assuming the user maps were all flushed at the outset they should be
reloaded with fresh data on access.

Do I get it more or less?

> But if we create a new address for the kernel to use which lands on the
> same cache line as the user's address (and this is what SHMLBA is used
> to indicate), there is no incoherency between the kernel's view and the
> user's view.  And no new cache flushing API is needed.

So... how exactly would one force the kernel address to be at the
SHMLBA boundary?

> Is that clearer?  I'm not always good at explaining these things in a
> way which makes sense to other people :-(

Yeah.  I think I'm at 80% comprehension here.  Or at least I think I
am.  Thanks.


Re: [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation

2016-01-25 Thread Jared Hulbert
On Mon, Jan 25, 2016 at 8:52 AM, Matthew Wilcox <wi...@linux.intel.com> wrote:
> On Sun, Jan 24, 2016 at 01:03:49AM -0800, Jared Hulbert wrote:
>> I our defense we didn't know we were sinning at the time.
>
> Fair enough.  Cache flushing is Hard.
>
>> Can you walk me through the cache flushing hole?  How is it okay on
>> X86 but not VIVT archs?  I'm missing something obvious here.
>>
>> I thought earlier that vm_insert_mixed() handled the necessary
>> flushing.  Is that even the part you are worried about?
>
> No, that part should be fine.  My concern is about write() calls to files
> which are also mmaped.  See Documentation/cachetlb.txt around line 229,
> starting with "There exists another whole class of cpu cache issues" ...

oh wow.  So aren't all the copy_to/from_user() variants specifically
supposed to handle such cases?

>> What flushing functions would you call if you did have a cache page.
>
> Well, that's the problem; they don't currently exist.
>
>> There are all kinds of cache flushing functions that work without a
>> struct page. If nothing else the specialized ASM instructions that do
>> the various flushes don't use struct page as a parameter.  This isn't
>> the first I've run into the lack of a sane cache API.  Grep for
>> inval_cache in the mtd drivers, should have been much easier.  Isn't
>> the proper solution to fix update_mmu_cache() or build out a pageless
>> cache flushing API?
>>
>> I don't get the explicit mapping solution.  What are you mapping
>> where?  What addresses would be SHMLBA?  Phys, kernel, userspace?
>
> The problem comes in dax_io() where the kernel stores to an alias of the
> user address (or reads from an alias of the user address).  Theoretically,
> we should flush user addresses before we read from the kernel's alias,
> and flush the kernel's alias after we store to it.

Reasoning this out loud here.  Please correct.

For the dax read case:
- kernel virt is mapped to pfn
- data is memcpy'd from kernel virt

For the dax write case:
- kernel virt is mapped to pfn
- data is memcpy'd to kernel virt
- user virt map to pfn attempts to read

Is that right?  I see the x86 does a nocache copy_to/from operation,
I'm not familiar with the semantics of that call and it would take me
a while to understand the assembly but I assume it's doing some magic
opcodes that forces the writes down to physical memory with each
load/store.  Does the the caching model of the x86 arch update the
cache entries tied to the physical memory on update?

For architectures that don't do auto coherency magic...

For reads:
- User dcaches need flushing before kernel virtual mapping to ensure
kernel reads latest data.  If the user has unflushed data in the
dcache it would not be reflected in the read copy.
This failure mode only is a problem if the filesystem is RW.

For writes:
- Unlike the read case we don't need up to date data for the user's
mapping of a pfn.  However, the user will need to caches invalidated
to get fresh data, so we should make sure to writeback any affected
lines in the user caches so they don't get lost if we do an
invalidate.  I suppose uncommitted data might corrupt the new data
written from the kernel mapping if the cachelines get flushed later.
- After the data is memcpy'ed to the kernel virt map the cache, and
possibly the write buffers, should be flushed.  Without this flush the
data might not ever get to the user mapped versions.
- Assuming the user maps were all flushed at the outset they should be
reloaded with fresh data on access.

Do I get it more or less?

> But if we create a new address for the kernel to use which lands on the
> same cache line as the user's address (and this is what SHMLBA is used
> to indicate), there is no incoherency between the kernel's view and the
> user's view.  And no new cache flushing API is needed.

So... how exactly would one force the kernel address to be at the
SHMLBA boundary?

> Is that clearer?  I'm not always good at explaining these things in a
> way which makes sense to other people :-(

Yeah.  I think I'm at 80% comprehension here.  Or at least I think I
am.  Thanks.


Re: [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation

2016-01-24 Thread Jared Hulbert
I our defense we didn't know we were sinning at the time.

Can you walk me through the cache flushing hole?  How is it okay on
X86 but not VIVT archs?  I'm missing something obvious here.

I thought earlier that vm_insert_mixed() handled the necessary
flushing.  Is that even the part you are worried about?

vm_insert_mixed()->insert_pfn()->update_mmu_cache() _should_ handle
the flush.  Except of course now that I look at the ARM code it looks
like it isn't doing anything if !pfn_valid().I need to spend
some more time looking at this again.

What flushing functions would you call if you did have a cache page.
There are all kinds of cache flushing functions that work without a
struct page. If nothing else the specialized ASM instructions that do
the various flushes don't use struct page as a parameter.  This isn't
the first I've run into the lack of a sane cache API.  Grep for
inval_cache in the mtd drivers, should have been much easier.  Isn't
the proper solution to fix update_mmu_cache() or build out a pageless
cache flushing API?

I don't get the explicit mapping solution.  What are you mapping
where?  What addresses would be SHMLBA?  Phys, kernel, userspace?


Re: [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation

2016-01-24 Thread Jared Hulbert
I our defense we didn't know we were sinning at the time.

Can you walk me through the cache flushing hole?  How is it okay on
X86 but not VIVT archs?  I'm missing something obvious here.

I thought earlier that vm_insert_mixed() handled the necessary
flushing.  Is that even the part you are worried about?

vm_insert_mixed()->insert_pfn()->update_mmu_cache() _should_ handle
the flush.  Except of course now that I look at the ARM code it looks
like it isn't doing anything if !pfn_valid().I need to spend
some more time looking at this again.

What flushing functions would you call if you did have a cache page.
There are all kinds of cache flushing functions that work without a
struct page. If nothing else the specialized ASM instructions that do
the various flushes don't use struct page as a parameter.  This isn't
the first I've run into the lack of a sane cache API.  Grep for
inval_cache in the mtd drivers, should have been much easier.  Isn't
the proper solution to fix update_mmu_cache() or build out a pageless
cache flushing API?

I don't get the explicit mapping solution.  What are you mapping
where?  What addresses would be SHMLBA?  Phys, kernel, userspace?


Re: [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation

2016-01-21 Thread Jared Hulbert
HI!  I've been out of the community for a while, but I'm trying to
step back in here and catch up with some of my old areas of specialty.
Couple questions, sorry to drag up such old conversations.

The DAX documentation that made it into kernel 4.0 has the following
line  "The DAX code does not work correctly on architectures which
have virtually mapped caches such as ARM, MIPS and SPARC."

1) It really doesn't support ARM.?  I never had problems with
the old filemap_xip.c stuff on ARM, what changed?
2) Is there a thread discussing this?

On Fri, Oct 24, 2014 at 2:20 PM, Matthew Wilcox
 wrote:
> From: Matthew Wilcox 
>
> Based on the original XIP documentation, this documents the current
> state of affairs, and includes instructions on how users can enable DAX
> if their devices and kernel support it.
>
> Signed-off-by: Matthew Wilcox 
> Reviewed-by: Randy Dunlap 
> ---
>  Documentation/filesystems/00-INDEX |  5 ++-
>  Documentation/filesystems/dax.txt  | 89 
> ++
>  Documentation/filesystems/xip.txt  | 71 --
>  3 files changed, 92 insertions(+), 73 deletions(-)
>  create mode 100644 Documentation/filesystems/dax.txt
>  delete mode 100644 Documentation/filesystems/xip.txt
>
> diff --git a/Documentation/filesystems/00-INDEX 
> b/Documentation/filesystems/00-INDEX
> index ac28149..9922939 100644
> --- a/Documentation/filesystems/00-INDEX
> +++ b/Documentation/filesystems/00-INDEX
> @@ -34,6 +34,9 @@ configfs/
> - directory containing configfs documentation and example code.
>  cramfs.txt
> - info on the cram filesystem for small storage (ROMs etc).
> +dax.txt
> +   - info on avoiding the page cache for files stored on CPU-addressable
> + storage devices.
>  debugfs.txt
> - info on the debugfs filesystem.
>  devpts.txt
> @@ -154,5 +157,3 @@ xfs-self-describing-metadata.txt
> - info on XFS Self Describing Metadata.
>  xfs.txt
> - info and mount options for the XFS filesystem.
> -xip.txt
> -   - info on execute-in-place for file mappings.
> diff --git a/Documentation/filesystems/dax.txt 
> b/Documentation/filesystems/dax.txt
> new file mode 100644
> index 000..635adaa
> --- /dev/null
> +++ b/Documentation/filesystems/dax.txt
> @@ -0,0 +1,89 @@
> +Direct Access for files
> +---
> +
> +Motivation
> +--
> +
> +The page cache is usually used to buffer reads and writes to files.
> +It is also used to provide the pages which are mapped into userspace
> +by a call to mmap.
> +
> +For block devices that are memory-like, the page cache pages would be
> +unnecessary copies of the original storage.  The DAX code removes the
> +extra copy by performing reads and writes directly to the storage device.
> +For file mappings, the storage device is mapped directly into userspace.
> +
> +
> +Usage
> +-
> +
> +If you have a block device which supports DAX, you can make a filesystem
> +on it as usual.  When mounting it, use the -o dax option manually
> +or add 'dax' to the options in /etc/fstab.
> +
> +
> +Implementation Tips for Block Driver Writers
> +
> +
> +To support DAX in your block driver, implement the 'direct_access'
> +block device operation.  It is used to translate the sector number
> +(expressed in units of 512-byte sectors) to a page frame number (pfn)
> +that identifies the physical page for the memory.  It also returns a
> +kernel virtual address that can be used to access the memory.
> +
> +The direct_access method takes a 'size' parameter that indicates the
> +number of bytes being requested.  The function should return the number
> +of bytes that can be contiguously accessed at that offset.  It may also
> +return a negative errno if an error occurs.
> +
> +In order to support this method, the storage must be byte-accessible by
> +the CPU at all times.  If your device uses paging techniques to expose
> +a large amount of memory through a smaller window, then you cannot
> +implement direct_access.  Equally, if your device can occasionally
> +stall the CPU for an extended period, you should also not attempt to
> +implement direct_access.
> +
> +These block devices may be used for inspiration:
> +- axonram: Axon DDR2 device driver
> +- brd: RAM backed block device driver
> +- dcssblk: s390 dcss block device driver
> +
> +
> +Implementation Tips for Filesystem Writers
> +--
> +
> +Filesystem support consists of
> +- adding support to mark inodes as being DAX by setting the S_DAX flag in
> +  i_flags
> +- implementing the direct_IO address space operation, and calling
> +  dax_do_io() instead of blockdev_direct_IO() if S_DAX is set
> +- implementing an mmap file operation for DAX files which sets the
> +  VM_MIXEDMAP flag on the VMA, and setting the vm_ops to include handlers
> +  for fault and page_mkwrite (which should probably call dax_fault() and
> +  dax_mkwrite(), 

Re: [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation

2016-01-21 Thread Jared Hulbert
HI!  I've been out of the community for a while, but I'm trying to
step back in here and catch up with some of my old areas of specialty.
Couple questions, sorry to drag up such old conversations.

The DAX documentation that made it into kernel 4.0 has the following
line  "The DAX code does not work correctly on architectures which
have virtually mapped caches such as ARM, MIPS and SPARC."

1) It really doesn't support ARM.?  I never had problems with
the old filemap_xip.c stuff on ARM, what changed?
2) Is there a thread discussing this?

On Fri, Oct 24, 2014 at 2:20 PM, Matthew Wilcox
 wrote:
> From: Matthew Wilcox 
>
> Based on the original XIP documentation, this documents the current
> state of affairs, and includes instructions on how users can enable DAX
> if their devices and kernel support it.
>
> Signed-off-by: Matthew Wilcox 
> Reviewed-by: Randy Dunlap 
> ---
>  Documentation/filesystems/00-INDEX |  5 ++-
>  Documentation/filesystems/dax.txt  | 89 
> ++
>  Documentation/filesystems/xip.txt  | 71 --
>  3 files changed, 92 insertions(+), 73 deletions(-)
>  create mode 100644 Documentation/filesystems/dax.txt
>  delete mode 100644 Documentation/filesystems/xip.txt
>
> diff --git a/Documentation/filesystems/00-INDEX 
> b/Documentation/filesystems/00-INDEX
> index ac28149..9922939 100644
> --- a/Documentation/filesystems/00-INDEX
> +++ b/Documentation/filesystems/00-INDEX
> @@ -34,6 +34,9 @@ configfs/
> - directory containing configfs documentation and example code.
>  cramfs.txt
> - info on the cram filesystem for small storage (ROMs etc).
> +dax.txt
> +   - info on avoiding the page cache for files stored on CPU-addressable
> + storage devices.
>  debugfs.txt
> - info on the debugfs filesystem.
>  devpts.txt
> @@ -154,5 +157,3 @@ xfs-self-describing-metadata.txt
> - info on XFS Self Describing Metadata.
>  xfs.txt
> - info and mount options for the XFS filesystem.
> -xip.txt
> -   - info on execute-in-place for file mappings.
> diff --git a/Documentation/filesystems/dax.txt 
> b/Documentation/filesystems/dax.txt
> new file mode 100644
> index 000..635adaa
> --- /dev/null
> +++ b/Documentation/filesystems/dax.txt
> @@ -0,0 +1,89 @@
> +Direct Access for files
> +---
> +
> +Motivation
> +--
> +
> +The page cache is usually used to buffer reads and writes to files.
> +It is also used to provide the pages which are mapped into userspace
> +by a call to mmap.
> +
> +For block devices that are memory-like, the page cache pages would be
> +unnecessary copies of the original storage.  The DAX code removes the
> +extra copy by performing reads and writes directly to the storage device.
> +For file mappings, the storage device is mapped directly into userspace.
> +
> +
> +Usage
> +-
> +
> +If you have a block device which supports DAX, you can make a filesystem
> +on it as usual.  When mounting it, use the -o dax option manually
> +or add 'dax' to the options in /etc/fstab.
> +
> +
> +Implementation Tips for Block Driver Writers
> +
> +
> +To support DAX in your block driver, implement the 'direct_access'
> +block device operation.  It is used to translate the sector number
> +(expressed in units of 512-byte sectors) to a page frame number (pfn)
> +that identifies the physical page for the memory.  It also returns a
> +kernel virtual address that can be used to access the memory.
> +
> +The direct_access method takes a 'size' parameter that indicates the
> +number of bytes being requested.  The function should return the number
> +of bytes that can be contiguously accessed at that offset.  It may also
> +return a negative errno if an error occurs.
> +
> +In order to support this method, the storage must be byte-accessible by
> +the CPU at all times.  If your device uses paging techniques to expose
> +a large amount of memory through a smaller window, then you cannot
> +implement direct_access.  Equally, if your device can occasionally
> +stall the CPU for an extended period, you should also not attempt to
> +implement direct_access.
> +
> +These block devices may be used for inspiration:
> +- axonram: Axon DDR2 device driver
> +- brd: RAM backed block device driver
> +- dcssblk: s390 dcss block device driver
> +
> +
> +Implementation Tips for Filesystem Writers
> +--
> +
> +Filesystem support consists of
> +- adding support to mark inodes as being DAX by setting the S_DAX flag in
> +  i_flags
> +- implementing the direct_IO address space operation, and calling
> +  dax_do_io() instead of blockdev_direct_IO() if S_DAX is set
> +- implementing an mmap file operation for DAX files which sets the
> +  VM_MIXEDMAP flag on the VMA, and setting the vm_ops to include handlers
> 

Re: [patch] ext2: xip check fix

2007-12-07 Thread Jared Hulbert
> > I think so.  The filemap_xip.c functionality doesn't work for Flash
> > memory yet.  Flash memory doesn't have struct pages to back it up with
> > which this stuff depends on.
>
> Struct page is not the major issue. The primary problem is writing to
> the media (and I am not a flash expert at all, just relaying here):
> For some period of time, the flash memory is not usable and thus we
> need to make sure we can nuke the page table entries that we have in
> userland page tables. For that, we need a callback from the device so
> that it can ask to get its references back. Oh, and a put_xip_page
> counterpart to get_xip_page, so that the driver knows when it's safe
> to erase.

Well... That's the biggest/hardest problem, yes.  But not the first.
First we got to tackle the easy read only case, which doesn't require
any of that unpleasantness, yet which is used in a bunch of out of
tree hacks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] ext2: xip check fix

2007-12-07 Thread Jared Hulbert
  I think so.  The filemap_xip.c functionality doesn't work for Flash
  memory yet.  Flash memory doesn't have struct pages to back it up with
  which this stuff depends on.

 Struct page is not the major issue. The primary problem is writing to
 the media (and I am not a flash expert at all, just relaying here):
 For some period of time, the flash memory is not usable and thus we
 need to make sure we can nuke the page table entries that we have in
 userland page tables. For that, we need a callback from the device so
 that it can ask to get its references back. Oh, and a put_xip_page
 counterpart to get_xip_page, so that the driver knows when it's safe
 to erase.

Well... That's the biggest/hardest problem, yes.  But not the first.
First we got to tackle the easy read only case, which doesn't require
any of that unpleasantness, yet which is used in a bunch of out of
tree hacks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] ext2: xip check fix

2007-12-06 Thread Jared Hulbert
> Um, trying to clarify:  S390.  Also known as zSeries, big iron machine, uses
> its own weird processor design rather than x86, x86-64, arm, or mips
> processors.

Right.  filemap_xip.c allows for an XIP filesystem.  The only
filesystem that is supported is ext2.  Even that requires a block
device driver thingy, which I don't understand, that's specific to the
s390.

> How does "struct page" enter into this?

Don't sweat it, it has to do with the way filemap_xip.c works.

> What I want to know is, are you saying execute in place doesn't work on things
> like arm and mips?  (I so, I was unaware of this.  I heard about somebody
> getting it to work on a Nintendo DS:
> http://forums.maxconsole.net/showthread.php?t=18668 )

XIP works fine on things like arm and mips.  However there is mixed
support in the mainline kernel for it.  For example, you can build an
XiP kernel image for arm since like 2.6.10 or 12.  Also MTD has an XiP
aware mode that protects XiP objects in flash from get screwed up
during programs and erases.  But there is no mainlined solution for
XiP of applications from the filesystem.  However there have been
patches for cramfs to do this for years.  They are kind of messy and
keep getting rejected.  I do have a solution in the works for this
part of it - http://axfs.sf.net.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] ext2: xip check fix

2007-12-06 Thread Jared Hulbert
> > I have'nt looked at it yet. I do appreciate it, I think it might
> > broaden the user-base of this feature which is up to now s390 only due
> > to the fact that the flash memory extensions have not been implemented
> > (yet?). And it enables testing xip on other platforms. The patch is on
> > my must-read list.
>
> query: which feature is currently s390 only?  (Execute In Place?)

I think so.  The filemap_xip.c functionality doesn't work for Flash
memory yet.  Flash memory doesn't have struct pages to back it up with
which this stuff depends on.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] ext2: xip check fix

2007-12-06 Thread Jared Hulbert
  I have'nt looked at it yet. I do appreciate it, I think it might
  broaden the user-base of this feature which is up to now s390 only due
  to the fact that the flash memory extensions have not been implemented
  (yet?). And it enables testing xip on other platforms. The patch is on
  my must-read list.

 query: which feature is currently s390 only?  (Execute In Place?)

I think so.  The filemap_xip.c functionality doesn't work for Flash
memory yet.  Flash memory doesn't have struct pages to back it up with
which this stuff depends on.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] ext2: xip check fix

2007-12-06 Thread Jared Hulbert
 Um, trying to clarify:  S390.  Also known as zSeries, big iron machine, uses
 its own weird processor design rather than x86, x86-64, arm, or mips
 processors.

Right.  filemap_xip.c allows for an XIP filesystem.  The only
filesystem that is supported is ext2.  Even that requires a block
device driver thingy, which I don't understand, that's specific to the
s390.

 How does struct page enter into this?

Don't sweat it, it has to do with the way filemap_xip.c works.

 What I want to know is, are you saying execute in place doesn't work on things
 like arm and mips?  (I so, I was unaware of this.  I heard about somebody
 getting it to work on a Nintendo DS:
 http://forums.maxconsole.net/showthread.php?t=18668 )

XIP works fine on things like arm and mips.  However there is mixed
support in the mainline kernel for it.  For example, you can build an
XiP kernel image for arm since like 2.6.10 or 12.  Also MTD has an XiP
aware mode that protects XiP objects in flash from get screwed up
during programs and erases.  But there is no mainlined solution for
XiP of applications from the filesystem.  However there have been
patches for cramfs to do this for years.  They are kind of messy and
keep getting rejected.  I do have a solution in the works for this
part of it - http://axfs.sf.net.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: solid state drive access and context switching

2007-12-05 Thread Jared Hulbert
> Probably about 1000 clocks but its always going to depend upon the
> workload and whether any other work can be done usefully.

Yeah.  Sounds right, in the microsecond range.  Be interesting to see data.

Anybody have ideas on what kind of experiments could confirm this
estimate is right?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: solid state drive access and context switching

2007-12-05 Thread Jared Hulbert
 Probably about 1000 clocks but its always going to depend upon the
 workload and whether any other work can be done usefully.

Yeah.  Sounds right, in the microsecond range.  Be interesting to see data.

Anybody have ideas on what kind of experiments could confirm this
estimate is right?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: solid state drive access and context switching

2007-12-04 Thread Jared Hulbert
On Dec 4, 2007 3:24 PM, Alan Cox <[EMAIL PROTECTED]> wrote:
> > Right.  The trend is to hide the nastiness of NAND technology changes
> > behind controllers.  In general I think this is a good thing.
>
> You miss the point - any controller you hide it behind almost inevitably
> adds enough latency you don't want to use it synchronously.

I think I get it.  We keep saying that it's the latency is too high.
I agree that most technologies out there have latencies that are too
high.  Again I ask the question, what latencies do we have to hit
before the sync options become worth it?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: solid state drive access and context switching

2007-12-04 Thread Jared Hulbert
> > Maybe I'm missing something but I don't see it.  We want a block
> > interface for these devices, we just need a faster slimmer interface.
> > Maybe a new mtdblock interface that doesn't do erase would be the
> > place for?
>
> Doesn't do erase?  MTD has to learn almost all tricks from the block
> layer, as devices are becoming high-latency high-bandwidth, compared to
> what MTD was designed for.  In order to get any decent performance, we
> need asynchronous operations, request queues and caching.
>
> The only useful advantage MTD does have over block devices is an
> _explicit_ erase operation.  Did you mean "doesn't do _implicit_ erase".


You're right.  That the point I was trying to make, albeit badly, MTD
isn't the place for this.  The fact that more and more of what the MTD
is being used for looks a lot like the block layer is a whole
different discussion.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: solid state drive access and context switching

2007-12-04 Thread Jared Hulbert
> > microseconds level and an order of magnitude higher bandwidth than
> > SATA.  Is that fast enough to warrant this more synchronous IO?
>
> See the mtd layer.

Right.  The trend is to hide the nastiness of NAND technology changes
behind controllers.  In general I think this is a good thing.
Basically the changes in ECC and reliability change very rapidly in
this technology.  Having custom controller hardware to handle this is
faster than handling it in software and makes for a nice modular
interface.  We don't rewrite our SATA drivers and filesystem
everything the magnetic media switches to a new recording scheme, we
just plug it in.  SSD's are going to be like that even if they aren't
SATA. However, the MTD layer is more about managing the chips
themselves, which is what the controllers are for.

Maybe I'm missing something but I don't see it.  We want a block
interface for these devices, we just need a faster slimmer interface.
Maybe a new mtdblock interface that doesn't do erase would be the
place for?

> > BTW - This trend toward faster, lower latency busses is marching
> > forward.  2 examples; the ioDrive from Fusion IO, Micron's RAM-module
> > like SSD concept.
>
> Very much so but we can do quite a bit in 10,000 processor cycles ...
>
> Alan
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: solid state drive access and context switching

2007-12-04 Thread Jared Hulbert
> > refinements could theoretically get us down one more (~100
> > microsecond).
>
> They've already done already better than that.  Here's a solid state
> drive with a claimed 20 microsecond access time:
>
> http://www.curtisssd.com/products/drives/hyperxclr

Right.  That looks to be RAM based, which means  compared to NAND,
so that's not going to breakout of a server niche.  I imagine the
latency is the device latency not the system latency.  By the time you
send the request through the fibrechannel stack and get the block back
it's gonna be much closer to 100 microseconds.  It's that OS visible
latency that you've got to design to.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: solid state drive access and context switching

2007-12-04 Thread Jared Hulbert
> > Has anyone played with this concept?
>
> For things like SATA based devices they aren't that fast yet.

What is fast enough?

As I understand the basic memory technology, the hard limit is in the
100's of microseconds range for latency.  SATA adds something to that.
 I'd be surprised to see latencies on SATA SSD's as measured at the OS
level to get below 1 millisecond.

What happens we start placing NAND technology in lower latency, higher
bandwidth buses?  I'm guessing we'll get down to that 100's of
microseconds level and an order of magnitude higher bandwidth than
SATA.  Is that fast enough to warrant this more synchronous IO?

Magnetic drives have latencies ~10 milliseconds, current SSD's are an
order of magnitude better (~1 millisecond), new interfaces and
refinements could theoretically get us down one more (~100
microsecond).  I'm guessing the current block driver subsystem would
negate a lot of that latency gain.  Am I wrong?

BTW - This trend toward faster, lower latency busses is marching
forward.  2 examples; the ioDrive from Fusion IO, Micron's RAM-module
like SSD concept.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: solid state drive access and context switching

2007-12-04 Thread Jared Hulbert
  Has anyone played with this concept?

 For things like SATA based devices they aren't that fast yet.

What is fast enough?

As I understand the basic memory technology, the hard limit is in the
100's of microseconds range for latency.  SATA adds something to that.
 I'd be surprised to see latencies on SATA SSD's as measured at the OS
level to get below 1 millisecond.

What happens we start placing NAND technology in lower latency, higher
bandwidth buses?  I'm guessing we'll get down to that 100's of
microseconds level and an order of magnitude higher bandwidth than
SATA.  Is that fast enough to warrant this more synchronous IO?

Magnetic drives have latencies ~10 milliseconds, current SSD's are an
order of magnitude better (~1 millisecond), new interfaces and
refinements could theoretically get us down one more (~100
microsecond).  I'm guessing the current block driver subsystem would
negate a lot of that latency gain.  Am I wrong?

BTW - This trend toward faster, lower latency busses is marching
forward.  2 examples; the ioDrive from Fusion IO, Micron's RAM-module
like SSD concept.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: solid state drive access and context switching

2007-12-04 Thread Jared Hulbert
  refinements could theoretically get us down one more (~100
  microsecond).

 They've already done already better than that.  Here's a solid state
 drive with a claimed 20 microsecond access time:

 http://www.curtisssd.com/products/drives/hyperxclr

Right.  That looks to be RAM based, which means  compared to NAND,
so that's not going to breakout of a server niche.  I imagine the
latency is the device latency not the system latency.  By the time you
send the request through the fibrechannel stack and get the block back
it's gonna be much closer to 100 microseconds.  It's that OS visible
latency that you've got to design to.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: solid state drive access and context switching

2007-12-04 Thread Jared Hulbert
  microseconds level and an order of magnitude higher bandwidth than
  SATA.  Is that fast enough to warrant this more synchronous IO?

 See the mtd layer.

Right.  The trend is to hide the nastiness of NAND technology changes
behind controllers.  In general I think this is a good thing.
Basically the changes in ECC and reliability change very rapidly in
this technology.  Having custom controller hardware to handle this is
faster than handling it in software and makes for a nice modular
interface.  We don't rewrite our SATA drivers and filesystem
everything the magnetic media switches to a new recording scheme, we
just plug it in.  SSD's are going to be like that even if they aren't
SATA. However, the MTD layer is more about managing the chips
themselves, which is what the controllers are for.

Maybe I'm missing something but I don't see it.  We want a block
interface for these devices, we just need a faster slimmer interface.
Maybe a new mtdblock interface that doesn't do erase would be the
place for?

  BTW - This trend toward faster, lower latency busses is marching
  forward.  2 examples; the ioDrive from Fusion IO, Micron's RAM-module
  like SSD concept.

 Very much so but we can do quite a bit in 10,000 processor cycles ...

 Alan

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: solid state drive access and context switching

2007-12-04 Thread Jared Hulbert
  Maybe I'm missing something but I don't see it.  We want a block
  interface for these devices, we just need a faster slimmer interface.
  Maybe a new mtdblock interface that doesn't do erase would be the
  place for?

 Doesn't do erase?  MTD has to learn almost all tricks from the block
 layer, as devices are becoming high-latency high-bandwidth, compared to
 what MTD was designed for.  In order to get any decent performance, we
 need asynchronous operations, request queues and caching.

 The only useful advantage MTD does have over block devices is an
 _explicit_ erase operation.  Did you mean doesn't do _implicit_ erase.


You're right.  That the point I was trying to make, albeit badly, MTD
isn't the place for this.  The fact that more and more of what the MTD
is being used for looks a lot like the block layer is a whole
different discussion.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: solid state drive access and context switching

2007-12-04 Thread Jared Hulbert
On Dec 4, 2007 3:24 PM, Alan Cox [EMAIL PROTECTED] wrote:
  Right.  The trend is to hide the nastiness of NAND technology changes
  behind controllers.  In general I think this is a good thing.

 You miss the point - any controller you hide it behind almost inevitably
 adds enough latency you don't want to use it synchronously.

I think I get it.  We keep saying that it's the latency is too high.
I agree that most technologies out there have latencies that are too
high.  Again I ask the question, what latencies do we have to hit
before the sync options become worth it?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] Linux-tiny project revival

2007-09-20 Thread Jared Hulbert
> > I think that this idea is not worth it.

Don't use the config option then

> My problem is that switching off printk is the single biggest bloat cutter in
> the kernel, yet it makes the resulting system very hard to support.  It
> combines a big upside with a big downside, and I'd like something in between.

It's not such a big downside IMHO.  You can support a kernel without
printk.  Need to debug the kernel without printk?  Use a JTAG
debugger...

If you have a system that actually configures out printk's, chances
are you don't have storage and output mechanisms to do much with the
messages anyway.  Think embedded _products_ here.  Sure the
development boards have serial, ethernet, and all that jazz but tens
of millions of ARM based gadgets don't.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [Announce] Linux-tiny project revival

2007-09-20 Thread Jared Hulbert
  I think that this idea is not worth it.

Don't use the config option then

 My problem is that switching off printk is the single biggest bloat cutter in
 the kernel, yet it makes the resulting system very hard to support.  It
 combines a big upside with a big downside, and I'd like something in between.

It's not such a big downside IMHO.  You can support a kernel without
printk.  Need to debug the kernel without printk?  Use a JTAG
debugger...

If you have a system that actually configures out printk's, chances
are you don't have storage and output mechanisms to do much with the
messages anyway.  Think embedded _products_ here.  Sure the
development boards have serial, ethernet, and all that jazz but tens
of millions of ARM based gadgets don't.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


2.6.22-rc6-mm1: BUG_ON() mm/memory.c, vm_insert_pfn(), filemap_xip.c, and spufs

2007-07-03 Thread Jared Hulbert

Recently there has been some discussion of the possiblity of reworking
some of filemap_xip.c to be pfn oriented.  This would allow an XIP
fork of cramfs to use the filemap_xip framework.  Today this is not
possible.

I've been trying out vm_insert_pfn() to start down that road.  I used
spufs as a reference for how to use it.  The include patch to cramfs
is my hack at it.

When I try to execute an XIP binary I get a BUG() on 2.6.22-rc6-mm1 at
mm/memory.c line 2334.  The way I read this is says that spufs might
not work.  I can't test it.
In spufs_mem_mmap() line 196 the vma is flagged as VM_PFNMAP:
vma->vm_flags |= VM_IO | VM_PFNMAP;

When you get a fault in a vma  __do_fault() will get this vma and
BUG() on line 2334:
BUG_ON(vma->vm_flags & VM_PFNMAP);

What happened to the functionality of do_no_pfn()?




diff -r 74bad9e01817 fs/Kconfig
--- a/fs/KconfigThu Jun 28 13:49:43 2007 -0700
+++ b/fs/KconfigMon Jul 02 15:47:16 2007 -0700
@@ -65,8 +65,7 @@ config FS_XIP
config FS_XIP
# execute in place
bool
-   depends on EXT2_FS_XIP
-   default y
+   default n

config EXT3_FS
tristate "Ext3 journalling file system support"
@@ -1399,8 +1398,8 @@ endchoice

config CRAMFS
tristate "Compressed ROM file system support (cramfs)"
-   depends on BLOCK
select ZLIB_INFLATE
+   select FS_XIP
help
  Saying Y here includes support for CramFs (Compressed ROM File
  System).  CramFs is designed to be a simple, small, and compressed
diff -r 74bad9e01817 fs/cramfs/inode.c
--- a/fs/cramfs/inode.c Thu Jun 28 13:49:43 2007 -0700
+++ b/fs/cramfs/inode.c Tue Jul 03 17:45:42 2007 -0700
@@ -24,15 +24,21 @@
#include 
#include 
#include 
-
+#include 
#include 

+static const struct file_operations cramfs_xip_fops;
static const struct super_operations cramfs_ops;
static const struct inode_operations cramfs_dir_inode_operations;
static const struct file_operations cramfs_directory_operations;
static const struct address_space_operations cramfs_aops;
+static const struct address_space_operations cramfs_xip_aops;

static DEFINE_MUTEX(read_mutex);
+
+static struct backing_dev_info cramfs_backing_dev_info = {
+   .ra_pages   = 0,/* No readahead */
+};


/* These two macros may change in future, to provide better st_ino
@@ -77,19 +83,31 @@ static int cramfs_iget5_set(struct inode
/* Struct copy intentional */
inode->i_mtime = inode->i_atime = inode->i_ctime = zerotime;
inode->i_ino = CRAMINO(cramfs_inode);
+
+   if (CRAMFS_INODE_IS_XIP(inode))
+   inode->i_mapping->backing_dev_info = _backing_dev_info;
+
/* inode->i_nlink is left 1 - arguably wrong for directories,
   but it's the best we can do without reading the directory
   contents.  1 yields the right result in GNU find, even
   without -noleaf option. */
if (S_ISREG(inode->i_mode)) {
-   inode->i_fop = _ro_fops;
-   inode->i_data.a_ops = _aops;
+   if (CRAMFS_INODE_IS_XIP(inode)) {
+   inode->i_fop = _xip_fops;
+   inode->i_data.a_ops = _xip_aops;
+   } else {
+   inode->i_fop = _ro_fops;
+   inode->i_data.a_ops = _aops;
+   }
} else if (S_ISDIR(inode->i_mode)) {
inode->i_op = _dir_inode_operations;
inode->i_fop = _directory_operations;
} else if (S_ISLNK(inode->i_mode)) {
inode->i_op = _symlink_inode_operations;
-   inode->i_data.a_ops = _aops;
+   if (CRAMFS_INODE_IS_XIP(inode))
+   inode->i_data.a_ops = _xip_aops;
+   else
+   inode->i_data.a_ops = _aops;
} else {
inode->i_size = 0;
inode->i_blocks = 0;
@@ -111,34 +129,6 @@ static struct inode *get_cramfs_inode(st
return inode;
}

-/*
- * We have our own block cache: don't fill up the buffer cache
- * with the rom-image, because the way the filesystem is set
- * up the accesses should be fairly regular and cached in the
- * page cache and dentry tree anyway..
- *
- * This also acts as a way to guarantee contiguous areas of up to
- * BLKS_PER_BUF*PAGE_CACHE_SIZE, so that the caller doesn't need to
- * worry about end-of-buffer issues even when decompressing a full
- * page cache.
- */
-#define READ_BUFFERS (2)
-/* NEXT_BUFFER(): Loop over [0..(READ_BUFFERS-1)]. */
-#define NEXT_BUFFER(_ix) ((_ix) ^ 1)
-
-/*
- * BLKS_PER_BUF_SHIFT should be at least 2 to allow for "compressed"
- * data that takes up more space than the original and with unlucky
- * alignment.
- */
-#define BLKS_PER_BUF_SHIFT (2)
-#define BLKS_PER_BUF   (1 << BLKS_PER_BUF_SHIFT)
-#define BUFFER_SIZE(BLKS_PER_BUF*PAGE_CACHE_SIZE)
-
-static unsigned char read_buffers[READ_BUFFERS][BUFFER_SIZE];
-static unsigned 

2.6.22-rc6-mm1: BUG_ON() mm/memory.c, vm_insert_pfn(), filemap_xip.c, and spufs

2007-07-03 Thread Jared Hulbert

Recently there has been some discussion of the possiblity of reworking
some of filemap_xip.c to be pfn oriented.  This would allow an XIP
fork of cramfs to use the filemap_xip framework.  Today this is not
possible.

I've been trying out vm_insert_pfn() to start down that road.  I used
spufs as a reference for how to use it.  The include patch to cramfs
is my hack at it.

When I try to execute an XIP binary I get a BUG() on 2.6.22-rc6-mm1 at
mm/memory.c line 2334.  The way I read this is says that spufs might
not work.  I can't test it.
In spufs_mem_mmap() line 196 the vma is flagged as VM_PFNMAP:
vma-vm_flags |= VM_IO | VM_PFNMAP;

When you get a fault in a vma  __do_fault() will get this vma and
BUG() on line 2334:
BUG_ON(vma-vm_flags  VM_PFNMAP);

What happened to the functionality of do_no_pfn()?




diff -r 74bad9e01817 fs/Kconfig
--- a/fs/KconfigThu Jun 28 13:49:43 2007 -0700
+++ b/fs/KconfigMon Jul 02 15:47:16 2007 -0700
@@ -65,8 +65,7 @@ config FS_XIP
config FS_XIP
# execute in place
bool
-   depends on EXT2_FS_XIP
-   default y
+   default n

config EXT3_FS
tristate Ext3 journalling file system support
@@ -1399,8 +1398,8 @@ endchoice

config CRAMFS
tristate Compressed ROM file system support (cramfs)
-   depends on BLOCK
select ZLIB_INFLATE
+   select FS_XIP
help
  Saying Y here includes support for CramFs (Compressed ROM File
  System).  CramFs is designed to be a simple, small, and compressed
diff -r 74bad9e01817 fs/cramfs/inode.c
--- a/fs/cramfs/inode.c Thu Jun 28 13:49:43 2007 -0700
+++ b/fs/cramfs/inode.c Tue Jul 03 17:45:42 2007 -0700
@@ -24,15 +24,21 @@
#include linux/vfs.h
#include linux/mutex.h
#include asm/semaphore.h
-
+#include linux/vmalloc.h
#include asm/uaccess.h

+static const struct file_operations cramfs_xip_fops;
static const struct super_operations cramfs_ops;
static const struct inode_operations cramfs_dir_inode_operations;
static const struct file_operations cramfs_directory_operations;
static const struct address_space_operations cramfs_aops;
+static const struct address_space_operations cramfs_xip_aops;

static DEFINE_MUTEX(read_mutex);
+
+static struct backing_dev_info cramfs_backing_dev_info = {
+   .ra_pages   = 0,/* No readahead */
+};


/* These two macros may change in future, to provide better st_ino
@@ -77,19 +83,31 @@ static int cramfs_iget5_set(struct inode
/* Struct copy intentional */
inode-i_mtime = inode-i_atime = inode-i_ctime = zerotime;
inode-i_ino = CRAMINO(cramfs_inode);
+
+   if (CRAMFS_INODE_IS_XIP(inode))
+   inode-i_mapping-backing_dev_info = cramfs_backing_dev_info;
+
/* inode-i_nlink is left 1 - arguably wrong for directories,
   but it's the best we can do without reading the directory
   contents.  1 yields the right result in GNU find, even
   without -noleaf option. */
if (S_ISREG(inode-i_mode)) {
-   inode-i_fop = generic_ro_fops;
-   inode-i_data.a_ops = cramfs_aops;
+   if (CRAMFS_INODE_IS_XIP(inode)) {
+   inode-i_fop = cramfs_xip_fops;
+   inode-i_data.a_ops = cramfs_xip_aops;
+   } else {
+   inode-i_fop = generic_ro_fops;
+   inode-i_data.a_ops = cramfs_aops;
+   }
} else if (S_ISDIR(inode-i_mode)) {
inode-i_op = cramfs_dir_inode_operations;
inode-i_fop = cramfs_directory_operations;
} else if (S_ISLNK(inode-i_mode)) {
inode-i_op = page_symlink_inode_operations;
-   inode-i_data.a_ops = cramfs_aops;
+   if (CRAMFS_INODE_IS_XIP(inode))
+   inode-i_data.a_ops = cramfs_xip_aops;
+   else
+   inode-i_data.a_ops = cramfs_aops;
} else {
inode-i_size = 0;
inode-i_blocks = 0;
@@ -111,34 +129,6 @@ static struct inode *get_cramfs_inode(st
return inode;
}

-/*
- * We have our own block cache: don't fill up the buffer cache
- * with the rom-image, because the way the filesystem is set
- * up the accesses should be fairly regular and cached in the
- * page cache and dentry tree anyway..
- *
- * This also acts as a way to guarantee contiguous areas of up to
- * BLKS_PER_BUF*PAGE_CACHE_SIZE, so that the caller doesn't need to
- * worry about end-of-buffer issues even when decompressing a full
- * page cache.
- */
-#define READ_BUFFERS (2)
-/* NEXT_BUFFER(): Loop over [0..(READ_BUFFERS-1)]. */
-#define NEXT_BUFFER(_ix) ((_ix) ^ 1)
-
-/*
- * BLKS_PER_BUF_SHIFT should be at least 2 to allow for compressed
- * data that takes up more space than the original and with unlucky
- * alignment.
- */
-#define BLKS_PER_BUF_SHIFT (2)
-#define BLKS_PER_BUF   (1  BLKS_PER_BUF_SHIFT)
-#define BUFFER_SIZE

Re: vm/fs meetup in september?

2007-07-02 Thread Jared Hulbert

On 7/2/07, Jörn Engel <[EMAIL PROTECTED]> wrote:

On Mon, 2 July 2007 10:44:00 -0700, Jared Hulbert wrote:
>
> >So what you mean is "swap on flash" ?  Defintively sounds like an
> >interesting topic, although I'm not too sure it's all that
> >filesystem-related.
>
> Maybe not. Yet, it would be a very useful place to store data from a
> file as a non-volatile page cache.
>
> Also it is something that I believe would benefit from a VFS-like API.
> I mean there is a consistent interface a management layer like this
> could use, yet the algorithms used to order the data and the interface
> to the physical media may vary.  There is no single right way to do
> the management layer, much like filesystems.
>
> Given the page orientation of the current VFS seems to me like there
> might be a nice way to use it for this purpose.
>
> Or maybe the real experts on this stuff can tell me how wrong that is
> and where it should go :)

I don't believe anyone has implemented this before, so any experts would
be self-appointed.

Maybe this should be turned into a filesystem subject after all.  The
complexity comes from combining XIP with writes on the same chip.  So
solving your problem should be identical to solving the rw XIP
filesystem problem.

If there is interest in the latter, I'd offer my self-appointed
expertise.


Right, the solution to swap problem is identical to the rw XIP
filesystem problem.Jörn, that's why you're the self-appointed
subject matter expert!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: vm/fs meetup in september?

2007-07-02 Thread Jared Hulbert

So what you mean is "swap on flash" ?  Defintively sounds like an
interesting topic, although I'm not too sure it's all that
filesystem-related.


Maybe not. Yet, it would be a very useful place to store data from a
file as a non-volatile page cache.

Also it is something that I believe would benefit from a VFS-like API.
I mean there is a consistent interface a management layer like this
could use, yet the algorithms used to order the data and the interface
to the physical media may vary.  There is no single right way to do
the management layer, much like filesystems.

Given the page orientation of the current VFS seems to me like there
might be a nice way to use it for this purpose.

Or maybe the real experts on this stuff can tell me how wrong that is
and where it should go :)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: vm/fs meetup in september?

2007-07-02 Thread Jared Hulbert

> Christoph> So what you mean is "swap on flash" ?  Defintively sounds
> Christoph> like an interesting topic, although I'm not too sure it's
> Christoph> all that filesystem-related.

I wouldn't want to call it swap, as this carries with it block-io
connotations.  It's really mmap on flash.


Yes it is really mmap on flash.  But you are "swapping" pages from RAM
to be mmap'ed on flash.  Also the flash-io complexities are similar to
the block-io layer.  I think "swap on flash" is fair.  Though that
might be confused with making swap work on a NAND flash, which is very
much like the current block-io approach.  "Mmappable swap on flash" is
more exact, I suppose.


> You need either a block translation layer,

Are you suggesting to go through the block layer to reach the flash?


Well the obvious route would be to have this management layer use the
MTD, I can't see anything wrong with that.


> or a (swap) filesystem that
> understands flash peculiarities in order to make such a thing work.
> The standard Linux swap format will not work.

Correct.

BTW, you may want to have a look at my "[RFC] VM: I have a dream..." thread.


Interesting.  This idea does allow for swap to be access directly.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: vm/fs meetup in september?

2007-07-02 Thread Jared Hulbert

 Christoph So what you mean is swap on flash ?  Defintively sounds
 Christoph like an interesting topic, although I'm not too sure it's
 Christoph all that filesystem-related.

I wouldn't want to call it swap, as this carries with it block-io
connotations.  It's really mmap on flash.


Yes it is really mmap on flash.  But you are swapping pages from RAM
to be mmap'ed on flash.  Also the flash-io complexities are similar to
the block-io layer.  I think swap on flash is fair.  Though that
might be confused with making swap work on a NAND flash, which is very
much like the current block-io approach.  Mmappable swap on flash is
more exact, I suppose.


 You need either a block translation layer,

Are you suggesting to go through the block layer to reach the flash?


Well the obvious route would be to have this management layer use the
MTD, I can't see anything wrong with that.


 or a (swap) filesystem that
 understands flash peculiarities in order to make such a thing work.
 The standard Linux swap format will not work.

Correct.

BTW, you may want to have a look at my [RFC] VM: I have a dream... thread.


Interesting.  This idea does allow for swap to be access directly.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: vm/fs meetup in september?

2007-07-02 Thread Jared Hulbert

So what you mean is swap on flash ?  Defintively sounds like an
interesting topic, although I'm not too sure it's all that
filesystem-related.


Maybe not. Yet, it would be a very useful place to store data from a
file as a non-volatile page cache.

Also it is something that I believe would benefit from a VFS-like API.
I mean there is a consistent interface a management layer like this
could use, yet the algorithms used to order the data and the interface
to the physical media may vary.  There is no single right way to do
the management layer, much like filesystems.

Given the page orientation of the current VFS seems to me like there
might be a nice way to use it for this purpose.

Or maybe the real experts on this stuff can tell me how wrong that is
and where it should go :)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: vm/fs meetup in september?

2007-07-02 Thread Jared Hulbert

On 7/2/07, Jörn Engel [EMAIL PROTECTED] wrote:

On Mon, 2 July 2007 10:44:00 -0700, Jared Hulbert wrote:

 So what you mean is swap on flash ?  Defintively sounds like an
 interesting topic, although I'm not too sure it's all that
 filesystem-related.

 Maybe not. Yet, it would be a very useful place to store data from a
 file as a non-volatile page cache.

 Also it is something that I believe would benefit from a VFS-like API.
 I mean there is a consistent interface a management layer like this
 could use, yet the algorithms used to order the data and the interface
 to the physical media may vary.  There is no single right way to do
 the management layer, much like filesystems.

 Given the page orientation of the current VFS seems to me like there
 might be a nice way to use it for this purpose.

 Or maybe the real experts on this stuff can tell me how wrong that is
 and where it should go :)

I don't believe anyone has implemented this before, so any experts would
be self-appointed.

Maybe this should be turned into a filesystem subject after all.  The
complexity comes from combining XIP with writes on the same chip.  So
solving your problem should be identical to solving the rw XIP
filesystem problem.

If there is interest in the latter, I'd offer my self-appointed
expertise.


Right, the solution to swap problem is identical to the rw XIP
filesystem problem.Jörn, that's why you're the self-appointed
subject matter expert!
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: vm/fs meetup in september?

2007-06-26 Thread Jared Hulbert

On 6/25/07, Christoph Hellwig <[EMAIL PROTECTED]> wrote:

On Mon, Jun 25, 2007 at 05:08:02PM -0700, Jared Hulbert wrote:
> -memory mappable swap file (I'm not sure if this one is appropriate
> for the proposed meeting)

Please explain what this is supposed to mean.


If you have a large array of a non-volatile semi-writeable memory such
as a highspeed NOR Flash or some of the similar emerging technologies
in a system.  It would be useful to use that memory as an extension of
RAM.  One of the ways you could do that is allow pages to be swapped
out to this memory.  Once there these pages could be read directly,
but would require a COW procedure on a write access.  The reason why I
think this may be a vm/fs topic is that the hardware makes writing to
this memory efficiently a non-trivial operation that requires
management just like a filesystem.  Also it seems to me that there are
probably overlaps between this topic and the recent filemap_xip.c
discussions.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: vm/fs meetup in september?

2007-06-26 Thread Jared Hulbert

On 6/25/07, Christoph Hellwig [EMAIL PROTECTED] wrote:

On Mon, Jun 25, 2007 at 05:08:02PM -0700, Jared Hulbert wrote:
 -memory mappable swap file (I'm not sure if this one is appropriate
 for the proposed meeting)

Please explain what this is supposed to mean.


If you have a large array of a non-volatile semi-writeable memory such
as a highspeed NOR Flash or some of the similar emerging technologies
in a system.  It would be useful to use that memory as an extension of
RAM.  One of the ways you could do that is allow pages to be swapped
out to this memory.  Once there these pages could be read directly,
but would require a COW procedure on a write access.  The reason why I
think this may be a vm/fs topic is that the hardware makes writing to
this memory efficiently a non-trivial operation that requires
management just like a filesystem.  Also it seems to me that there are
probably overlaps between this topic and the recent filemap_xip.c
discussions.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: vm/fs meetup in september?

2007-06-25 Thread Jared Hulbert

A few things I'd like to talk about are:

- the address space operations APIs, and their page based nature. I think
  it would be nice to generally move toward offset,length based ones as
  much as possible because it should give more efficiency and flexibility
  in the filesystem.

- write_begin API if it is still an issue by that date. Hope not :)

- truncate races

- fsblock if it hasn't been shot down by then

- how to make complex API changes without having to fix most things
  yourself.


I'd like to add:

-revamping filemap_xip.c

-memory mappable swap file (I'm not sure if this one is appropriate
for the proposed meeting)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: vm/fs meetup in september?

2007-06-25 Thread Jared Hulbert

A few things I'd like to talk about are:

- the address space operations APIs, and their page based nature. I think
  it would be nice to generally move toward offset,length based ones as
  much as possible because it should give more efficiency and flexibility
  in the filesystem.

- write_begin API if it is still an issue by that date. Hope not :)

- truncate races

- fsblock if it hasn't been shot down by then

- how to make complex API changes without having to fix most things
  yourself.


I'd like to add:

-revamping filemap_xip.c

-memory mappable swap file (I'm not sure if this one is appropriate
for the proposed meeting)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-15 Thread Jared Hulbert

On 6/15/07, Carsten Otte <[EMAIL PROTECTED]> wrote:

Nick Piggin wrote:
> Carsten Otte wrote:
>> The current xip stack relies on having struct page behind the memory
>> segment. This causes few impact on memory management, but occupies
>> some more memory. The cramfs patch chose to modify copy on write in
>> order to deal with vmas that don't have struct page behind.
>> So far, Hugh and Linus have shown strong opposition against copy on
>> write with no struct page behind. If this implementation is acceptable
>> to the them, it seems preferable to me over wasting memory. The xip
>> stack should be modified to use this vma flag in that case.
>
> I would rather not :P
>
> We can copy on write without a struct page behind the source today, no?
> What is insufficient for the XIP code with the current COW?

I've looked at the -mm version of mm/memory.c today, with intend to
try out VM_PFNMAP for our xip mappings and replace nopage() with fault().
The thing is, I believe it does'nt work for us:
  * The way we recognize those mappings is through the rules set up
  * by "remap_pfn_range()": the vma will have the VM_PFNMAP bit set,
  * and the vm_pgoff will point to the first PFN mapped: thus every
  * page that is a raw mapping will always honor the rule
  *
  *  pfn_of_page == vma->vm_pgoff + ((addr - vma->vm_start) >>
PAGE_SHIFT)

This is, as far as I can tell, not true for our xip mappings. Ext2 may
spread the physical pages behind a given file all over its media. That
means, that the pfns of the pages that form a vma may be more or less
random rather than contiguous. The common memory management code
cannot tell whether or not a given page has been COW'ed.
Did I miss something?


I agree, the conditions imposed by the remap_pfn_range() don't work.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-15 Thread Jared Hulbert

If you can write code that doesn't need any struct pages that would make
life a bit easier, since we wouldn't need any pseudo memory hotplug code
that just adds struct pages.


That was my gut feel too.  However, it seems from Carsten and Jörn
discussion of read/write XIP on Flash (and some new Phase Change)
memories that having the struct pages has a lot of potential benefits.
Wouldn't it also allow most of the mm routines to remain unchanged.
I just worry that it would be difficult to set apart these non
volitatile pages that can't be written too directly.


We would still need to add the kernel mapping though.


But that's handled by ioremap()ing it right?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-15 Thread Jared Hulbert

If you can write code that doesn't need any struct pages that would make
life a bit easier, since we wouldn't need any pseudo memory hotplug code
that just adds struct pages.


That was my gut feel too.  However, it seems from Carsten and Jörn
discussion of read/write XIP on Flash (and some new Phase Change)
memories that having the struct pages has a lot of potential benefits.
Wouldn't it also allow most of the mm routines to remain unchanged.
I just worry that it would be difficult to set apart these non
volitatile pages that can't be written too directly.


We would still need to add the kernel mapping though.


But that's handled by ioremap()ing it right?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-15 Thread Jared Hulbert

On 6/15/07, Carsten Otte [EMAIL PROTECTED] wrote:

Nick Piggin wrote:
 Carsten Otte wrote:
 The current xip stack relies on having struct page behind the memory
 segment. This causes few impact on memory management, but occupies
 some more memory. The cramfs patch chose to modify copy on write in
 order to deal with vmas that don't have struct page behind.
 So far, Hugh and Linus have shown strong opposition against copy on
 write with no struct page behind. If this implementation is acceptable
 to the them, it seems preferable to me over wasting memory. The xip
 stack should be modified to use this vma flag in that case.

 I would rather not :P

 We can copy on write without a struct page behind the source today, no?
 What is insufficient for the XIP code with the current COW?

I've looked at the -mm version of mm/memory.c today, with intend to
try out VM_PFNMAP for our xip mappings and replace nopage() with fault().
The thing is, I believe it does'nt work for us:
  * The way we recognize those mappings is through the rules set up
  * by remap_pfn_range(): the vma will have the VM_PFNMAP bit set,
  * and the vm_pgoff will point to the first PFN mapped: thus every
  * page that is a raw mapping will always honor the rule
  *
  *  pfn_of_page == vma-vm_pgoff + ((addr - vma-vm_start) 
PAGE_SHIFT)

This is, as far as I can tell, not true for our xip mappings. Ext2 may
spread the physical pages behind a given file all over its media. That
means, that the pfns of the pages that form a vma may be more or less
random rather than contiguous. The common memory management code
cannot tell whether or not a given page has been COW'ed.
Did I miss something?


I agree, the conditions imposed by the remap_pfn_range() don't work.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-14 Thread Jared Hulbert

An alternative approach, which does not need to have struct page at
hand, would be to use the nopfn vm operations struct. That one would
have to rely on get_xip_pfn.


Of course!  Okay now I'm begining to understand.


The current path would then be deprecated.


Why?  Wouldn't both paths be valid options?


If you're interrested in using the later for xip without
struct page, I would volounteer to go ahead and implement this?


I'm very interested in this.


I'm not opposed to using struct page, but I'm confused as to how to
start that.  As I understand it, which is not well, defined a
CONFIG_DISCONTIGMEM region to cover the Flash memory would add that to
my pool of RAM.  That would be 'bad', right?  I don't see how to
create the page structs and set this memory aside as different.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-14 Thread Jared Hulbert

An alternative approach, which does not need to have struct page at
hand, would be to use the nopfn vm operations struct. That one would
have to rely on get_xip_pfn.


Of course!  Okay now I'm begining to understand.


The current path would then be deprecated.


Why?  Wouldn't both paths be valid options?


If you're interrested in using the later for xip without
struct page, I would volounteer to go ahead and implement this?


I'm very interested in this.


I'm not opposed to using struct page, but I'm confused as to how to
start that.  As I understand it, which is not well, defined a
CONFIG_DISCONTIGMEM region to cover the Flash memory would add that to
my pool of RAM.  That would be 'bad', right?  I don't see how to
create the page structs and set this memory aside as different.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-12 Thread Jared Hulbert

Nick Piggin wrote:
> The question is, why is that not enough (I haven't looked at these
> patches enough to work out if there is anything more they provide).
I think, it just takes trying things out. From reading the code, I
think this should work well for the filemap_xip code with no struct page.
Also, we need eliminate nopage() to get rid of the struct page.
Unfortunately I don't find time to try this out for now, and on 390 we
can live with struct page for the time being. In contrast to the
embedded platforms, the mem_mep array gets swapped out to disk by our
hypervisor.


Can you help me understand the comment about nopage()?  Do you mean
set xip_file_vm_ops.nopage to NULL?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-12 Thread Jared Hulbert

Nick Piggin wrote:
 The question is, why is that not enough (I haven't looked at these
 patches enough to work out if there is anything more they provide).
I think, it just takes trying things out. From reading the code, I
think this should work well for the filemap_xip code with no struct page.
Also, we need eliminate nopage() to get rid of the struct page.
Unfortunately I don't find time to try this out for now, and on 390 we
can live with struct page for the time being. In contrast to the
embedded platforms, the mem_mep array gets swapped out to disk by our
hypervisor.


Can you help me understand the comment about nopage()?  Do you mean
set xip_file_vm_ops.nopage to NULL?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-08 Thread Jared Hulbert

The downside: We need mem_map[] struct page entries behind all memory
segments. Nowerdays we can easily create those via vmem_map/sparsemem.

Opinions?


Frankly this is going to be mostly relevant on ARM architectures at
least at first.  Maybe I'm missing something but I don't see that
sparemem is supported on ARM...
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-08 Thread Jared Hulbert

On Fri, Jun 08, 2007 at 09:05:32AM -0700, Jared Hulbert wrote:
> Okay so we need some driver that opens/closes this ROM.  This has been
> done from the dcss block device but that doesn't make sense for most
> embedded systems.  The MTD allows for this with point(),unpoint().
> That should work just fine.  It does introduce the MTD as a dependancy
> which is unnecessary in many systems, but it will work now.

The Linux solution to this problem would be to introduce an option for
mtd write support.  That way the majority of the code doesn't heave to
be compiled for the read-only case but you still get a uniform interface.


You mean make an MTD-light interface possible?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-08 Thread Jared Hulbert

On 6/8/07, Christoph Hellwig <[EMAIL PROTECTED]> wrote:

On Fri, Jun 08, 2007 at 09:59:20AM +0200, Carsten Otte wrote:
> Christoph Hellwig wrote:
> >Jared's patch currently does ioremap on mount (and no iounmap at all).
> >That mapping needs to move from the filesystem to the device driver.
> The device driver needs to do ioremap on open(), and iounmap() on
> release. That's effectively what our block driver does.

Yes, exactly.


Okay so we need some driver that opens/closes this ROM.  This has been
done from the dcss block device but that doesn't make sense for most
embedded systems.  The MTD allows for this with point(),unpoint().
That should work just fine.  It does introduce the MTD as a dependancy
which is unnecessary in many systems, but it will work now.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-08 Thread Jared Hulbert

On 6/8/07, Christoph Hellwig <[EMAIL PROTECTED]> wrote:

On Thu, Jun 07, 2007 at 01:34:12PM -0700, Jared Hulbert wrote:
> >And we'll need that even when using cramfs.  There's not way we'd
> >merge a hack where the user has to specify a physical address on
> >the mount command line.
>
> Why not?  For the use case in question the user usually manually
> burned the image to a physical address before hand.  Many of these
> system don't have MTD turned on for this Flash, they don't need it
> because they don't write to this Flash once the system is up.

Then add a small device layer for it.  Remember that linux is not all
about hacked up embedded devices that get shipped once and never
touched again.


Remember that linux is not all about big iron machines with lots of
processors and gigabytes of RAM :)

I concede your layer point, ioremap() doesn't belong in the filesystem.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-08 Thread Jared Hulbert

On 6/8/07, Christoph Hellwig [EMAIL PROTECTED] wrote:

On Thu, Jun 07, 2007 at 01:34:12PM -0700, Jared Hulbert wrote:
 And we'll need that even when using cramfs.  There's not way we'd
 merge a hack where the user has to specify a physical address on
 the mount command line.

 Why not?  For the use case in question the user usually manually
 burned the image to a physical address before hand.  Many of these
 system don't have MTD turned on for this Flash, they don't need it
 because they don't write to this Flash once the system is up.

Then add a small device layer for it.  Remember that linux is not all
about hacked up embedded devices that get shipped once and never
touched again.


Remember that linux is not all about big iron machines with lots of
processors and gigabytes of RAM :)

I concede your layer point, ioremap() doesn't belong in the filesystem.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-08 Thread Jared Hulbert

On 6/8/07, Christoph Hellwig [EMAIL PROTECTED] wrote:

On Fri, Jun 08, 2007 at 09:59:20AM +0200, Carsten Otte wrote:
 Christoph Hellwig wrote:
 Jared's patch currently does ioremap on mount (and no iounmap at all).
 That mapping needs to move from the filesystem to the device driver.
 The device driver needs to do ioremap on open(), and iounmap() on
 release. That's effectively what our block driver does.

Yes, exactly.


Okay so we need some driver that opens/closes this ROM.  This has been
done from the dcss block device but that doesn't make sense for most
embedded systems.  The MTD allows for this with point(),unpoint().
That should work just fine.  It does introduce the MTD as a dependancy
which is unnecessary in many systems, but it will work now.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-08 Thread Jared Hulbert

On Fri, Jun 08, 2007 at 09:05:32AM -0700, Jared Hulbert wrote:
 Okay so we need some driver that opens/closes this ROM.  This has been
 done from the dcss block device but that doesn't make sense for most
 embedded systems.  The MTD allows for this with point(),unpoint().
 That should work just fine.  It does introduce the MTD as a dependancy
 which is unnecessary in many systems, but it will work now.

The Linux solution to this problem would be to introduce an option for
mtd write support.  That way the majority of the code doesn't heave to
be compiled for the read-only case but you still get a uniform interface.


You mean make an MTD-light interface possible?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-08 Thread Jared Hulbert

The downside: We need mem_map[] struct page entries behind all memory
segments. Nowerdays we can easily create those via vmem_map/sparsemem.

Opinions?


Frankly this is going to be mostly relevant on ARM architectures at
least at first.  Maybe I'm missing something but I don't see that
sparemem is supported on ARM...
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


get_xip_page() uncertainity

2007-06-07 Thread Jared Hulbert

I am trying to create valid "struct page* (*get_xip_page)(struct
address_space *, sector_t, int)" to use the filemap_xip.c.

I've been trying to do it as follows:

virtual = ioremap(physical,size);

struct page* my_get_xip_page(struct address_space *mapping, sector_t
sector, int create)
{
 unsigned long offset;
 /*extract offset from mapping and sector*/
 return virt_to_page(virtual + offset);
}

I believe this to be fundamentally flawed.  While this works for
xip_file_read(), it does not work for xip_file_mmap().  I'm not sure I
understand the correct way to do this.  But I assume the problem, and
have some evidence to support it, is that virt_to_page() is not
returning a vaild page struct.

How can I get a valid page struct?  The memory is not RAM but Flash.
It is addressable like RAM and I want userspace to use it like
readonly RAM.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-07 Thread Jared Hulbert

If if were actually talking about complex filesystem I'd agree.  But
the cramfs xip patch posted here touches about 2/3 of the number of
lines that cramfs has in total.


Fair enough.  But look at the complexity rather than number of lines.
It adds tedium to the cramfs_fill_super and one extra level of
indirection to a hand full of ops like mmap() and cramfs_read().  But
the changes to the real meat of cramfs, cramfs_readpage(), are limited
to the XIP changes, which I want on block devices anyway.

So if we did fork cramfs I would submit a simple patch to cramfs for
XIP support on block devices and I would submit a patch for a new
filesystem, cramfs-linear.  Cramfs-linear would have an exact copy of
1/3 of the cramfs code such as cramfs_readpage(), it would use the
same headers, and it would use the same userspace tools.

This fork is what the community wants?  Speak up!


And cramfs is not exactly the best base to start with..


This is a moot point, there is a significant installed base issue.
There are lots of cramfs-linear-xip based systems in existance with
can't be easily ported to newer kernel because of a lack of support.


> This is nirvana.   But it is not the goal of the patches in question.
> In fact there are several use cases that don't need and don't value
> the writeability and don't need therefore the overhead.  It is a long
> term goal never the less.

With the filemap_xip.c helpers adding xip support to any filesystem
is pretty trivial for the highlevel filesystem operations.  The only
interesting bit is the lowlevel code (the get_xip_page method and
the others Carsten mentioned), but we need to do these lowlevel
code in a generic and proper way anyway.


It's not that trivial.  The filesystem needs to meet several
requirements such as, having data nodes that are page aligned.
Anytime any changes are made to any page in the underlying Flash block
or if the Flash physical partition goes out of read mode you've got to
hide that from userspace or otherwise deal with it.  A filesystem that
doesn't understand these subtle hardware requirements would either not
work at all, have lots of deadlock issues, or at least have terrible
performance problems.  Nevertheless I supose a simple, but invasive,
hack could likely produce a worthwhile proof of concept.

I think this is worthy of it's own thread


I'll try to hack up an xip prototype for jffs2 next week.


Very cool.  I can't wait to see what you have in mind.  But remember
this doesn't solve the problem of the huge installed base of
cramfs-linear-xip images.

Gee I think it seems logfs would be a better choice.  Jffs2 and
ubifs(jffs3) for that matter combine node and node header in series
which means your data nodes aren't aligned to page boundarys. Logfs
nodes could be more easily aligned.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-07 Thread Jared Hulbert

I've had a few beer long discussion with Joern Engel and David
Woodhouse on this one. To cut a long discussion short: the current XIP
infrastructure is not sufficient to be used on top of mtd. We'd need
some extenstions:
- on get_xip_page() we'd need to state if we want the reference
read-only or read+write
- we need a put_xip_page() to return references
- and finally we need a callback for the referece, so that the mtd
driver can ask to get its reference back (in order to unmap from
userland when erasing a block)


Yes. And one more thing.  We can't assume every page in a file is XIP or not.

However, I still can't get even the existing get_xip_page() to work
for me so we are getting ahead of ourselves;)  Looking back on this
thread I realize I haven't confirmed if my cramfs_get_xip_page() gets
a page struct.  I assume that is my problem?  The UML find_iomem()
probably returns psuedo iomem with page structs.  While ioremap() does
not return with page struct backed memory.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-07 Thread Jared Hulbert

that even more important doesn't require pulling in
the whole block layer which is especially important for embedded
devices at the lower end of the scala.


Good point.  That is a big oversight.  Though I would prefer to handle
that in the same fs rather than fork.


I still think it'd be even better to just
hook xip support into jffs or logfs because they give you a full
featured flash filesystem for all needs without the complexity
of strictly partitioning between xip-capable and write parts
of your storage.


This is nirvana.   But it is not the goal of the patches in question.
In fact there are several use cases that don't need and don't value
the writeability and don't need therefore the overhead.  It is a long
term goal never the less.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-07 Thread Jared Hulbert

On 6/7/07, Christoph Hellwig <[EMAIL PROTECTED]> wrote:

On Thu, Jun 07, 2007 at 07:07:54PM +0200, Carsten Otte wrote:
> I've had a few beer long discussion with Joern Engel and David
> Woodhouse on this one. To cut a long discussion short: the current XIP
> infrastructure is not sufficient to be used on top of mtd. We'd need
> some extenstions:
> - on get_xip_page() we'd need to state if we want the reference
> read-only or read+write
> - we need a put_xip_page() to return references
> - and finally we need a callback for the referece, so that the mtd
> driver can ask to get its reference back (in order to unmap from
> userland when erasing a block)

And we'll need that even when using cramfs.  There's not way we'd
merge a hack where the user has to specify a physical address on
the mount command line.


Why not?  For the use case in question the user usually manually
burned the image to a physical address before hand.  Many of these
system don't have MTD turned on for this Flash, they don't need it
because they don't write to this Flash once the system is up.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-07 Thread Jared Hulbert

On 6/7/07, Christoph Hellwig <[EMAIL PROTECTED]> wrote:

On Thu, Jun 07, 2007 at 08:37:07PM +0100, Christoph Hellwig wrote:
> The code is at http://verein.lst.de/~hch/cramfs-xip.tar.gz.

And for thus just wanting to take a quick glance, this is the
diff vs an out of tree cramfs where uncompress.c and cramfs_fs_sb.h
are merged into inode.c:


Cool.  I notice you removed my UML hacks... Why?

I just don't get one thing.  This is almost a duplicate of
cramfs-block.  Why would we prefer a fork with a lot of code
duplication to adding a couple alternate code paths in cramfs-block?

Also keep in mind there are several reasons why you might want to have
block access to to a XIP built cramfs image.  I am unpersuaded that
this fork approach is fundamentally better.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


get_xip_page() uncertainity

2007-06-07 Thread Jared Hulbert

I am trying to create valid struct page* (*get_xip_page)(struct
address_space *, sector_t, int) to use the filemap_xip.c.

I've been trying to do it as follows:

virtual = ioremap(physical,size);

struct page* my_get_xip_page(struct address_space *mapping, sector_t
sector, int create)
{
 unsigned long offset;
 /*extract offset from mapping and sector*/
 return virt_to_page(virtual + offset);
}

I believe this to be fundamentally flawed.  While this works for
xip_file_read(), it does not work for xip_file_mmap().  I'm not sure I
understand the correct way to do this.  But I assume the problem, and
have some evidence to support it, is that virt_to_page() is not
returning a vaild page struct.

How can I get a valid page struct?  The memory is not RAM but Flash.
It is addressable like RAM and I want userspace to use it like
readonly RAM.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-07 Thread Jared Hulbert

On 6/7/07, Christoph Hellwig [EMAIL PROTECTED] wrote:

On Thu, Jun 07, 2007 at 08:37:07PM +0100, Christoph Hellwig wrote:
 The code is at http://verein.lst.de/~hch/cramfs-xip.tar.gz.

And for thus just wanting to take a quick glance, this is the
diff vs an out of tree cramfs where uncompress.c and cramfs_fs_sb.h
are merged into inode.c:


Cool.  I notice you removed my UML hacks... Why?

I just don't get one thing.  This is almost a duplicate of
cramfs-block.  Why would we prefer a fork with a lot of code
duplication to adding a couple alternate code paths in cramfs-block?

Also keep in mind there are several reasons why you might want to have
block access to to a XIP built cramfs image.  I am unpersuaded that
this fork approach is fundamentally better.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-07 Thread Jared Hulbert

On 6/7/07, Christoph Hellwig [EMAIL PROTECTED] wrote:

On Thu, Jun 07, 2007 at 07:07:54PM +0200, Carsten Otte wrote:
 I've had a few beer long discussion with Joern Engel and David
 Woodhouse on this one. To cut a long discussion short: the current XIP
 infrastructure is not sufficient to be used on top of mtd. We'd need
 some extenstions:
 - on get_xip_page() we'd need to state if we want the reference
 read-only or read+write
 - we need a put_xip_page() to return references
 - and finally we need a callback for the referece, so that the mtd
 driver can ask to get its reference back (in order to unmap from
 userland when erasing a block)

And we'll need that even when using cramfs.  There's not way we'd
merge a hack where the user has to specify a physical address on
the mount command line.


Why not?  For the use case in question the user usually manually
burned the image to a physical address before hand.  Many of these
system don't have MTD turned on for this Flash, they don't need it
because they don't write to this Flash once the system is up.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-07 Thread Jared Hulbert

that even more important doesn't require pulling in
the whole block layer which is especially important for embedded
devices at the lower end of the scala.


Good point.  That is a big oversight.  Though I would prefer to handle
that in the same fs rather than fork.


I still think it'd be even better to just
hook xip support into jffs or logfs because they give you a full
featured flash filesystem for all needs without the complexity
of strictly partitioning between xip-capable and write parts
of your storage.


This is nirvana.   But it is not the goal of the patches in question.
In fact there are several use cases that don't need and don't value
the writeability and don't need therefore the overhead.  It is a long
term goal never the less.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-07 Thread Jared Hulbert

I've had a few beer long discussion with Joern Engel and David
Woodhouse on this one. To cut a long discussion short: the current XIP
infrastructure is not sufficient to be used on top of mtd. We'd need
some extenstions:
- on get_xip_page() we'd need to state if we want the reference
read-only or read+write
- we need a put_xip_page() to return references
- and finally we need a callback for the referece, so that the mtd
driver can ask to get its reference back (in order to unmap from
userland when erasing a block)


Yes. And one more thing.  We can't assume every page in a file is XIP or not.

However, I still can't get even the existing get_xip_page() to work
for me so we are getting ahead of ourselves;)  Looking back on this
thread I realize I haven't confirmed if my cramfs_get_xip_page() gets
a page struct.  I assume that is my problem?  The UML find_iomem()
probably returns psuedo iomem with page structs.  While ioremap() does
not return with page struct backed memory.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-07 Thread Jared Hulbert

If if were actually talking about complex filesystem I'd agree.  But
the cramfs xip patch posted here touches about 2/3 of the number of
lines that cramfs has in total.


Fair enough.  But look at the complexity rather than number of lines.
It adds tedium to the cramfs_fill_super and one extra level of
indirection to a hand full of ops like mmap() and cramfs_read().  But
the changes to the real meat of cramfs, cramfs_readpage(), are limited
to the XIP changes, which I want on block devices anyway.

So if we did fork cramfs I would submit a simple patch to cramfs for
XIP support on block devices and I would submit a patch for a new
filesystem, cramfs-linear.  Cramfs-linear would have an exact copy of
1/3 of the cramfs code such as cramfs_readpage(), it would use the
same headers, and it would use the same userspace tools.

This fork is what the community wants?  Speak up!


And cramfs is not exactly the best base to start with..


This is a moot point, there is a significant installed base issue.
There are lots of cramfs-linear-xip based systems in existance with
can't be easily ported to newer kernel because of a lack of support.


 This is nirvana.   But it is not the goal of the patches in question.
 In fact there are several use cases that don't need and don't value
 the writeability and don't need therefore the overhead.  It is a long
 term goal never the less.

With the filemap_xip.c helpers adding xip support to any filesystem
is pretty trivial for the highlevel filesystem operations.  The only
interesting bit is the lowlevel code (the get_xip_page method and
the others Carsten mentioned), but we need to do these lowlevel
code in a generic and proper way anyway.


It's not that trivial.  The filesystem needs to meet several
requirements such as, having data nodes that are page aligned.
Anytime any changes are made to any page in the underlying Flash block
or if the Flash physical partition goes out of read mode you've got to
hide that from userspace or otherwise deal with it.  A filesystem that
doesn't understand these subtle hardware requirements would either not
work at all, have lots of deadlock issues, or at least have terrible
performance problems.  Nevertheless I supose a simple, but invasive,
hack could likely produce a worthwhile proof of concept.

I think this is worthy of it's own thread


I'll try to hack up an xip prototype for jffs2 next week.


Very cool.  I can't wait to see what you have in mind.  But remember
this doesn't solve the problem of the huge installed base of
cramfs-linear-xip images.

Gee I think it seems logfs would be a better choice.  Jffs2 and
ubifs(jffs3) for that matter combine node and node header in series
which means your data nodes aren't aligned to page boundarys. Logfs
nodes could be more easily aligned.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-06 Thread Jared Hulbert

On 6/6/07, Carsten Otte <[EMAIL PROTECTED]> wrote:

Jared Hulbert wrote:
> (2) failed with the following messages.  (This wasn't really busybox.
> It was xxd, not statically link, hence the issue with ld.so)
Could you try to figure what happend to subject page before? Was it
subject to copy on write? With what flags has this vma been mmaped?

thanks,
Carsten


The vma->flags = 1875 = 0x753

This is:
VM_READ
VM_WRITE
VM_MAYREAD
VM_MAYEXEC
VM_GROWSDOWN
VM_GROWSUP
VM_PFNMAP

I assume no struct page exists for the pages of this file.  When
vm_no_page was called it seems it failed on a pte check since there is
no backing page structure.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-06 Thread Jared Hulbert

 The embedded people already use them
on flash which is a little dumb, but now we add even more cludge for
a non-block based access.


Please justify your assertion that using cramfs on flash is dumb.
What would be not dumb?  In an embedded system with addressable Flash
the linear addressing cramfs is simple and elegant solution.

Removing support for block based access would drastically reduce the
complexity of cramfs.  The non-block access bits of code are trivial
in comparison.  Specifically which part of my patch represents
unwarranted, unfixable cludge?


The right way to architect xip for flash-based devices is to implement
a generic get_xip_page for mtd-based devices and integrate that into
an existing flash filesystem or write a simple new flash filesystem
tailored to that use case.


There is often no need for the complexity of the MTD for a readonly
compressed filesystem in the embedded world.   I am intrigued by the
suggestion of a generic get_xip_page() for mtd-based devices.  I fail
to see how get_xip_page() is not highly filesystem dependant.  How
might a generic one work?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-06 Thread Jared Hulbert

On Wed, Jun 06, 2007 at 09:07:16AM -0700, Jared Hulbert wrote:
> I estimate something on the order 5-10 million Linux phones use
> something similar to these patches.  I wonder if there are that many
> provable users of of the simple cramfs.  This is where the community
> has taken cramfs.

This is what a community disjoint to mainline development has hacked
cramfs in their trees into.  Not a good rationale.  This whole
"but we've always done it" attitute is a little annoying, really.


It is that disjointedness we are trying to address.


FYI: Cartsten had an xip fs for s390 aswell, and that evolved into
the filemap.c bits after a lot of rework an quite a few round of
review.


Right.  So now we leverage this filemap_xip.c in cramfs.  Why is this a problem?


> Nevertheless, I understand your point.  I wrote AXFS in part because
> the hacks required to do XIP on cramfs where ugly, hacky, and complex.

I can't find a reference to AXFS anywhere in this thread.


No, it's not here.  There's a year old thread referencing it.


> > Please
> >use something like the existing ext2 xip mode instead of add support
> >to romfs using the generic filemap methods.
>
> What??  You mean like use xip_file_mmap() and implement
> get_xip_page()?  Did you read my latest patch?

Yes.  This is the highlevel way to go, just please don't hack it into
cramfs.


Right, so this latest patch _does_ implement get_xip_page() and
xip_file_mmap().  Why not hack it into cramfs?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-06 Thread Jared Hulbert

On 6/6/07, Christoph Hellwig <[EMAIL PROTECTED]> wrote:

On Wed, Jun 06, 2007 at 08:17:43AM -0700, Richard Griffiths wrote:
> Too late :) The XIP cramfs patch is widely used in the embedded Linux
> community and has been used for years. It fulfills a need for a small
> XIP Flash file system. Hence our interest in getting it or some
> variation into the mainline kernel.

That's not a reason to put it in as-is.  Maybe you should have showed
up here before making the wrong decision to hack this into cramfs.



Please read the entire thread before passing judgements like this.
The hacks evolved over the last 8 years, and are really handy.  We're
just trying to figure the best way to share them..
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-06 Thread Jared Hulbert

On 6/6/07, Christoph Hellwig <[EMAIL PROTECTED]> wrote:

I might be a little late in the discussion, but I somehow missed this
before.  Please don't add this xip support to cramfs, because the
whole point of cramfs is to be a simple _compressed_ filesystem,
and we really don't want to add more complexity to it.


I estimate something on the order 5-10 million Linux phones use
something similar to these patches.  I wonder if there are that many
provable users of of the simple cramfs.  This is where the community
has taken cramfs.

Nevertheless, I understand your point.  I wrote AXFS in part because
the hacks required to do XIP on cramfs where ugly, hacky, and complex.
Please review the latest patch in the thread, it's just a draft but
the changes required are not very complex now, especially in light of
the filemap_xip.c APIs being used.  It just happens not to work, yet.


 Please
use something like the existing ext2 xip mode instead of add support
to romfs using the generic filemap methods.


What??  You mean like use xip_file_mmap() and implement
get_xip_page()?  Did you read my latest patch?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-06 Thread Jared Hulbert
he compressed rom filesystem.
 * The actual compression is based on zlib, see the other files.
+ */
+
+/* Linear Addressing code
+ *
+ * Copyright (C) 2000 Shane Nay.
+ *
+ * Allows you to have a linearly addressed cramfs filesystem.
+ * Saves the need for buffer, and the munging of the buffer.
+ * Savings a bit over 32k with default PAGE_SIZE, BUFFER_SIZE
+ * etc.  Usefull on embedded platform with ROM :-).
+ *
+ * Downsides- Currently linear addressed cramfs partitions
+ * don't co-exist with block cramfs partitions.
+ *
+ */
+
+/*
+ * 28-Dec-2000: XIP mode for linear cramfs
+ * Copyright (C) 2000 Robert Leslie <[EMAIL PROTECTED]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+/* filemap_xip.c interfaces - Jared Hulbert 2007
+ * linear + block coexisting - Jared Hulbert 2007
+ *  (inspired by patches from Kyungmin Park of Samsung and others at
+ *   Motorola from the EZX phones)
+ *
 */

#include 
@@ -24,22 +64,25 @@
#include 
#include 
#include 
+#include 

#include 
-
-static const struct super_operations cramfs_ops;
-static const struct inode_operations cramfs_dir_inode_operations;
+#include 
+#ifdef CONFIG_UML
+#include 
+#endif
+
+static struct super_operations cramfs_ops;
+static struct inode_operations cramfs_dir_inode_operations;
static const struct file_operations cramfs_directory_operations;
static const struct address_space_operations cramfs_aops;

static DEFINE_MUTEX(read_mutex);
-

/* These two macros may change in future, to provide better st_ino
   semantics. */
#define CRAMINO(x)  (((x)->offset && (x)->size)?(x)->offset<<2:1)
#define OFFSET(x)   ((x)->i_ino)
-

static int cramfs_iget5_test(struct inode *inode, void *opaque)
{
@@ -99,13 +142,77 @@ static int cramfs_iget5_set(struct inode
return 0;
}

+static int cramfs_mmap(struct file *file, struct vm_area_struct *vma)
+{
+   struct inode *inode = file->f_dentry->d_inode;
+   struct cramfs_sb_info *sbi = CRAMFS_SB(inode->i_sb);
+   
+   if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
+   return -EINVAL;
+
+   if ((CRAMFS_INODE_IS_XIP(inode)) && !(vma->vm_flags & VM_WRITE) &&
+ (LINEAR_CRAMFS(sbi)))
+   return xip_file_mmap(file, vma);
+
+   return generic_file_mmap(file, vma);
+}
+
+struct page *cramfs_get_xip_page(struct address_space *mapping,
sector_t offset,
+  int create)
+{
+   unsigned long address;
+   unsigned long offs = offset;
+   struct inode *inode = mapping->host;
+   struct super_block *sb = inode->i_sb;
+   struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
+
+   address  = PAGE_ALIGN((unsigned long)(sbi->linear_virt_addr
+   + OFFSET(inode)));
+   offs *= 512; /* FIXME -- This shouldn't be hard coded */
+   address += offs;
+
+   return virt_to_page(address);
+}
+
+ssize_t cramfs_file_read(struct file *file, char __user * buf, size_t len,
+  loff_t * ppos)
+{
+   struct inode *inode = file->f_dentry->d_inode;
+   struct cramfs_sb_info *sbi = CRAMFS_SB(inode->i_sb);
+   
+   if ((CRAMFS_INODE_IS_XIP(inode)) && (LINEAR_CRAMFS(sbi)))
+   return xip_file_read(file, buf, len, ppos);
+   
+   return do_sync_read(file, buf, len, ppos);
+}
+
+static struct file_operations cramfs_linear_xip_fops = {
+   aio_read:   generic_file_aio_read,
+   read:   cramfs_file_read,
+   mmap:   cramfs_mmap,
+};
+
+static struct backing_dev_info cramfs_backing_dev_info = {
+   .ra_pages   = 0,/* No readahead */
+};
+
static struct inode *get_cramfs_inode(struct super_block *sb,
struct cramfs_inode * cramfs_inode)
{
+   struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
struct inode *inode = iget5_locked(sb, CRAMINO(cramfs_inode),
cramfs_iget5_test, cramfs_iget5_set,
cramfs_inode);
if (inode && (inode->i_state & I_NEW)) {
+   if (LINEAR_CRAMFS(sbi))
+   inode->i_mapping->

Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-06 Thread Jared Hulbert
 03:27:31 2007 -0700
@@ -9,6 +9,46 @@
/*
 * These are the VFS interfaces to the compressed rom filesystem.
 * The actual compression is based on zlib, see the other files.
+ */
+
+/* Linear Addressing code
+ *
+ * Copyright (C) 2000 Shane Nay.
+ *
+ * Allows you to have a linearly addressed cramfs filesystem.
+ * Saves the need for buffer, and the munging of the buffer.
+ * Savings a bit over 32k with default PAGE_SIZE, BUFFER_SIZE
+ * etc.  Usefull on embedded platform with ROM :-).
+ *
+ * Downsides- Currently linear addressed cramfs partitions
+ * don't co-exist with block cramfs partitions.
+ *
+ */
+
+/*
+ * 28-Dec-2000: XIP mode for linear cramfs
+ * Copyright (C) 2000 Robert Leslie [EMAIL PROTECTED]
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+
+/* filemap_xip.c interfaces - Jared Hulbert 2007
+ * linear + block coexisting - Jared Hulbert 2007
+ *  (inspired by patches from Kyungmin Park of Samsung and others at
+ *   Motorola from the EZX phones)
+ *
 */

#include linux/module.h
@@ -24,22 +64,25 @@
#include linux/vfs.h
#include linux/mutex.h
#include asm/semaphore.h
+#include linux/vmalloc.h

#include asm/uaccess.h
-
-static const struct super_operations cramfs_ops;
-static const struct inode_operations cramfs_dir_inode_operations;
+#include asm/tlbflush.h
+#ifdef CONFIG_UML
+#include mem_user.h
+#endif
+
+static struct super_operations cramfs_ops;
+static struct inode_operations cramfs_dir_inode_operations;
static const struct file_operations cramfs_directory_operations;
static const struct address_space_operations cramfs_aops;

static DEFINE_MUTEX(read_mutex);
-

/* These two macros may change in future, to provide better st_ino
   semantics. */
#define CRAMINO(x)  (((x)-offset  (x)-size)?(x)-offset2:1)
#define OFFSET(x)   ((x)-i_ino)
-

static int cramfs_iget5_test(struct inode *inode, void *opaque)
{
@@ -99,13 +142,77 @@ static int cramfs_iget5_set(struct inode
return 0;
}

+static int cramfs_mmap(struct file *file, struct vm_area_struct *vma)
+{
+   struct inode *inode = file-f_dentry-d_inode;
+   struct cramfs_sb_info *sbi = CRAMFS_SB(inode-i_sb);
+   
+   if ((vma-vm_flags  VM_SHARED)  (vma-vm_flags  VM_MAYWRITE))
+   return -EINVAL;
+
+   if ((CRAMFS_INODE_IS_XIP(inode))  !(vma-vm_flags  VM_WRITE) 
+ (LINEAR_CRAMFS(sbi)))
+   return xip_file_mmap(file, vma);
+
+   return generic_file_mmap(file, vma);
+}
+
+struct page *cramfs_get_xip_page(struct address_space *mapping,
sector_t offset,
+  int create)
+{
+   unsigned long address;
+   unsigned long offs = offset;
+   struct inode *inode = mapping-host;
+   struct super_block *sb = inode-i_sb;
+   struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
+
+   address  = PAGE_ALIGN((unsigned long)(sbi-linear_virt_addr
+   + OFFSET(inode)));
+   offs *= 512; /* FIXME -- This shouldn't be hard coded */
+   address += offs;
+
+   return virt_to_page(address);
+}
+
+ssize_t cramfs_file_read(struct file *file, char __user * buf, size_t len,
+  loff_t * ppos)
+{
+   struct inode *inode = file-f_dentry-d_inode;
+   struct cramfs_sb_info *sbi = CRAMFS_SB(inode-i_sb);
+   
+   if ((CRAMFS_INODE_IS_XIP(inode))  (LINEAR_CRAMFS(sbi)))
+   return xip_file_read(file, buf, len, ppos);
+   
+   return do_sync_read(file, buf, len, ppos);
+}
+
+static struct file_operations cramfs_linear_xip_fops = {
+   aio_read:   generic_file_aio_read,
+   read:   cramfs_file_read,
+   mmap:   cramfs_mmap,
+};
+
+static struct backing_dev_info cramfs_backing_dev_info = {
+   .ra_pages   = 0,/* No readahead */
+};
+
static struct inode *get_cramfs_inode(struct super_block *sb,
struct cramfs_inode * cramfs_inode)
{
+   struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
struct inode *inode = iget5_locked(sb, CRAMINO(cramfs_inode),
cramfs_iget5_test, cramfs_iget5_set,
cramfs_inode);
if (inode  (inode-i_state  I_NEW)) {
+   if (LINEAR_CRAMFS(sbi))
+   inode-i_mapping-backing_dev_info

Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-06 Thread Jared Hulbert

On 6/6/07, Christoph Hellwig [EMAIL PROTECTED] wrote:

I might be a little late in the discussion, but I somehow missed this
before.  Please don't add this xip support to cramfs, because the
whole point of cramfs is to be a simple _compressed_ filesystem,
and we really don't want to add more complexity to it.


I estimate something on the order 5-10 million Linux phones use
something similar to these patches.  I wonder if there are that many
provable users of of the simple cramfs.  This is where the community
has taken cramfs.

Nevertheless, I understand your point.  I wrote AXFS in part because
the hacks required to do XIP on cramfs where ugly, hacky, and complex.
Please review the latest patch in the thread, it's just a draft but
the changes required are not very complex now, especially in light of
the filemap_xip.c APIs being used.  It just happens not to work, yet.


 Please
use something like the existing ext2 xip mode instead of add support
to romfs using the generic filemap methods.


What??  You mean like use xip_file_mmap() and implement
get_xip_page()?  Did you read my latest patch?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-06 Thread Jared Hulbert

On 6/6/07, Christoph Hellwig [EMAIL PROTECTED] wrote:

On Wed, Jun 06, 2007 at 08:17:43AM -0700, Richard Griffiths wrote:
 Too late :) The XIP cramfs patch is widely used in the embedded Linux
 community and has been used for years. It fulfills a need for a small
 XIP Flash file system. Hence our interest in getting it or some
 variation into the mainline kernel.

That's not a reason to put it in as-is.  Maybe you should have showed
up here before making the wrong decision to hack this into cramfs.



Please read the entire thread before passing judgements like this.
The hacks evolved over the last 8 years, and are really handy.  We're
just trying to figure the best way to share them..
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-06 Thread Jared Hulbert

On Wed, Jun 06, 2007 at 09:07:16AM -0700, Jared Hulbert wrote:
 I estimate something on the order 5-10 million Linux phones use
 something similar to these patches.  I wonder if there are that many
 provable users of of the simple cramfs.  This is where the community
 has taken cramfs.

This is what a community disjoint to mainline development has hacked
cramfs in their trees into.  Not a good rationale.  This whole
but we've always done it attitute is a little annoying, really.


It is that disjointedness we are trying to address.


FYI: Cartsten had an xip fs for s390 aswell, and that evolved into
the filemap.c bits after a lot of rework an quite a few round of
review.


Right.  So now we leverage this filemap_xip.c in cramfs.  Why is this a problem?


 Nevertheless, I understand your point.  I wrote AXFS in part because
 the hacks required to do XIP on cramfs where ugly, hacky, and complex.

I can't find a reference to AXFS anywhere in this thread.


No, it's not here.  There's a year old thread referencing it.


  Please
 use something like the existing ext2 xip mode instead of add support
 to romfs using the generic filemap methods.

 What??  You mean like use xip_file_mmap() and implement
 get_xip_page()?  Did you read my latest patch?

Yes.  This is the highlevel way to go, just please don't hack it into
cramfs.


Right, so this latest patch _does_ implement get_xip_page() and
xip_file_mmap().  Why not hack it into cramfs?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-06 Thread Jared Hulbert

 The embedded people already use them
on flash which is a little dumb, but now we add even more cludge for
a non-block based access.


Please justify your assertion that using cramfs on flash is dumb.
What would be not dumb?  In an embedded system with addressable Flash
the linear addressing cramfs is simple and elegant solution.

Removing support for block based access would drastically reduce the
complexity of cramfs.  The non-block access bits of code are trivial
in comparison.  Specifically which part of my patch represents
unwarranted, unfixable cludge?


The right way to architect xip for flash-based devices is to implement
a generic get_xip_page for mtd-based devices and integrate that into
an existing flash filesystem or write a simple new flash filesystem
tailored to that use case.


There is often no need for the complexity of the MTD for a readonly
compressed filesystem in the embedded world.   I am intrigued by the
suggestion of a generic get_xip_page() for mtd-based devices.  I fail
to see how get_xip_page() is not highly filesystem dependant.  How
might a generic one work?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-06 Thread Jared Hulbert

On 6/6/07, Carsten Otte [EMAIL PROTECTED] wrote:

Jared Hulbert wrote:
 (2) failed with the following messages.  (This wasn't really busybox.
 It was xxd, not statically link, hence the issue with ld.so)
Could you try to figure what happend to subject page before? Was it
subject to copy on write? With what flags has this vma been mmaped?

thanks,
Carsten


The vma-flags = 1875 = 0x753

This is:
VM_READ
VM_WRITE
VM_MAYREAD
VM_MAYEXEC
VM_GROWSDOWN
VM_GROWSUP
VM_PFNMAP

I assume no struct page exists for the pages of this file.  When
vm_no_page was called it seems it failed on a pte check since there is
no backing page structure.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-01 Thread Jared Hulbert

> The current xip stack relies on having struct page behind the memory
> segment. This causes few impact on memory management, but occupies some
> more memory. The cramfs patch chose to modify copy on write in order to
> deal with vmas that don't have struct page behind.
> So far, Hugh and Linus have shown strong opposition against copy on
> write with no struct page behind. If this implementation is acceptable
> to the them, it seems preferable to me over wasting memory. The xip
> stack should be modified to use this vma flag in that case.

I would rather not :P

We can copy on write without a struct page behind the source today, no?


The existing COW techniques fail on some corner cases.  I'm not up to
speed on the vm code.  I'll try to look into this a little more but it
might be useful if I knew what questions I need to answer so you vm
experts can understand the problem.

Let me give one example.  If you try to debug an XIP application
without this patch, bad things happen.  XIP in this sense is synomous
with executing directly out of Flash and you can't just change the
physical memory to redirect it to the debugger so easily in Flash.
Now I don't know exactly why yet some, but not all applications,
trigger this added vm hack.  I'm not sure exactly why it would get
triggered under normal circumstances.  Why would a read-only map get
written to?


What is insufficient for the XIP code with the current COW?


So I think the problem may have something to do with the nature of the
memory in question.   We are using Flash that is ioremap()'ed to a
usable virtual address.  And yet we go on to try to use it as if it
were plain old system memory, like any RAM page.  We need it to be
presented as any other memory page only physically read-only.
ioremap() seems to be a hacky way of accomplishing that, but I can't
think of better way.  In ARM we even had to invent ioremap_cached() to
improve performance.  Thoughts?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-06-01 Thread Jared Hulbert

 The current xip stack relies on having struct page behind the memory
 segment. This causes few impact on memory management, but occupies some
 more memory. The cramfs patch chose to modify copy on write in order to
 deal with vmas that don't have struct page behind.
 So far, Hugh and Linus have shown strong opposition against copy on
 write with no struct page behind. If this implementation is acceptable
 to the them, it seems preferable to me over wasting memory. The xip
 stack should be modified to use this vma flag in that case.

I would rather not :P

We can copy on write without a struct page behind the source today, no?


The existing COW techniques fail on some corner cases.  I'm not up to
speed on the vm code.  I'll try to look into this a little more but it
might be useful if I knew what questions I need to answer so you vm
experts can understand the problem.

Let me give one example.  If you try to debug an XIP application
without this patch, bad things happen.  XIP in this sense is synomous
with executing directly out of Flash and you can't just change the
physical memory to redirect it to the debugger so easily in Flash.
Now I don't know exactly why yet some, but not all applications,
trigger this added vm hack.  I'm not sure exactly why it would get
triggered under normal circumstances.  Why would a read-only map get
written to?


What is insufficient for the XIP code with the current COW?


So I think the problem may have something to do with the nature of the
memory in question.   We are using Flash that is ioremap()'ed to a
usable virtual address.  And yet we go on to try to use it as if it
were plain old system memory, like any RAM page.  We need it to be
presented as any other memory page only physically read-only.
ioremap() seems to be a hacky way of accomplishing that, but I can't
think of better way.  In ARM we even had to invent ioremap_cached() to
improve performance.  Thoughts?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-05-24 Thread Jared Hulbert

Yes you can, but I won't have access to a PXA270 for a few weeks. I
assume you don't see the issue if you static link busybox?


I don't know.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-05-24 Thread Jared Hulbert

On 5/22/07, Richard Griffiths <[EMAIL PROTECTED]> wrote:

Venerable cramfs fs Linear XIP patch originally from MontaVista, used in
the embedded Linux community for years, updated for 2.6.21. Tested on
several systems with NOR Flash. PXA270, TI OMAP2430, ARM Versatile and
Freescale iMX31ADS.



When trying to verify this patch on our PXA270 system we get the
following error when running an XIP rootfs:

cramfs: checking physical address 0xa0 for linear cramfs image
cramfs: linear cramfs image appears to be 3236 KB in size
VFS: Mounted root (cramfs filesystem) readonly.
Freeing init memory: 96K
/sbin/init: error while loading shared libraries: libgcc_s.so.1:
failed to map segment from shared object: Error 11
Kernel panic - not syncing: Attempted to kill init!

However, if our busybox binary is XIP while the libgcc_s.so.1 is not
XIP, busybox runs fine.

Richard, May I email you the rootfs tarball so you can recreate what
we are seeing?  It is a little less than 2MiB.  The filing system
executables will only run on a PXA27x processor.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-05-24 Thread Jared Hulbert

On 5/22/07, Richard Griffiths [EMAIL PROTECTED] wrote:

Venerable cramfs fs Linear XIP patch originally from MontaVista, used in
the embedded Linux community for years, updated for 2.6.21. Tested on
several systems with NOR Flash. PXA270, TI OMAP2430, ARM Versatile and
Freescale iMX31ADS.



When trying to verify this patch on our PXA270 system we get the
following error when running an XIP rootfs:

cramfs: checking physical address 0xa0 for linear cramfs image
cramfs: linear cramfs image appears to be 3236 KB in size
VFS: Mounted root (cramfs filesystem) readonly.
Freeing init memory: 96K
/sbin/init: error while loading shared libraries: libgcc_s.so.1:
failed to map segment from shared object: Error 11
Kernel panic - not syncing: Attempted to kill init!

However, if our busybox binary is XIP while the libgcc_s.so.1 is not
XIP, busybox runs fine.

Richard, May I email you the rootfs tarball so you can recreate what
we are seeing?  It is a little less than 2MiB.  The filing system
executables will only run on a PXA27x processor.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP

2007-05-24 Thread Jared Hulbert

Yes you can, but I won't have access to a PXA270 for a few weeks. I
assume you don't see the issue if you static link busybox?


I don't know.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >