Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences
On Tue, Feb 2, 2016 at 4:34 PM, Matthew Wilcox wrote: > On Tue, Feb 02, 2016 at 01:46:06PM -0800, Jared Hulbert wrote: >> On Tue, Feb 2, 2016 at 8:51 AM, Dan Williams >> wrote: >> >> The filesystem I'm concerned with is AXFS >> >> (https://www.kernel.org/doc/ols/2008/ols2008v1-pages-211-218.pdf). >> >> Which I've been planning on trying to merge again due to a recent >> >> resurgence of interest. The device model for AXFS is... weird. It >> >> can use one or two devices at a time of any mix of NOR MTD, NAND MTD, >> >> block, and unmanaged physical memory. It's a terribly useful model >> >> for embedded. Anyway AXFS is readonly so hacking in a read only >> >> dax_fault_nodev() and dax_file_read() would work fine, looks easy >> >> enough. But... it would be cool if similar small embedded focused RW >> >> filesystems were enabled. >> > >> > Are those also out of tree? >> >> Of course. Merging embedded filesystems is little merging regular >> filesystems except 98% of you reviewers don't want it merged. > > You should at least be able to get it into staging these days. I mean, > look at some of the junk that's in staging ... and I don't think AXFS was > nearly as bad. Thanks? ;) >> IMO you're making DAX more complex by overly coupling to the bdev and >> I think it could bite you later. I submit this rework of the radix >> tree and confusion about where to get the real bdev as evidence. I'm >> guessing that it won't be the last time. It's unnecessary to couple >> it like this, and in fact is not how the vfs has been layered in the >> past. > > Huh? The rework to use the radix tree for PFNs was done with one eye > firmly on your usage case. Just because I had to thread the get_block > interface through it for the moment doesn't mean that I didn't have > the "how do we get rid of get_block entirely" question on my mind. Oh yeah. I think we're on the same page. But I'm not sure Dan is. I get the need to phase this in too. > Using get_block seemed like the right idea three years ago. I didn't > know just how fundamentally ext4 and XFS disagree on how it should be > used. Sure. I can see that. >> To look at the the downside consider dax_fault(). Its called on a >> fault to a user memory map, uses the filesystems get_block() to lookup >> a sector so you can ask a block device to convert it to an address on >> a DIMM. Come on, that's awkward. Everything around dax_fault() is >> dripping with memory semantic interfaces, the dax_fault() call are >> fundamentally about memory, the pmem calls are memory, the hardware is >> memory, and yet it directly calls bdev_direct_access(). It's out of >> place. > > What was out of place was the old 'get_xip_mem' in address_space > operations. Returning a kernel virtual address and a PFN from a > filesystem operation? That looks awful. Yes. Yes it does! But at least my big hack was just one line. ;) Nobody really even seemed to notice at the time. > All the other operations deal > in struct pages, file offsets and occasionally sectors. Of course, we > don't have a struct page, so a pfn makes sense, but the kernel virtual > address being returned was a gargantuan layering problem. Well yes, but it was an expedient hack. >> The legacy vfs/mm code didn't have this layering problem either. Even >> filemap_fault() that dax_fault() is modeled after doesn't call any >> bdev methods directly, when it needs something it asks the filesystem >> with a ->readpage(). The precedence is that you ask the filesystem >> for what you need. Look at the get_bdev() thing you've concluded you >> need. It _almost_ makes my point. I just happen to be of the opinion >> that you don't actually want or need the bdev, you want the pfn/kaddr >> so you can flush or map or memcpy(). > > You want the pfn. The device driver doesn't have enough information to > give you a (coherent with userspace) kaddr. That's what (some future > arch-specific implementation of) dax_map_pfn() is for. That's why it > takes 'index' as a parameter, so you can calculate where it'll be mapped > in userspace, and determine an appropriate kernel virtual address to > use for it. Oh I think I'm just beginning to catch your vision for dax_map_pfn(). I still don't get why we can't just do semi-arch specific flushing instead of the alignment thing. But that just might be epic ignorance on my part. Either way flush or magic alignments dax_(un)map_pfn() would handle it, right?
Re: [PATCH] dax: allow DAX to look up an inode's block device
On Tue, Feb 2, 2016 at 3:41 PM, Dan Williams wrote: > On Tue, Feb 2, 2016 at 3:36 PM, Jared Hulbert wrote: >> On Tue, Feb 2, 2016 at 3:19 PM, Al Viro wrote: >>> >>> On Tue, Feb 02, 2016 at 04:11:42PM -0700, Ross Zwisler wrote: >>> >>> > However, for raw block devices and for XFS with a real-time device, the >>> > value in inode->i_sb->s_bdev is not correct. With the code as it is >>> > currently written, an fsync or msync to a DAX enabled raw block device >>> > will >>> > cause a NULL pointer dereference kernel BUG. For this to work correctly >>> > we >>> > need to ask the block device or filesystem what struct block_device is >>> > appropriate for our inode. >>> > >>> > To that end, add a get_bdev(struct inode *) entry point to struct >>> > super_operations. If this function pointer is non-NULL, this notifies DAX >>> > that it needs to use it to look up the correct block_device. If >>> > i_sb->get_bdev() is NULL DAX will default to inode->i_sb->s_bdev. >>> >>> Umm... It assumes that bdev will stay pinned for as long as inode is >>> referenced, presumably? If so, that needs to be documented (and verified >>> for existing fs instances). In principle, multi-disk fs might want to >>> support things like "silently move the inodes backed by that disk to other >>> ones"... >> >> Dan, This is exactly the kind of thing I'm taking about WRT the >> weirder device models and directly calling bdev_direct_access(). >> Filesystems don't have the monogamous relationship with a device that >> is implicitly assumed in DAX, you have to ask the filesystem what the >> relationship is and is migrating to, and allow the filesystem to >> update DAX when the relationship is changing. > > That's precisely what ->get_bdev() does. When the answer > inode->i_sb->s_bdev lookup is invalid, use ->get_bdev(). > >> As we start to see many >> DIMM's and 10s TiB pmem systems this is going be an even bigger deal >> as load balancing, wear leveling, and fault tolerance concerned are >> inevitably driven by the filesystem. > > No, there are no plans on the horizon for an fs to manage these media > specific concerns for persistent memory. So the filesystem is now directly in charge of mapping user pages to physical memory. The filesystem is effectively bypassing NUMA and zones and all that stuff that tries to balance memory bus and QPI traffic etc. You don't think the filesystem will therefore be in charge of memory bus hotspots? Alright. We can just agree to disagree on that point.
Re: [PATCH] dax: allow DAX to look up an inode's block device
On Tue, Feb 2, 2016 at 3:19 PM, Al Viro wrote: > > On Tue, Feb 02, 2016 at 04:11:42PM -0700, Ross Zwisler wrote: > > > However, for raw block devices and for XFS with a real-time device, the > > value in inode->i_sb->s_bdev is not correct. With the code as it is > > currently written, an fsync or msync to a DAX enabled raw block device will > > cause a NULL pointer dereference kernel BUG. For this to work correctly we > > need to ask the block device or filesystem what struct block_device is > > appropriate for our inode. > > > > To that end, add a get_bdev(struct inode *) entry point to struct > > super_operations. If this function pointer is non-NULL, this notifies DAX > > that it needs to use it to look up the correct block_device. If > > i_sb->get_bdev() is NULL DAX will default to inode->i_sb->s_bdev. > > Umm... It assumes that bdev will stay pinned for as long as inode is > referenced, presumably? If so, that needs to be documented (and verified > for existing fs instances). In principle, multi-disk fs might want to > support things like "silently move the inodes backed by that disk to other > ones"... Dan, This is exactly the kind of thing I'm taking about WRT the weirder device models and directly calling bdev_direct_access(). Filesystems don't have the monogamous relationship with a device that is implicitly assumed in DAX, you have to ask the filesystem what the relationship is and is migrating to, and allow the filesystem to update DAX when the relationship is changing. As we start to see many DIMM's and 10s TiB pmem systems this is going be an even bigger deal as load balancing, wear leveling, and fault tolerance concerned are inevitably driven by the filesystem.
Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences
On Tue, Feb 2, 2016 at 8:51 AM, Dan Williams wrote: > On Tue, Feb 2, 2016 at 12:05 AM, Jared Hulbert wrote: > [..] >> Well... as CONFIG_BLOCK was not required with filemap_xip.c for a >> decade. This CONFIG_BLOCK dependency is a result of an incremental >> feature from a certain point of view ;) >> >> The obvious 'driver' is physical RAM without a particular driver. >> Remember please I'm talking about embedded. RAM measured in MiB and >> funky one off hardware etc. In the embedded world there are lots of >> ways that persistent memory has been supported in device specific ways >> without the new fancypants NFIT and Intel instructions,so frankly >> they don't fit in the PMEM stuff. Maybe they could be supported in >> PMEM but not without effort to bring embedded players to the table. > > Not sure what you're trying to say here. An ACPI NFIT only feeds the > generic libnvdimm device model. You don't need NFIT to get pmem. Right... I'm just not seeing how the libnvdimm device model fits, is relevant, or useful to a persistent SRAM in embedded. Therefore I don't see some of the user will have a driver. >> The other drivers are the MTD drivers, probably as read-only for now. >> But the paradigm there isn't so different from what PMEM looks like >> with asymmetric read/write capabilities. >> >> The filesystem I'm concerned with is AXFS >> (https://www.kernel.org/doc/ols/2008/ols2008v1-pages-211-218.pdf). >> Which I've been planning on trying to merge again due to a recent >> resurgence of interest. The device model for AXFS is... weird. It >> can use one or two devices at a time of any mix of NOR MTD, NAND MTD, >> block, and unmanaged physical memory. It's a terribly useful model >> for embedded. Anyway AXFS is readonly so hacking in a read only >> dax_fault_nodev() and dax_file_read() would work fine, looks easy >> enough. But... it would be cool if similar small embedded focused RW >> filesystems were enabled. > > Are those also out of tree? Of course. Merging embedded filesystems is little merging regular filesystems except 98% of you reviewers don't want it merged. >> I don't expect you to taint DAX with design requirements for this >> stuff that it wasn't built for, nobody ends up happy in that case. >> However, if enabling the filesystem to manage the bdev_direct_access() >> interactions solves some of the "alternate device" problems you are >> discussing here, then there is a chance we can accommodate both. >> Sometimes that works. >> >> So... Forget CONFIG_BLOCK=n entirely I didn't want that to be the >> focus anyway. Does it help to support the weirder XFS and btrfs >> device models to enable the filesystem to handle the >> bdev_direct_access() stuff? > > It's not clear that it does. We just clarified with xfs and ext4 that > we can really on get_blocks(). That solves the immediate concern with > multi-device filesystems. IMO you're making DAX more complex by overly coupling to the bdev and I think it could bite you later. I submit this rework of the radix tree and confusion about where to get the real bdev as evidence. I'm guessing that it won't be the last time. It's unnecessary to couple it like this, and in fact is not how the vfs has been layered in the past. The trouble with vfs work has been that it straddles the line between mm and block, unfortunately that line is dark chasm with ill defined boundaries. DAX is even more exciting because it's trying to duct tape the filesystem even closer to the mm system, one could argue it's actually in some respects enabling the filesystem to bypass the mm code. On top of that DAX is designed to enable block based filesystems to use RAM like devices. Bolting the block device interface on to NVDIMM is a brilliant hack and the right design choice, but it's still a hack. The upside is it enables the reuse of all this glorious legacy filesystem code which does a pretty amazing job of handling what the pmem device applications need considering they were designed to manage data on platters of slow spinning rust. How would DAX look like developed with a filesystem purpose built for pmem? To look at the the downside consider dax_fault(). Its called on a fault to a user memory map, uses the filesystems get_block() to lookup a sector so you can ask a block device to convert it to an address on a DIMM. Come on, that's awkward. Everything around dax_fault() is dripping with memory semantic interfaces, the dax_fault() call are fundamentally about memory, the pmem calls are memory, the hardware is memory, and yet it directly calls bdev_direct_access(). It's out of place. The legacy vfs/mm code didn't have this layering problem either. Even filemap_fault() that dax_fault() i
Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences
On Mon, Feb 1, 2016 at 10:46 PM, Dan Williams wrote: > On Mon, Feb 1, 2016 at 10:06 PM, Jared Hulbert wrote: >> On Mon, Feb 1, 2016 at 1:47 PM, Dave Chinner wrote: >>> On Mon, Feb 01, 2016 at 03:51:47PM +0100, Jan Kara wrote: >>>> On Sat 30-01-16 00:28:33, Matthew Wilcox wrote: >>>> > On Fri, Jan 29, 2016 at 11:28:15AM -0700, Ross Zwisler wrote: >>>> > > I guess I need to go off and understand if we can have DAX mappings on >>>> > > such a >>>> > > device. If we can, we may have a problem - we can get the >>>> > > block_device from >>>> > > get_block() in I/O path and the various fault paths, but we don't have >>>> > > access >>>> > > to get_block() when flushing via dax_writeback_mapping_range(). We >>>> > > avoid >>>> > > needing it the normal case by storing the sector results from >>>> > > get_block() in >>>> > > the radix tree. >>>> > >>>> > I think we're doing it wrong by storing the sector in the radix tree; >>>> > we'd >>>> > really need to store both the sector and the bdev which is too much data. >>>> > >>>> > If we store the PFN of the underlying page instead, we don't have this >>>> > problem. Instead, we have a different problem; of the device going >>>> > away under us. I'm trying to find the code which tears down PTEs when >>>> > the device goes away, and I'm not seeing it. What do we do about user >>>> > mappings of the device? >>>> >>>> So I don't have a strong opinion whether storing PFN or sector is better. >>>> Maybe PFN is somewhat more generic but OTOH turning DAX off for special >>>> cases like inodes on XFS RT devices would be IMHO fine. >>> >>> We need to support alternate devices. >> >> Embedded devices trying to use NOR Flash to free up RAM was >> historically one of the more prevalent real world uses of the old >> filemap_xip.c code although the users never made it to mainline. So I >> spent some time last week trying to figure out how to make a subset of >> DAX not depend on CONFIG_BLOCK. It was a very frustrating and >> unfruitful experience. I discarded my main conclusion as impractical, >> but now that I see the difficultly DAX faces in dealing with >> "alternate devices" especially some of the crazy stuff btrfs can do, I >> wonder if it's not so crazy after all. >> >> Lets stop calling bdev_direct_access() directly from DAX. Let the >> filesystems do it. >> >> Sure we could enable generic_dax_direct_access() helper for the >> filesystems that only support single devices to make it easy. But XFS >> and btrfs for example, have to do the work of figuring out what bdev >> is required and then calling bdev_direct_access(). >> >> My reasoning is that the filesystem knows how to map inodes and >> offsets to devices and sectors, no matter how complex that is. It >> would even enable a filesystem to intelligently use a mix of >> direct_access and regular block devices down the road. Of course it >> would also make the block-less solution doable. >> >> Good idea? Stupid idea? > > The CONFIG_BLOCK=y case isn't going anywhere, so if anything it seems > the CONFIG_BLOCK=n is an incremental feature in its own right. What > driver and what filesystem are looking to enable this XIP support in? Well... as CONFIG_BLOCK was not required with filemap_xip.c for a decade. This CONFIG_BLOCK dependency is a result of an incremental feature from a certain point of view ;) The obvious 'driver' is physical RAM without a particular driver. Remember please I'm talking about embedded. RAM measured in MiB and funky one off hardware etc. In the embedded world there are lots of ways that persistent memory has been supported in device specific ways without the new fancypants NFIT and Intel instructions, so frankly they don't fit in the PMEM stuff. Maybe they could be supported in PMEM but not without effort to bring embedded players to the table. The other drivers are the MTD drivers, probably as read-only for now. But the paradigm there isn't so different from what PMEM looks like with asymmetric read/write capabilities. The filesystem I'm concerned with is AXFS (https://www.kernel.org/doc/ols/2008/ols2008v1-pages-211-218.pdf). Which I've been planning on trying to merge again due to a recent resurgence of interest. The device model for AXFS is... weird. It can use one or two devices at a time of any mix of NOR MTD, NAND MTD, block, and unmanaged
Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences
On Tue, Feb 2, 2016 at 8:51 AM, Dan Williams <dan.j.willi...@intel.com> wrote: > On Tue, Feb 2, 2016 at 12:05 AM, Jared Hulbert <jare...@gmail.com> wrote: > [..] >> Well... as CONFIG_BLOCK was not required with filemap_xip.c for a >> decade. This CONFIG_BLOCK dependency is a result of an incremental >> feature from a certain point of view ;) >> >> The obvious 'driver' is physical RAM without a particular driver. >> Remember please I'm talking about embedded. RAM measured in MiB and >> funky one off hardware etc. In the embedded world there are lots of >> ways that persistent memory has been supported in device specific ways >> without the new fancypants NFIT and Intel instructions,so frankly >> they don't fit in the PMEM stuff. Maybe they could be supported in >> PMEM but not without effort to bring embedded players to the table. > > Not sure what you're trying to say here. An ACPI NFIT only feeds the > generic libnvdimm device model. You don't need NFIT to get pmem. Right... I'm just not seeing how the libnvdimm device model fits, is relevant, or useful to a persistent SRAM in embedded. Therefore I don't see some of the user will have a driver. >> The other drivers are the MTD drivers, probably as read-only for now. >> But the paradigm there isn't so different from what PMEM looks like >> with asymmetric read/write capabilities. >> >> The filesystem I'm concerned with is AXFS >> (https://www.kernel.org/doc/ols/2008/ols2008v1-pages-211-218.pdf). >> Which I've been planning on trying to merge again due to a recent >> resurgence of interest. The device model for AXFS is... weird. It >> can use one or two devices at a time of any mix of NOR MTD, NAND MTD, >> block, and unmanaged physical memory. It's a terribly useful model >> for embedded. Anyway AXFS is readonly so hacking in a read only >> dax_fault_nodev() and dax_file_read() would work fine, looks easy >> enough. But... it would be cool if similar small embedded focused RW >> filesystems were enabled. > > Are those also out of tree? Of course. Merging embedded filesystems is little merging regular filesystems except 98% of you reviewers don't want it merged. >> I don't expect you to taint DAX with design requirements for this >> stuff that it wasn't built for, nobody ends up happy in that case. >> However, if enabling the filesystem to manage the bdev_direct_access() >> interactions solves some of the "alternate device" problems you are >> discussing here, then there is a chance we can accommodate both. >> Sometimes that works. >> >> So... Forget CONFIG_BLOCK=n entirely I didn't want that to be the >> focus anyway. Does it help to support the weirder XFS and btrfs >> device models to enable the filesystem to handle the >> bdev_direct_access() stuff? > > It's not clear that it does. We just clarified with xfs and ext4 that > we can really on get_blocks(). That solves the immediate concern with > multi-device filesystems. IMO you're making DAX more complex by overly coupling to the bdev and I think it could bite you later. I submit this rework of the radix tree and confusion about where to get the real bdev as evidence. I'm guessing that it won't be the last time. It's unnecessary to couple it like this, and in fact is not how the vfs has been layered in the past. The trouble with vfs work has been that it straddles the line between mm and block, unfortunately that line is dark chasm with ill defined boundaries. DAX is even more exciting because it's trying to duct tape the filesystem even closer to the mm system, one could argue it's actually in some respects enabling the filesystem to bypass the mm code. On top of that DAX is designed to enable block based filesystems to use RAM like devices. Bolting the block device interface on to NVDIMM is a brilliant hack and the right design choice, but it's still a hack. The upside is it enables the reuse of all this glorious legacy filesystem code which does a pretty amazing job of handling what the pmem device applications need considering they were designed to manage data on platters of slow spinning rust. How would DAX look like developed with a filesystem purpose built for pmem? To look at the the downside consider dax_fault(). Its called on a fault to a user memory map, uses the filesystems get_block() to lookup a sector so you can ask a block device to convert it to an address on a DIMM. Come on, that's awkward. Everything around dax_fault() is dripping with memory semantic interfaces, the dax_fault() call are fundamentally about memory, the pmem calls are memory, the hardware is memory, and yet it directly calls bdev_direct_access(). It's out of place. The legacy vfs/mm code didn't have this layering prob
Re: [PATCH] dax: allow DAX to look up an inode's block device
On Tue, Feb 2, 2016 at 3:19 PM, Al Virowrote: > > On Tue, Feb 02, 2016 at 04:11:42PM -0700, Ross Zwisler wrote: > > > However, for raw block devices and for XFS with a real-time device, the > > value in inode->i_sb->s_bdev is not correct. With the code as it is > > currently written, an fsync or msync to a DAX enabled raw block device will > > cause a NULL pointer dereference kernel BUG. For this to work correctly we > > need to ask the block device or filesystem what struct block_device is > > appropriate for our inode. > > > > To that end, add a get_bdev(struct inode *) entry point to struct > > super_operations. If this function pointer is non-NULL, this notifies DAX > > that it needs to use it to look up the correct block_device. If > > i_sb->get_bdev() is NULL DAX will default to inode->i_sb->s_bdev. > > Umm... It assumes that bdev will stay pinned for as long as inode is > referenced, presumably? If so, that needs to be documented (and verified > for existing fs instances). In principle, multi-disk fs might want to > support things like "silently move the inodes backed by that disk to other > ones"... Dan, This is exactly the kind of thing I'm taking about WRT the weirder device models and directly calling bdev_direct_access(). Filesystems don't have the monogamous relationship with a device that is implicitly assumed in DAX, you have to ask the filesystem what the relationship is and is migrating to, and allow the filesystem to update DAX when the relationship is changing. As we start to see many DIMM's and 10s TiB pmem systems this is going be an even bigger deal as load balancing, wear leveling, and fault tolerance concerned are inevitably driven by the filesystem.
Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences
On Tue, Feb 2, 2016 at 4:34 PM, Matthew Wilcox <wi...@linux.intel.com> wrote: > On Tue, Feb 02, 2016 at 01:46:06PM -0800, Jared Hulbert wrote: >> On Tue, Feb 2, 2016 at 8:51 AM, Dan Williams <dan.j.willi...@intel.com> >> wrote: >> >> The filesystem I'm concerned with is AXFS >> >> (https://www.kernel.org/doc/ols/2008/ols2008v1-pages-211-218.pdf). >> >> Which I've been planning on trying to merge again due to a recent >> >> resurgence of interest. The device model for AXFS is... weird. It >> >> can use one or two devices at a time of any mix of NOR MTD, NAND MTD, >> >> block, and unmanaged physical memory. It's a terribly useful model >> >> for embedded. Anyway AXFS is readonly so hacking in a read only >> >> dax_fault_nodev() and dax_file_read() would work fine, looks easy >> >> enough. But... it would be cool if similar small embedded focused RW >> >> filesystems were enabled. >> > >> > Are those also out of tree? >> >> Of course. Merging embedded filesystems is little merging regular >> filesystems except 98% of you reviewers don't want it merged. > > You should at least be able to get it into staging these days. I mean, > look at some of the junk that's in staging ... and I don't think AXFS was > nearly as bad. Thanks? ;) >> IMO you're making DAX more complex by overly coupling to the bdev and >> I think it could bite you later. I submit this rework of the radix >> tree and confusion about where to get the real bdev as evidence. I'm >> guessing that it won't be the last time. It's unnecessary to couple >> it like this, and in fact is not how the vfs has been layered in the >> past. > > Huh? The rework to use the radix tree for PFNs was done with one eye > firmly on your usage case. Just because I had to thread the get_block > interface through it for the moment doesn't mean that I didn't have > the "how do we get rid of get_block entirely" question on my mind. Oh yeah. I think we're on the same page. But I'm not sure Dan is. I get the need to phase this in too. > Using get_block seemed like the right idea three years ago. I didn't > know just how fundamentally ext4 and XFS disagree on how it should be > used. Sure. I can see that. >> To look at the the downside consider dax_fault(). Its called on a >> fault to a user memory map, uses the filesystems get_block() to lookup >> a sector so you can ask a block device to convert it to an address on >> a DIMM. Come on, that's awkward. Everything around dax_fault() is >> dripping with memory semantic interfaces, the dax_fault() call are >> fundamentally about memory, the pmem calls are memory, the hardware is >> memory, and yet it directly calls bdev_direct_access(). It's out of >> place. > > What was out of place was the old 'get_xip_mem' in address_space > operations. Returning a kernel virtual address and a PFN from a > filesystem operation? That looks awful. Yes. Yes it does! But at least my big hack was just one line. ;) Nobody really even seemed to notice at the time. > All the other operations deal > in struct pages, file offsets and occasionally sectors. Of course, we > don't have a struct page, so a pfn makes sense, but the kernel virtual > address being returned was a gargantuan layering problem. Well yes, but it was an expedient hack. >> The legacy vfs/mm code didn't have this layering problem either. Even >> filemap_fault() that dax_fault() is modeled after doesn't call any >> bdev methods directly, when it needs something it asks the filesystem >> with a ->readpage(). The precedence is that you ask the filesystem >> for what you need. Look at the get_bdev() thing you've concluded you >> need. It _almost_ makes my point. I just happen to be of the opinion >> that you don't actually want or need the bdev, you want the pfn/kaddr >> so you can flush or map or memcpy(). > > You want the pfn. The device driver doesn't have enough information to > give you a (coherent with userspace) kaddr. That's what (some future > arch-specific implementation of) dax_map_pfn() is for. That's why it > takes 'index' as a parameter, so you can calculate where it'll be mapped > in userspace, and determine an appropriate kernel virtual address to > use for it. Oh I think I'm just beginning to catch your vision for dax_map_pfn(). I still don't get why we can't just do semi-arch specific flushing instead of the alignment thing. But that just might be epic ignorance on my part. Either way flush or magic alignments dax_(un)map_pfn() would handle it, right?
Re: [PATCH] dax: allow DAX to look up an inode's block device
On Tue, Feb 2, 2016 at 3:41 PM, Dan Williams <dan.j.willi...@intel.com> wrote: > On Tue, Feb 2, 2016 at 3:36 PM, Jared Hulbert <jare...@gmail.com> wrote: >> On Tue, Feb 2, 2016 at 3:19 PM, Al Viro <v...@zeniv.linux.org.uk> wrote: >>> >>> On Tue, Feb 02, 2016 at 04:11:42PM -0700, Ross Zwisler wrote: >>> >>> > However, for raw block devices and for XFS with a real-time device, the >>> > value in inode->i_sb->s_bdev is not correct. With the code as it is >>> > currently written, an fsync or msync to a DAX enabled raw block device >>> > will >>> > cause a NULL pointer dereference kernel BUG. For this to work correctly >>> > we >>> > need to ask the block device or filesystem what struct block_device is >>> > appropriate for our inode. >>> > >>> > To that end, add a get_bdev(struct inode *) entry point to struct >>> > super_operations. If this function pointer is non-NULL, this notifies DAX >>> > that it needs to use it to look up the correct block_device. If >>> > i_sb->get_bdev() is NULL DAX will default to inode->i_sb->s_bdev. >>> >>> Umm... It assumes that bdev will stay pinned for as long as inode is >>> referenced, presumably? If so, that needs to be documented (and verified >>> for existing fs instances). In principle, multi-disk fs might want to >>> support things like "silently move the inodes backed by that disk to other >>> ones"... >> >> Dan, This is exactly the kind of thing I'm taking about WRT the >> weirder device models and directly calling bdev_direct_access(). >> Filesystems don't have the monogamous relationship with a device that >> is implicitly assumed in DAX, you have to ask the filesystem what the >> relationship is and is migrating to, and allow the filesystem to >> update DAX when the relationship is changing. > > That's precisely what ->get_bdev() does. When the answer > inode->i_sb->s_bdev lookup is invalid, use ->get_bdev(). > >> As we start to see many >> DIMM's and 10s TiB pmem systems this is going be an even bigger deal >> as load balancing, wear leveling, and fault tolerance concerned are >> inevitably driven by the filesystem. > > No, there are no plans on the horizon for an fs to manage these media > specific concerns for persistent memory. So the filesystem is now directly in charge of mapping user pages to physical memory. The filesystem is effectively bypassing NUMA and zones and all that stuff that tries to balance memory bus and QPI traffic etc. You don't think the filesystem will therefore be in charge of memory bus hotspots? Alright. We can just agree to disagree on that point.
Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences
On Mon, Feb 1, 2016 at 10:46 PM, Dan Williams <dan.j.willi...@intel.com> wrote: > On Mon, Feb 1, 2016 at 10:06 PM, Jared Hulbert <jare...@gmail.com> wrote: >> On Mon, Feb 1, 2016 at 1:47 PM, Dave Chinner <da...@fromorbit.com> wrote: >>> On Mon, Feb 01, 2016 at 03:51:47PM +0100, Jan Kara wrote: >>>> On Sat 30-01-16 00:28:33, Matthew Wilcox wrote: >>>> > On Fri, Jan 29, 2016 at 11:28:15AM -0700, Ross Zwisler wrote: >>>> > > I guess I need to go off and understand if we can have DAX mappings on >>>> > > such a >>>> > > device. If we can, we may have a problem - we can get the >>>> > > block_device from >>>> > > get_block() in I/O path and the various fault paths, but we don't have >>>> > > access >>>> > > to get_block() when flushing via dax_writeback_mapping_range(). We >>>> > > avoid >>>> > > needing it the normal case by storing the sector results from >>>> > > get_block() in >>>> > > the radix tree. >>>> > >>>> > I think we're doing it wrong by storing the sector in the radix tree; >>>> > we'd >>>> > really need to store both the sector and the bdev which is too much data. >>>> > >>>> > If we store the PFN of the underlying page instead, we don't have this >>>> > problem. Instead, we have a different problem; of the device going >>>> > away under us. I'm trying to find the code which tears down PTEs when >>>> > the device goes away, and I'm not seeing it. What do we do about user >>>> > mappings of the device? >>>> >>>> So I don't have a strong opinion whether storing PFN or sector is better. >>>> Maybe PFN is somewhat more generic but OTOH turning DAX off for special >>>> cases like inodes on XFS RT devices would be IMHO fine. >>> >>> We need to support alternate devices. >> >> Embedded devices trying to use NOR Flash to free up RAM was >> historically one of the more prevalent real world uses of the old >> filemap_xip.c code although the users never made it to mainline. So I >> spent some time last week trying to figure out how to make a subset of >> DAX not depend on CONFIG_BLOCK. It was a very frustrating and >> unfruitful experience. I discarded my main conclusion as impractical, >> but now that I see the difficultly DAX faces in dealing with >> "alternate devices" especially some of the crazy stuff btrfs can do, I >> wonder if it's not so crazy after all. >> >> Lets stop calling bdev_direct_access() directly from DAX. Let the >> filesystems do it. >> >> Sure we could enable generic_dax_direct_access() helper for the >> filesystems that only support single devices to make it easy. But XFS >> and btrfs for example, have to do the work of figuring out what bdev >> is required and then calling bdev_direct_access(). >> >> My reasoning is that the filesystem knows how to map inodes and >> offsets to devices and sectors, no matter how complex that is. It >> would even enable a filesystem to intelligently use a mix of >> direct_access and regular block devices down the road. Of course it >> would also make the block-less solution doable. >> >> Good idea? Stupid idea? > > The CONFIG_BLOCK=y case isn't going anywhere, so if anything it seems > the CONFIG_BLOCK=n is an incremental feature in its own right. What > driver and what filesystem are looking to enable this XIP support in? Well... as CONFIG_BLOCK was not required with filemap_xip.c for a decade. This CONFIG_BLOCK dependency is a result of an incremental feature from a certain point of view ;) The obvious 'driver' is physical RAM without a particular driver. Remember please I'm talking about embedded. RAM measured in MiB and funky one off hardware etc. In the embedded world there are lots of ways that persistent memory has been supported in device specific ways without the new fancypants NFIT and Intel instructions, so frankly they don't fit in the PMEM stuff. Maybe they could be supported in PMEM but not without effort to bring embedded players to the table. The other drivers are the MTD drivers, probably as read-only for now. But the paradigm there isn't so different from what PMEM looks like with asymmetric read/write capabilities. The filesystem I'm concerned with is AXFS (https://www.kernel.org/doc/ols/2008/ols2008v1-pages-211-218.pdf). Which I've been planning on trying to merge again due to a recent resurgence of interest. The device model for AXFS is... weird. I
Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences
On Mon, Feb 1, 2016 at 1:47 PM, Dave Chinner wrote: > On Mon, Feb 01, 2016 at 03:51:47PM +0100, Jan Kara wrote: >> On Sat 30-01-16 00:28:33, Matthew Wilcox wrote: >> > On Fri, Jan 29, 2016 at 11:28:15AM -0700, Ross Zwisler wrote: >> > > I guess I need to go off and understand if we can have DAX mappings on >> > > such a >> > > device. If we can, we may have a problem - we can get the block_device >> > > from >> > > get_block() in I/O path and the various fault paths, but we don't have >> > > access >> > > to get_block() when flushing via dax_writeback_mapping_range(). We avoid >> > > needing it the normal case by storing the sector results from >> > > get_block() in >> > > the radix tree. >> > >> > I think we're doing it wrong by storing the sector in the radix tree; we'd >> > really need to store both the sector and the bdev which is too much data. >> > >> > If we store the PFN of the underlying page instead, we don't have this >> > problem. Instead, we have a different problem; of the device going >> > away under us. I'm trying to find the code which tears down PTEs when >> > the device goes away, and I'm not seeing it. What do we do about user >> > mappings of the device? >> >> So I don't have a strong opinion whether storing PFN or sector is better. >> Maybe PFN is somewhat more generic but OTOH turning DAX off for special >> cases like inodes on XFS RT devices would be IMHO fine. > > We need to support alternate devices. Embedded devices trying to use NOR Flash to free up RAM was historically one of the more prevalent real world uses of the old filemap_xip.c code although the users never made it to mainline. So I spent some time last week trying to figure out how to make a subset of DAX not depend on CONFIG_BLOCK. It was a very frustrating and unfruitful experience. I discarded my main conclusion as impractical, but now that I see the difficultly DAX faces in dealing with "alternate devices" especially some of the crazy stuff btrfs can do, I wonder if it's not so crazy after all. Lets stop calling bdev_direct_access() directly from DAX. Let the filesystems do it. Sure we could enable generic_dax_direct_access() helper for the filesystems that only support single devices to make it easy. But XFS and btrfs for example, have to do the work of figuring out what bdev is required and then calling bdev_direct_access(). My reasoning is that the filesystem knows how to map inodes and offsets to devices and sectors, no matter how complex that is. It would even enable a filesystem to intelligently use a mix of direct_access and regular block devices down the road. Of course it would also make the block-less solution doable. Good idea? Stupid idea?
Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences
On Mon, Feb 1, 2016 at 1:47 PM, Dave Chinnerwrote: > On Mon, Feb 01, 2016 at 03:51:47PM +0100, Jan Kara wrote: >> On Sat 30-01-16 00:28:33, Matthew Wilcox wrote: >> > On Fri, Jan 29, 2016 at 11:28:15AM -0700, Ross Zwisler wrote: >> > > I guess I need to go off and understand if we can have DAX mappings on >> > > such a >> > > device. If we can, we may have a problem - we can get the block_device >> > > from >> > > get_block() in I/O path and the various fault paths, but we don't have >> > > access >> > > to get_block() when flushing via dax_writeback_mapping_range(). We avoid >> > > needing it the normal case by storing the sector results from >> > > get_block() in >> > > the radix tree. >> > >> > I think we're doing it wrong by storing the sector in the radix tree; we'd >> > really need to store both the sector and the bdev which is too much data. >> > >> > If we store the PFN of the underlying page instead, we don't have this >> > problem. Instead, we have a different problem; of the device going >> > away under us. I'm trying to find the code which tears down PTEs when >> > the device goes away, and I'm not seeing it. What do we do about user >> > mappings of the device? >> >> So I don't have a strong opinion whether storing PFN or sector is better. >> Maybe PFN is somewhat more generic but OTOH turning DAX off for special >> cases like inodes on XFS RT devices would be IMHO fine. > > We need to support alternate devices. Embedded devices trying to use NOR Flash to free up RAM was historically one of the more prevalent real world uses of the old filemap_xip.c code although the users never made it to mainline. So I spent some time last week trying to figure out how to make a subset of DAX not depend on CONFIG_BLOCK. It was a very frustrating and unfruitful experience. I discarded my main conclusion as impractical, but now that I see the difficultly DAX faces in dealing with "alternate devices" especially some of the crazy stuff btrfs can do, I wonder if it's not so crazy after all. Lets stop calling bdev_direct_access() directly from DAX. Let the filesystems do it. Sure we could enable generic_dax_direct_access() helper for the filesystems that only support single devices to make it easy. But XFS and btrfs for example, have to do the work of figuring out what bdev is required and then calling bdev_direct_access(). My reasoning is that the filesystem knows how to map inodes and offsets to devices and sectors, no matter how complex that is. It would even enable a filesystem to intelligently use a mix of direct_access and regular block devices down the road. Of course it would also make the block-less solution doable. Good idea? Stupid idea?
Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences
On Fri, Jan 29, 2016 at 10:01 PM, Dan Williams wrote: > On Fri, Jan 29, 2016 at 9:28 PM, Matthew Wilcox wrote: >> On Fri, Jan 29, 2016 at 11:28:15AM -0700, Ross Zwisler wrote: >>> I guess I need to go off and understand if we can have DAX mappings on such >>> a >>> device. If we can, we may have a problem - we can get the block_device from >>> get_block() in I/O path and the various fault paths, but we don't have >>> access >>> to get_block() when flushing via dax_writeback_mapping_range(). We avoid >>> needing it the normal case by storing the sector results from get_block() in >>> the radix tree. >> >> I think we're doing it wrong by storing the sector in the radix tree; we'd >> really need to store both the sector and the bdev which is too much data. >> >> If we store the PFN of the underlying page instead, we don't have this >> problem. Instead, we have a different problem; of the device going >> away under us. I'm trying to find the code which tears down PTEs when >> the device goes away, and I'm not seeing it. What do we do about user >> mappings of the device? >> > > I deferred the dax tear down code until next cycle as Al rightly > pointed out some needed re-works: > > https://lists.01.org/pipermail/linux-nvdimm/2016-January/003995.html If you store sectors in the radix and the device gets removed you still have to unmap user mappings of PFNs. So why is the device remove harder with the PFN vs bdev+sector radix entry? Either way you need a list of PFNs and their corresponding PTE's, right? And are we just talking graceful removal? Any plans for device failures?
Re: [PATCH 2/2] dax: fix bdev NULL pointer dereferences
On Fri, Jan 29, 2016 at 10:01 PM, Dan Williamswrote: > On Fri, Jan 29, 2016 at 9:28 PM, Matthew Wilcox wrote: >> On Fri, Jan 29, 2016 at 11:28:15AM -0700, Ross Zwisler wrote: >>> I guess I need to go off and understand if we can have DAX mappings on such >>> a >>> device. If we can, we may have a problem - we can get the block_device from >>> get_block() in I/O path and the various fault paths, but we don't have >>> access >>> to get_block() when flushing via dax_writeback_mapping_range(). We avoid >>> needing it the normal case by storing the sector results from get_block() in >>> the radix tree. >> >> I think we're doing it wrong by storing the sector in the radix tree; we'd >> really need to store both the sector and the bdev which is too much data. >> >> If we store the PFN of the underlying page instead, we don't have this >> problem. Instead, we have a different problem; of the device going >> away under us. I'm trying to find the code which tears down PTEs when >> the device goes away, and I'm not seeing it. What do we do about user >> mappings of the device? >> > > I deferred the dax tear down code until next cycle as Al rightly > pointed out some needed re-works: > > https://lists.01.org/pipermail/linux-nvdimm/2016-January/003995.html If you store sectors in the radix and the device gets removed you still have to unmap user mappings of PFNs. So why is the device remove harder with the PFN vs bdev+sector radix entry? Either way you need a list of PFNs and their corresponding PTE's, right? And are we just talking graceful removal? Any plans for device failures?
Re: [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation
On Mon, Jan 25, 2016 at 1:18 PM, Jared Hulbert wrote: > On Mon, Jan 25, 2016 at 8:52 AM, Matthew Wilcox wrote: >> On Sun, Jan 24, 2016 at 01:03:49AM -0800, Jared Hulbert wrote: >>> I our defense we didn't know we were sinning at the time. >> >> Fair enough. Cache flushing is Hard. >> >>> Can you walk me through the cache flushing hole? How is it okay on >>> X86 but not VIVT archs? I'm missing something obvious here. >>> >>> I thought earlier that vm_insert_mixed() handled the necessary >>> flushing. Is that even the part you are worried about? >> >> No, that part should be fine. My concern is about write() calls to files >> which are also mmaped. See Documentation/cachetlb.txt around line 229, >> starting with "There exists another whole class of cpu cache issues" ... > > oh wow. So aren't all the copy_to/from_user() variants specifically > supposed to handle such cases? > >>> What flushing functions would you call if you did have a cache page. >> >> Well, that's the problem; they don't currently exist. >> >>> There are all kinds of cache flushing functions that work without a >>> struct page. If nothing else the specialized ASM instructions that do >>> the various flushes don't use struct page as a parameter. This isn't >>> the first I've run into the lack of a sane cache API. Grep for >>> inval_cache in the mtd drivers, should have been much easier. Isn't >>> the proper solution to fix update_mmu_cache() or build out a pageless >>> cache flushing API? >>> >>> I don't get the explicit mapping solution. What are you mapping >>> where? What addresses would be SHMLBA? Phys, kernel, userspace? >> >> The problem comes in dax_io() where the kernel stores to an alias of the >> user address (or reads from an alias of the user address). Theoretically, >> we should flush user addresses before we read from the kernel's alias, >> and flush the kernel's alias after we store to it. > > Reasoning this out loud here. Please correct. > > For the dax read case: > - kernel virt is mapped to pfn > - data is memcpy'd from kernel virt > > For the dax write case: > - kernel virt is mapped to pfn > - data is memcpy'd to kernel virt > - user virt map to pfn attempts to read > > Is that right? I see the x86 does a nocache copy_to/from operation, > I'm not familiar with the semantics of that call and it would take me > a while to understand the assembly but I assume it's doing some magic > opcodes that forces the writes down to physical memory with each > load/store. Does the the caching model of the x86 arch update the > cache entries tied to the physical memory on update? > > For architectures that don't do auto coherency magic... > > For reads: > - User dcaches need flushing before kernel virtual mapping to ensure > kernel reads latest data. If the user has unflushed data in the > dcache it would not be reflected in the read copy. > This failure mode only is a problem if the filesystem is RW. > > For writes: > - Unlike the read case we don't need up to date data for the user's > mapping of a pfn. However, the user will need to caches invalidated > to get fresh data, so we should make sure to writeback any affected > lines in the user caches so they don't get lost if we do an > invalidate. I suppose uncommitted data might corrupt the new data > written from the kernel mapping if the cachelines get flushed later. > - After the data is memcpy'ed to the kernel virt map the cache, and > possibly the write buffers, should be flushed. Without this flush the > data might not ever get to the user mapped versions. > - Assuming the user maps were all flushed at the outset they should be > reloaded with fresh data on access. > > Do I get it more or less? I assume the silence means I don't get it. Moving along... The need to flush kernel aliases and user alias without a struct page was articulated and cited as the reason why the DAX doesn't work with ARM, MIPS, and SPARC. One of the following routines should work for kernel flushing, right? -- flush_cache_vmap(unsigned long start, unsigned long end) -- flush_kernel_vmap_range(void *vaddr, int size) -- invalidate_kernel_vmap_range(void *vaddr, int size) For user aliases I'm less confident with here, but at first glance I don't see why these wouldn't work? -- flush_cache_page(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn) -- flush_cache_range(struct vm_area_struct *vma, unsigned long start, unsigned long end) Help?! I missing something here. >> But if we create a new address for the kernel to use which lands on the >> same cache line
Re: [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation
On Mon, Jan 25, 2016 at 1:18 PM, Jared Hulbert <jare...@gmail.com> wrote: > On Mon, Jan 25, 2016 at 8:52 AM, Matthew Wilcox <wi...@linux.intel.com> wrote: >> On Sun, Jan 24, 2016 at 01:03:49AM -0800, Jared Hulbert wrote: >>> I our defense we didn't know we were sinning at the time. >> >> Fair enough. Cache flushing is Hard. >> >>> Can you walk me through the cache flushing hole? How is it okay on >>> X86 but not VIVT archs? I'm missing something obvious here. >>> >>> I thought earlier that vm_insert_mixed() handled the necessary >>> flushing. Is that even the part you are worried about? >> >> No, that part should be fine. My concern is about write() calls to files >> which are also mmaped. See Documentation/cachetlb.txt around line 229, >> starting with "There exists another whole class of cpu cache issues" ... > > oh wow. So aren't all the copy_to/from_user() variants specifically > supposed to handle such cases? > >>> What flushing functions would you call if you did have a cache page. >> >> Well, that's the problem; they don't currently exist. >> >>> There are all kinds of cache flushing functions that work without a >>> struct page. If nothing else the specialized ASM instructions that do >>> the various flushes don't use struct page as a parameter. This isn't >>> the first I've run into the lack of a sane cache API. Grep for >>> inval_cache in the mtd drivers, should have been much easier. Isn't >>> the proper solution to fix update_mmu_cache() or build out a pageless >>> cache flushing API? >>> >>> I don't get the explicit mapping solution. What are you mapping >>> where? What addresses would be SHMLBA? Phys, kernel, userspace? >> >> The problem comes in dax_io() where the kernel stores to an alias of the >> user address (or reads from an alias of the user address). Theoretically, >> we should flush user addresses before we read from the kernel's alias, >> and flush the kernel's alias after we store to it. > > Reasoning this out loud here. Please correct. > > For the dax read case: > - kernel virt is mapped to pfn > - data is memcpy'd from kernel virt > > For the dax write case: > - kernel virt is mapped to pfn > - data is memcpy'd to kernel virt > - user virt map to pfn attempts to read > > Is that right? I see the x86 does a nocache copy_to/from operation, > I'm not familiar with the semantics of that call and it would take me > a while to understand the assembly but I assume it's doing some magic > opcodes that forces the writes down to physical memory with each > load/store. Does the the caching model of the x86 arch update the > cache entries tied to the physical memory on update? > > For architectures that don't do auto coherency magic... > > For reads: > - User dcaches need flushing before kernel virtual mapping to ensure > kernel reads latest data. If the user has unflushed data in the > dcache it would not be reflected in the read copy. > This failure mode only is a problem if the filesystem is RW. > > For writes: > - Unlike the read case we don't need up to date data for the user's > mapping of a pfn. However, the user will need to caches invalidated > to get fresh data, so we should make sure to writeback any affected > lines in the user caches so they don't get lost if we do an > invalidate. I suppose uncommitted data might corrupt the new data > written from the kernel mapping if the cachelines get flushed later. > - After the data is memcpy'ed to the kernel virt map the cache, and > possibly the write buffers, should be flushed. Without this flush the > data might not ever get to the user mapped versions. > - Assuming the user maps were all flushed at the outset they should be > reloaded with fresh data on access. > > Do I get it more or less? I assume the silence means I don't get it. Moving along... The need to flush kernel aliases and user alias without a struct page was articulated and cited as the reason why the DAX doesn't work with ARM, MIPS, and SPARC. One of the following routines should work for kernel flushing, right? -- flush_cache_vmap(unsigned long start, unsigned long end) -- flush_kernel_vmap_range(void *vaddr, int size) -- invalidate_kernel_vmap_range(void *vaddr, int size) For user aliases I'm less confident with here, but at first glance I don't see why these wouldn't work? -- flush_cache_page(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn) -- flush_cache_range(struct vm_area_struct *vma, unsigned long start, unsigned long end) Help?! I missing something here. >> But if we create a new address for the kernel
Re: [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation
On Mon, Jan 25, 2016 at 8:52 AM, Matthew Wilcox wrote: > On Sun, Jan 24, 2016 at 01:03:49AM -0800, Jared Hulbert wrote: >> I our defense we didn't know we were sinning at the time. > > Fair enough. Cache flushing is Hard. > >> Can you walk me through the cache flushing hole? How is it okay on >> X86 but not VIVT archs? I'm missing something obvious here. >> >> I thought earlier that vm_insert_mixed() handled the necessary >> flushing. Is that even the part you are worried about? > > No, that part should be fine. My concern is about write() calls to files > which are also mmaped. See Documentation/cachetlb.txt around line 229, > starting with "There exists another whole class of cpu cache issues" ... oh wow. So aren't all the copy_to/from_user() variants specifically supposed to handle such cases? >> What flushing functions would you call if you did have a cache page. > > Well, that's the problem; they don't currently exist. > >> There are all kinds of cache flushing functions that work without a >> struct page. If nothing else the specialized ASM instructions that do >> the various flushes don't use struct page as a parameter. This isn't >> the first I've run into the lack of a sane cache API. Grep for >> inval_cache in the mtd drivers, should have been much easier. Isn't >> the proper solution to fix update_mmu_cache() or build out a pageless >> cache flushing API? >> >> I don't get the explicit mapping solution. What are you mapping >> where? What addresses would be SHMLBA? Phys, kernel, userspace? > > The problem comes in dax_io() where the kernel stores to an alias of the > user address (or reads from an alias of the user address). Theoretically, > we should flush user addresses before we read from the kernel's alias, > and flush the kernel's alias after we store to it. Reasoning this out loud here. Please correct. For the dax read case: - kernel virt is mapped to pfn - data is memcpy'd from kernel virt For the dax write case: - kernel virt is mapped to pfn - data is memcpy'd to kernel virt - user virt map to pfn attempts to read Is that right? I see the x86 does a nocache copy_to/from operation, I'm not familiar with the semantics of that call and it would take me a while to understand the assembly but I assume it's doing some magic opcodes that forces the writes down to physical memory with each load/store. Does the the caching model of the x86 arch update the cache entries tied to the physical memory on update? For architectures that don't do auto coherency magic... For reads: - User dcaches need flushing before kernel virtual mapping to ensure kernel reads latest data. If the user has unflushed data in the dcache it would not be reflected in the read copy. This failure mode only is a problem if the filesystem is RW. For writes: - Unlike the read case we don't need up to date data for the user's mapping of a pfn. However, the user will need to caches invalidated to get fresh data, so we should make sure to writeback any affected lines in the user caches so they don't get lost if we do an invalidate. I suppose uncommitted data might corrupt the new data written from the kernel mapping if the cachelines get flushed later. - After the data is memcpy'ed to the kernel virt map the cache, and possibly the write buffers, should be flushed. Without this flush the data might not ever get to the user mapped versions. - Assuming the user maps were all flushed at the outset they should be reloaded with fresh data on access. Do I get it more or less? > But if we create a new address for the kernel to use which lands on the > same cache line as the user's address (and this is what SHMLBA is used > to indicate), there is no incoherency between the kernel's view and the > user's view. And no new cache flushing API is needed. So... how exactly would one force the kernel address to be at the SHMLBA boundary? > Is that clearer? I'm not always good at explaining these things in a > way which makes sense to other people :-( Yeah. I think I'm at 80% comprehension here. Or at least I think I am. Thanks.
Re: [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation
On Mon, Jan 25, 2016 at 8:52 AM, Matthew Wilcox <wi...@linux.intel.com> wrote: > On Sun, Jan 24, 2016 at 01:03:49AM -0800, Jared Hulbert wrote: >> I our defense we didn't know we were sinning at the time. > > Fair enough. Cache flushing is Hard. > >> Can you walk me through the cache flushing hole? How is it okay on >> X86 but not VIVT archs? I'm missing something obvious here. >> >> I thought earlier that vm_insert_mixed() handled the necessary >> flushing. Is that even the part you are worried about? > > No, that part should be fine. My concern is about write() calls to files > which are also mmaped. See Documentation/cachetlb.txt around line 229, > starting with "There exists another whole class of cpu cache issues" ... oh wow. So aren't all the copy_to/from_user() variants specifically supposed to handle such cases? >> What flushing functions would you call if you did have a cache page. > > Well, that's the problem; they don't currently exist. > >> There are all kinds of cache flushing functions that work without a >> struct page. If nothing else the specialized ASM instructions that do >> the various flushes don't use struct page as a parameter. This isn't >> the first I've run into the lack of a sane cache API. Grep for >> inval_cache in the mtd drivers, should have been much easier. Isn't >> the proper solution to fix update_mmu_cache() or build out a pageless >> cache flushing API? >> >> I don't get the explicit mapping solution. What are you mapping >> where? What addresses would be SHMLBA? Phys, kernel, userspace? > > The problem comes in dax_io() where the kernel stores to an alias of the > user address (or reads from an alias of the user address). Theoretically, > we should flush user addresses before we read from the kernel's alias, > and flush the kernel's alias after we store to it. Reasoning this out loud here. Please correct. For the dax read case: - kernel virt is mapped to pfn - data is memcpy'd from kernel virt For the dax write case: - kernel virt is mapped to pfn - data is memcpy'd to kernel virt - user virt map to pfn attempts to read Is that right? I see the x86 does a nocache copy_to/from operation, I'm not familiar with the semantics of that call and it would take me a while to understand the assembly but I assume it's doing some magic opcodes that forces the writes down to physical memory with each load/store. Does the the caching model of the x86 arch update the cache entries tied to the physical memory on update? For architectures that don't do auto coherency magic... For reads: - User dcaches need flushing before kernel virtual mapping to ensure kernel reads latest data. If the user has unflushed data in the dcache it would not be reflected in the read copy. This failure mode only is a problem if the filesystem is RW. For writes: - Unlike the read case we don't need up to date data for the user's mapping of a pfn. However, the user will need to caches invalidated to get fresh data, so we should make sure to writeback any affected lines in the user caches so they don't get lost if we do an invalidate. I suppose uncommitted data might corrupt the new data written from the kernel mapping if the cachelines get flushed later. - After the data is memcpy'ed to the kernel virt map the cache, and possibly the write buffers, should be flushed. Without this flush the data might not ever get to the user mapped versions. - Assuming the user maps were all flushed at the outset they should be reloaded with fresh data on access. Do I get it more or less? > But if we create a new address for the kernel to use which lands on the > same cache line as the user's address (and this is what SHMLBA is used > to indicate), there is no incoherency between the kernel's view and the > user's view. And no new cache flushing API is needed. So... how exactly would one force the kernel address to be at the SHMLBA boundary? > Is that clearer? I'm not always good at explaining these things in a > way which makes sense to other people :-( Yeah. I think I'm at 80% comprehension here. Or at least I think I am. Thanks.
Re: [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation
I our defense we didn't know we were sinning at the time. Can you walk me through the cache flushing hole? How is it okay on X86 but not VIVT archs? I'm missing something obvious here. I thought earlier that vm_insert_mixed() handled the necessary flushing. Is that even the part you are worried about? vm_insert_mixed()->insert_pfn()->update_mmu_cache() _should_ handle the flush. Except of course now that I look at the ARM code it looks like it isn't doing anything if !pfn_valid().I need to spend some more time looking at this again. What flushing functions would you call if you did have a cache page. There are all kinds of cache flushing functions that work without a struct page. If nothing else the specialized ASM instructions that do the various flushes don't use struct page as a parameter. This isn't the first I've run into the lack of a sane cache API. Grep for inval_cache in the mtd drivers, should have been much easier. Isn't the proper solution to fix update_mmu_cache() or build out a pageless cache flushing API? I don't get the explicit mapping solution. What are you mapping where? What addresses would be SHMLBA? Phys, kernel, userspace?
Re: [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation
I our defense we didn't know we were sinning at the time. Can you walk me through the cache flushing hole? How is it okay on X86 but not VIVT archs? I'm missing something obvious here. I thought earlier that vm_insert_mixed() handled the necessary flushing. Is that even the part you are worried about? vm_insert_mixed()->insert_pfn()->update_mmu_cache() _should_ handle the flush. Except of course now that I look at the ARM code it looks like it isn't doing anything if !pfn_valid().I need to spend some more time looking at this again. What flushing functions would you call if you did have a cache page. There are all kinds of cache flushing functions that work without a struct page. If nothing else the specialized ASM instructions that do the various flushes don't use struct page as a parameter. This isn't the first I've run into the lack of a sane cache API. Grep for inval_cache in the mtd drivers, should have been much easier. Isn't the proper solution to fix update_mmu_cache() or build out a pageless cache flushing API? I don't get the explicit mapping solution. What are you mapping where? What addresses would be SHMLBA? Phys, kernel, userspace?
Re: [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation
HI! I've been out of the community for a while, but I'm trying to step back in here and catch up with some of my old areas of specialty. Couple questions, sorry to drag up such old conversations. The DAX documentation that made it into kernel 4.0 has the following line "The DAX code does not work correctly on architectures which have virtually mapped caches such as ARM, MIPS and SPARC." 1) It really doesn't support ARM.? I never had problems with the old filemap_xip.c stuff on ARM, what changed? 2) Is there a thread discussing this? On Fri, Oct 24, 2014 at 2:20 PM, Matthew Wilcox wrote: > From: Matthew Wilcox > > Based on the original XIP documentation, this documents the current > state of affairs, and includes instructions on how users can enable DAX > if their devices and kernel support it. > > Signed-off-by: Matthew Wilcox > Reviewed-by: Randy Dunlap > --- > Documentation/filesystems/00-INDEX | 5 ++- > Documentation/filesystems/dax.txt | 89 > ++ > Documentation/filesystems/xip.txt | 71 -- > 3 files changed, 92 insertions(+), 73 deletions(-) > create mode 100644 Documentation/filesystems/dax.txt > delete mode 100644 Documentation/filesystems/xip.txt > > diff --git a/Documentation/filesystems/00-INDEX > b/Documentation/filesystems/00-INDEX > index ac28149..9922939 100644 > --- a/Documentation/filesystems/00-INDEX > +++ b/Documentation/filesystems/00-INDEX > @@ -34,6 +34,9 @@ configfs/ > - directory containing configfs documentation and example code. > cramfs.txt > - info on the cram filesystem for small storage (ROMs etc). > +dax.txt > + - info on avoiding the page cache for files stored on CPU-addressable > + storage devices. > debugfs.txt > - info on the debugfs filesystem. > devpts.txt > @@ -154,5 +157,3 @@ xfs-self-describing-metadata.txt > - info on XFS Self Describing Metadata. > xfs.txt > - info and mount options for the XFS filesystem. > -xip.txt > - - info on execute-in-place for file mappings. > diff --git a/Documentation/filesystems/dax.txt > b/Documentation/filesystems/dax.txt > new file mode 100644 > index 000..635adaa > --- /dev/null > +++ b/Documentation/filesystems/dax.txt > @@ -0,0 +1,89 @@ > +Direct Access for files > +--- > + > +Motivation > +-- > + > +The page cache is usually used to buffer reads and writes to files. > +It is also used to provide the pages which are mapped into userspace > +by a call to mmap. > + > +For block devices that are memory-like, the page cache pages would be > +unnecessary copies of the original storage. The DAX code removes the > +extra copy by performing reads and writes directly to the storage device. > +For file mappings, the storage device is mapped directly into userspace. > + > + > +Usage > +- > + > +If you have a block device which supports DAX, you can make a filesystem > +on it as usual. When mounting it, use the -o dax option manually > +or add 'dax' to the options in /etc/fstab. > + > + > +Implementation Tips for Block Driver Writers > + > + > +To support DAX in your block driver, implement the 'direct_access' > +block device operation. It is used to translate the sector number > +(expressed in units of 512-byte sectors) to a page frame number (pfn) > +that identifies the physical page for the memory. It also returns a > +kernel virtual address that can be used to access the memory. > + > +The direct_access method takes a 'size' parameter that indicates the > +number of bytes being requested. The function should return the number > +of bytes that can be contiguously accessed at that offset. It may also > +return a negative errno if an error occurs. > + > +In order to support this method, the storage must be byte-accessible by > +the CPU at all times. If your device uses paging techniques to expose > +a large amount of memory through a smaller window, then you cannot > +implement direct_access. Equally, if your device can occasionally > +stall the CPU for an extended period, you should also not attempt to > +implement direct_access. > + > +These block devices may be used for inspiration: > +- axonram: Axon DDR2 device driver > +- brd: RAM backed block device driver > +- dcssblk: s390 dcss block device driver > + > + > +Implementation Tips for Filesystem Writers > +-- > + > +Filesystem support consists of > +- adding support to mark inodes as being DAX by setting the S_DAX flag in > + i_flags > +- implementing the direct_IO address space operation, and calling > + dax_do_io() instead of blockdev_direct_IO() if S_DAX is set > +- implementing an mmap file operation for DAX files which sets the > + VM_MIXEDMAP flag on the VMA, and setting the vm_ops to include handlers > + for fault and page_mkwrite (which should probably call dax_fault() and > + dax_mkwrite(),
Re: [PATCH v12 10/20] dax: Replace XIP documentation with DAX documentation
HI! I've been out of the community for a while, but I'm trying to step back in here and catch up with some of my old areas of specialty. Couple questions, sorry to drag up such old conversations. The DAX documentation that made it into kernel 4.0 has the following line "The DAX code does not work correctly on architectures which have virtually mapped caches such as ARM, MIPS and SPARC." 1) It really doesn't support ARM.? I never had problems with the old filemap_xip.c stuff on ARM, what changed? 2) Is there a thread discussing this? On Fri, Oct 24, 2014 at 2:20 PM, Matthew Wilcoxwrote: > From: Matthew Wilcox > > Based on the original XIP documentation, this documents the current > state of affairs, and includes instructions on how users can enable DAX > if their devices and kernel support it. > > Signed-off-by: Matthew Wilcox > Reviewed-by: Randy Dunlap > --- > Documentation/filesystems/00-INDEX | 5 ++- > Documentation/filesystems/dax.txt | 89 > ++ > Documentation/filesystems/xip.txt | 71 -- > 3 files changed, 92 insertions(+), 73 deletions(-) > create mode 100644 Documentation/filesystems/dax.txt > delete mode 100644 Documentation/filesystems/xip.txt > > diff --git a/Documentation/filesystems/00-INDEX > b/Documentation/filesystems/00-INDEX > index ac28149..9922939 100644 > --- a/Documentation/filesystems/00-INDEX > +++ b/Documentation/filesystems/00-INDEX > @@ -34,6 +34,9 @@ configfs/ > - directory containing configfs documentation and example code. > cramfs.txt > - info on the cram filesystem for small storage (ROMs etc). > +dax.txt > + - info on avoiding the page cache for files stored on CPU-addressable > + storage devices. > debugfs.txt > - info on the debugfs filesystem. > devpts.txt > @@ -154,5 +157,3 @@ xfs-self-describing-metadata.txt > - info on XFS Self Describing Metadata. > xfs.txt > - info and mount options for the XFS filesystem. > -xip.txt > - - info on execute-in-place for file mappings. > diff --git a/Documentation/filesystems/dax.txt > b/Documentation/filesystems/dax.txt > new file mode 100644 > index 000..635adaa > --- /dev/null > +++ b/Documentation/filesystems/dax.txt > @@ -0,0 +1,89 @@ > +Direct Access for files > +--- > + > +Motivation > +-- > + > +The page cache is usually used to buffer reads and writes to files. > +It is also used to provide the pages which are mapped into userspace > +by a call to mmap. > + > +For block devices that are memory-like, the page cache pages would be > +unnecessary copies of the original storage. The DAX code removes the > +extra copy by performing reads and writes directly to the storage device. > +For file mappings, the storage device is mapped directly into userspace. > + > + > +Usage > +- > + > +If you have a block device which supports DAX, you can make a filesystem > +on it as usual. When mounting it, use the -o dax option manually > +or add 'dax' to the options in /etc/fstab. > + > + > +Implementation Tips for Block Driver Writers > + > + > +To support DAX in your block driver, implement the 'direct_access' > +block device operation. It is used to translate the sector number > +(expressed in units of 512-byte sectors) to a page frame number (pfn) > +that identifies the physical page for the memory. It also returns a > +kernel virtual address that can be used to access the memory. > + > +The direct_access method takes a 'size' parameter that indicates the > +number of bytes being requested. The function should return the number > +of bytes that can be contiguously accessed at that offset. It may also > +return a negative errno if an error occurs. > + > +In order to support this method, the storage must be byte-accessible by > +the CPU at all times. If your device uses paging techniques to expose > +a large amount of memory through a smaller window, then you cannot > +implement direct_access. Equally, if your device can occasionally > +stall the CPU for an extended period, you should also not attempt to > +implement direct_access. > + > +These block devices may be used for inspiration: > +- axonram: Axon DDR2 device driver > +- brd: RAM backed block device driver > +- dcssblk: s390 dcss block device driver > + > + > +Implementation Tips for Filesystem Writers > +-- > + > +Filesystem support consists of > +- adding support to mark inodes as being DAX by setting the S_DAX flag in > + i_flags > +- implementing the direct_IO address space operation, and calling > + dax_do_io() instead of blockdev_direct_IO() if S_DAX is set > +- implementing an mmap file operation for DAX files which sets the > + VM_MIXEDMAP flag on the VMA, and setting the vm_ops to include handlers >
Re: [patch] ext2: xip check fix
> > I think so. The filemap_xip.c functionality doesn't work for Flash > > memory yet. Flash memory doesn't have struct pages to back it up with > > which this stuff depends on. > > Struct page is not the major issue. The primary problem is writing to > the media (and I am not a flash expert at all, just relaying here): > For some period of time, the flash memory is not usable and thus we > need to make sure we can nuke the page table entries that we have in > userland page tables. For that, we need a callback from the device so > that it can ask to get its references back. Oh, and a put_xip_page > counterpart to get_xip_page, so that the driver knows when it's safe > to erase. Well... That's the biggest/hardest problem, yes. But not the first. First we got to tackle the easy read only case, which doesn't require any of that unpleasantness, yet which is used in a bunch of out of tree hacks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] ext2: xip check fix
I think so. The filemap_xip.c functionality doesn't work for Flash memory yet. Flash memory doesn't have struct pages to back it up with which this stuff depends on. Struct page is not the major issue. The primary problem is writing to the media (and I am not a flash expert at all, just relaying here): For some period of time, the flash memory is not usable and thus we need to make sure we can nuke the page table entries that we have in userland page tables. For that, we need a callback from the device so that it can ask to get its references back. Oh, and a put_xip_page counterpart to get_xip_page, so that the driver knows when it's safe to erase. Well... That's the biggest/hardest problem, yes. But not the first. First we got to tackle the easy read only case, which doesn't require any of that unpleasantness, yet which is used in a bunch of out of tree hacks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] ext2: xip check fix
> Um, trying to clarify: S390. Also known as zSeries, big iron machine, uses > its own weird processor design rather than x86, x86-64, arm, or mips > processors. Right. filemap_xip.c allows for an XIP filesystem. The only filesystem that is supported is ext2. Even that requires a block device driver thingy, which I don't understand, that's specific to the s390. > How does "struct page" enter into this? Don't sweat it, it has to do with the way filemap_xip.c works. > What I want to know is, are you saying execute in place doesn't work on things > like arm and mips? (I so, I was unaware of this. I heard about somebody > getting it to work on a Nintendo DS: > http://forums.maxconsole.net/showthread.php?t=18668 ) XIP works fine on things like arm and mips. However there is mixed support in the mainline kernel for it. For example, you can build an XiP kernel image for arm since like 2.6.10 or 12. Also MTD has an XiP aware mode that protects XiP objects in flash from get screwed up during programs and erases. But there is no mainlined solution for XiP of applications from the filesystem. However there have been patches for cramfs to do this for years. They are kind of messy and keep getting rejected. I do have a solution in the works for this part of it - http://axfs.sf.net. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] ext2: xip check fix
> > I have'nt looked at it yet. I do appreciate it, I think it might > > broaden the user-base of this feature which is up to now s390 only due > > to the fact that the flash memory extensions have not been implemented > > (yet?). And it enables testing xip on other platforms. The patch is on > > my must-read list. > > query: which feature is currently s390 only? (Execute In Place?) I think so. The filemap_xip.c functionality doesn't work for Flash memory yet. Flash memory doesn't have struct pages to back it up with which this stuff depends on. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] ext2: xip check fix
I have'nt looked at it yet. I do appreciate it, I think it might broaden the user-base of this feature which is up to now s390 only due to the fact that the flash memory extensions have not been implemented (yet?). And it enables testing xip on other platforms. The patch is on my must-read list. query: which feature is currently s390 only? (Execute In Place?) I think so. The filemap_xip.c functionality doesn't work for Flash memory yet. Flash memory doesn't have struct pages to back it up with which this stuff depends on. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch] ext2: xip check fix
Um, trying to clarify: S390. Also known as zSeries, big iron machine, uses its own weird processor design rather than x86, x86-64, arm, or mips processors. Right. filemap_xip.c allows for an XIP filesystem. The only filesystem that is supported is ext2. Even that requires a block device driver thingy, which I don't understand, that's specific to the s390. How does struct page enter into this? Don't sweat it, it has to do with the way filemap_xip.c works. What I want to know is, are you saying execute in place doesn't work on things like arm and mips? (I so, I was unaware of this. I heard about somebody getting it to work on a Nintendo DS: http://forums.maxconsole.net/showthread.php?t=18668 ) XIP works fine on things like arm and mips. However there is mixed support in the mainline kernel for it. For example, you can build an XiP kernel image for arm since like 2.6.10 or 12. Also MTD has an XiP aware mode that protects XiP objects in flash from get screwed up during programs and erases. But there is no mainlined solution for XiP of applications from the filesystem. However there have been patches for cramfs to do this for years. They are kind of messy and keep getting rejected. I do have a solution in the works for this part of it - http://axfs.sf.net. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: solid state drive access and context switching
> Probably about 1000 clocks but its always going to depend upon the > workload and whether any other work can be done usefully. Yeah. Sounds right, in the microsecond range. Be interesting to see data. Anybody have ideas on what kind of experiments could confirm this estimate is right? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: solid state drive access and context switching
Probably about 1000 clocks but its always going to depend upon the workload and whether any other work can be done usefully. Yeah. Sounds right, in the microsecond range. Be interesting to see data. Anybody have ideas on what kind of experiments could confirm this estimate is right? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: solid state drive access and context switching
On Dec 4, 2007 3:24 PM, Alan Cox <[EMAIL PROTECTED]> wrote: > > Right. The trend is to hide the nastiness of NAND technology changes > > behind controllers. In general I think this is a good thing. > > You miss the point - any controller you hide it behind almost inevitably > adds enough latency you don't want to use it synchronously. I think I get it. We keep saying that it's the latency is too high. I agree that most technologies out there have latencies that are too high. Again I ask the question, what latencies do we have to hit before the sync options become worth it? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: solid state drive access and context switching
> > Maybe I'm missing something but I don't see it. We want a block > > interface for these devices, we just need a faster slimmer interface. > > Maybe a new mtdblock interface that doesn't do erase would be the > > place for? > > Doesn't do erase? MTD has to learn almost all tricks from the block > layer, as devices are becoming high-latency high-bandwidth, compared to > what MTD was designed for. In order to get any decent performance, we > need asynchronous operations, request queues and caching. > > The only useful advantage MTD does have over block devices is an > _explicit_ erase operation. Did you mean "doesn't do _implicit_ erase". You're right. That the point I was trying to make, albeit badly, MTD isn't the place for this. The fact that more and more of what the MTD is being used for looks a lot like the block layer is a whole different discussion. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: solid state drive access and context switching
> > microseconds level and an order of magnitude higher bandwidth than > > SATA. Is that fast enough to warrant this more synchronous IO? > > See the mtd layer. Right. The trend is to hide the nastiness of NAND technology changes behind controllers. In general I think this is a good thing. Basically the changes in ECC and reliability change very rapidly in this technology. Having custom controller hardware to handle this is faster than handling it in software and makes for a nice modular interface. We don't rewrite our SATA drivers and filesystem everything the magnetic media switches to a new recording scheme, we just plug it in. SSD's are going to be like that even if they aren't SATA. However, the MTD layer is more about managing the chips themselves, which is what the controllers are for. Maybe I'm missing something but I don't see it. We want a block interface for these devices, we just need a faster slimmer interface. Maybe a new mtdblock interface that doesn't do erase would be the place for? > > BTW - This trend toward faster, lower latency busses is marching > > forward. 2 examples; the ioDrive from Fusion IO, Micron's RAM-module > > like SSD concept. > > Very much so but we can do quite a bit in 10,000 processor cycles ... > > Alan > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: solid state drive access and context switching
> > refinements could theoretically get us down one more (~100 > > microsecond). > > They've already done already better than that. Here's a solid state > drive with a claimed 20 microsecond access time: > > http://www.curtisssd.com/products/drives/hyperxclr Right. That looks to be RAM based, which means compared to NAND, so that's not going to breakout of a server niche. I imagine the latency is the device latency not the system latency. By the time you send the request through the fibrechannel stack and get the block back it's gonna be much closer to 100 microseconds. It's that OS visible latency that you've got to design to. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: solid state drive access and context switching
> > Has anyone played with this concept? > > For things like SATA based devices they aren't that fast yet. What is fast enough? As I understand the basic memory technology, the hard limit is in the 100's of microseconds range for latency. SATA adds something to that. I'd be surprised to see latencies on SATA SSD's as measured at the OS level to get below 1 millisecond. What happens we start placing NAND technology in lower latency, higher bandwidth buses? I'm guessing we'll get down to that 100's of microseconds level and an order of magnitude higher bandwidth than SATA. Is that fast enough to warrant this more synchronous IO? Magnetic drives have latencies ~10 milliseconds, current SSD's are an order of magnitude better (~1 millisecond), new interfaces and refinements could theoretically get us down one more (~100 microsecond). I'm guessing the current block driver subsystem would negate a lot of that latency gain. Am I wrong? BTW - This trend toward faster, lower latency busses is marching forward. 2 examples; the ioDrive from Fusion IO, Micron's RAM-module like SSD concept. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: solid state drive access and context switching
Has anyone played with this concept? For things like SATA based devices they aren't that fast yet. What is fast enough? As I understand the basic memory technology, the hard limit is in the 100's of microseconds range for latency. SATA adds something to that. I'd be surprised to see latencies on SATA SSD's as measured at the OS level to get below 1 millisecond. What happens we start placing NAND technology in lower latency, higher bandwidth buses? I'm guessing we'll get down to that 100's of microseconds level and an order of magnitude higher bandwidth than SATA. Is that fast enough to warrant this more synchronous IO? Magnetic drives have latencies ~10 milliseconds, current SSD's are an order of magnitude better (~1 millisecond), new interfaces and refinements could theoretically get us down one more (~100 microsecond). I'm guessing the current block driver subsystem would negate a lot of that latency gain. Am I wrong? BTW - This trend toward faster, lower latency busses is marching forward. 2 examples; the ioDrive from Fusion IO, Micron's RAM-module like SSD concept. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: solid state drive access and context switching
refinements could theoretically get us down one more (~100 microsecond). They've already done already better than that. Here's a solid state drive with a claimed 20 microsecond access time: http://www.curtisssd.com/products/drives/hyperxclr Right. That looks to be RAM based, which means compared to NAND, so that's not going to breakout of a server niche. I imagine the latency is the device latency not the system latency. By the time you send the request through the fibrechannel stack and get the block back it's gonna be much closer to 100 microseconds. It's that OS visible latency that you've got to design to. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: solid state drive access and context switching
microseconds level and an order of magnitude higher bandwidth than SATA. Is that fast enough to warrant this more synchronous IO? See the mtd layer. Right. The trend is to hide the nastiness of NAND technology changes behind controllers. In general I think this is a good thing. Basically the changes in ECC and reliability change very rapidly in this technology. Having custom controller hardware to handle this is faster than handling it in software and makes for a nice modular interface. We don't rewrite our SATA drivers and filesystem everything the magnetic media switches to a new recording scheme, we just plug it in. SSD's are going to be like that even if they aren't SATA. However, the MTD layer is more about managing the chips themselves, which is what the controllers are for. Maybe I'm missing something but I don't see it. We want a block interface for these devices, we just need a faster slimmer interface. Maybe a new mtdblock interface that doesn't do erase would be the place for? BTW - This trend toward faster, lower latency busses is marching forward. 2 examples; the ioDrive from Fusion IO, Micron's RAM-module like SSD concept. Very much so but we can do quite a bit in 10,000 processor cycles ... Alan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: solid state drive access and context switching
Maybe I'm missing something but I don't see it. We want a block interface for these devices, we just need a faster slimmer interface. Maybe a new mtdblock interface that doesn't do erase would be the place for? Doesn't do erase? MTD has to learn almost all tricks from the block layer, as devices are becoming high-latency high-bandwidth, compared to what MTD was designed for. In order to get any decent performance, we need asynchronous operations, request queues and caching. The only useful advantage MTD does have over block devices is an _explicit_ erase operation. Did you mean doesn't do _implicit_ erase. You're right. That the point I was trying to make, albeit badly, MTD isn't the place for this. The fact that more and more of what the MTD is being used for looks a lot like the block layer is a whole different discussion. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: solid state drive access and context switching
On Dec 4, 2007 3:24 PM, Alan Cox [EMAIL PROTECTED] wrote: Right. The trend is to hide the nastiness of NAND technology changes behind controllers. In general I think this is a good thing. You miss the point - any controller you hide it behind almost inevitably adds enough latency you don't want to use it synchronously. I think I get it. We keep saying that it's the latency is too high. I agree that most technologies out there have latencies that are too high. Again I ask the question, what latencies do we have to hit before the sync options become worth it? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Announce] Linux-tiny project revival
> > I think that this idea is not worth it. Don't use the config option then > My problem is that switching off printk is the single biggest bloat cutter in > the kernel, yet it makes the resulting system very hard to support. It > combines a big upside with a big downside, and I'd like something in between. It's not such a big downside IMHO. You can support a kernel without printk. Need to debug the kernel without printk? Use a JTAG debugger... If you have a system that actually configures out printk's, chances are you don't have storage and output mechanisms to do much with the messages anyway. Think embedded _products_ here. Sure the development boards have serial, ethernet, and all that jazz but tens of millions of ARM based gadgets don't. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [Announce] Linux-tiny project revival
I think that this idea is not worth it. Don't use the config option then My problem is that switching off printk is the single biggest bloat cutter in the kernel, yet it makes the resulting system very hard to support. It combines a big upside with a big downside, and I'd like something in between. It's not such a big downside IMHO. You can support a kernel without printk. Need to debug the kernel without printk? Use a JTAG debugger... If you have a system that actually configures out printk's, chances are you don't have storage and output mechanisms to do much with the messages anyway. Think embedded _products_ here. Sure the development boards have serial, ethernet, and all that jazz but tens of millions of ARM based gadgets don't. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.6.22-rc6-mm1: BUG_ON() mm/memory.c, vm_insert_pfn(), filemap_xip.c, and spufs
Recently there has been some discussion of the possiblity of reworking some of filemap_xip.c to be pfn oriented. This would allow an XIP fork of cramfs to use the filemap_xip framework. Today this is not possible. I've been trying out vm_insert_pfn() to start down that road. I used spufs as a reference for how to use it. The include patch to cramfs is my hack at it. When I try to execute an XIP binary I get a BUG() on 2.6.22-rc6-mm1 at mm/memory.c line 2334. The way I read this is says that spufs might not work. I can't test it. In spufs_mem_mmap() line 196 the vma is flagged as VM_PFNMAP: vma->vm_flags |= VM_IO | VM_PFNMAP; When you get a fault in a vma __do_fault() will get this vma and BUG() on line 2334: BUG_ON(vma->vm_flags & VM_PFNMAP); What happened to the functionality of do_no_pfn()? diff -r 74bad9e01817 fs/Kconfig --- a/fs/KconfigThu Jun 28 13:49:43 2007 -0700 +++ b/fs/KconfigMon Jul 02 15:47:16 2007 -0700 @@ -65,8 +65,7 @@ config FS_XIP config FS_XIP # execute in place bool - depends on EXT2_FS_XIP - default y + default n config EXT3_FS tristate "Ext3 journalling file system support" @@ -1399,8 +1398,8 @@ endchoice config CRAMFS tristate "Compressed ROM file system support (cramfs)" - depends on BLOCK select ZLIB_INFLATE + select FS_XIP help Saying Y here includes support for CramFs (Compressed ROM File System). CramFs is designed to be a simple, small, and compressed diff -r 74bad9e01817 fs/cramfs/inode.c --- a/fs/cramfs/inode.c Thu Jun 28 13:49:43 2007 -0700 +++ b/fs/cramfs/inode.c Tue Jul 03 17:45:42 2007 -0700 @@ -24,15 +24,21 @@ #include #include #include - +#include #include +static const struct file_operations cramfs_xip_fops; static const struct super_operations cramfs_ops; static const struct inode_operations cramfs_dir_inode_operations; static const struct file_operations cramfs_directory_operations; static const struct address_space_operations cramfs_aops; +static const struct address_space_operations cramfs_xip_aops; static DEFINE_MUTEX(read_mutex); + +static struct backing_dev_info cramfs_backing_dev_info = { + .ra_pages = 0,/* No readahead */ +}; /* These two macros may change in future, to provide better st_ino @@ -77,19 +83,31 @@ static int cramfs_iget5_set(struct inode /* Struct copy intentional */ inode->i_mtime = inode->i_atime = inode->i_ctime = zerotime; inode->i_ino = CRAMINO(cramfs_inode); + + if (CRAMFS_INODE_IS_XIP(inode)) + inode->i_mapping->backing_dev_info = _backing_dev_info; + /* inode->i_nlink is left 1 - arguably wrong for directories, but it's the best we can do without reading the directory contents. 1 yields the right result in GNU find, even without -noleaf option. */ if (S_ISREG(inode->i_mode)) { - inode->i_fop = _ro_fops; - inode->i_data.a_ops = _aops; + if (CRAMFS_INODE_IS_XIP(inode)) { + inode->i_fop = _xip_fops; + inode->i_data.a_ops = _xip_aops; + } else { + inode->i_fop = _ro_fops; + inode->i_data.a_ops = _aops; + } } else if (S_ISDIR(inode->i_mode)) { inode->i_op = _dir_inode_operations; inode->i_fop = _directory_operations; } else if (S_ISLNK(inode->i_mode)) { inode->i_op = _symlink_inode_operations; - inode->i_data.a_ops = _aops; + if (CRAMFS_INODE_IS_XIP(inode)) + inode->i_data.a_ops = _xip_aops; + else + inode->i_data.a_ops = _aops; } else { inode->i_size = 0; inode->i_blocks = 0; @@ -111,34 +129,6 @@ static struct inode *get_cramfs_inode(st return inode; } -/* - * We have our own block cache: don't fill up the buffer cache - * with the rom-image, because the way the filesystem is set - * up the accesses should be fairly regular and cached in the - * page cache and dentry tree anyway.. - * - * This also acts as a way to guarantee contiguous areas of up to - * BLKS_PER_BUF*PAGE_CACHE_SIZE, so that the caller doesn't need to - * worry about end-of-buffer issues even when decompressing a full - * page cache. - */ -#define READ_BUFFERS (2) -/* NEXT_BUFFER(): Loop over [0..(READ_BUFFERS-1)]. */ -#define NEXT_BUFFER(_ix) ((_ix) ^ 1) - -/* - * BLKS_PER_BUF_SHIFT should be at least 2 to allow for "compressed" - * data that takes up more space than the original and with unlucky - * alignment. - */ -#define BLKS_PER_BUF_SHIFT (2) -#define BLKS_PER_BUF (1 << BLKS_PER_BUF_SHIFT) -#define BUFFER_SIZE(BLKS_PER_BUF*PAGE_CACHE_SIZE) - -static unsigned char read_buffers[READ_BUFFERS][BUFFER_SIZE]; -static unsigned
2.6.22-rc6-mm1: BUG_ON() mm/memory.c, vm_insert_pfn(), filemap_xip.c, and spufs
Recently there has been some discussion of the possiblity of reworking some of filemap_xip.c to be pfn oriented. This would allow an XIP fork of cramfs to use the filemap_xip framework. Today this is not possible. I've been trying out vm_insert_pfn() to start down that road. I used spufs as a reference for how to use it. The include patch to cramfs is my hack at it. When I try to execute an XIP binary I get a BUG() on 2.6.22-rc6-mm1 at mm/memory.c line 2334. The way I read this is says that spufs might not work. I can't test it. In spufs_mem_mmap() line 196 the vma is flagged as VM_PFNMAP: vma-vm_flags |= VM_IO | VM_PFNMAP; When you get a fault in a vma __do_fault() will get this vma and BUG() on line 2334: BUG_ON(vma-vm_flags VM_PFNMAP); What happened to the functionality of do_no_pfn()? diff -r 74bad9e01817 fs/Kconfig --- a/fs/KconfigThu Jun 28 13:49:43 2007 -0700 +++ b/fs/KconfigMon Jul 02 15:47:16 2007 -0700 @@ -65,8 +65,7 @@ config FS_XIP config FS_XIP # execute in place bool - depends on EXT2_FS_XIP - default y + default n config EXT3_FS tristate Ext3 journalling file system support @@ -1399,8 +1398,8 @@ endchoice config CRAMFS tristate Compressed ROM file system support (cramfs) - depends on BLOCK select ZLIB_INFLATE + select FS_XIP help Saying Y here includes support for CramFs (Compressed ROM File System). CramFs is designed to be a simple, small, and compressed diff -r 74bad9e01817 fs/cramfs/inode.c --- a/fs/cramfs/inode.c Thu Jun 28 13:49:43 2007 -0700 +++ b/fs/cramfs/inode.c Tue Jul 03 17:45:42 2007 -0700 @@ -24,15 +24,21 @@ #include linux/vfs.h #include linux/mutex.h #include asm/semaphore.h - +#include linux/vmalloc.h #include asm/uaccess.h +static const struct file_operations cramfs_xip_fops; static const struct super_operations cramfs_ops; static const struct inode_operations cramfs_dir_inode_operations; static const struct file_operations cramfs_directory_operations; static const struct address_space_operations cramfs_aops; +static const struct address_space_operations cramfs_xip_aops; static DEFINE_MUTEX(read_mutex); + +static struct backing_dev_info cramfs_backing_dev_info = { + .ra_pages = 0,/* No readahead */ +}; /* These two macros may change in future, to provide better st_ino @@ -77,19 +83,31 @@ static int cramfs_iget5_set(struct inode /* Struct copy intentional */ inode-i_mtime = inode-i_atime = inode-i_ctime = zerotime; inode-i_ino = CRAMINO(cramfs_inode); + + if (CRAMFS_INODE_IS_XIP(inode)) + inode-i_mapping-backing_dev_info = cramfs_backing_dev_info; + /* inode-i_nlink is left 1 - arguably wrong for directories, but it's the best we can do without reading the directory contents. 1 yields the right result in GNU find, even without -noleaf option. */ if (S_ISREG(inode-i_mode)) { - inode-i_fop = generic_ro_fops; - inode-i_data.a_ops = cramfs_aops; + if (CRAMFS_INODE_IS_XIP(inode)) { + inode-i_fop = cramfs_xip_fops; + inode-i_data.a_ops = cramfs_xip_aops; + } else { + inode-i_fop = generic_ro_fops; + inode-i_data.a_ops = cramfs_aops; + } } else if (S_ISDIR(inode-i_mode)) { inode-i_op = cramfs_dir_inode_operations; inode-i_fop = cramfs_directory_operations; } else if (S_ISLNK(inode-i_mode)) { inode-i_op = page_symlink_inode_operations; - inode-i_data.a_ops = cramfs_aops; + if (CRAMFS_INODE_IS_XIP(inode)) + inode-i_data.a_ops = cramfs_xip_aops; + else + inode-i_data.a_ops = cramfs_aops; } else { inode-i_size = 0; inode-i_blocks = 0; @@ -111,34 +129,6 @@ static struct inode *get_cramfs_inode(st return inode; } -/* - * We have our own block cache: don't fill up the buffer cache - * with the rom-image, because the way the filesystem is set - * up the accesses should be fairly regular and cached in the - * page cache and dentry tree anyway.. - * - * This also acts as a way to guarantee contiguous areas of up to - * BLKS_PER_BUF*PAGE_CACHE_SIZE, so that the caller doesn't need to - * worry about end-of-buffer issues even when decompressing a full - * page cache. - */ -#define READ_BUFFERS (2) -/* NEXT_BUFFER(): Loop over [0..(READ_BUFFERS-1)]. */ -#define NEXT_BUFFER(_ix) ((_ix) ^ 1) - -/* - * BLKS_PER_BUF_SHIFT should be at least 2 to allow for compressed - * data that takes up more space than the original and with unlucky - * alignment. - */ -#define BLKS_PER_BUF_SHIFT (2) -#define BLKS_PER_BUF (1 BLKS_PER_BUF_SHIFT) -#define BUFFER_SIZE
Re: vm/fs meetup in september?
On 7/2/07, Jörn Engel <[EMAIL PROTECTED]> wrote: On Mon, 2 July 2007 10:44:00 -0700, Jared Hulbert wrote: > > >So what you mean is "swap on flash" ? Defintively sounds like an > >interesting topic, although I'm not too sure it's all that > >filesystem-related. > > Maybe not. Yet, it would be a very useful place to store data from a > file as a non-volatile page cache. > > Also it is something that I believe would benefit from a VFS-like API. > I mean there is a consistent interface a management layer like this > could use, yet the algorithms used to order the data and the interface > to the physical media may vary. There is no single right way to do > the management layer, much like filesystems. > > Given the page orientation of the current VFS seems to me like there > might be a nice way to use it for this purpose. > > Or maybe the real experts on this stuff can tell me how wrong that is > and where it should go :) I don't believe anyone has implemented this before, so any experts would be self-appointed. Maybe this should be turned into a filesystem subject after all. The complexity comes from combining XIP with writes on the same chip. So solving your problem should be identical to solving the rw XIP filesystem problem. If there is interest in the latter, I'd offer my self-appointed expertise. Right, the solution to swap problem is identical to the rw XIP filesystem problem.Jörn, that's why you're the self-appointed subject matter expert! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vm/fs meetup in september?
So what you mean is "swap on flash" ? Defintively sounds like an interesting topic, although I'm not too sure it's all that filesystem-related. Maybe not. Yet, it would be a very useful place to store data from a file as a non-volatile page cache. Also it is something that I believe would benefit from a VFS-like API. I mean there is a consistent interface a management layer like this could use, yet the algorithms used to order the data and the interface to the physical media may vary. There is no single right way to do the management layer, much like filesystems. Given the page orientation of the current VFS seems to me like there might be a nice way to use it for this purpose. Or maybe the real experts on this stuff can tell me how wrong that is and where it should go :) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vm/fs meetup in september?
> Christoph> So what you mean is "swap on flash" ? Defintively sounds > Christoph> like an interesting topic, although I'm not too sure it's > Christoph> all that filesystem-related. I wouldn't want to call it swap, as this carries with it block-io connotations. It's really mmap on flash. Yes it is really mmap on flash. But you are "swapping" pages from RAM to be mmap'ed on flash. Also the flash-io complexities are similar to the block-io layer. I think "swap on flash" is fair. Though that might be confused with making swap work on a NAND flash, which is very much like the current block-io approach. "Mmappable swap on flash" is more exact, I suppose. > You need either a block translation layer, Are you suggesting to go through the block layer to reach the flash? Well the obvious route would be to have this management layer use the MTD, I can't see anything wrong with that. > or a (swap) filesystem that > understands flash peculiarities in order to make such a thing work. > The standard Linux swap format will not work. Correct. BTW, you may want to have a look at my "[RFC] VM: I have a dream..." thread. Interesting. This idea does allow for swap to be access directly. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vm/fs meetup in september?
Christoph So what you mean is swap on flash ? Defintively sounds Christoph like an interesting topic, although I'm not too sure it's Christoph all that filesystem-related. I wouldn't want to call it swap, as this carries with it block-io connotations. It's really mmap on flash. Yes it is really mmap on flash. But you are swapping pages from RAM to be mmap'ed on flash. Also the flash-io complexities are similar to the block-io layer. I think swap on flash is fair. Though that might be confused with making swap work on a NAND flash, which is very much like the current block-io approach. Mmappable swap on flash is more exact, I suppose. You need either a block translation layer, Are you suggesting to go through the block layer to reach the flash? Well the obvious route would be to have this management layer use the MTD, I can't see anything wrong with that. or a (swap) filesystem that understands flash peculiarities in order to make such a thing work. The standard Linux swap format will not work. Correct. BTW, you may want to have a look at my [RFC] VM: I have a dream... thread. Interesting. This idea does allow for swap to be access directly. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vm/fs meetup in september?
So what you mean is swap on flash ? Defintively sounds like an interesting topic, although I'm not too sure it's all that filesystem-related. Maybe not. Yet, it would be a very useful place to store data from a file as a non-volatile page cache. Also it is something that I believe would benefit from a VFS-like API. I mean there is a consistent interface a management layer like this could use, yet the algorithms used to order the data and the interface to the physical media may vary. There is no single right way to do the management layer, much like filesystems. Given the page orientation of the current VFS seems to me like there might be a nice way to use it for this purpose. Or maybe the real experts on this stuff can tell me how wrong that is and where it should go :) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vm/fs meetup in september?
On 7/2/07, Jörn Engel [EMAIL PROTECTED] wrote: On Mon, 2 July 2007 10:44:00 -0700, Jared Hulbert wrote: So what you mean is swap on flash ? Defintively sounds like an interesting topic, although I'm not too sure it's all that filesystem-related. Maybe not. Yet, it would be a very useful place to store data from a file as a non-volatile page cache. Also it is something that I believe would benefit from a VFS-like API. I mean there is a consistent interface a management layer like this could use, yet the algorithms used to order the data and the interface to the physical media may vary. There is no single right way to do the management layer, much like filesystems. Given the page orientation of the current VFS seems to me like there might be a nice way to use it for this purpose. Or maybe the real experts on this stuff can tell me how wrong that is and where it should go :) I don't believe anyone has implemented this before, so any experts would be self-appointed. Maybe this should be turned into a filesystem subject after all. The complexity comes from combining XIP with writes on the same chip. So solving your problem should be identical to solving the rw XIP filesystem problem. If there is interest in the latter, I'd offer my self-appointed expertise. Right, the solution to swap problem is identical to the rw XIP filesystem problem.Jörn, that's why you're the self-appointed subject matter expert! - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vm/fs meetup in september?
On 6/25/07, Christoph Hellwig <[EMAIL PROTECTED]> wrote: On Mon, Jun 25, 2007 at 05:08:02PM -0700, Jared Hulbert wrote: > -memory mappable swap file (I'm not sure if this one is appropriate > for the proposed meeting) Please explain what this is supposed to mean. If you have a large array of a non-volatile semi-writeable memory such as a highspeed NOR Flash or some of the similar emerging technologies in a system. It would be useful to use that memory as an extension of RAM. One of the ways you could do that is allow pages to be swapped out to this memory. Once there these pages could be read directly, but would require a COW procedure on a write access. The reason why I think this may be a vm/fs topic is that the hardware makes writing to this memory efficiently a non-trivial operation that requires management just like a filesystem. Also it seems to me that there are probably overlaps between this topic and the recent filemap_xip.c discussions. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vm/fs meetup in september?
On 6/25/07, Christoph Hellwig [EMAIL PROTECTED] wrote: On Mon, Jun 25, 2007 at 05:08:02PM -0700, Jared Hulbert wrote: -memory mappable swap file (I'm not sure if this one is appropriate for the proposed meeting) Please explain what this is supposed to mean. If you have a large array of a non-volatile semi-writeable memory such as a highspeed NOR Flash or some of the similar emerging technologies in a system. It would be useful to use that memory as an extension of RAM. One of the ways you could do that is allow pages to be swapped out to this memory. Once there these pages could be read directly, but would require a COW procedure on a write access. The reason why I think this may be a vm/fs topic is that the hardware makes writing to this memory efficiently a non-trivial operation that requires management just like a filesystem. Also it seems to me that there are probably overlaps between this topic and the recent filemap_xip.c discussions. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vm/fs meetup in september?
A few things I'd like to talk about are: - the address space operations APIs, and their page based nature. I think it would be nice to generally move toward offset,length based ones as much as possible because it should give more efficiency and flexibility in the filesystem. - write_begin API if it is still an issue by that date. Hope not :) - truncate races - fsblock if it hasn't been shot down by then - how to make complex API changes without having to fix most things yourself. I'd like to add: -revamping filemap_xip.c -memory mappable swap file (I'm not sure if this one is appropriate for the proposed meeting) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vm/fs meetup in september?
A few things I'd like to talk about are: - the address space operations APIs, and their page based nature. I think it would be nice to generally move toward offset,length based ones as much as possible because it should give more efficiency and flexibility in the filesystem. - write_begin API if it is still an issue by that date. Hope not :) - truncate races - fsblock if it hasn't been shot down by then - how to make complex API changes without having to fix most things yourself. I'd like to add: -revamping filemap_xip.c -memory mappable swap file (I'm not sure if this one is appropriate for the proposed meeting) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
On 6/15/07, Carsten Otte <[EMAIL PROTECTED]> wrote: Nick Piggin wrote: > Carsten Otte wrote: >> The current xip stack relies on having struct page behind the memory >> segment. This causes few impact on memory management, but occupies >> some more memory. The cramfs patch chose to modify copy on write in >> order to deal with vmas that don't have struct page behind. >> So far, Hugh and Linus have shown strong opposition against copy on >> write with no struct page behind. If this implementation is acceptable >> to the them, it seems preferable to me over wasting memory. The xip >> stack should be modified to use this vma flag in that case. > > I would rather not :P > > We can copy on write without a struct page behind the source today, no? > What is insufficient for the XIP code with the current COW? I've looked at the -mm version of mm/memory.c today, with intend to try out VM_PFNMAP for our xip mappings and replace nopage() with fault(). The thing is, I believe it does'nt work for us: * The way we recognize those mappings is through the rules set up * by "remap_pfn_range()": the vma will have the VM_PFNMAP bit set, * and the vm_pgoff will point to the first PFN mapped: thus every * page that is a raw mapping will always honor the rule * * pfn_of_page == vma->vm_pgoff + ((addr - vma->vm_start) >> PAGE_SHIFT) This is, as far as I can tell, not true for our xip mappings. Ext2 may spread the physical pages behind a given file all over its media. That means, that the pfns of the pages that form a vma may be more or less random rather than contiguous. The common memory management code cannot tell whether or not a given page has been COW'ed. Did I miss something? I agree, the conditions imposed by the remap_pfn_range() don't work. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
If you can write code that doesn't need any struct pages that would make life a bit easier, since we wouldn't need any pseudo memory hotplug code that just adds struct pages. That was my gut feel too. However, it seems from Carsten and Jörn discussion of read/write XIP on Flash (and some new Phase Change) memories that having the struct pages has a lot of potential benefits. Wouldn't it also allow most of the mm routines to remain unchanged. I just worry that it would be difficult to set apart these non volitatile pages that can't be written too directly. We would still need to add the kernel mapping though. But that's handled by ioremap()ing it right? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
If you can write code that doesn't need any struct pages that would make life a bit easier, since we wouldn't need any pseudo memory hotplug code that just adds struct pages. That was my gut feel too. However, it seems from Carsten and Jörn discussion of read/write XIP on Flash (and some new Phase Change) memories that having the struct pages has a lot of potential benefits. Wouldn't it also allow most of the mm routines to remain unchanged. I just worry that it would be difficult to set apart these non volitatile pages that can't be written too directly. We would still need to add the kernel mapping though. But that's handled by ioremap()ing it right? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
On 6/15/07, Carsten Otte [EMAIL PROTECTED] wrote: Nick Piggin wrote: Carsten Otte wrote: The current xip stack relies on having struct page behind the memory segment. This causes few impact on memory management, but occupies some more memory. The cramfs patch chose to modify copy on write in order to deal with vmas that don't have struct page behind. So far, Hugh and Linus have shown strong opposition against copy on write with no struct page behind. If this implementation is acceptable to the them, it seems preferable to me over wasting memory. The xip stack should be modified to use this vma flag in that case. I would rather not :P We can copy on write without a struct page behind the source today, no? What is insufficient for the XIP code with the current COW? I've looked at the -mm version of mm/memory.c today, with intend to try out VM_PFNMAP for our xip mappings and replace nopage() with fault(). The thing is, I believe it does'nt work for us: * The way we recognize those mappings is through the rules set up * by remap_pfn_range(): the vma will have the VM_PFNMAP bit set, * and the vm_pgoff will point to the first PFN mapped: thus every * page that is a raw mapping will always honor the rule * * pfn_of_page == vma-vm_pgoff + ((addr - vma-vm_start) PAGE_SHIFT) This is, as far as I can tell, not true for our xip mappings. Ext2 may spread the physical pages behind a given file all over its media. That means, that the pfns of the pages that form a vma may be more or less random rather than contiguous. The common memory management code cannot tell whether or not a given page has been COW'ed. Did I miss something? I agree, the conditions imposed by the remap_pfn_range() don't work. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
An alternative approach, which does not need to have struct page at hand, would be to use the nopfn vm operations struct. That one would have to rely on get_xip_pfn. Of course! Okay now I'm begining to understand. The current path would then be deprecated. Why? Wouldn't both paths be valid options? If you're interrested in using the later for xip without struct page, I would volounteer to go ahead and implement this? I'm very interested in this. I'm not opposed to using struct page, but I'm confused as to how to start that. As I understand it, which is not well, defined a CONFIG_DISCONTIGMEM region to cover the Flash memory would add that to my pool of RAM. That would be 'bad', right? I don't see how to create the page structs and set this memory aside as different. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
An alternative approach, which does not need to have struct page at hand, would be to use the nopfn vm operations struct. That one would have to rely on get_xip_pfn. Of course! Okay now I'm begining to understand. The current path would then be deprecated. Why? Wouldn't both paths be valid options? If you're interrested in using the later for xip without struct page, I would volounteer to go ahead and implement this? I'm very interested in this. I'm not opposed to using struct page, but I'm confused as to how to start that. As I understand it, which is not well, defined a CONFIG_DISCONTIGMEM region to cover the Flash memory would add that to my pool of RAM. That would be 'bad', right? I don't see how to create the page structs and set this memory aside as different. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
Nick Piggin wrote: > The question is, why is that not enough (I haven't looked at these > patches enough to work out if there is anything more they provide). I think, it just takes trying things out. From reading the code, I think this should work well for the filemap_xip code with no struct page. Also, we need eliminate nopage() to get rid of the struct page. Unfortunately I don't find time to try this out for now, and on 390 we can live with struct page for the time being. In contrast to the embedded platforms, the mem_mep array gets swapped out to disk by our hypervisor. Can you help me understand the comment about nopage()? Do you mean set xip_file_vm_ops.nopage to NULL? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
Nick Piggin wrote: The question is, why is that not enough (I haven't looked at these patches enough to work out if there is anything more they provide). I think, it just takes trying things out. From reading the code, I think this should work well for the filemap_xip code with no struct page. Also, we need eliminate nopage() to get rid of the struct page. Unfortunately I don't find time to try this out for now, and on 390 we can live with struct page for the time being. In contrast to the embedded platforms, the mem_mep array gets swapped out to disk by our hypervisor. Can you help me understand the comment about nopage()? Do you mean set xip_file_vm_ops.nopage to NULL? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
The downside: We need mem_map[] struct page entries behind all memory segments. Nowerdays we can easily create those via vmem_map/sparsemem. Opinions? Frankly this is going to be mostly relevant on ARM architectures at least at first. Maybe I'm missing something but I don't see that sparemem is supported on ARM... - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
On Fri, Jun 08, 2007 at 09:05:32AM -0700, Jared Hulbert wrote: > Okay so we need some driver that opens/closes this ROM. This has been > done from the dcss block device but that doesn't make sense for most > embedded systems. The MTD allows for this with point(),unpoint(). > That should work just fine. It does introduce the MTD as a dependancy > which is unnecessary in many systems, but it will work now. The Linux solution to this problem would be to introduce an option for mtd write support. That way the majority of the code doesn't heave to be compiled for the read-only case but you still get a uniform interface. You mean make an MTD-light interface possible? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
On 6/8/07, Christoph Hellwig <[EMAIL PROTECTED]> wrote: On Fri, Jun 08, 2007 at 09:59:20AM +0200, Carsten Otte wrote: > Christoph Hellwig wrote: > >Jared's patch currently does ioremap on mount (and no iounmap at all). > >That mapping needs to move from the filesystem to the device driver. > The device driver needs to do ioremap on open(), and iounmap() on > release. That's effectively what our block driver does. Yes, exactly. Okay so we need some driver that opens/closes this ROM. This has been done from the dcss block device but that doesn't make sense for most embedded systems. The MTD allows for this with point(),unpoint(). That should work just fine. It does introduce the MTD as a dependancy which is unnecessary in many systems, but it will work now. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
On 6/8/07, Christoph Hellwig <[EMAIL PROTECTED]> wrote: On Thu, Jun 07, 2007 at 01:34:12PM -0700, Jared Hulbert wrote: > >And we'll need that even when using cramfs. There's not way we'd > >merge a hack where the user has to specify a physical address on > >the mount command line. > > Why not? For the use case in question the user usually manually > burned the image to a physical address before hand. Many of these > system don't have MTD turned on for this Flash, they don't need it > because they don't write to this Flash once the system is up. Then add a small device layer for it. Remember that linux is not all about hacked up embedded devices that get shipped once and never touched again. Remember that linux is not all about big iron machines with lots of processors and gigabytes of RAM :) I concede your layer point, ioremap() doesn't belong in the filesystem. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
On 6/8/07, Christoph Hellwig [EMAIL PROTECTED] wrote: On Thu, Jun 07, 2007 at 01:34:12PM -0700, Jared Hulbert wrote: And we'll need that even when using cramfs. There's not way we'd merge a hack where the user has to specify a physical address on the mount command line. Why not? For the use case in question the user usually manually burned the image to a physical address before hand. Many of these system don't have MTD turned on for this Flash, they don't need it because they don't write to this Flash once the system is up. Then add a small device layer for it. Remember that linux is not all about hacked up embedded devices that get shipped once and never touched again. Remember that linux is not all about big iron machines with lots of processors and gigabytes of RAM :) I concede your layer point, ioremap() doesn't belong in the filesystem. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
On 6/8/07, Christoph Hellwig [EMAIL PROTECTED] wrote: On Fri, Jun 08, 2007 at 09:59:20AM +0200, Carsten Otte wrote: Christoph Hellwig wrote: Jared's patch currently does ioremap on mount (and no iounmap at all). That mapping needs to move from the filesystem to the device driver. The device driver needs to do ioremap on open(), and iounmap() on release. That's effectively what our block driver does. Yes, exactly. Okay so we need some driver that opens/closes this ROM. This has been done from the dcss block device but that doesn't make sense for most embedded systems. The MTD allows for this with point(),unpoint(). That should work just fine. It does introduce the MTD as a dependancy which is unnecessary in many systems, but it will work now. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
On Fri, Jun 08, 2007 at 09:05:32AM -0700, Jared Hulbert wrote: Okay so we need some driver that opens/closes this ROM. This has been done from the dcss block device but that doesn't make sense for most embedded systems. The MTD allows for this with point(),unpoint(). That should work just fine. It does introduce the MTD as a dependancy which is unnecessary in many systems, but it will work now. The Linux solution to this problem would be to introduce an option for mtd write support. That way the majority of the code doesn't heave to be compiled for the read-only case but you still get a uniform interface. You mean make an MTD-light interface possible? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
The downside: We need mem_map[] struct page entries behind all memory segments. Nowerdays we can easily create those via vmem_map/sparsemem. Opinions? Frankly this is going to be mostly relevant on ARM architectures at least at first. Maybe I'm missing something but I don't see that sparemem is supported on ARM... - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
get_xip_page() uncertainity
I am trying to create valid "struct page* (*get_xip_page)(struct address_space *, sector_t, int)" to use the filemap_xip.c. I've been trying to do it as follows: virtual = ioremap(physical,size); struct page* my_get_xip_page(struct address_space *mapping, sector_t sector, int create) { unsigned long offset; /*extract offset from mapping and sector*/ return virt_to_page(virtual + offset); } I believe this to be fundamentally flawed. While this works for xip_file_read(), it does not work for xip_file_mmap(). I'm not sure I understand the correct way to do this. But I assume the problem, and have some evidence to support it, is that virt_to_page() is not returning a vaild page struct. How can I get a valid page struct? The memory is not RAM but Flash. It is addressable like RAM and I want userspace to use it like readonly RAM. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
If if were actually talking about complex filesystem I'd agree. But the cramfs xip patch posted here touches about 2/3 of the number of lines that cramfs has in total. Fair enough. But look at the complexity rather than number of lines. It adds tedium to the cramfs_fill_super and one extra level of indirection to a hand full of ops like mmap() and cramfs_read(). But the changes to the real meat of cramfs, cramfs_readpage(), are limited to the XIP changes, which I want on block devices anyway. So if we did fork cramfs I would submit a simple patch to cramfs for XIP support on block devices and I would submit a patch for a new filesystem, cramfs-linear. Cramfs-linear would have an exact copy of 1/3 of the cramfs code such as cramfs_readpage(), it would use the same headers, and it would use the same userspace tools. This fork is what the community wants? Speak up! And cramfs is not exactly the best base to start with.. This is a moot point, there is a significant installed base issue. There are lots of cramfs-linear-xip based systems in existance with can't be easily ported to newer kernel because of a lack of support. > This is nirvana. But it is not the goal of the patches in question. > In fact there are several use cases that don't need and don't value > the writeability and don't need therefore the overhead. It is a long > term goal never the less. With the filemap_xip.c helpers adding xip support to any filesystem is pretty trivial for the highlevel filesystem operations. The only interesting bit is the lowlevel code (the get_xip_page method and the others Carsten mentioned), but we need to do these lowlevel code in a generic and proper way anyway. It's not that trivial. The filesystem needs to meet several requirements such as, having data nodes that are page aligned. Anytime any changes are made to any page in the underlying Flash block or if the Flash physical partition goes out of read mode you've got to hide that from userspace or otherwise deal with it. A filesystem that doesn't understand these subtle hardware requirements would either not work at all, have lots of deadlock issues, or at least have terrible performance problems. Nevertheless I supose a simple, but invasive, hack could likely produce a worthwhile proof of concept. I think this is worthy of it's own thread I'll try to hack up an xip prototype for jffs2 next week. Very cool. I can't wait to see what you have in mind. But remember this doesn't solve the problem of the huge installed base of cramfs-linear-xip images. Gee I think it seems logfs would be a better choice. Jffs2 and ubifs(jffs3) for that matter combine node and node header in series which means your data nodes aren't aligned to page boundarys. Logfs nodes could be more easily aligned. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
I've had a few beer long discussion with Joern Engel and David Woodhouse on this one. To cut a long discussion short: the current XIP infrastructure is not sufficient to be used on top of mtd. We'd need some extenstions: - on get_xip_page() we'd need to state if we want the reference read-only or read+write - we need a put_xip_page() to return references - and finally we need a callback for the referece, so that the mtd driver can ask to get its reference back (in order to unmap from userland when erasing a block) Yes. And one more thing. We can't assume every page in a file is XIP or not. However, I still can't get even the existing get_xip_page() to work for me so we are getting ahead of ourselves;) Looking back on this thread I realize I haven't confirmed if my cramfs_get_xip_page() gets a page struct. I assume that is my problem? The UML find_iomem() probably returns psuedo iomem with page structs. While ioremap() does not return with page struct backed memory. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
that even more important doesn't require pulling in the whole block layer which is especially important for embedded devices at the lower end of the scala. Good point. That is a big oversight. Though I would prefer to handle that in the same fs rather than fork. I still think it'd be even better to just hook xip support into jffs or logfs because they give you a full featured flash filesystem for all needs without the complexity of strictly partitioning between xip-capable and write parts of your storage. This is nirvana. But it is not the goal of the patches in question. In fact there are several use cases that don't need and don't value the writeability and don't need therefore the overhead. It is a long term goal never the less. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
On 6/7/07, Christoph Hellwig <[EMAIL PROTECTED]> wrote: On Thu, Jun 07, 2007 at 07:07:54PM +0200, Carsten Otte wrote: > I've had a few beer long discussion with Joern Engel and David > Woodhouse on this one. To cut a long discussion short: the current XIP > infrastructure is not sufficient to be used on top of mtd. We'd need > some extenstions: > - on get_xip_page() we'd need to state if we want the reference > read-only or read+write > - we need a put_xip_page() to return references > - and finally we need a callback for the referece, so that the mtd > driver can ask to get its reference back (in order to unmap from > userland when erasing a block) And we'll need that even when using cramfs. There's not way we'd merge a hack where the user has to specify a physical address on the mount command line. Why not? For the use case in question the user usually manually burned the image to a physical address before hand. Many of these system don't have MTD turned on for this Flash, they don't need it because they don't write to this Flash once the system is up. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
On 6/7/07, Christoph Hellwig <[EMAIL PROTECTED]> wrote: On Thu, Jun 07, 2007 at 08:37:07PM +0100, Christoph Hellwig wrote: > The code is at http://verein.lst.de/~hch/cramfs-xip.tar.gz. And for thus just wanting to take a quick glance, this is the diff vs an out of tree cramfs where uncompress.c and cramfs_fs_sb.h are merged into inode.c: Cool. I notice you removed my UML hacks... Why? I just don't get one thing. This is almost a duplicate of cramfs-block. Why would we prefer a fork with a lot of code duplication to adding a couple alternate code paths in cramfs-block? Also keep in mind there are several reasons why you might want to have block access to to a XIP built cramfs image. I am unpersuaded that this fork approach is fundamentally better. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
get_xip_page() uncertainity
I am trying to create valid struct page* (*get_xip_page)(struct address_space *, sector_t, int) to use the filemap_xip.c. I've been trying to do it as follows: virtual = ioremap(physical,size); struct page* my_get_xip_page(struct address_space *mapping, sector_t sector, int create) { unsigned long offset; /*extract offset from mapping and sector*/ return virt_to_page(virtual + offset); } I believe this to be fundamentally flawed. While this works for xip_file_read(), it does not work for xip_file_mmap(). I'm not sure I understand the correct way to do this. But I assume the problem, and have some evidence to support it, is that virt_to_page() is not returning a vaild page struct. How can I get a valid page struct? The memory is not RAM but Flash. It is addressable like RAM and I want userspace to use it like readonly RAM. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
On 6/7/07, Christoph Hellwig [EMAIL PROTECTED] wrote: On Thu, Jun 07, 2007 at 08:37:07PM +0100, Christoph Hellwig wrote: The code is at http://verein.lst.de/~hch/cramfs-xip.tar.gz. And for thus just wanting to take a quick glance, this is the diff vs an out of tree cramfs where uncompress.c and cramfs_fs_sb.h are merged into inode.c: Cool. I notice you removed my UML hacks... Why? I just don't get one thing. This is almost a duplicate of cramfs-block. Why would we prefer a fork with a lot of code duplication to adding a couple alternate code paths in cramfs-block? Also keep in mind there are several reasons why you might want to have block access to to a XIP built cramfs image. I am unpersuaded that this fork approach is fundamentally better. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
On 6/7/07, Christoph Hellwig [EMAIL PROTECTED] wrote: On Thu, Jun 07, 2007 at 07:07:54PM +0200, Carsten Otte wrote: I've had a few beer long discussion with Joern Engel and David Woodhouse on this one. To cut a long discussion short: the current XIP infrastructure is not sufficient to be used on top of mtd. We'd need some extenstions: - on get_xip_page() we'd need to state if we want the reference read-only or read+write - we need a put_xip_page() to return references - and finally we need a callback for the referece, so that the mtd driver can ask to get its reference back (in order to unmap from userland when erasing a block) And we'll need that even when using cramfs. There's not way we'd merge a hack where the user has to specify a physical address on the mount command line. Why not? For the use case in question the user usually manually burned the image to a physical address before hand. Many of these system don't have MTD turned on for this Flash, they don't need it because they don't write to this Flash once the system is up. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
that even more important doesn't require pulling in the whole block layer which is especially important for embedded devices at the lower end of the scala. Good point. That is a big oversight. Though I would prefer to handle that in the same fs rather than fork. I still think it'd be even better to just hook xip support into jffs or logfs because they give you a full featured flash filesystem for all needs without the complexity of strictly partitioning between xip-capable and write parts of your storage. This is nirvana. But it is not the goal of the patches in question. In fact there are several use cases that don't need and don't value the writeability and don't need therefore the overhead. It is a long term goal never the less. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
I've had a few beer long discussion with Joern Engel and David Woodhouse on this one. To cut a long discussion short: the current XIP infrastructure is not sufficient to be used on top of mtd. We'd need some extenstions: - on get_xip_page() we'd need to state if we want the reference read-only or read+write - we need a put_xip_page() to return references - and finally we need a callback for the referece, so that the mtd driver can ask to get its reference back (in order to unmap from userland when erasing a block) Yes. And one more thing. We can't assume every page in a file is XIP or not. However, I still can't get even the existing get_xip_page() to work for me so we are getting ahead of ourselves;) Looking back on this thread I realize I haven't confirmed if my cramfs_get_xip_page() gets a page struct. I assume that is my problem? The UML find_iomem() probably returns psuedo iomem with page structs. While ioremap() does not return with page struct backed memory. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
If if were actually talking about complex filesystem I'd agree. But the cramfs xip patch posted here touches about 2/3 of the number of lines that cramfs has in total. Fair enough. But look at the complexity rather than number of lines. It adds tedium to the cramfs_fill_super and one extra level of indirection to a hand full of ops like mmap() and cramfs_read(). But the changes to the real meat of cramfs, cramfs_readpage(), are limited to the XIP changes, which I want on block devices anyway. So if we did fork cramfs I would submit a simple patch to cramfs for XIP support on block devices and I would submit a patch for a new filesystem, cramfs-linear. Cramfs-linear would have an exact copy of 1/3 of the cramfs code such as cramfs_readpage(), it would use the same headers, and it would use the same userspace tools. This fork is what the community wants? Speak up! And cramfs is not exactly the best base to start with.. This is a moot point, there is a significant installed base issue. There are lots of cramfs-linear-xip based systems in existance with can't be easily ported to newer kernel because of a lack of support. This is nirvana. But it is not the goal of the patches in question. In fact there are several use cases that don't need and don't value the writeability and don't need therefore the overhead. It is a long term goal never the less. With the filemap_xip.c helpers adding xip support to any filesystem is pretty trivial for the highlevel filesystem operations. The only interesting bit is the lowlevel code (the get_xip_page method and the others Carsten mentioned), but we need to do these lowlevel code in a generic and proper way anyway. It's not that trivial. The filesystem needs to meet several requirements such as, having data nodes that are page aligned. Anytime any changes are made to any page in the underlying Flash block or if the Flash physical partition goes out of read mode you've got to hide that from userspace or otherwise deal with it. A filesystem that doesn't understand these subtle hardware requirements would either not work at all, have lots of deadlock issues, or at least have terrible performance problems. Nevertheless I supose a simple, but invasive, hack could likely produce a worthwhile proof of concept. I think this is worthy of it's own thread I'll try to hack up an xip prototype for jffs2 next week. Very cool. I can't wait to see what you have in mind. But remember this doesn't solve the problem of the huge installed base of cramfs-linear-xip images. Gee I think it seems logfs would be a better choice. Jffs2 and ubifs(jffs3) for that matter combine node and node header in series which means your data nodes aren't aligned to page boundarys. Logfs nodes could be more easily aligned. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
On 6/6/07, Carsten Otte <[EMAIL PROTECTED]> wrote: Jared Hulbert wrote: > (2) failed with the following messages. (This wasn't really busybox. > It was xxd, not statically link, hence the issue with ld.so) Could you try to figure what happend to subject page before? Was it subject to copy on write? With what flags has this vma been mmaped? thanks, Carsten The vma->flags = 1875 = 0x753 This is: VM_READ VM_WRITE VM_MAYREAD VM_MAYEXEC VM_GROWSDOWN VM_GROWSUP VM_PFNMAP I assume no struct page exists for the pages of this file. When vm_no_page was called it seems it failed on a pte check since there is no backing page structure. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
The embedded people already use them on flash which is a little dumb, but now we add even more cludge for a non-block based access. Please justify your assertion that using cramfs on flash is dumb. What would be not dumb? In an embedded system with addressable Flash the linear addressing cramfs is simple and elegant solution. Removing support for block based access would drastically reduce the complexity of cramfs. The non-block access bits of code are trivial in comparison. Specifically which part of my patch represents unwarranted, unfixable cludge? The right way to architect xip for flash-based devices is to implement a generic get_xip_page for mtd-based devices and integrate that into an existing flash filesystem or write a simple new flash filesystem tailored to that use case. There is often no need for the complexity of the MTD for a readonly compressed filesystem in the embedded world. I am intrigued by the suggestion of a generic get_xip_page() for mtd-based devices. I fail to see how get_xip_page() is not highly filesystem dependant. How might a generic one work? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
On Wed, Jun 06, 2007 at 09:07:16AM -0700, Jared Hulbert wrote: > I estimate something on the order 5-10 million Linux phones use > something similar to these patches. I wonder if there are that many > provable users of of the simple cramfs. This is where the community > has taken cramfs. This is what a community disjoint to mainline development has hacked cramfs in their trees into. Not a good rationale. This whole "but we've always done it" attitute is a little annoying, really. It is that disjointedness we are trying to address. FYI: Cartsten had an xip fs for s390 aswell, and that evolved into the filemap.c bits after a lot of rework an quite a few round of review. Right. So now we leverage this filemap_xip.c in cramfs. Why is this a problem? > Nevertheless, I understand your point. I wrote AXFS in part because > the hacks required to do XIP on cramfs where ugly, hacky, and complex. I can't find a reference to AXFS anywhere in this thread. No, it's not here. There's a year old thread referencing it. > > Please > >use something like the existing ext2 xip mode instead of add support > >to romfs using the generic filemap methods. > > What?? You mean like use xip_file_mmap() and implement > get_xip_page()? Did you read my latest patch? Yes. This is the highlevel way to go, just please don't hack it into cramfs. Right, so this latest patch _does_ implement get_xip_page() and xip_file_mmap(). Why not hack it into cramfs? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
On 6/6/07, Christoph Hellwig <[EMAIL PROTECTED]> wrote: On Wed, Jun 06, 2007 at 08:17:43AM -0700, Richard Griffiths wrote: > Too late :) The XIP cramfs patch is widely used in the embedded Linux > community and has been used for years. It fulfills a need for a small > XIP Flash file system. Hence our interest in getting it or some > variation into the mainline kernel. That's not a reason to put it in as-is. Maybe you should have showed up here before making the wrong decision to hack this into cramfs. Please read the entire thread before passing judgements like this. The hacks evolved over the last 8 years, and are really handy. We're just trying to figure the best way to share them.. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
On 6/6/07, Christoph Hellwig <[EMAIL PROTECTED]> wrote: I might be a little late in the discussion, but I somehow missed this before. Please don't add this xip support to cramfs, because the whole point of cramfs is to be a simple _compressed_ filesystem, and we really don't want to add more complexity to it. I estimate something on the order 5-10 million Linux phones use something similar to these patches. I wonder if there are that many provable users of of the simple cramfs. This is where the community has taken cramfs. Nevertheless, I understand your point. I wrote AXFS in part because the hacks required to do XIP on cramfs where ugly, hacky, and complex. Please review the latest patch in the thread, it's just a draft but the changes required are not very complex now, especially in light of the filemap_xip.c APIs being used. It just happens not to work, yet. Please use something like the existing ext2 xip mode instead of add support to romfs using the generic filemap methods. What?? You mean like use xip_file_mmap() and implement get_xip_page()? Did you read my latest patch? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
he compressed rom filesystem. * The actual compression is based on zlib, see the other files. + */ + +/* Linear Addressing code + * + * Copyright (C) 2000 Shane Nay. + * + * Allows you to have a linearly addressed cramfs filesystem. + * Saves the need for buffer, and the munging of the buffer. + * Savings a bit over 32k with default PAGE_SIZE, BUFFER_SIZE + * etc. Usefull on embedded platform with ROM :-). + * + * Downsides- Currently linear addressed cramfs partitions + * don't co-exist with block cramfs partitions. + * + */ + +/* + * 28-Dec-2000: XIP mode for linear cramfs + * Copyright (C) 2000 Robert Leslie <[EMAIL PROTECTED]> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +/* filemap_xip.c interfaces - Jared Hulbert 2007 + * linear + block coexisting - Jared Hulbert 2007 + * (inspired by patches from Kyungmin Park of Samsung and others at + * Motorola from the EZX phones) + * */ #include @@ -24,22 +64,25 @@ #include #include #include +#include #include - -static const struct super_operations cramfs_ops; -static const struct inode_operations cramfs_dir_inode_operations; +#include +#ifdef CONFIG_UML +#include +#endif + +static struct super_operations cramfs_ops; +static struct inode_operations cramfs_dir_inode_operations; static const struct file_operations cramfs_directory_operations; static const struct address_space_operations cramfs_aops; static DEFINE_MUTEX(read_mutex); - /* These two macros may change in future, to provide better st_ino semantics. */ #define CRAMINO(x) (((x)->offset && (x)->size)?(x)->offset<<2:1) #define OFFSET(x) ((x)->i_ino) - static int cramfs_iget5_test(struct inode *inode, void *opaque) { @@ -99,13 +142,77 @@ static int cramfs_iget5_set(struct inode return 0; } +static int cramfs_mmap(struct file *file, struct vm_area_struct *vma) +{ + struct inode *inode = file->f_dentry->d_inode; + struct cramfs_sb_info *sbi = CRAMFS_SB(inode->i_sb); + + if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE)) + return -EINVAL; + + if ((CRAMFS_INODE_IS_XIP(inode)) && !(vma->vm_flags & VM_WRITE) && + (LINEAR_CRAMFS(sbi))) + return xip_file_mmap(file, vma); + + return generic_file_mmap(file, vma); +} + +struct page *cramfs_get_xip_page(struct address_space *mapping, sector_t offset, + int create) +{ + unsigned long address; + unsigned long offs = offset; + struct inode *inode = mapping->host; + struct super_block *sb = inode->i_sb; + struct cramfs_sb_info *sbi = CRAMFS_SB(sb); + + address = PAGE_ALIGN((unsigned long)(sbi->linear_virt_addr + + OFFSET(inode))); + offs *= 512; /* FIXME -- This shouldn't be hard coded */ + address += offs; + + return virt_to_page(address); +} + +ssize_t cramfs_file_read(struct file *file, char __user * buf, size_t len, + loff_t * ppos) +{ + struct inode *inode = file->f_dentry->d_inode; + struct cramfs_sb_info *sbi = CRAMFS_SB(inode->i_sb); + + if ((CRAMFS_INODE_IS_XIP(inode)) && (LINEAR_CRAMFS(sbi))) + return xip_file_read(file, buf, len, ppos); + + return do_sync_read(file, buf, len, ppos); +} + +static struct file_operations cramfs_linear_xip_fops = { + aio_read: generic_file_aio_read, + read: cramfs_file_read, + mmap: cramfs_mmap, +}; + +static struct backing_dev_info cramfs_backing_dev_info = { + .ra_pages = 0,/* No readahead */ +}; + static struct inode *get_cramfs_inode(struct super_block *sb, struct cramfs_inode * cramfs_inode) { + struct cramfs_sb_info *sbi = CRAMFS_SB(sb); struct inode *inode = iget5_locked(sb, CRAMINO(cramfs_inode), cramfs_iget5_test, cramfs_iget5_set, cramfs_inode); if (inode && (inode->i_state & I_NEW)) { + if (LINEAR_CRAMFS(sbi)) + inode->i_mapping->
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
03:27:31 2007 -0700 @@ -9,6 +9,46 @@ /* * These are the VFS interfaces to the compressed rom filesystem. * The actual compression is based on zlib, see the other files. + */ + +/* Linear Addressing code + * + * Copyright (C) 2000 Shane Nay. + * + * Allows you to have a linearly addressed cramfs filesystem. + * Saves the need for buffer, and the munging of the buffer. + * Savings a bit over 32k with default PAGE_SIZE, BUFFER_SIZE + * etc. Usefull on embedded platform with ROM :-). + * + * Downsides- Currently linear addressed cramfs partitions + * don't co-exist with block cramfs partitions. + * + */ + +/* + * 28-Dec-2000: XIP mode for linear cramfs + * Copyright (C) 2000 Robert Leslie [EMAIL PROTECTED] + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +/* filemap_xip.c interfaces - Jared Hulbert 2007 + * linear + block coexisting - Jared Hulbert 2007 + * (inspired by patches from Kyungmin Park of Samsung and others at + * Motorola from the EZX phones) + * */ #include linux/module.h @@ -24,22 +64,25 @@ #include linux/vfs.h #include linux/mutex.h #include asm/semaphore.h +#include linux/vmalloc.h #include asm/uaccess.h - -static const struct super_operations cramfs_ops; -static const struct inode_operations cramfs_dir_inode_operations; +#include asm/tlbflush.h +#ifdef CONFIG_UML +#include mem_user.h +#endif + +static struct super_operations cramfs_ops; +static struct inode_operations cramfs_dir_inode_operations; static const struct file_operations cramfs_directory_operations; static const struct address_space_operations cramfs_aops; static DEFINE_MUTEX(read_mutex); - /* These two macros may change in future, to provide better st_ino semantics. */ #define CRAMINO(x) (((x)-offset (x)-size)?(x)-offset2:1) #define OFFSET(x) ((x)-i_ino) - static int cramfs_iget5_test(struct inode *inode, void *opaque) { @@ -99,13 +142,77 @@ static int cramfs_iget5_set(struct inode return 0; } +static int cramfs_mmap(struct file *file, struct vm_area_struct *vma) +{ + struct inode *inode = file-f_dentry-d_inode; + struct cramfs_sb_info *sbi = CRAMFS_SB(inode-i_sb); + + if ((vma-vm_flags VM_SHARED) (vma-vm_flags VM_MAYWRITE)) + return -EINVAL; + + if ((CRAMFS_INODE_IS_XIP(inode)) !(vma-vm_flags VM_WRITE) + (LINEAR_CRAMFS(sbi))) + return xip_file_mmap(file, vma); + + return generic_file_mmap(file, vma); +} + +struct page *cramfs_get_xip_page(struct address_space *mapping, sector_t offset, + int create) +{ + unsigned long address; + unsigned long offs = offset; + struct inode *inode = mapping-host; + struct super_block *sb = inode-i_sb; + struct cramfs_sb_info *sbi = CRAMFS_SB(sb); + + address = PAGE_ALIGN((unsigned long)(sbi-linear_virt_addr + + OFFSET(inode))); + offs *= 512; /* FIXME -- This shouldn't be hard coded */ + address += offs; + + return virt_to_page(address); +} + +ssize_t cramfs_file_read(struct file *file, char __user * buf, size_t len, + loff_t * ppos) +{ + struct inode *inode = file-f_dentry-d_inode; + struct cramfs_sb_info *sbi = CRAMFS_SB(inode-i_sb); + + if ((CRAMFS_INODE_IS_XIP(inode)) (LINEAR_CRAMFS(sbi))) + return xip_file_read(file, buf, len, ppos); + + return do_sync_read(file, buf, len, ppos); +} + +static struct file_operations cramfs_linear_xip_fops = { + aio_read: generic_file_aio_read, + read: cramfs_file_read, + mmap: cramfs_mmap, +}; + +static struct backing_dev_info cramfs_backing_dev_info = { + .ra_pages = 0,/* No readahead */ +}; + static struct inode *get_cramfs_inode(struct super_block *sb, struct cramfs_inode * cramfs_inode) { + struct cramfs_sb_info *sbi = CRAMFS_SB(sb); struct inode *inode = iget5_locked(sb, CRAMINO(cramfs_inode), cramfs_iget5_test, cramfs_iget5_set, cramfs_inode); if (inode (inode-i_state I_NEW)) { + if (LINEAR_CRAMFS(sbi)) + inode-i_mapping-backing_dev_info
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
On 6/6/07, Christoph Hellwig [EMAIL PROTECTED] wrote: I might be a little late in the discussion, but I somehow missed this before. Please don't add this xip support to cramfs, because the whole point of cramfs is to be a simple _compressed_ filesystem, and we really don't want to add more complexity to it. I estimate something on the order 5-10 million Linux phones use something similar to these patches. I wonder if there are that many provable users of of the simple cramfs. This is where the community has taken cramfs. Nevertheless, I understand your point. I wrote AXFS in part because the hacks required to do XIP on cramfs where ugly, hacky, and complex. Please review the latest patch in the thread, it's just a draft but the changes required are not very complex now, especially in light of the filemap_xip.c APIs being used. It just happens not to work, yet. Please use something like the existing ext2 xip mode instead of add support to romfs using the generic filemap methods. What?? You mean like use xip_file_mmap() and implement get_xip_page()? Did you read my latest patch? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
On 6/6/07, Christoph Hellwig [EMAIL PROTECTED] wrote: On Wed, Jun 06, 2007 at 08:17:43AM -0700, Richard Griffiths wrote: Too late :) The XIP cramfs patch is widely used in the embedded Linux community and has been used for years. It fulfills a need for a small XIP Flash file system. Hence our interest in getting it or some variation into the mainline kernel. That's not a reason to put it in as-is. Maybe you should have showed up here before making the wrong decision to hack this into cramfs. Please read the entire thread before passing judgements like this. The hacks evolved over the last 8 years, and are really handy. We're just trying to figure the best way to share them.. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
On Wed, Jun 06, 2007 at 09:07:16AM -0700, Jared Hulbert wrote: I estimate something on the order 5-10 million Linux phones use something similar to these patches. I wonder if there are that many provable users of of the simple cramfs. This is where the community has taken cramfs. This is what a community disjoint to mainline development has hacked cramfs in their trees into. Not a good rationale. This whole but we've always done it attitute is a little annoying, really. It is that disjointedness we are trying to address. FYI: Cartsten had an xip fs for s390 aswell, and that evolved into the filemap.c bits after a lot of rework an quite a few round of review. Right. So now we leverage this filemap_xip.c in cramfs. Why is this a problem? Nevertheless, I understand your point. I wrote AXFS in part because the hacks required to do XIP on cramfs where ugly, hacky, and complex. I can't find a reference to AXFS anywhere in this thread. No, it's not here. There's a year old thread referencing it. Please use something like the existing ext2 xip mode instead of add support to romfs using the generic filemap methods. What?? You mean like use xip_file_mmap() and implement get_xip_page()? Did you read my latest patch? Yes. This is the highlevel way to go, just please don't hack it into cramfs. Right, so this latest patch _does_ implement get_xip_page() and xip_file_mmap(). Why not hack it into cramfs? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
The embedded people already use them on flash which is a little dumb, but now we add even more cludge for a non-block based access. Please justify your assertion that using cramfs on flash is dumb. What would be not dumb? In an embedded system with addressable Flash the linear addressing cramfs is simple and elegant solution. Removing support for block based access would drastically reduce the complexity of cramfs. The non-block access bits of code are trivial in comparison. Specifically which part of my patch represents unwarranted, unfixable cludge? The right way to architect xip for flash-based devices is to implement a generic get_xip_page for mtd-based devices and integrate that into an existing flash filesystem or write a simple new flash filesystem tailored to that use case. There is often no need for the complexity of the MTD for a readonly compressed filesystem in the embedded world. I am intrigued by the suggestion of a generic get_xip_page() for mtd-based devices. I fail to see how get_xip_page() is not highly filesystem dependant. How might a generic one work? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
On 6/6/07, Carsten Otte [EMAIL PROTECTED] wrote: Jared Hulbert wrote: (2) failed with the following messages. (This wasn't really busybox. It was xxd, not statically link, hence the issue with ld.so) Could you try to figure what happend to subject page before? Was it subject to copy on write? With what flags has this vma been mmaped? thanks, Carsten The vma-flags = 1875 = 0x753 This is: VM_READ VM_WRITE VM_MAYREAD VM_MAYEXEC VM_GROWSDOWN VM_GROWSUP VM_PFNMAP I assume no struct page exists for the pages of this file. When vm_no_page was called it seems it failed on a pte check since there is no backing page structure. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
> The current xip stack relies on having struct page behind the memory > segment. This causes few impact on memory management, but occupies some > more memory. The cramfs patch chose to modify copy on write in order to > deal with vmas that don't have struct page behind. > So far, Hugh and Linus have shown strong opposition against copy on > write with no struct page behind. If this implementation is acceptable > to the them, it seems preferable to me over wasting memory. The xip > stack should be modified to use this vma flag in that case. I would rather not :P We can copy on write without a struct page behind the source today, no? The existing COW techniques fail on some corner cases. I'm not up to speed on the vm code. I'll try to look into this a little more but it might be useful if I knew what questions I need to answer so you vm experts can understand the problem. Let me give one example. If you try to debug an XIP application without this patch, bad things happen. XIP in this sense is synomous with executing directly out of Flash and you can't just change the physical memory to redirect it to the debugger so easily in Flash. Now I don't know exactly why yet some, but not all applications, trigger this added vm hack. I'm not sure exactly why it would get triggered under normal circumstances. Why would a read-only map get written to? What is insufficient for the XIP code with the current COW? So I think the problem may have something to do with the nature of the memory in question. We are using Flash that is ioremap()'ed to a usable virtual address. And yet we go on to try to use it as if it were plain old system memory, like any RAM page. We need it to be presented as any other memory page only physically read-only. ioremap() seems to be a hacky way of accomplishing that, but I can't think of better way. In ARM we even had to invent ioremap_cached() to improve performance. Thoughts? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
The current xip stack relies on having struct page behind the memory segment. This causes few impact on memory management, but occupies some more memory. The cramfs patch chose to modify copy on write in order to deal with vmas that don't have struct page behind. So far, Hugh and Linus have shown strong opposition against copy on write with no struct page behind. If this implementation is acceptable to the them, it seems preferable to me over wasting memory. The xip stack should be modified to use this vma flag in that case. I would rather not :P We can copy on write without a struct page behind the source today, no? The existing COW techniques fail on some corner cases. I'm not up to speed on the vm code. I'll try to look into this a little more but it might be useful if I knew what questions I need to answer so you vm experts can understand the problem. Let me give one example. If you try to debug an XIP application without this patch, bad things happen. XIP in this sense is synomous with executing directly out of Flash and you can't just change the physical memory to redirect it to the debugger so easily in Flash. Now I don't know exactly why yet some, but not all applications, trigger this added vm hack. I'm not sure exactly why it would get triggered under normal circumstances. Why would a read-only map get written to? What is insufficient for the XIP code with the current COW? So I think the problem may have something to do with the nature of the memory in question. We are using Flash that is ioremap()'ed to a usable virtual address. And yet we go on to try to use it as if it were plain old system memory, like any RAM page. We need it to be presented as any other memory page only physically read-only. ioremap() seems to be a hacky way of accomplishing that, but I can't think of better way. In ARM we even had to invent ioremap_cached() to improve performance. Thoughts? - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
Yes you can, but I won't have access to a PXA270 for a few weeks. I assume you don't see the issue if you static link busybox? I don't know. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
On 5/22/07, Richard Griffiths <[EMAIL PROTECTED]> wrote: Venerable cramfs fs Linear XIP patch originally from MontaVista, used in the embedded Linux community for years, updated for 2.6.21. Tested on several systems with NOR Flash. PXA270, TI OMAP2430, ARM Versatile and Freescale iMX31ADS. When trying to verify this patch on our PXA270 system we get the following error when running an XIP rootfs: cramfs: checking physical address 0xa0 for linear cramfs image cramfs: linear cramfs image appears to be 3236 KB in size VFS: Mounted root (cramfs filesystem) readonly. Freeing init memory: 96K /sbin/init: error while loading shared libraries: libgcc_s.so.1: failed to map segment from shared object: Error 11 Kernel panic - not syncing: Attempted to kill init! However, if our busybox binary is XIP while the libgcc_s.so.1 is not XIP, busybox runs fine. Richard, May I email you the rootfs tarball so you can recreate what we are seeing? It is a little less than 2MiB. The filing system executables will only run on a PXA27x processor. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
On 5/22/07, Richard Griffiths [EMAIL PROTECTED] wrote: Venerable cramfs fs Linear XIP patch originally from MontaVista, used in the embedded Linux community for years, updated for 2.6.21. Tested on several systems with NOR Flash. PXA270, TI OMAP2430, ARM Versatile and Freescale iMX31ADS. When trying to verify this patch on our PXA270 system we get the following error when running an XIP rootfs: cramfs: checking physical address 0xa0 for linear cramfs image cramfs: linear cramfs image appears to be 3236 KB in size VFS: Mounted root (cramfs filesystem) readonly. Freeing init memory: 96K /sbin/init: error while loading shared libraries: libgcc_s.so.1: failed to map segment from shared object: Error 11 Kernel panic - not syncing: Attempted to kill init! However, if our busybox binary is XIP while the libgcc_s.so.1 is not XIP, busybox runs fine. Richard, May I email you the rootfs tarball so you can recreate what we are seeing? It is a little less than 2MiB. The filing system executables will only run on a PXA27x processor. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2.6.21] cramfs: add cramfs Linear XIP
Yes you can, but I won't have access to a PXA270 for a few weeks. I assume you don't see the issue if you static link busybox? I don't know. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/