Returned mail: Data format error
___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC PATCH] pmem: advertise page alignment for pmem devices supporting fsdax
On Sat, Feb 23, 2019 at 10:30:38AM +1100, Dave Chinner wrote: > On Fri, Feb 22, 2019 at 10:45:25AM -0800, Darrick J. Wong wrote: > > On Fri, Feb 22, 2019 at 10:28:15AM -0800, Dan Williams wrote: > > > On Fri, Feb 22, 2019 at 10:21 AM Darrick J. Wong > > > wrote: > > > > > > > > Hi all! > > > > > > > > Uh, we have an internal customer who's been trying out MAP_SYNC > > > > on pmem, and they've observed that one has to do a fair amount of > > > > legwork (in the form of mkfs.xfs parameters) to get the kernel to set up > > > > 2M PMD mappings. They (of course) want to mmap hundreds of GB of pmem, > > > > so the PMD mappings are much more efficient. > > Are you really saying that "mkfs.xfs -d su=2MB,sw=1 " is > considered "too much legwork" to set up the filesystem for DAX and > PMD alignment? Yes. I mean ... userspace /can/ figure out the page sizes on arm64 & ppc64le (or extract it from sysfs), but why not just advertise it as a io hint on the pmem "block" device? Hmm, now having watched various xfstests blow up because they don't expect blocks to be larger than 64k, maybe I'll rethink this as a default behavior. :) > > > > I started poking around w.r.t. what mkfs.xfs was doing and realized that > > > > if the fsdax pmem device advertised iomin/ioopt of 2MB, then mkfs will > > > > set up all the parameters automatically. Below is my ham-handed attempt > > > > to teach the kernel to do this. > > Still need extent size hints so that writes that are smaller than > the PMD size are allocated correctly aligned and sized to map to > PMDs... I think we're generally planning to use the RT device where we can make 2M alignment mandatory, so for the data device the effectiveness of the extent hint doesn't really matter. > > > > Comments, flames, "WTF is this guy smoking?" are all welcome. :) > > > > > > > > --D > > > > > > > > --- > > > > Configure pmem devices to advertise the default page alignment when said > > > > block device supports fsdax. Certain filesystems use these iomin/ioopt > > > > hints to try to create aligned file extents, which makes it much easier > > > > for mmaps to take advantage of huge page table entries. > > > > > > > > Signed-off-by: Darrick J. Wong > > > > --- > > > > drivers/nvdimm/pmem.c |5 - > > > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > > > > index bc2f700feef8..3eeb9dd117d5 100644 > > > > --- a/drivers/nvdimm/pmem.c > > > > +++ b/drivers/nvdimm/pmem.c > > > > @@ -441,8 +441,11 @@ static int pmem_attach_disk(struct device *dev, > > > > blk_queue_logical_block_size(q, pmem_sector_size(ndns)); > > > > blk_queue_max_hw_sectors(q, UINT_MAX); > > > > blk_queue_flag_set(QUEUE_FLAG_NONROT, q); > > > > - if (pmem->pfn_flags & PFN_MAP) > > > > + if (pmem->pfn_flags & PFN_MAP) { > > > > blk_queue_flag_set(QUEUE_FLAG_DAX, q); > > > > + blk_queue_io_min(q, PFN_DEFAULT_ALIGNMENT); > > > > + blk_queue_io_opt(q, PFN_DEFAULT_ALIGNMENT); > > > > > > The device alignment might sometimes be bigger than this default. > > > Would there be any detrimental effects for filesystems if io_min and > > > io_opt were set to 1GB? > > > > Hmmm, that's going to be a struggle on ext4 and the xfs data device > > because we'd be preferentially skipping the 1023.8MB immediately after > > each allocation group's metadata. It already does this now with a 2MB > > io hint, but losing 1.8MB here and there isn't so bad. > > > > We'd have to study it further, though; filesystems historically have > > interpreted the iomin/ioopt hints as RAID striping geometry, and I don't > > think very many people set up 1GB raid stripe units. > > Setting sunit=1GB is really going to cause havoc with things like > inode chunk allocation alignment, and the first write() will either > have to be >=1GB or use 1GB extent size hints to trigger alignment. > And, AFAICT, it will prevent us from doing 2MB alignment on other > files, even with 2MB extent size hints set. > > IOWs, I don't think 1GB alignment is a good idea as a default. > > (I doubt very many people have done 2M raid stripes either, but it seems > > to work easily where we've tried it...) > > That's been pretty common with stacked hardware raid for as long as > I've worked on XFS. e.g. a software RAID0 stripe of hardware RAID5/6 > luns was pretty common with large storage arrays in HPC environments > (i.e. huge streaming read/write bandwidth). In these cases, XFS was > set up with the RAID5/6 lun width as the stripe unit (commonly 2MB > with 8+1 and 256k raid chunk size), and the RAID 0 > width as the stripe width (commonly 8-16 wide spread across 8-16 FC > portsi w/ multipath) and it wasn't uncommon to see widths in the > 16-32MB range. > > This aligned the filesystem to the underlying RAID5/6 luns, and > allows stripe width IO to be aligned an hit every RAID5/6 lun > evenly. Ensuring applic
Re: [PATCH] dax: add a 'modalias' attribute to DAX 'bus' devices
Looks ok, but I think the changelog could be more accurate. On Fri, Feb 22, 2019 at 3:59 PM Vishal Verma wrote: > > Add a 'modalias' attribute to devices under the DAX bus so that userspace > is able to dynamically load modules as needed. The modalias is already published in the uevent which is how udev identifies the module. This patch would allow "modalias to module lookups" *outside* of the typical uevent used for dynamically loading modules. Care to fix up the changelog with that detail and why userspace needs to do these lookups in addition to the typical uevent lookups? > The modalias already > exists, it was only the sysfs attribute that was missing. > > Cc: Dan Williams > Cc: Dave Hansen > Signed-off-by: Vishal Verma > --- > drivers/dax/bus.c | 12 > 1 file changed, 12 insertions(+) > > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c > index 28c3324271ac..2109cfe80219 100644 > --- a/drivers/dax/bus.c > +++ b/drivers/dax/bus.c > @@ -295,6 +295,17 @@ static ssize_t target_node_show(struct device *dev, > } > static DEVICE_ATTR_RO(target_node); > > +static ssize_t modalias_show(struct device *dev, struct device_attribute > *attr, > + char *buf) > +{ > + /* > +* We only ever expect to handle device-dax instances, i.e. the > +* @type argument to MODULE_ALIAS_DAX_DEVICE() is always zero > +*/ > + return sprintf(buf, DAX_DEVICE_MODALIAS_FMT "\n", 0); > +} > +static DEVICE_ATTR_RO(modalias); > + > static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, > int n) > { > struct device *dev = container_of(kobj, struct device, kobj); > @@ -306,6 +317,7 @@ static umode_t dev_dax_visible(struct kobject *kobj, > struct attribute *a, int n) > } > > static struct attribute *dev_dax_attributes[] = { > + &dev_attr_modalias.attr, > &dev_attr_size.attr, > &dev_attr_target_node.attr, > NULL, > -- > 2.20.1 > ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[PATCH] dax: add a 'modalias' attribute to DAX 'bus' devices
Add a 'modalias' attribute to devices under the DAX bus so that userspace is able to dynamically load modules as needed. The modalias already exists, it was only the sysfs attribute that was missing. Cc: Dan Williams Cc: Dave Hansen Signed-off-by: Vishal Verma --- drivers/dax/bus.c | 12 1 file changed, 12 insertions(+) diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 28c3324271ac..2109cfe80219 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -295,6 +295,17 @@ static ssize_t target_node_show(struct device *dev, } static DEVICE_ATTR_RO(target_node); +static ssize_t modalias_show(struct device *dev, struct device_attribute *attr, + char *buf) +{ + /* +* We only ever expect to handle device-dax instances, i.e. the +* @type argument to MODULE_ALIAS_DAX_DEVICE() is always zero +*/ + return sprintf(buf, DAX_DEVICE_MODALIAS_FMT "\n", 0); +} +static DEVICE_ATTR_RO(modalias); + static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n) { struct device *dev = container_of(kobj, struct device, kobj); @@ -306,6 +317,7 @@ static umode_t dev_dax_visible(struct kobject *kobj, struct attribute *a, int n) } static struct attribute *dev_dax_attributes[] = { + &dev_attr_modalias.attr, &dev_attr_size.attr, &dev_attr_target_node.attr, NULL, -- 2.20.1 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC PATCH] pmem: advertise page alignment for pmem devices supporting fsdax
On Fri, Feb 22, 2019 at 10:45:25AM -0800, Darrick J. Wong wrote: > On Fri, Feb 22, 2019 at 10:28:15AM -0800, Dan Williams wrote: > > On Fri, Feb 22, 2019 at 10:21 AM Darrick J. Wong > > wrote: > > > > > > Hi all! > > > > > > Uh, we have an internal customer who's been trying out MAP_SYNC > > > on pmem, and they've observed that one has to do a fair amount of > > > legwork (in the form of mkfs.xfs parameters) to get the kernel to set up > > > 2M PMD mappings. They (of course) want to mmap hundreds of GB of pmem, > > > so the PMD mappings are much more efficient. Are you really saying that "mkfs.xfs -d su=2MB,sw=1 " is considered "too much legwork" to set up the filesystem for DAX and PMD alignment? > > > I started poking around w.r.t. what mkfs.xfs was doing and realized that > > > if the fsdax pmem device advertised iomin/ioopt of 2MB, then mkfs will > > > set up all the parameters automatically. Below is my ham-handed attempt > > > to teach the kernel to do this. Still need extent size hints so that writes that are smaller than the PMD size are allocated correctly aligned and sized to map to PMDs... > > > Comments, flames, "WTF is this guy smoking?" are all welcome. :) > > > > > > --D > > > > > > --- > > > Configure pmem devices to advertise the default page alignment when said > > > block device supports fsdax. Certain filesystems use these iomin/ioopt > > > hints to try to create aligned file extents, which makes it much easier > > > for mmaps to take advantage of huge page table entries. > > > > > > Signed-off-by: Darrick J. Wong > > > --- > > > drivers/nvdimm/pmem.c |5 - > > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > > > index bc2f700feef8..3eeb9dd117d5 100644 > > > --- a/drivers/nvdimm/pmem.c > > > +++ b/drivers/nvdimm/pmem.c > > > @@ -441,8 +441,11 @@ static int pmem_attach_disk(struct device *dev, > > > blk_queue_logical_block_size(q, pmem_sector_size(ndns)); > > > blk_queue_max_hw_sectors(q, UINT_MAX); > > > blk_queue_flag_set(QUEUE_FLAG_NONROT, q); > > > - if (pmem->pfn_flags & PFN_MAP) > > > + if (pmem->pfn_flags & PFN_MAP) { > > > blk_queue_flag_set(QUEUE_FLAG_DAX, q); > > > + blk_queue_io_min(q, PFN_DEFAULT_ALIGNMENT); > > > + blk_queue_io_opt(q, PFN_DEFAULT_ALIGNMENT); > > > > The device alignment might sometimes be bigger than this default. > > Would there be any detrimental effects for filesystems if io_min and > > io_opt were set to 1GB? > > Hmmm, that's going to be a struggle on ext4 and the xfs data device > because we'd be preferentially skipping the 1023.8MB immediately after > each allocation group's metadata. It already does this now with a 2MB > io hint, but losing 1.8MB here and there isn't so bad. > > We'd have to study it further, though; filesystems historically have > interpreted the iomin/ioopt hints as RAID striping geometry, and I don't > think very many people set up 1GB raid stripe units. Setting sunit=1GB is really going to cause havoc with things like inode chunk allocation alignment, and the first write() will either have to be >=1GB or use 1GB extent size hints to trigger alignment. And, AFAICT, it will prevent us from doing 2MB alignment on other files, even with 2MB extent size hints set. IOWs, I don't think 1GB alignment is a good idea as a default. > (I doubt very many people have done 2M raid stripes either, but it seems > to work easily where we've tried it...) That's been pretty common with stacked hardware raid for as long as I've worked on XFS. e.g. a software RAID0 stripe of hardware RAID5/6 luns was pretty common with large storage arrays in HPC environments (i.e. huge streaming read/write bandwidth). In these cases, XFS was set up with the RAID5/6 lun width as the stripe unit (commonly 2MB with 8+1 and 256k raid chunk size), and the RAID 0 width as the stripe width (commonly 8-16 wide spread across 8-16 FC portsi w/ multipath) and it wasn't uncommon to see widths in the 16-32MB range. This aligned the filesystem to the underlying RAID5/6 luns, and allows stripe width IO to be aligned an hit every RAID5/6 lun evenly. Ensuring applications could do this easily with large direct IO reads and writes is where the swalloc and largeio mount options come into their own > > I'm thinking and xfs-realtime configuration might be able to support > > 1GB mappings in the future. > > The xfs realtime device ought to be able to support 1g alignment pretty > easily though. :) Yup, but I think that's the maximum "block" size it can support and DAX will have some serious long tail latency and CPU usage issues at allocation time because each new 1GB "block" that is dynamically allocated will have to be completely zeroed during the allocation inside the page fault handler. Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-nvdi
Re: [RFC PATCH] pmem: advertise page alignment for pmem devices supporting fsdax
On Sat, Feb 23, 2019 at 10:11:36AM +1100, Dave Chinner wrote: > On Fri, Feb 22, 2019 at 10:20:08AM -0800, Darrick J. Wong wrote: > > Hi all! > > > > Uh, we have an internal customer who's been trying out MAP_SYNC > > on pmem, and they've observed that one has to do a fair amount of > > legwork (in the form of mkfs.xfs parameters) to get the kernel to set up > > 2M PMD mappings. They (of course) want to mmap hundreds of GB of pmem, > > so the PMD mappings are much more efficient. > > > > I started poking around w.r.t. what mkfs.xfs was doing and realized that > > if the fsdax pmem device advertised iomin/ioopt of 2MB, then mkfs will > > set up all the parameters automatically. Below is my ham-handed attempt > > to teach the kernel to do this. > > What's the before and after mkfs output? > > (need to see the context that this "fixes" before I comment) Here's what we do today assuming no options and 800GB pmem devices: # blockdev --getiomin --getioopt /dev/pmem0 /dev/pmem1 4096 0 4096 0 # mkfs.xfs -N /dev/pmem0 -r rtdev=/dev/pmem1 meta-data=/dev/pmem0 isize=512agcount=4, agsize=52428800 blks = sectsz=512 attr=2, projid32bit=1 = crc=1finobt=1, sparse=1, rmapbt=0 = reflink=0 data = bsize=4096 blocks=209715200, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=102400, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =/dev/pmem1 extsz=4096 blocks=209715200, rtextents=209715200 And here's what we do to get 2M aligned mappings: # mkfs.xfs -N /dev/pmem0 -r rtdev=/dev/pmem1,extsize=2m -d su=2m,sw=1 meta-data=/dev/pmem0 isize=512agcount=32, agsize=6553600 blks = sectsz=512 attr=2, projid32bit=1 = crc=1finobt=1, sparse=1, rmapbt=0 = reflink=0 data = bsize=4096 blocks=209715200, imaxpct=25 = sunit=512swidth=512 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=102400, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =/dev/pmem1 extsz=2097152 blocks=209715200, rtextents=409600 With this patch, things change as such: # blockdev --getiomin --getioopt /dev/pmem0 /dev/pmem1 2097152 2097152 2097152 2097152 # mkfs.xfs -N /dev/pmem0 -r rtdev=/dev/pmem1 meta-data=/dev/pmem0 isize=512agcount=32, agsize=6553600 blks = sectsz=512 attr=2, projid32bit=1 = crc=1finobt=1, sparse=1, rmapbt=0 = reflink=0 data = bsize=4096 blocks=209715200, imaxpct=25 = sunit=512swidth=512 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=102400, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =/dev/pmem1 extsz=2097152 blocks=209715200, rtextents=409600 I think the only change is the agcount, which for 2M mappings probably isn't a huge deal. It's obviously a bigger deal for 1G pages, assuming we decide that's even advisable. --D > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC PATCH] pmem: advertise page alignment for pmem devices supporting fsdax
On Fri, Feb 22, 2019 at 10:20:08AM -0800, Darrick J. Wong wrote: > Hi all! > > Uh, we have an internal customer who's been trying out MAP_SYNC > on pmem, and they've observed that one has to do a fair amount of > legwork (in the form of mkfs.xfs parameters) to get the kernel to set up > 2M PMD mappings. They (of course) want to mmap hundreds of GB of pmem, > so the PMD mappings are much more efficient. > > I started poking around w.r.t. what mkfs.xfs was doing and realized that > if the fsdax pmem device advertised iomin/ioopt of 2MB, then mkfs will > set up all the parameters automatically. Below is my ham-handed attempt > to teach the kernel to do this. What's the before and after mkfs output? (need to see the context that this "fixes" before I comment) Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC v4 00/17] kunit: introduce KUnit, the Linux kernel unit testing framework
Frank Rowand writes: > On 2/19/19 10:34 PM, Brendan Higgins wrote: >> On Mon, Feb 18, 2019 at 12:02 PM Frank Rowand wrote: >> >>> I have not read through the patches in any detail. I have read some of >>> the code to try to understand the patches to the devicetree unit tests. >>> So that may limit how valid my comments below are. >> >> No problem. >> >>> >>> I found the code difficult to read in places where it should have been >>> much simpler to read. Structuring the code in a pseudo object oriented >>> style meant that everywhere in a code path that I encountered a dynamic >>> function call, I had to go find where that dynamic function call was >>> initialized (and being the cautious person that I am, verify that >>> no where else was the value of that dynamic function call). With >>> primitive vi and tags, that search would have instead just been a >>> simple key press (or at worst a few keys) if hard coded function >>> calls were done instead of dynamic function calls. In the code paths >>> that I looked at, I did not see any case of a dynamic function being >>> anything other than the value it was originally initialized as. >>> There may be such cases, I did not read the entire patch set. There >>> may also be cases envisioned in the architects mind of how this >>> flexibility may be of future value. Dunno. >> >> Yeah, a lot of it is intended to make architecture specific >> implementations and some other future work easier. Some of it is also >> for testing purposes. Admittedly some is for neither reason, but given >> the heavy usage elsewhere, I figured there was no harm since it was >> all private internal usage anyway. >> > > Increasing the cost for me (and all the other potential code readers) > to read the code is harm. Dynamic function calls aren't necessary for arch-specific implementations either. See for example arch_kexec_image_load() in kernel/kexec_file.c, which uses a weak symbol that is overriden by arch-specific code. Not everybody likes weak symbols, so another alternative (which admitedly not everybody likes either) is to use a macro with the name of the arch-specific function, as used by arch_kexec_post_alloc_pages() in for instance. -- Thiago Jung Bauermann IBM Linux Technology Center ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC PATCH] pmem: advertise page alignment for pmem devices supporting fsdax
On Fri, Feb 22, 2019 at 10:28:15AM -0800, Dan Williams wrote: > On Fri, Feb 22, 2019 at 10:21 AM Darrick J. Wong > wrote: > > > > Hi all! > > > > Uh, we have an internal customer who's been trying out MAP_SYNC > > on pmem, and they've observed that one has to do a fair amount of > > legwork (in the form of mkfs.xfs parameters) to get the kernel to set up > > 2M PMD mappings. They (of course) want to mmap hundreds of GB of pmem, > > so the PMD mappings are much more efficient. > > > > I started poking around w.r.t. what mkfs.xfs was doing and realized that > > if the fsdax pmem device advertised iomin/ioopt of 2MB, then mkfs will > > set up all the parameters automatically. Below is my ham-handed attempt > > to teach the kernel to do this. > > > > Comments, flames, "WTF is this guy smoking?" are all welcome. :) > > > > --D > > > > --- > > Configure pmem devices to advertise the default page alignment when said > > block device supports fsdax. Certain filesystems use these iomin/ioopt > > hints to try to create aligned file extents, which makes it much easier > > for mmaps to take advantage of huge page table entries. > > > > Signed-off-by: Darrick J. Wong > > --- > > drivers/nvdimm/pmem.c |5 - > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > > index bc2f700feef8..3eeb9dd117d5 100644 > > --- a/drivers/nvdimm/pmem.c > > +++ b/drivers/nvdimm/pmem.c > > @@ -441,8 +441,11 @@ static int pmem_attach_disk(struct device *dev, > > blk_queue_logical_block_size(q, pmem_sector_size(ndns)); > > blk_queue_max_hw_sectors(q, UINT_MAX); > > blk_queue_flag_set(QUEUE_FLAG_NONROT, q); > > - if (pmem->pfn_flags & PFN_MAP) > > + if (pmem->pfn_flags & PFN_MAP) { > > blk_queue_flag_set(QUEUE_FLAG_DAX, q); > > + blk_queue_io_min(q, PFN_DEFAULT_ALIGNMENT); > > + blk_queue_io_opt(q, PFN_DEFAULT_ALIGNMENT); > > The device alignment might sometimes be bigger than this default. > Would there be any detrimental effects for filesystems if io_min and > io_opt were set to 1GB? Hmmm, that's going to be a struggle on ext4 and the xfs data device because we'd be preferentially skipping the 1023.8MB immediately after each allocation group's metadata. It already does this now with a 2MB io hint, but losing 1.8MB here and there isn't so bad. We'd have to study it further, though; filesystems historically have interpreted the iomin/ioopt hints as RAID striping geometry, and I don't think very many people set up 1GB raid stripe units. (I doubt very many people have done 2M raid stripes either, but it seems to work easily where we've tried it...) > I'm thinking and xfs-realtime configuration might be able to support > 1GB mappings in the future. The xfs realtime device ought to be able to support 1g alignment pretty easily though. :) --D ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [RFC PATCH] pmem: advertise page alignment for pmem devices supporting fsdax
On Fri, Feb 22, 2019 at 10:21 AM Darrick J. Wong wrote: > > Hi all! > > Uh, we have an internal customer who's been trying out MAP_SYNC > on pmem, and they've observed that one has to do a fair amount of > legwork (in the form of mkfs.xfs parameters) to get the kernel to set up > 2M PMD mappings. They (of course) want to mmap hundreds of GB of pmem, > so the PMD mappings are much more efficient. > > I started poking around w.r.t. what mkfs.xfs was doing and realized that > if the fsdax pmem device advertised iomin/ioopt of 2MB, then mkfs will > set up all the parameters automatically. Below is my ham-handed attempt > to teach the kernel to do this. > > Comments, flames, "WTF is this guy smoking?" are all welcome. :) > > --D > > --- > Configure pmem devices to advertise the default page alignment when said > block device supports fsdax. Certain filesystems use these iomin/ioopt > hints to try to create aligned file extents, which makes it much easier > for mmaps to take advantage of huge page table entries. > > Signed-off-by: Darrick J. Wong > --- > drivers/nvdimm/pmem.c |5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c > index bc2f700feef8..3eeb9dd117d5 100644 > --- a/drivers/nvdimm/pmem.c > +++ b/drivers/nvdimm/pmem.c > @@ -441,8 +441,11 @@ static int pmem_attach_disk(struct device *dev, > blk_queue_logical_block_size(q, pmem_sector_size(ndns)); > blk_queue_max_hw_sectors(q, UINT_MAX); > blk_queue_flag_set(QUEUE_FLAG_NONROT, q); > - if (pmem->pfn_flags & PFN_MAP) > + if (pmem->pfn_flags & PFN_MAP) { > blk_queue_flag_set(QUEUE_FLAG_DAX, q); > + blk_queue_io_min(q, PFN_DEFAULT_ALIGNMENT); > + blk_queue_io_opt(q, PFN_DEFAULT_ALIGNMENT); The device alignment might sometimes be bigger than this default. Would there be any detrimental effects for filesystems if io_min and io_opt were set to 1GB? I'm thinking and xfs-realtime configuration might be able to support 1GB mappings in the future. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[RFC PATCH] pmem: advertise page alignment for pmem devices supporting fsdax
Hi all! Uh, we have an internal customer who's been trying out MAP_SYNC on pmem, and they've observed that one has to do a fair amount of legwork (in the form of mkfs.xfs parameters) to get the kernel to set up 2M PMD mappings. They (of course) want to mmap hundreds of GB of pmem, so the PMD mappings are much more efficient. I started poking around w.r.t. what mkfs.xfs was doing and realized that if the fsdax pmem device advertised iomin/ioopt of 2MB, then mkfs will set up all the parameters automatically. Below is my ham-handed attempt to teach the kernel to do this. Comments, flames, "WTF is this guy smoking?" are all welcome. :) --D --- Configure pmem devices to advertise the default page alignment when said block device supports fsdax. Certain filesystems use these iomin/ioopt hints to try to create aligned file extents, which makes it much easier for mmaps to take advantage of huge page table entries. Signed-off-by: Darrick J. Wong --- drivers/nvdimm/pmem.c |5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c index bc2f700feef8..3eeb9dd117d5 100644 --- a/drivers/nvdimm/pmem.c +++ b/drivers/nvdimm/pmem.c @@ -441,8 +441,11 @@ static int pmem_attach_disk(struct device *dev, blk_queue_logical_block_size(q, pmem_sector_size(ndns)); blk_queue_max_hw_sectors(q, UINT_MAX); blk_queue_flag_set(QUEUE_FLAG_NONROT, q); - if (pmem->pfn_flags & PFN_MAP) + if (pmem->pfn_flags & PFN_MAP) { blk_queue_flag_set(QUEUE_FLAG_DAX, q); + blk_queue_io_min(q, PFN_DEFAULT_ALIGNMENT); + blk_queue_io_opt(q, PFN_DEFAULT_ALIGNMENT); + } q->queuedata = pmem; disk = alloc_disk_node(0, nid); ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 7/7] libnvdimm/pfn: Fix 'start_pad' implementation
Dan Williams writes: >> Great! Now let's create another one. >> >> # ndctl create-namespace -m fsdax -s 132m >> libndctl: ndctl_pfn_enable: pfn1.1: failed to enable >> Error: namespace1.2: failed to enable >> >> failed to create namespace: No such device or address >> >> (along with a kernel warning spew) > > I assume you're seeing this on the libnvdimm-pending branch? Yes, but also on linus' master branch. Things have been operating in this manner for some time. >> I understand the desire for expediency. At some point, though, we have >> to address the root of the problem. > > Well, you've defibrillated me back to reality. We've suffered the > incomplete broken hacks for 2 years, what's another 10 weeks? I'll > dust off the sub-section patches and take another run at it. OK, thanks. Let me know if I can help at all. Cheers, Jeff ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 7/7] libnvdimm/pfn: Fix 'start_pad' implementation
On Fri, Feb 22, 2019 at 7:42 AM Jeff Moyer wrote: > > Dan Williams writes: > > >> > However, to fix this situation a non-backwards compatible change > >> > needs to be made to the interpretation of the nd_pfn info-block. > >> > ->start_pad needs to be accounted in ->map.map_offset (formerly > >> > ->data_offset), and ->map.map_base (formerly ->phys_addr) needs to be > >> > adjusted to the section aligned resource base used to establish > >> > ->map.map formerly (formerly ->virt_addr). > >> > > >> > The guiding principles of the info-block compatibility fixup is to > >> > maintain the interpretation of ->data_offset for implementations like > >> > the EFI driver that only care about data_access not dax, but cause older > >> > Linux implementations that care about the mode and dax to fail to parse > >> > the new info-block. > >> > >> What if the core mm grew support for hotplug on sub-section boundaries? > >> Would't that fix this problem (and others)? > > > > Yes, I think it would, and I had patches along these lines [2]. Last > > time I looked at this I was asked by core-mm folks to await some > > general refactoring of hotplug [3], and I wasn't proud about some of > > the hacks I used to make it work. In general I'm less confident about > > being able to get sub-section-hotplug over the goal line (core-mm > > resistance to hotplug complexity) vs the local hacks in nvdimm to deal > > with this breakage. > > You first posted that patch series in December of 2016. How long do we > wait for this refactoring to happen? > > Meanwhile, we've been kicking this can down the road for far too long. > Simple namespace creation fails to work. For example: > > # ndctl create-namespace -m fsdax -s 128m > Error: '--size=' must align to interleave-width: 6 and alignment: 2097152 > did you intend --size=132M? > > failed to create namespace: Invalid argument > > ok, I can't actually create a small, section-aligned namespace. Let's > bump it up: > > # ndctl create-namespace -m fsdax -s 132m > { > "dev":"namespace1.0", > "mode":"fsdax", > "map":"dev", > "size":"126.00 MiB (132.12 MB)", > "uuid":"2a5f8fe0-69e2-46bf-98bc-0f5667cd810a", > "raw_uuid":"f7324317-5cd2-491e-8cd1-ad03770593f2", > "sector_size":512, > "blockdev":"pmem1", > "numa_node":1 > } > > Great! Now let's create another one. > > # ndctl create-namespace -m fsdax -s 132m > libndctl: ndctl_pfn_enable: pfn1.1: failed to enable > Error: namespace1.2: failed to enable > > failed to create namespace: No such device or address > > (along with a kernel warning spew) I assume you're seeing this on the libnvdimm-pending branch? > And at this point, all further ndctl create-namespace commands fail. > Lovely. This is a wart that was acceptable only because a fix was > coming. 2+ years later, and we're still adding hacks to work around it > (and there have been *several* hacks). True. > > > Local hacks are always a sad choice, but I think leaving these > > configurations stranded for another kernel cycle is not tenable. It > > wasn't until the github issue did I realize that the problem was > > happening in the wild on NVDIMM-N platforms. > > I understand the desire for expediency. At some point, though, we have > to address the root of the problem. Well, you've defibrillated me back to reality. We've suffered the incomplete broken hacks for 2 years, what's another 10 weeks? I'll dust off the sub-section patches and take another run at it. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 7/7] libnvdimm/pfn: Fix 'start_pad' implementation
Dan Williams writes: >> > However, to fix this situation a non-backwards compatible change >> > needs to be made to the interpretation of the nd_pfn info-block. >> > ->start_pad needs to be accounted in ->map.map_offset (formerly >> > ->data_offset), and ->map.map_base (formerly ->phys_addr) needs to be >> > adjusted to the section aligned resource base used to establish >> > ->map.map formerly (formerly ->virt_addr). >> > >> > The guiding principles of the info-block compatibility fixup is to >> > maintain the interpretation of ->data_offset for implementations like >> > the EFI driver that only care about data_access not dax, but cause older >> > Linux implementations that care about the mode and dax to fail to parse >> > the new info-block. >> >> What if the core mm grew support for hotplug on sub-section boundaries? >> Would't that fix this problem (and others)? > > Yes, I think it would, and I had patches along these lines [2]. Last > time I looked at this I was asked by core-mm folks to await some > general refactoring of hotplug [3], and I wasn't proud about some of > the hacks I used to make it work. In general I'm less confident about > being able to get sub-section-hotplug over the goal line (core-mm > resistance to hotplug complexity) vs the local hacks in nvdimm to deal > with this breakage. You first posted that patch series in December of 2016. How long do we wait for this refactoring to happen? Meanwhile, we've been kicking this can down the road for far too long. Simple namespace creation fails to work. For example: # ndctl create-namespace -m fsdax -s 128m Error: '--size=' must align to interleave-width: 6 and alignment: 2097152 did you intend --size=132M? failed to create namespace: Invalid argument ok, I can't actually create a small, section-aligned namespace. Let's bump it up: # ndctl create-namespace -m fsdax -s 132m { "dev":"namespace1.0", "mode":"fsdax", "map":"dev", "size":"126.00 MiB (132.12 MB)", "uuid":"2a5f8fe0-69e2-46bf-98bc-0f5667cd810a", "raw_uuid":"f7324317-5cd2-491e-8cd1-ad03770593f2", "sector_size":512, "blockdev":"pmem1", "numa_node":1 } Great! Now let's create another one. # ndctl create-namespace -m fsdax -s 132m libndctl: ndctl_pfn_enable: pfn1.1: failed to enable Error: namespace1.2: failed to enable failed to create namespace: No such device or address (along with a kernel warning spew) And at this point, all further ndctl create-namespace commands fail. Lovely. This is a wart that was acceptable only because a fix was coming. 2+ years later, and we're still adding hacks to work around it (and there have been *several* hacks). > Local hacks are always a sad choice, but I think leaving these > configurations stranded for another kernel cycle is not tenable. It > wasn't until the github issue did I realize that the problem was > happening in the wild on NVDIMM-N platforms. I understand the desire for expediency. At some point, though, we have to address the root of the problem. -Jeff > > [2]: > https://lore.kernel.org/lkml/148964440651.19438.2288075389153762985.st...@dwillia2-desk3.amr.corp.intel.com/ > [3]: https://lore.kernel.org/lkml/20170319163531.ga25...@dhcp22.suse.cz/ > >> >> -Jeff >> >> [1] https://github.com/pmem/ndctl/issues/76 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Returned mail: Data format error
This Message was undeliverable due to the following reason: Your message was not delivered because the destination computer was not reachable within the allowed queue period. The amount of time a message is queued before it is returned depends on local configura- tion parameters. Most likely there is a network problem that prevented delivery, but it is also possible that the computer is turned off, or does not have a mail system running right now. Your message was not delivered within 8 days: Host 179.28.116.202 is not responding. The following recipients did not receive this message: Please reply to postmas...@lists.01.org if you feel this message to be in error. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm