Wingate Dunross: Leaders of Executive Search
Dear Entrepreneurs/VCs, Hiring Principals, Fellow Scientists/Engineers and HR People: Are you, or your Company, currently trying to fill a particularly challenging yet vital position? Nothing is more valuable to a Corporation than 'Human Capital', and a top Recruiter can save you and your company both time and money by filling a key position quickly, without the expenditure of countless man-hours wasted on a futile effort. Enlisting a wise consultative/collaborative Recruiter with an Engineering degree to your side can be a huge advantage, achieving rapid connections with a superior echelon of candidates; so is having a Recruiter whose research skills were formed and honed at one of the top-ranked academic departments in the World. I offer both qualities, and even provide 'no-risk' guarantees of satisfaction on ALL retained searches. Whether your search needs are for junior positions or top-level senior scientists with hundreds of patents, I can demonstrate a track-record of prior success. Please contact me if you have a hiring need that requires special insight, dedication and intelligence to resolve. Best Regards and Thank You, Nicholas Meyler GM/President, Technology Wingate Dunross Associates, Inc. ph (818)597-3200 ext. 211 ni...@wdsearch.com Partial List of Successfully-Completed Engineering Searches: 3D Printing Mechanical Engineer at IrOs LLC Analytical Chemist at NanoH2O Board of Scientific Advisors Member at QSI Nanotech Business Development Manager at Koch-Knight CEO at RSET Technologies CEO/COO at Eutricity CEO/VP Sales and Marketing at Opta Chief Chemical Engineer at CoolPlanet Biofuels Chief Product Officer at Nexa3D CTO at Unidym Director of Nonvolatile Memory Business Development at Nanosys Director of Quantum Computational Chemistry at Nanostellar Director of Technical Marketing at Analogix Director of Technical Sales at Bitboys Director of Western Region Sales at S3 Director of WW Sales at NanoInk DRAM Marketing Manager at Mitsubishi Electrical Equipment Engineer at Diamond Foundry Executive VP at Chroma Energy Field Applications Engineer (Optical Ethernet) at TE Connectivity Field Applications Engineer at ATI Field Applications Engineer at Cyrix Field Applications Engineer at Genesis Microchip Field Applications Engineer at MediaQ Field Applications Engineer at NanoH2O Field Applications Engineer at Weitek Machine-Vision Scientist at TE Connectivity Materials Scientist/Rheologist at IrOs LLC Microfluidics Scientist at Bio-Rad Laboratories Principle Systems Architect QLEDs Device Physicist at NanoPhotonica Senior CMOS Process Integration Engineer at Nantero Senior Scientist High-Temp Materials at Morgan AM Technical Marketing Manager at ATI Technical Marketing Manager at Auravision Technical Marketing Manager at C-Cube Microsystems VP Business Development at MicroOptical Corp VP Business Development at Ubiquitous Energy VP of Chemical Engineering at Nantero VP Intellectual Property and Licensing at Nantero VP of Government Business Development at Nantero/Lockheed VP of WW PR/Marcomm at 3Dfx VP of WW Sales and Marketing at 3Dfx VP Products at Unidym VP Sales and Marketing at Memsic VP Sales and Marketing at Millennial.net Selected Accomplishments: 10 retained software placements at Rasna (3rd fastest-growing startup in the Nation, later sold to PTC for $500 million). My first retained search (1989) identified a candidate from a graduate program at RPI with studies in Differential Geometry in under two weeks, who accepted the job and spent 12 years with the company rising to VP status. No other search firms in the World had been able to produce any candidates at all for several months prior to my work. 21 placements at Nantero (featured on the cover of Scientific American as revolutionary nanotechnology, half of company acquired by Lockheed Martin). I am a proud shareholder, as well. Yes, we do accept stock options as fees, if desired... 4 placements at NanoH2O (sold to LG Chem for $200 million) 10 retained placements at MicroDisplay, Inc. (miniature high-res LCD chips) 12 placements at TE Connectivity (world’s leading connectivity company) Placed prolific Inventor with 391 granted patents and 356 patent applications still in process (extracted from Micron) with a $1million sign-on bonus. He led company to $billion+ revenues by solving key production issues. Placed prolific Inventor with 170 patents in conductive ink chemistry, etc. (extracted from Xerox) http://app.streamsend.com/private/u4Kt/nKR/rPOzpjo/unsubscribe/29848761 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
On Thu, 2017-10-26 at 12:58 +0200, Jan Kara wrote: > On Fri 20-10-17 11:31:48, Christoph Hellwig wrote: > > On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote: > > > I'd like to brainstorm how we can do something better. > > > > > > How about: > > > > > > If we hit a page with an elevated refcount in truncate / hole puch > > > etc for a DAX file system we do not free the blocks in the file system, > > > but add it to the extent busy list. We mark the page as delayed > > > free (e.g. page flag?) so that when it finally hits refcount zero we > > > call back into the file system to remove it from the busy list. > > > > Brainstorming some more: > > > > Given that on a DAX file there shouldn't be any long-term page > > references after we unmap it from the page table and don't allow > > get_user_pages calls why not wait for the references for all > > DAX pages to go away first? E.g. if we find a DAX page in > > truncate_inode_pages_range that has an elevated refcount we set > > a new flag to prevent new references from showing up, and then > > simply wait for it to go away. Instead of a busy way we can > > do this through a few hashed waitqueued in dev_pagemap. And in > > fact put_zone_device_page already gets called when putting the > > last page so we can handle the wakeup from there. > > > > In fact if we can't find a page flag for the stop new callers > > things we could probably come up with a way to do that through > > dev_pagemap somehow, but I'm not sure how efficient that would > > be. > > We were talking about this yesterday with Dan so some more brainstorming > from us. We can implement the solution with extent busy list in ext4 > relatively easily - we already have such list currently similarly to XFS. > There would be some modifications needed but nothing too complex. The > biggest downside of this solution I see is that it requires per-filesystem > solution for busy extents - ext4 and XFS are reasonably fine, however btrfs > may have problems and ext2 definitely will need some modifications. > Invisible used blocks may be surprising to users at times although given > page refs should be relatively short term, that should not be a big issue. > But are we guaranteed page refs are short term? E.g. if someone creates > v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs > can be rather long-term similarly as in RDMA case. Also freeing of blocks > on page reference drop is another async entry point into the filesystem > which could unpleasantly surprise us but I guess workqueues would solve > that reasonably fine. > > WRT waiting for page refs to be dropped before proceeding with truncate (or > punch hole for that matter - that case is even nastier since we don't have > i_size to guard us). What I like about this solution is that it is very > visible there's something unusual going on with the file being truncated / > punched and so problems are easier to diagnose / fix from the admin side. > So far we have guarded hole punching from concurrent faults (and > get_user_pages() does fault once you do unmap_mapping_range()) with > I_MMAP_LOCK (or its equivalent in ext4). We cannot easily wait for page > refs to be dropped under I_MMAP_LOCK as that could deadlock - the most > obvious case Dan came up with is when GUP obtains ref to page A, then hole > punch comes grabbing I_MMAP_LOCK and waiting for page ref on A to be > dropped, and then GUP blocks on trying to fault in another page. > > I think we cannot easily prevent new page references to be grabbed as you > write above since nobody expects stuff like get_page() to fail. But I > think that unmapping relevant pages and then preventing them to be faulted > in again is workable and stops GUP as well. The problem with that is though > what to do with page faults to such pages - you cannot just fail them for > hole punch, and you cannot easily allocate new blocks either. So we are > back at a situation where we need to detach blocks from the inode and then > wait for page refs to be dropped - so some form of busy extents. Am I > missing something? > No, that's a good summary of what we talked about. However, I did go back and give the new lock approach a try and was able to get my test to pass. The new locking is not pretty especially since you need to drop and reacquire the lock so that get_user_pages() can finish grabbing all the pages it needs. Here are the two primary patches in the series, do you think the extent-busy approach would be cleaner? --- commit 5023d20a0aa795ddafd43655be1bfb2cbc7f4445 Author: Dan WilliamsDate: Wed Oct 25 05:14:54 2017 -0700 mm, dax: handle truncate of dma-busy pages get_user_pages() pins file backed memory pages for access by dma devices. However, it only pins the memory pages not the page-to-file offset association. If a file is truncated the pages are mapped out of the file and dma may continue indefinitely into
Re: [bug report] libnvdimm: clear the internal poison_list when clearing badblocks
On Thu, Oct 26, 2017 at 3:29 AM, Dan Carpenterwrote: > Hello Vishal Verma, > > The patch e046114af5fc: "libnvdimm: clear the internal poison_list > when clearing badblocks" from Sep 30, 2016, leads to the following > static checker warning: Thanks for the report Dan, we'll take a look. ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 17/17] xfs: support for synchronous DAX faults
On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote: > On Wed 25-10-17 09:23:22, Dave Chinner wrote: > > On Tue, Oct 24, 2017 at 05:24:14PM +0200, Jan Kara wrote: > > > From: Christoph Hellwig> > > > > > Return IOMAP_F_DIRTY from xfs_file_iomap_begin() when asked to prepare > > > blocks for writing and the inode is pinned, and has dirty fields other > > > than the timestamps. > > > > That's "fdatasync dirty", not "fsync dirty". > > Correct. > > > IOMAP_F_DIRTY needs a far better description of it's semantics than > > "/* block mapping is not yet on persistent storage */" so we know > > exactly what filesystems are supposed to be implementing here. I > > suspect that what it really is meant to say is: > > > > /* > > * IOMAP_F_DIRTY indicates the inode has uncommitted metadata to > > * written data and requires fdatasync to commit to persistent storage. > > */ > > I'll update the comment. Thanks! > > > [] > > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > > > index f179bdf1644d..b43be199fbdf 100644 > > > --- a/fs/xfs/xfs_iomap.c > > > +++ b/fs/xfs/xfs_iomap.c > > > @@ -33,6 +33,7 @@ > > > #include "xfs_error.h" > > > #include "xfs_trans.h" > > > #include "xfs_trans_space.h" > > > +#include "xfs_inode_item.h" > > > #include "xfs_iomap.h" > > > #include "xfs_trace.h" > > > #include "xfs_icache.h" > > > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin( > > > trace_xfs_iomap_found(ip, offset, length, 0, ); > > > } > > > > > > + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) && > > > + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP)) > > > + iomap->flags |= IOMAP_F_DIRTY; > > > > This is the very definition of an inode that is "fdatasync dirty". > > > > H, shouldn't this also be set for read faults, too? > > No, read faults don't need to set IOMAP_F_DIRTY since user cannot write any > data to the page which he'd then like to be persistent. The only reason why > I thought it could be useful for a while was that it would be nice to make > MAP_SYNC mapping provide the guarantee that data you see now is the data > you'll see after a crash Isn't that the entire point of MAP_SYNC? i.e. That when we return from a page fault, the app knows that the data and it's underlying extent is on persistent storage? > but we cannot provide that guarantee for RO > mapping anyway if someone else has the page mapped as well. So I just > decided not to return IOMAP_F_DIRTY for read faults. If there are multiple MAP_SYNC mappings to the inode, I would have expected that they all sync all of the data/metadata on every page fault, regardless of who dirtied the inode. An RO mapping doesn't mean the data/metadata on the inode can't change, it just means it can't change through that mapping. Running fsync() to guarantee the persistence of that data/metadata doesn't actually changing any data IOWs, if read faults don't guarantee the mapped range has stable extents on a MAP_SYNC mapping, then I think MAP_SYNC is broken because it's not giving consistent guarantees to userspace. Yes, it works fine when only one MAP_SYNC mapping is modifying the inode, but the moment we have concurrent operations on the inode that aren't MAP_SYNC or O_SYNC this goes out the window Cheers, Dave. -- Dave Chinner da...@fromorbit.com ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH] mmap.2: Add description of MAP_SHARED_VALIDATE and MAP_SYNC
On Thu, Oct 26, 2017 at 03:24:02PM +0200, Jan Kara wrote: > On Tue 24-10-17 15:10:07, Ross Zwisler wrote: > > On Tue, Oct 24, 2017 at 05:24:15PM +0200, Jan Kara wrote: > > > Signed-off-by: Jan Kara> > > > This looks unchanged since the previous version? > > Ah, thanks for checking. I forgot to commit modifications. Attached is > really updated patch. > > Honza > -- > Jan Kara > SUSE Labs, CR > From 59eeec2998ed9b3840aab951f213148cb1d053a5 Mon Sep 17 00:00:00 2001 > From: Jan Kara > Date: Thu, 19 Oct 2017 14:44:55 +0200 > Subject: [PATCH] mmap.2: Add description of MAP_SHARED_VALIDATE and MAP_SYNC > > Signed-off-by: Jan Kara Looks good, you can add: Reviewed-by: Ross Zwisler ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH 17/17] xfs: support for synchronous DAX faults
On Wed 25-10-17 09:23:22, Dave Chinner wrote: > On Tue, Oct 24, 2017 at 05:24:14PM +0200, Jan Kara wrote: > > From: Christoph Hellwig> > > > Return IOMAP_F_DIRTY from xfs_file_iomap_begin() when asked to prepare > > blocks for writing and the inode is pinned, and has dirty fields other > > than the timestamps. > > That's "fdatasync dirty", not "fsync dirty". Correct. > IOMAP_F_DIRTY needs a far better description of it's semantics than > "/* block mapping is not yet on persistent storage */" so we know > exactly what filesystems are supposed to be implementing here. I > suspect that what it really is meant to say is: > > /* > * IOMAP_F_DIRTY indicates the inode has uncommitted metadata to > * written data and requires fdatasync to commit to persistent storage. > */ I'll update the comment. Thanks! > [] > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > > index f179bdf1644d..b43be199fbdf 100644 > > --- a/fs/xfs/xfs_iomap.c > > +++ b/fs/xfs/xfs_iomap.c > > @@ -33,6 +33,7 @@ > > #include "xfs_error.h" > > #include "xfs_trans.h" > > #include "xfs_trans_space.h" > > +#include "xfs_inode_item.h" > > #include "xfs_iomap.h" > > #include "xfs_trace.h" > > #include "xfs_icache.h" > > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin( > > trace_xfs_iomap_found(ip, offset, length, 0, ); > > } > > > > + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) && > > + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP)) > > + iomap->flags |= IOMAP_F_DIRTY; > > This is the very definition of an inode that is "fdatasync dirty". > > H, shouldn't this also be set for read faults, too? No, read faults don't need to set IOMAP_F_DIRTY since user cannot write any data to the page which he'd then like to be persistent. The only reason why I thought it could be useful for a while was that it would be nice to make MAP_SYNC mapping provide the guarantee that data you see now is the data you'll see after a crash but we cannot provide that guarantee for RO mapping anyway if someone else has the page mapped as well. So I just decided not to return IOMAP_F_DIRTY for read faults. But now that I look at XFS implementation again, it misses handling of VM_FAULT_NEEDSYNC in xfs_filemap_pfn_mkwrite() (ext4 gets this right). I'll fix this by using __xfs_filemap_fault() for xfs_filemap_pfn_mkwrite() as well since it mostly duplicates it anyway... Thanks for inquiring! Honza -- Jan Kara SUSE Labs, CR ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
˭�ܾ��쿪�������Ͽͻ������IJ�Ʒ��˭���Ǵ�Ӯ�ң���
___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
��������Ŀ��������Ʒ����֮���Ĺ�����ϵ
___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: Enabling peer to peer device transactions for PCIe devices
- Original Message - > From: "David Laight"> To: "Petrosyan, Ludwig" , "Logan Gunthorpe" > > Cc: "Alexander Deucher" , "linux-kernel" > , "linux-rdma" > , "linux-nvdimm" , > "Linux-media" , > "dri-devel" , "linux-pci" > , "John Bridgman" > , "Felix Kuehling" , "Serguei > Sagalovitch" > , "Paul Blinzer" , > "Christian Koenig" , > "Suravee Suthikulpanit" , "Ben Sander" > > Sent: Tuesday, 24 October, 2017 16:58:24 > Subject: RE: Enabling peer to peer device transactions for PCIe devices > Please don't top post, write shorter lines, and add the odd blank line. > Big blocks of text are hard to read quickly. > OK this time I am very short. peer2peer works Ludwig ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH] mmap.2: Add description of MAP_SHARED_VALIDATE and MAP_SYNC
On Tue 24-10-17 15:10:07, Ross Zwisler wrote: > On Tue, Oct 24, 2017 at 05:24:15PM +0200, Jan Kara wrote: > > Signed-off-by: Jan Kara> > This looks unchanged since the previous version? Ah, thanks for checking. I forgot to commit modifications. Attached is really updated patch. Honza -- Jan Kara SUSE Labs, CR >From 59eeec2998ed9b3840aab951f213148cb1d053a5 Mon Sep 17 00:00:00 2001 From: Jan Kara Date: Thu, 19 Oct 2017 14:44:55 +0200 Subject: [PATCH] mmap.2: Add description of MAP_SHARED_VALIDATE and MAP_SYNC Signed-off-by: Jan Kara --- man2/mmap.2 | 35 ++- 1 file changed, 34 insertions(+), 1 deletion(-) diff --git a/man2/mmap.2 b/man2/mmap.2 index 47c3148653be..b38ee6809327 100644 --- a/man2/mmap.2 +++ b/man2/mmap.2 @@ -125,6 +125,21 @@ are carried through to the underlying file. to the underlying file requires the use of .BR msync (2).) .TP +.BR MAP_SHARED_VALIDATE " (since Linux 4.15)" +The same as +.B MAP_SHARED +except that +.B MAP_SHARED +mappings ignore unknown flags in +.IR flags . +In contrast when creating mapping of +.B MAP_SHARED_VALIDATE +mapping type, the kernel verifies all passed flags are known and fails the +mapping with +.BR EOPNOTSUPP +otherwise. This mapping type is also required to be able to use some mapping +flags. +.TP .B MAP_PRIVATE Create a private copy-on-write mapping. Updates to the mapping are not visible to other processes @@ -134,7 +149,10 @@ It is unspecified whether changes made to the file after the .BR mmap () call are visible in the mapped region. .PP -Both of these flags are described in POSIX.1-2001 and POSIX.1-2008. +.B MAP_SHARED +and +.B MAP_PRIVATE +are described in POSIX.1-2001 and POSIX.1-2008. .PP In addition, zero or more of the following values can be ORed in .IR flags : @@ -352,6 +370,21 @@ option. Because of the security implications, that option is normally enabled only on embedded devices (i.e., devices where one has complete control of the contents of user memory). +.TP +.BR MAP_SYNC " (since Linux 4.15)" +This flags is available only with +.B MAP_SHARED_VALIDATE +mapping type. Mappings of +.B MAP_SHARED +type will silently ignore this flag. +This flag is supported only for files supporting DAX (direct mapping of persistent +memory). For other files, creating mapping with this flag results in +.B EOPNOTSUPP +error. Shared file mappings with this flag provide the guarantee that while +some memory is writeably mapped in the address space of the process, it will +be visible in the same file at the same offset even after the system crashes or +is rebooted. This allows users of such mappings to make data modifications +persistent in a more efficient way using appropriate CPU instructions. .PP Of the above flags, only .B MAP_FIXED -- 2.12.3 ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[bug report] libnvdimm: add support for clear poison list and badblocks for device dax
Hello Dave Jiang, The patch 006358b35c73: "libnvdimm: add support for clear poison list and badblocks for device dax" from Apr 7, 2017, leads to the following static checker warning: drivers/nvdimm/bus.c:852 nd_pmem_forget_poison_check() warn: we tested 'nd_dax' before and it was 'false' drivers/nvdimm/bus.c 835 static int nd_pmem_forget_poison_check(struct device *dev, void *data) 836 { 837 struct nd_cmd_clear_error *clear_err = 838 (struct nd_cmd_clear_error *)data; 839 struct nd_btt *nd_btt = is_nd_btt(dev) ? to_nd_btt(dev) : NULL; 840 struct nd_pfn *nd_pfn = is_nd_pfn(dev) ? to_nd_pfn(dev) : NULL; 841 struct nd_dax *nd_dax = is_nd_dax(dev) ? to_nd_dax(dev) : NULL; ^^ nd_dax is set here. 842 struct nd_namespace_common *ndns = NULL; 843 struct nd_namespace_io *nsio; 844 resource_size_t offset = 0, end_trunc = 0, start, end, pstart, pend; 845 846 if (nd_dax || !dev->driver) ^^ We return if it's non-NULL 847 return 0; 848 849 start = clear_err->address; 850 end = clear_err->address + clear_err->cleared - 1; 851 852 if (nd_btt || nd_pfn || nd_dax) { 853 if (nd_btt) 854 ndns = nd_btt->ndns; 855 else if (nd_pfn) 856 ndns = nd_pfn->ndns; 857 else if (nd_dax) ^^ but the rest of the function assumes it can be true. Perhaps we plan to enable it in the future? It's not clear to me. 858 ndns = nd_dax->nd_pfn.ndns; 859 860 if (!ndns) 861 return 0; 862 } else 863 ndns = to_ndns(dev); 864 865 nsio = to_nd_namespace_io(>dev); 866 pstart = nsio->res.start + offset; 867 pend = nsio->res.end - end_trunc; 868 869 if ((pstart >= start) && (pend <= end)) 870 return -EBUSY; 871 872 return 0; 873 874 } regards, dan carpenter ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
Re: [PATCH v3 00/13] dax: fix dma vs truncate and remove 'page-less' support
On Fri 20-10-17 11:31:48, Christoph Hellwig wrote: > On Fri, Oct 20, 2017 at 09:47:50AM +0200, Christoph Hellwig wrote: > > I'd like to brainstorm how we can do something better. > > > > How about: > > > > If we hit a page with an elevated refcount in truncate / hole puch > > etc for a DAX file system we do not free the blocks in the file system, > > but add it to the extent busy list. We mark the page as delayed > > free (e.g. page flag?) so that when it finally hits refcount zero we > > call back into the file system to remove it from the busy list. > > Brainstorming some more: > > Given that on a DAX file there shouldn't be any long-term page > references after we unmap it from the page table and don't allow > get_user_pages calls why not wait for the references for all > DAX pages to go away first? E.g. if we find a DAX page in > truncate_inode_pages_range that has an elevated refcount we set > a new flag to prevent new references from showing up, and then > simply wait for it to go away. Instead of a busy way we can > do this through a few hashed waitqueued in dev_pagemap. And in > fact put_zone_device_page already gets called when putting the > last page so we can handle the wakeup from there. > > In fact if we can't find a page flag for the stop new callers > things we could probably come up with a way to do that through > dev_pagemap somehow, but I'm not sure how efficient that would > be. We were talking about this yesterday with Dan so some more brainstorming from us. We can implement the solution with extent busy list in ext4 relatively easily - we already have such list currently similarly to XFS. There would be some modifications needed but nothing too complex. The biggest downside of this solution I see is that it requires per-filesystem solution for busy extents - ext4 and XFS are reasonably fine, however btrfs may have problems and ext2 definitely will need some modifications. Invisible used blocks may be surprising to users at times although given page refs should be relatively short term, that should not be a big issue. But are we guaranteed page refs are short term? E.g. if someone creates v4l2 videobuf in MAP_SHARED mapping of a file on DAX filesystem, page refs can be rather long-term similarly as in RDMA case. Also freeing of blocks on page reference drop is another async entry point into the filesystem which could unpleasantly surprise us but I guess workqueues would solve that reasonably fine. WRT waiting for page refs to be dropped before proceeding with truncate (or punch hole for that matter - that case is even nastier since we don't have i_size to guard us). What I like about this solution is that it is very visible there's something unusual going on with the file being truncated / punched and so problems are easier to diagnose / fix from the admin side. So far we have guarded hole punching from concurrent faults (and get_user_pages() does fault once you do unmap_mapping_range()) with I_MMAP_LOCK (or its equivalent in ext4). We cannot easily wait for page refs to be dropped under I_MMAP_LOCK as that could deadlock - the most obvious case Dan came up with is when GUP obtains ref to page A, then hole punch comes grabbing I_MMAP_LOCK and waiting for page ref on A to be dropped, and then GUP blocks on trying to fault in another page. I think we cannot easily prevent new page references to be grabbed as you write above since nobody expects stuff like get_page() to fail. But I think that unmapping relevant pages and then preventing them to be faulted in again is workable and stops GUP as well. The problem with that is though what to do with page faults to such pages - you cannot just fail them for hole punch, and you cannot easily allocate new blocks either. So we are back at a situation where we need to detach blocks from the inode and then wait for page refs to be dropped - so some form of busy extents. Am I missing something? Honza -- Jan KaraSUSE Labs, CR ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm
[bug report] libnvdimm: clear the internal poison_list when clearing badblocks
Hello Vishal Verma, The patch e046114af5fc: "libnvdimm: clear the internal poison_list when clearing badblocks" from Sep 30, 2016, leads to the following static checker warning: drivers/nvdimm/core.c:601 nvdimm_forget_poison() warn: potential integer overflow from user 'start + len' drivers/nvdimm/core.c 597 void nvdimm_forget_poison(struct nvdimm_bus *nvdimm_bus, phys_addr_t start, 598 unsigned int len) 599 { 600 struct list_head *poison_list = _bus->poison_list; 601 u64 clr_end = start + len - 1; ^^^ Thes come from the __nd_ioctl() and it looks like they haven't been checked before we call this function. It's hard for me to read this function well enough that I can say for sure the overflow is harmless. Please review? 602 struct nd_poison *pl, *next; 603 604 spin_lock(_bus->poison_lock); 605 WARN_ON_ONCE(list_empty(poison_list)); 606 607 /* 608 * [start, clr_end] is the poison interval being cleared. 609 * [pl->start, pl_end] is the poison_list entry we're comparing 610 * the above interval against. The poison list entry may need 611 * to be modified (update either start or length), deleted, or 612 * split into two based on the overlap characteristics 613 */ 614 615 list_for_each_entry_safe(pl, next, poison_list, list) { 616 u64 pl_end = pl->start + pl->length - 1; 617 618 /* Skip intervals with no intersection */ 619 if (pl_end < start) 620 continue; 621 if (pl->start > clr_end) 622 continue; 623 /* Delete completely overlapped poison entries */ 624 if ((pl->start >= start) && (pl_end <= clr_end)) { 625 list_del(>list); regards, dan carpenter ___ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm