Re: [PATCH 0/5] 4.14 backports of fixes for "CoW after fork() issue"
On Wed, 7 Apr 2021, Linus Torvalds wrote: > On Wed, Apr 7, 2021 at 9:33 AM Suren Baghdasaryan wrote: > > > > Trying my hand at backporting the patchsets Peter mentioned proved > > this to be far from easy with many dependencies. Let me look into > > Vlastimil's suggestion to backport only 17839856fd58 and it sounds > > like 5.4 already followed that path. > > Well, in many ways 17839856fd58 was the "simple and obvious" fix, and > I do think it's easily backportable. > > But it *did* cause problems too. Those problems may not be issues on > those old kernels, though. > > In particular, commit 17839856fd58 caused uffd-wp to stop working > right, and it caused some issues with debugging (I forget the exact > details, but I think it was strace accessing PROT_NONE or write-only > pages or something like that, and COW failed). > > But yes, in many ways that commit is a much simpler and more > straightforward one (which is why I tried it once - we ended up with > the much more subtle and far-reaching fixes after the UFFD issues > crept up). > > The issues that 17839856fd58 caused may be entire non-events in old > kernels. In fact, the uffd writeprotect API was added fairly recently > (see commit 63b2d4174c4a that made it into v5.7), so the uffd-wp issue > that was triggered probably cannot happen in the old kernels. > > The strace issue might not be relevant either, but I forget what the > details were. Mikilas should know. > > See > > > https://lore.kernel.org/lkml/alpine.lrh.2.02.2009031328040.6...@file01.intranet.prod.int.rdu2.redhat.com/ > > for Mikulas report. I never looked into it in detail, because by then > the uffd-wp issue had already come up, so it was juat another nail in > the coffin for that simpler approach. > > Mikulas, do you remember? > > Linus Hi I think that we never found a root cause for this bug. I was testing if the whole system can run from persistent memory and found out that strace didn't work. I bisected it, reported it and when I received Peter Xu's patches (which fixed it), I stopped bothering about it. So, we fixed it, but we don't know why. Peter Xu's patchset that fixed it is here: https://lore.kernel.org/lkml/20200821234958.7896-1-pet...@redhat.com/ Mikulas
[RFC v3] nvfs: a filesystem for persistent memory
Hi I announce a new version of NVFS - a filesystem for persistent memory. http://people.redhat.com/~mpatocka/nvfs/ git://leontynka.twibright.com/nvfs.git Changes since the last release: I reworked file read/write handling: * the functions nvfs_read and nvfs_write were deleted beacause it's unlikely that the upstream kernel will allow them. * the functions nvfs_read_iter and nvfs_write_iter have a fast path if there is just one segment in iov_iter - they will call nvfs_read_locked and nvfs_write_locked directly. This improves performance by 3% on the read path and 1% on the write path. * read_iter_locked uses copy_to_iter as suggested by Al Viro. * write_iter_locked doesn't use copy_from_iter_flushcache, because we need copy that doesn't advance the iter (the "copy" and "advance" must be two separate operations). So, I added new operations "iov_iter_map" and "iov_iter_unmap" - iov_iter_map will map the first segment of iov and iov_iter_unmap will unmap it. Do you think that introducing "iov_iter_map" and "iov_iter_unmap" is appropriate? Do you have some other idea how to handle it? Mikukas
Re: Expense of read_iter
On Thu, 21 Jan 2021, Matthew Wilcox wrote: > On Wed, Jan 20, 2021 at 10:12:01AM -0500, Mikulas Patocka wrote: > > Do you have some idea how to optimize the generic code that calls > > ->read_iter? > > Yes. > > > It might be better to maintain an f_iocb_flags in the > > struct file and just copy that unconditionally. We'd need to remember > > to update it in fcntl(F_SETFL), but I think that's the only place. > > Want to give that a try? Yes - send me the patch and I'll benchmark it. Mikulas
Re: Expense of read_iter
On Wed, 20 Jan 2021, Jan Kara wrote: > Yeah, I agree. I'm against ext4 private solution for this read problem. And > I'm also against duplicating ->read_iter functionatily in ->read handler. > The maintenance burden of this code duplication is IMHO just too big. We > rather need to improve the generic code so that the fast path is faster. > And every filesystem will benefit because this is not ext4 specific > problem. > > Honza Do you have some idea how to optimize the generic code that calls ->read_iter? vfs_read calls ->read if it is present. If not, it calls new_sync_read. new_sync_read's frame size is 128 bytes - it holds the structures iovec, kiocb and iov_iter. new_sync_read calls ->read_iter. I have found out that the cost of calling new_sync_read is 3.3%, Zhongwei found out 3.9%. (the benchmark repeatedy reads the same 4k page) I don't see any way how to optimize new_sync_read or how to reduce its frame size. Do you? Mikulas
Re: Expense of read_iter
On Tue, 12 Jan 2021, Zhongwei Cai wrote: > > I'm working with Mingkai on optimizations for Ext4-dax. What specific patch are you working on? Please, post it somewhere. > We think that optmizing the read-iter method cannot achieve the > same performance as the read method for Ext4-dax. > We tried Mikulas's benchmark on Ext4-dax. The overall time and perf > results are listed below: > > Overall time of 2^26 4KB read. > > Method Time > read 26.782s > read-iter36.477s What happens if you use this trick ( https://lkml.org/lkml/2021/1/11/1612 ) - detect in the "read_iter" method that there is just one segment and treat it like a "read" method. I think that it should improve performance for your case. Mikulas
Re: Expense of read_iter
On Mon, 11 Jan 2021, Matthew Wilcox wrote: > On Sun, Jan 10, 2021 at 04:19:15PM -0500, Mikulas Patocka wrote: > > I put counters into vfs_read and vfs_readv. > > > > After a fresh boot of the virtual machine, the counters show "13385 4". > > After a kernel compilation they show "4475220 8". > > > > So, the readv path is almost unused. > > > > My reasoning was that we should optimize for the "read" path and glue the > > "readv" path on the top of that. Currently, the kernel is doing the > > opposite - optimizing for "readv" and glueing "read" on the top of it. > > But it's not about optimising for read vs readv. read_iter handles > a host of other cases, such as pread(), preadv(), AIO reads, splice, > and reads to in-kernel buffers. These things are used rarely compared to "read" and "pread". (BTW. "pread" could be handled by the read method too). What's the reason why do you think that the "read" syscall should use the "read_iter" code path? Is it because duplicating the logic is discouraged? Or because of code size? Or something else? > Some device drivers abused read() vs readv() to actually return different > information, depending which you called. That's why there's now a > prohibition against both. > > So let's figure out how to make iter_read() perform well for sys_read(). I've got another idea - in nvfs_read_iter, test if the iovec contains just one entry and call nvfs_read_locked if it does. diff --git a/file.c b/file.c index f4b8a1a..e4d87b2 100644 --- a/file.c +++ b/file.c @@ -460,6 +460,10 @@ static ssize_t nvfs_read_iter(struct kiocb *iocb, struct iov_iter *iov) if (!IS_DAX(>vfs_inode)) { r = generic_file_read_iter(iocb, iov); } else { + if (likely(iter_is_iovec(iov)) && likely(!iov->iov_offset) && likely(iov->nr_segs == 1)) { + r = nvfs_read_locked(nmi, iocb->ki_filp, iov->iov->iov_base, iov->count, true, >ki_pos); + goto unlock_ret; + } #if 1 r = nvfs_rw_iter_locked(iocb, iov, false); #else @@ -467,6 +471,7 @@ static ssize_t nvfs_read_iter(struct kiocb *iocb, struct iov_iter *iov) #endif } +unlock_ret: inode_unlock_shared(>vfs_inode); return r; The result is: nvfs_read_iter - 7.307s Al Viro's read_iter_locked - 7.147s test for just one entry - 7.010s the read method - 6.782s So far, this is the best way how to do it, but it's still 3.3% worse than the read method. There's not anything more that could be optimized on the filesystem level - the rest of optimizations must be done in the VFS. Mikulas
RE: [RFC v2] nvfs: a filesystem for persistent memory
On Mon, 11 Jan 2021, David Laight wrote: > From: Al Viro On Behalf Of Al Viro > > Sent: 10 January 2021 16:20 > > > > On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote: > > > Hi > > > > > > I announce a new version of NVFS - a filesystem for persistent memory. > > > http://people.redhat.com/~mpatocka/nvfs/ > > Utilities, AFAICS > > > > > git://leontynka.twibright.com/nvfs.git > > Seems to hang on git pull at the moment... Do you have it anywhere else? > > > > > I found out that on NVFS, reading a file with the read method has 10% > > > better performance than the read_iter method. The benchmark just reads the > > > same 4k page over and over again - and the cost of creating and parsing > > > the kiocb and iov_iter structures is just that high. > > > > Apples and oranges... What happens if you take > > > > ssize_t read_iter_locked(struct file *file, struct iov_iter *to, loff_t > > *ppos) > > { > > struct inode *inode = file_inode(file); > > struct nvfs_memory_inode *nmi = i_to_nmi(inode); > > struct nvfs_superblock *nvs = inode->i_sb->s_fs_info; > > ssize_t total = 0; > > loff_t pos = *ppos; > > int r; > > int shift = nvs->log2_page_size; > > size_t i_size; > > > > i_size = inode->i_size; > > if (pos >= i_size) > > return 0; > > iov_iter_truncate(to, i_size - pos); > > > > while (iov_iter_count(to)) { > > void *blk, *ptr; > > size_t page_mask = (1UL << shift) - 1; > > unsigned page_offset = pos & page_mask; > > unsigned prealloc = (iov_iter_count(to) + page_mask) >> shift; > > unsigned size; > > > > blk = nvfs_bmap(nmi, pos >> shift, , NULL, NULL, NULL); > > if (unlikely(IS_ERR(blk))) { > > r = PTR_ERR(blk); > > goto ret_r; > > } > > size = ((size_t)prealloc << shift) - page_offset; > > ptr = blk + page_offset; > > if (unlikely(!blk)) { > > size = min(size, (unsigned)PAGE_SIZE); > > ptr = empty_zero_page; > > } > > size = copy_to_iter(to, ptr, size); > > if (unlikely(!size)) { > > r = -EFAULT; > > goto ret_r; > > } > > > > pos += size; > > total += size; > > } while (iov_iter_count(to)); > > That isn't the best formed loop! > > David I removed the second "while" statement and fixed the arguments to copy_to_iter - other than that, Al's function works. Mikuklas
Re: [RFC v2] nvfs: a filesystem for persistent memory
On Sun, 10 Jan 2021, Al Viro wrote: > On Sun, Jan 10, 2021 at 04:14:55PM -0500, Mikulas Patocka wrote: > > > That's a good point. I split nvfs_rw_iter to separate functions > > nvfs_read_iter and nvfs_write_iter - and inlined nvfs_rw_iter_locked into > > both of them. It improved performance by 1.3%. > > > > > Not that it had been more useful on the write side, really, > > > but that's another story (nvfs_write_pages() handling of > > > copyin is... interesting). Let's figure out what's going > > > on with the read overhead first... > > > > > > lib/iov_iter.c primitives certainly could use massage for > > > better code generation, but let's find out how much of the > > > PITA is due to those and how much comes from you fighing > > > the damn thing instead of using it sanely... > > > > The results are: > > > > read: 6.744s > > read_iter: 7.417s > > read_iter - separate read and write path: 7.321s > > Al's read_iter: 7.182s > > Al's read_iter with _copy_to_iter: 7.181s > > So > * overhead of hardening stuff is noise here > * switching to more straightforward ->read_iter() cuts > the overhead by about 1/3. > > Interesting... I wonder how much of that is spent in > iterate_and_advance() glue inside copy_to_iter() here. There's > certainly quite a bit of optimizations possible in those > primitives and your usecase makes a decent test for that... > > Could you profile that and see where is it spending > the time, on instruction level? This is the read method profile: time9.056s 52.69% pread[kernel.vmlinux] [k] copy_user_generic_string 6.24% pread[kernel.vmlinux] [k] current_time 6.22% pread[kernel.vmlinux] [k] entry_SYSCALL_64 4.88% preadlibc-2.31.so [.] __libc_pread 3.75% pread[kernel.vmlinux] [k] syscall_return_via_sysret 3.63% pread[nvfs][k] nvfs_read 2.83% pread[nvfs][k] nvfs_bmap 2.81% pread[kernel.vmlinux] [k] vfs_read 2.63% pread[kernel.vmlinux] [k] __x64_sys_pread64 2.27% pread[kernel.vmlinux] [k] __fsnotify_parent 2.19% pread[kernel.vmlinux] [k] entry_SYSCALL_64_after_hwframe 1.55% pread[kernel.vmlinux] [k] atime_needs_update 1.17% pread[kernel.vmlinux] [k] syscall_enter_from_user_mode 1.15% pread[kernel.vmlinux] [k] touch_atime 0.84% pread[kernel.vmlinux] [k] down_read 0.82% pread[kernel.vmlinux] [k] syscall_exit_to_user_mode 0.71% pread[kernel.vmlinux] [k] do_syscall_64 0.68% pread[kernel.vmlinux] [k] ktime_get_coarse_real_ts64 0.66% pread[kernel.vmlinux] [k] __fget_light 0.53% pread[kernel.vmlinux] [k] exit_to_user_mode_prepare 0.45% pread[kernel.vmlinux] [k] up_read 0.44% preadpread [.] main 0.44% pread[kernel.vmlinux] [k] syscall_exit_to_user_mode_prepare 0.26% pread[kernel.vmlinux] [k] entry_SYSCALL_64_safe_stack 0.12% preadpread [.] pread@plt 0.07% pread[kernel.vmlinux] [k] __fdget 0.00% perf [kernel.vmlinux] [k] x86_pmu_enable_all This is profile of "read_iter - separate read and write path": time10.058s 53.05% pread[kernel.vmlinux] [k] copy_user_generic_string 6.82% pread[kernel.vmlinux] [k] current_time 6.27% pread[nvfs][k] nvfs_read_iter 4.70% pread[kernel.vmlinux] [k] entry_SYSCALL_64 3.20% preadlibc-2.31.so [.] __libc_pread 2.77% pread[kernel.vmlinux] [k] syscall_return_via_sysret 2.31% pread[kernel.vmlinux] [k] vfs_read 2.15% pread[kernel.vmlinux] [k] new_sync_read 2.06% pread[kernel.vmlinux] [k] __fsnotify_parent 2.02% pread[nvfs][k] nvfs_bmap 1.87% pread[kernel.vmlinux] [k] entry_SYSCALL_64_after_hwframe 1.86% pread[kernel.vmlinux] [k] iov_iter_advance 1.62% pread[kernel.vmlinux] [k] __x64_sys_pread64 1.40% pread[kernel.vmlinux] [k] atime_needs_update 0.99% pread[kernel.vmlinux] [k] syscall_enter_from_user_mode 0.85% pread[kernel.vmlinux] [k] touch_atime 0.85% pread[kernel.vmlinux] [k] down_read 0.84% pread[kernel.vmlinux] [k] syscall_exit_to_user_mode 0.78% pread[kernel.vmlinux] [k] ktime_get_coarse_real_ts64 0.65% pread[kernel.vmlinux] [k] __fget_light 0.57% pread[kernel.vmlinux] [k] exit_to_user_mode_prepare 0.53% pread[kernel.vmlinux] [k] syscall_exit_to_user_mode_prepare 0.45% pread
Re: Expense of read_iter
On Sun, 10 Jan 2021, Matthew Wilcox wrote: > > That is the reason for that 10% degradation with read_iter. > > You seem to be focusing on your argument for "let's just permit > filesystems to implement both ->read and ->read_iter". My suggestion > is that we need to optimise the ->read_iter path, but to do that we need > to know what's expensive. > > nvfs_rw_iter_locked() looks very complicated. I suspect it can > be simplified. I split it to a separate read and write function and it improved performance by 1.3%. Using Al Viro's read_iter improves performance by 3%. > Of course new_sync_read() needs to be improved too, > as do the other functions here, but fully a third of the difference > between read() and read_iter() is the difference between nvfs_read() > and nvfs_rw_iter_locked(). I put counters into vfs_read and vfs_readv. After a fresh boot of the virtual machine, the counters show "13385 4". After a kernel compilation they show "4475220 8". So, the readv path is almost unused. My reasoning was that we should optimize for the "read" path and glue the "readv" path on the top of that. Currently, the kernel is doing the opposite - optimizing for "readv" and glueing "read" on the top of it. Mikulas
Re: [RFC v2] nvfs: a filesystem for persistent memory
On Sun, 10 Jan 2021, Al Viro wrote: > On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote: > > Hi > > > > I announce a new version of NVFS - a filesystem for persistent memory. > > http://people.redhat.com/~mpatocka/nvfs/ > Utilities, AFAICS > > > git://leontynka.twibright.com/nvfs.git > Seems to hang on git pull at the moment... Do you have it anywhere else? I saw some errors 'git-daemon: fatal: the remote end hung up unexpectedly' in syslog. I don't know what's causing them. > > I found out that on NVFS, reading a file with the read method has 10% > > better performance than the read_iter method. The benchmark just reads the > > same 4k page over and over again - and the cost of creating and parsing > > the kiocb and iov_iter structures is just that high. > > Apples and oranges... What happens if you take > > ssize_t read_iter_locked(struct file *file, struct iov_iter *to, loff_t *ppos) > { > struct inode *inode = file_inode(file); > struct nvfs_memory_inode *nmi = i_to_nmi(inode); > struct nvfs_superblock *nvs = inode->i_sb->s_fs_info; > ssize_t total = 0; > loff_t pos = *ppos; > int r; > int shift = nvs->log2_page_size; > size_t i_size; > > i_size = inode->i_size; > if (pos >= i_size) > return 0; > iov_iter_truncate(to, i_size - pos); > > while (iov_iter_count(to)) { > void *blk, *ptr; > size_t page_mask = (1UL << shift) - 1; > unsigned page_offset = pos & page_mask; > unsigned prealloc = (iov_iter_count(to) + page_mask) >> shift; > unsigned size; > > blk = nvfs_bmap(nmi, pos >> shift, , NULL, NULL, NULL); > if (unlikely(IS_ERR(blk))) { > r = PTR_ERR(blk); > goto ret_r; > } > size = ((size_t)prealloc << shift) - page_offset; > ptr = blk + page_offset; > if (unlikely(!blk)) { > size = min(size, (unsigned)PAGE_SIZE); > ptr = empty_zero_page; > } > size = copy_to_iter(to, ptr, size); > if (unlikely(!size)) { > r = -EFAULT; > goto ret_r; > } > > pos += size; > total += size; > } while (iov_iter_count(to)); > > r = 0; > > ret_r: > *ppos = pos; > > if (file) > file_accessed(file); > > return total ? total : r; > } > > and use that instead of your nvfs_rw_iter_locked() in your > ->read_iter() for DAX read case? Then the same with > s/copy_to_iter/_copy_to_iter/, to see how much of that is > "hardening" overhead. > > Incidentally, what's the point of sharing nvfs_rw_iter() for > read and write cases? They have practically no overlap - > count the lines common for wr and !wr cases. And if you > do the same in nvfs_rw_iter_locked(), you'll see that the > shared parts _there_ are bloody pointless on the read side. That's a good point. I split nvfs_rw_iter to separate functions nvfs_read_iter and nvfs_write_iter - and inlined nvfs_rw_iter_locked into both of them. It improved performance by 1.3%. > Not that it had been more useful on the write side, really, > but that's another story (nvfs_write_pages() handling of > copyin is... interesting). Let's figure out what's going > on with the read overhead first... > > lib/iov_iter.c primitives certainly could use massage for > better code generation, but let's find out how much of the > PITA is due to those and how much comes from you fighing > the damn thing instead of using it sanely... The results are: read: 6.744s read_iter: 7.417s read_iter - separate read and write path: 7.321s Al's read_iter: 7.182s Al's read_iter with _copy_to_iter: 7.181s Mikulas
Re: Expense of read_iter
On Thu, 7 Jan 2021, Matthew Wilcox wrote: > On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote: > > I'd like to ask about this piece of code in __kernel_read: > > if (unlikely(!file->f_op->read_iter || file->f_op->read)) > > return warn_unsupported... > > and __kernel_write: > > if (unlikely(!file->f_op->write_iter || file->f_op->write)) > > return warn_unsupported... > > > > - It exits with an error if both read_iter and read or write_iter and > > write are present. > > > > I found out that on NVFS, reading a file with the read method has 10% > > better performance than the read_iter method. The benchmark just reads the > > same 4k page over and over again - and the cost of creating and parsing > > the kiocb and iov_iter structures is just that high. > > Which part of it is so expensive? The read_iter path is much bigger: vfs_read- 0x160 bytes new_sync_read - 0x160 bytes nvfs_rw_iter- 0x100 bytes nvfs_rw_iter_locked - 0x4a0 bytes iov_iter_advance- 0x300 bytes If we go with the "read" method, there's just: vfs_read- 0x160 bytes nvfs_read - 0x200 bytes > Is it worth, eg adding an iov_iter > type that points to a single buffer instead of a single-member iov? > > +++ b/include/linux/uio.h > @@ -19,6 +19,7 @@ struct kvec { > > enum iter_type { > /* iter types */ > + ITER_UBUF = 2, > ITER_IOVEC = 4, > ITER_KVEC = 8, > ITER_BVEC = 16, > @@ -36,6 +36,7 @@ struct iov_iter { > size_t iov_offset; > size_t count; > union { > + void __user *buf; > const struct iovec *iov; > const struct kvec *kvec; > const struct bio_vec *bvec; > > and then doing all the appropriate changes to make that work. I tried this benchmark on nvfs: #include #include #include int main(void) { unsigned long i; unsigned long l = 1UL << 38; unsigned s = 4096; void *a = valloc(s); if (!a) perror("malloc"), exit(1); for (i = 0; i < l; i += s) { if (pread(0, a, s, 0) != s) perror("read"), exit(1); } return 0; } Result, using the read_iter method: # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 3K of event 'cycles' # Event count (approx.): 1049885560 # # Overhead Command Shared Object Symbol # ... . # 47.32% pread[kernel.vmlinux] [k] copy_user_generic_string 7.83% pread[kernel.vmlinux] [k] current_time 6.57% pread[nvfs][k] nvfs_rw_iter_locked 5.59% pread[kernel.vmlinux] [k] entry_SYSCALL_64 4.23% preadlibc-2.31.so [.] __libc_pread 3.51% pread[kernel.vmlinux] [k] syscall_return_via_sysret 2.34% pread[kernel.vmlinux] [k] entry_SYSCALL_64_after_hwframe 2.34% pread[kernel.vmlinux] [k] vfs_read 2.34% pread[kernel.vmlinux] [k] __fsnotify_parent 2.31% pread[kernel.vmlinux] [k] new_sync_read 2.21% pread[nvfs][k] nvfs_bmap 1.89% pread[kernel.vmlinux] [k] iov_iter_advance 1.71% pread[kernel.vmlinux] [k] __x64_sys_pread64 1.59% pread[kernel.vmlinux] [k] atime_needs_update 1.24% pread[nvfs][k] nvfs_rw_iter 0.94% pread[kernel.vmlinux] [k] touch_atime 0.75% pread[kernel.vmlinux] [k] syscall_enter_from_user_mode 0.72% pread[kernel.vmlinux] [k] ktime_get_coarse_real_ts64 0.68% pread[kernel.vmlinux] [k] down_read 0.62% pread[kernel.vmlinux] [k] exit_to_user_mode_prepare 0.52% pread[kernel.vmlinux] [k] syscall_exit_to_user_mode 0.49% pread[kernel.vmlinux] [k] syscall_exit_to_user_mode_prepare 0.47% pread[kernel.vmlinux] [k] __fget_light 0.46% pread[kernel.vmlinux] [k] do_syscall_64 0.42% preadpread [.] main 0.33% pread[kernel.vmlinux] [k] up_read 0.29% pread[kernel.vmlinux] [k] iov_iter_init 0.16% pread[kernel.vmlinux] [k] __fdget 0.10% pread[kernel.vmlinux] [k] entry_SYSCALL_64_safe_stack 0.03% preadpread [.] pread@plt 0.00% perf [kernel.vmlinux] [k] x86_pmu_enable_all # # (Tip: Use --symfs if your symbol files are in non-standard locations) # Result, using the read method: # To display the perf.data header info, please use --header/--header-only options. # # # Total Lost Samples: 0 # # Samples: 3K of even
[RFC v2] nvfs: a filesystem for persistent memory
Hi I announce a new version of NVFS - a filesystem for persistent memory. http://people.redhat.com/~mpatocka/nvfs/ git://leontynka.twibright.com/nvfs.git Changes since the last release: * I added a microjournal to the filesystem, it can hold up to 16 entries. Each CPU has it's own journal, so that there is no lock contention. The journal is used to provide atomicity of reaname() and extended attribute replace. (note that file creation or deletion doesn't use the journal, because these operations can be deterministically cleaned up by fsck) * I created a framework that can be used to verify the filesystem driver. It logs all writes and memory barriers to a file, the entries in the file are randomly reordered (to simulate reordering in the CPU write-combining buffers), the sequence is cut at a random point (to simulate a system crash) and the result is replayed on a filesystem image. With this framework, we can for example check that if a crash happens during rename(), either old file or new file will be present in a directory. This framework helped to find a few bugs in sequencing the writes. * If we map an executable image, we turn off the DAX flag on the inode (because executables run 4% slower from persistent memory). There is also a switch that can turn DAX always off or always on. I'd like to ask about this piece of code in __kernel_read: if (unlikely(!file->f_op->read_iter || file->f_op->read)) return warn_unsupported... and __kernel_write: if (unlikely(!file->f_op->write_iter || file->f_op->write)) return warn_unsupported... - It exits with an error if both read_iter and read or write_iter and write are present. I found out that on NVFS, reading a file with the read method has 10% better performance than the read_iter method. The benchmark just reads the same 4k page over and over again - and the cost of creating and parsing the kiocb and iov_iter structures is just that high. So, I'd like to have both read and read_iter methods. Could the above conditions be changed, so that they don't fail with an error if the "read" or "write" method is present? Mikulas
Re: [PATCH v3 0/2] dm crypt: some fixes to support dm-crypt running in softirq context
On Mon, 4 Jan 2021, Ignat Korchagin wrote: > Changes from v1: > * 0001: handle memory allocation failure for GFP_ATOMIC > > Changes from v2: > * reordered patches > * 0002: crypt_convert will be retried from a workqueue, when a crypto > request > allocation fails with GFP_ATOMIC instead of just returning an IO error to > the user > > Ignat Korchagin (2): > dm crypt: do not wait for backlogged crypto request completion in > softirq > dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq > > drivers/md/dm-crypt.c | 138 +- > 1 file changed, 123 insertions(+), 15 deletions(-) > > -- > 2.20.1 > Acked-by: Mikulas Patocka
Re: [PATCH v2 1/2] dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq
On Wed, 30 Dec 2020, Ignat Korchagin wrote: > diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c > index 53791138d78b..e4fd690c70e1 100644 > --- a/drivers/md/dm-crypt.c > +++ b/drivers/md/dm-crypt.c > @@ -1539,7 +1549,10 @@ static blk_status_t crypt_convert(struct crypt_config > *cc, > > while (ctx->iter_in.bi_size && ctx->iter_out.bi_size) { > > - crypt_alloc_req(cc, ctx); > + r = crypt_alloc_req(cc, ctx); > + if (r) > + return BLK_STS_RESOURCE; > + > atomic_inc(>cc_pending); > > if (crypt_integrity_aead(cc)) > -- > 2.20.1 I'm not quite convinced that returning BLK_STS_RESOURCE will help. The block layer will convert this value back to -ENOMEM and return it to the caller, resulting in an I/O error. Note that GFP_ATOMIC allocations may fail anytime and you must handle allocation failure gracefully - i.e. process the request without any error. An acceptable solution would be to punt the request to a workqueue and do GFP_NOIO allocation from the workqueue. Or add the request to some list and process the list when some other request completes. You should write a test that simulates allocation failure and verify that the kernel handles it gracefully without any I/O error. Mikulas
Re: [PATCH 1/2] dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq
Hi This patch doesn't handle allocation failure gracefully. Mikulas On Tue, 29 Dec 2020, Ignat Korchagin wrote: > Commit 39d42fa96ba1b7d2544db3f8ed5da8fb0d5cb877 made it possible for some code > paths in dm-crypt to be executed in softirq context, when the underlying > driver > processes IO requests in interrupt/softirq context. > > In this case sometimes when allocating a new crypto request we may get a > stacktrace like below: > > [ 210.103008][C0] BUG: sleeping function called from invalid context at > mm/mempool.c:381 > [ 210.104746][C0] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: > 2602, name: fio > [ 210.106599][C0] CPU: 0 PID: 2602 Comm: fio Tainted: GW > 5.10.0+ #50 > [ 210.108331][C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS 0.0.0 02/06/2015 > [ 210.110212][C0] Call Trace: > [ 210.110921][C0] > [ 210.111527][C0] dump_stack+0x7d/0xa3 > [ 210.112411][C0] ___might_sleep.cold+0x122/0x151 > [ 210.113527][C0] mempool_alloc+0x16b/0x2f0 > [ 210.114524][C0] ? __queue_work+0x515/0xde0 > [ 210.115553][C0] ? mempool_resize+0x700/0x700 > [ 210.116586][C0] ? crypt_endio+0x91/0x180 > [ 210.117479][C0] ? blk_update_request+0x757/0x1150 > [ 210.118513][C0] ? blk_mq_end_request+0x4b/0x480 > [ 210.119572][C0] ? blk_done_softirq+0x21d/0x340 > [ 210.120628][C0] ? __do_softirq+0x190/0x611 > [ 210.121626][C0] crypt_convert+0x29f9/0x4c00 > [ 210.122668][C0] ? _raw_spin_lock_irqsave+0x87/0xe0 > [ 210.123824][C0] ? kasan_set_track+0x1c/0x30 > [ 210.124858][C0] ? crypt_iv_tcw_ctr+0x4a0/0x4a0 > [ 210.125930][C0] ? kmem_cache_free+0x104/0x470 > [ 210.126973][C0] ? crypt_endio+0x91/0x180 > [ 210.127947][C0] kcryptd_crypt_read_convert+0x30e/0x420 > [ 210.129165][C0] blk_update_request+0x757/0x1150 > [ 210.130231][C0] blk_mq_end_request+0x4b/0x480 > [ 210.131294][C0] blk_done_softirq+0x21d/0x340 > [ 210.132332][C0] ? _raw_spin_lock+0x81/0xd0 > [ 210.133289][C0] ? blk_mq_stop_hw_queue+0x30/0x30 > [ 210.134399][C0] ? _raw_read_lock_irq+0x40/0x40 > [ 210.135458][C0] __do_softirq+0x190/0x611 > [ 210.136409][C0] ? handle_edge_irq+0x221/0xb60 > [ 210.137447][C0] asm_call_irq_on_stack+0x12/0x20 > [ 210.138507][C0] > [ 210.139118][C0] do_softirq_own_stack+0x37/0x40 > [ 210.140191][C0] irq_exit_rcu+0x110/0x1b0 > [ 210.141151][C0] common_interrupt+0x74/0x120 > [ 210.142171][C0] asm_common_interrupt+0x1e/0x40 > [ 210.143206][C0] RIP: 0010:_aesni_enc1+0x65/0xb0 > [ 210.144313][C0] Code: 38 dc c2 41 0f 28 52 d0 66 0f 38 dc c2 41 0f 28 > 52 e0 66 0f 38 dc c2 41 0f 28 52 f0 66 0f 38 dc c2 41 0f 28 12 66 0f 38 dc c2 > <41> 0f 28 52 10 66 0f 38 dc c2 41 0f 28 52 20 66 0f 38 dc c2 41 0f > [ 210.148542][C0] RSP: 0018:88810dbe6db0 EFLAGS: 0286 > [ 210.149842][C0] RAX: 9a90cdc0 RBX: RCX: > 0200 > [ 210.151576][C0] RDX: 888101e5f240 RSI: 888101e5f240 RDI: > 888b5020 > [ 210.153339][C0] RBP: 88810dbe6e20 R08: R09: > 0020 > [ 210.155063][C0] R10: 888b5090 R11: 111021b7cdcc R12: > 9e87cd40 > [ 210.156791][C0] R13: 888b5210 R14: 888101e5f0d8 R15: > > [ 210.158497][C0] ? aesni_set_key+0x1e0/0x1e0 > [ 210.159523][C0] ? aesni_enc+0xf/0x20 > [ 210.160408][C0] ? glue_xts_req_128bit+0x181/0x6f0 > [ 210.161571][C0] ? aesni_set_key+0x1e0/0x1e0 > [ 210.162560][C0] ? glue_ctr_req_128bit+0x630/0x630 > [ 210.163706][C0] ? kasan_save_stack+0x37/0x50 > [ 210.164761][C0] ? kasan_save_stack+0x20/0x50 > [ 210.165786][C0] ? get_page_from_freelist+0x2052/0x36a0 > [ 210.167024][C0] ? __blkdev_direct_IO_simple+0x43b/0x7e0 > [ 210.168288][C0] ? blkdev_direct_IO+0xd16/0x1020 > [ 210.169396][C0] ? generic_file_direct_write+0x1a3/0x480 > [ 210.170648][C0] ? __generic_file_write_iter+0x1d9/0x530 > [ 210.171882][C0] ? blkdev_write_iter+0x20d/0x3e0 > [ 210.172954][C0] ? vfs_write+0x524/0x770 > [ 210.173889][C0] ? do_syscall_64+0x33/0x40 > [ 210.174859][C0] ? __zone_watermark_ok+0x340/0x340 > [ 210.175977][C0] ? crypt_convert+0x28b6/0x4c00 > [ 210.177079][C0] ? mempool_alloc+0x107/0x2f0 > [ 210.178096][C0] ? crypt_iv_tcw_ctr+0x4a0/0x4a0 > [ 210.179193][C0] ? bio_add_page+0x111/0x170 > [ 210.180251][C0] ? __bio_try_merge_page+0x480/0x480 > [ 210.181446][C0] ? bio_associate_blkg+0x6d/0x100 > [ 210.182558][C0] ? kcryptd_crypt_write_convert+0x5ea/0x980 > [ 210.183852][C0] ? crypt_map+0x5bf/0xc80 > [ 210.184838][C0] ? bio_clone_blkg_association+0x10e/0x2c0 > [ 210.186125][C0] ? __map_bio.isra.0+0x109/0x3f0 > [ 210.187204][C0] ?
Re: [RFC PATCH v3 1/2] block: add simple copy support
On Fri, 11 Dec 2020, Johannes Thumshirn wrote: > On 11/12/2020 15:57, SelvaKumar S wrote: > [...] > > +int blk_copy_emulate(struct block_device *bdev, struct blk_copy_payload > > *payload, > > + gfp_t gfp_mask) > > +{ > > + struct request_queue *q = bdev_get_queue(bdev); > > + struct bio *bio; > > + void *buf = NULL; > > + int i, nr_srcs, max_range_len, ret, cur_dest, cur_size; > > + > > + nr_srcs = payload->copy_range; > > + max_range_len = q->limits.max_copy_range_sectors << SECTOR_SHIFT; > > + cur_dest = payload->dest; > > + buf = kvmalloc(max_range_len, GFP_ATOMIC); > > Why GFP_ATOMIC and not the passed in gfp_mask? Especially as this is a > kvmalloc() > which has the potential to grow quite big. You are right, this is confusing. There's this piece of code at the top of kvmalloc_node: if ((flags & GFP_KERNEL) != GFP_KERNEL) return kmalloc_node(size, flags, node); So, when you use GFP_ATOMIC flag, it will always fall back to kmalloc. Mikulas
Re: md: dm-writeback: add __noreturn to BUG-ging function
On Wed, 18 Nov 2020, Mike Snitzer wrote: > On Wed, Nov 18 2020 at 10:49am -0500, > Mike Snitzer wrote: > > > I don't think my suggestion will help.. given it'd still leave > > persistent_memory_claim() without a return statement. > > > > Think it worthwhile to just add a dummy 'return 0;' after the BUG(). > > Decided to go with this, now staged for 5.11: > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.11=a1e4865b4dda7071f3707f7e551289ead66e38b1 Hi I would just use "return -EOPNOTSUPP;" and drop the "#ifdef DM_WRITECACHE_HAS_PMEM" that you added. That BUG/return -EOPNOTSUPP code can't happen at all - if DM_WRITECACHE_HAS_PMEM is not defined, WC_MODE_PMEM(wc) always returns false - so persistent_memory_claim and BUG() can't ever be called. And if it can't be called, you don't need to add a code that prints an error in that case. If we don't have DM_WRITECACHE_HAS_PMEM, the compiler optimizer will remove all the code guarded with if (WC_MODE_PMEM(wc)) as unreachable. Mikulas From: Mikulas Patocka Subject: [PATCH] dm writecache: remove BUG() and fail gracefully insteadfor-nextdm-5.11 Building on arch/s390/ results in this build error: cc1: some warnings being treated as errors ../drivers/md/dm-writecache.c: In function 'persistent_memory_claim': ../drivers/md/dm-writecache.c:323:1: error: no return statement in function returning non-void [-Werror=return-type] Fix this by replacing the BUG() with a -EOPNOTSUPP return. Fixes: 48debafe4f2f ("dm: add writecache target") Cc: sta...@vger.kernel.org # v4.18+ Reported-by: Randy Dunlap Signed-off-by: Mikulas Patocka Index: linux-2.6/drivers/md/dm-writecache.c === --- linux-2.6.orig/drivers/md/dm-writecache.c +++ linux-2.6/drivers/md/dm-writecache.c @@ -319,7 +319,7 @@ err1: #else static int persistent_memory_claim(struct dm_writecache *wc) { - BUG(); + return -EOPNOTSUPP; } #endif
Re: [PATCH] Revert "dm cache: fix arm link errors with inline"
Acked-by: Mikulas Patocka On Tue, 10 Nov 2020, Nick Desaulniers wrote: > This reverts commit 43aeaa29573924df76f44eda2bbd94ca36e407b5. > > Since > commit 0bddd227f3dc ("Documentation: update for gcc 4.9 requirement") > the minimum supported version of GCC is gcc-4.9. It's now safe to remove > this code. > > Link: https://github.com/ClangBuiltLinux/linux/issues/427 > Signed-off-by: Nick Desaulniers > --- > drivers/md/dm-cache-target.c | 4 > 1 file changed, 4 deletions(-) > > diff --git a/drivers/md/dm-cache-target.c b/drivers/md/dm-cache-target.c > index 9644424591da..4bc453f5bbaa 100644 > --- a/drivers/md/dm-cache-target.c > +++ b/drivers/md/dm-cache-target.c > @@ -712,10 +712,6 @@ static bool block_size_is_power_of_two(struct cache > *cache) > return cache->sectors_per_block_shift >= 0; > } > > -/* gcc on ARM generates spurious references to __udivdi3 and __umoddi3 */ > -#if defined(CONFIG_ARM) && __GNUC__ == 4 && __GNUC_MINOR__ <= 6 > -__always_inline > -#endif > static dm_block_t block_div(dm_block_t b, uint32_t n) > { > do_div(b, n); > -- > 2.29.2.222.g5d2a92d10f8-goog >
Re: NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache)
On Thu, 24 Sep 2020, Mikulas Patocka wrote: > On Tue, 22 Sep 2020, Matthew Wilcox wrote: > > > > There is a small window when renamed inode is neither in source nor in > > > target directory. Fsck will reclaim such inode and add it to lost+found - > > > just like on EXT2. > > > > ... ouch. If you have to choose, it'd be better to link it to the second > > directory then unlink it from the first one. Then your fsck can detect > > it has the wrong count and fix up the count (ie link it into both > > directories rather than neither). > > I admit that this is lame and I'll fix it. Rename is not so > performance-critical, so I can add a small journal for this. Hi I have implmemented transactions in nvfs and I use them for rename, setattr, atomic xattr replacement and for RENAME_EXCHANGE. You can download the current version here: git://leontynka.twibright.com/nvfs.git Mikulas
Re: NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache)
On Tue, 22 Sep 2020, Matthew Wilcox wrote: > > > The NVFS indirect block tree has a fan-out of 16, > > > > No. The top level in the inode contains 16 blocks (11 direct and 5 > > indirect). And each indirect block can have 512 pointers (4096/8). You can > > format the device with larger block size and this increases the fanout > > (the NVFS block size must be greater or equal than the system page size). > > > > 2 levels can map 1GiB (4096*512^2), 3 levels can map 512 GiB, 4 levels can > > map 256 TiB and 5 levels can map 128 PiB. > > But compare to an unfragmented file ... you can map the entire thing with > a single entry. Even if you have to use a leaf node, you can get four > extents in a single cacheline (and that's a fairly naive leaf node layout; > I don't know exactly what XFS uses) But the benchmarks show that it is comparable to extent-based filesystems. > > > Rename is another operation that has specific "operation has atomic > > > behaviour" expectations. I haven't looked at how you've > > > implementated that yet, but I suspect it also is extremely difficult > > > to implement in an atomic manner using direct pmem updates to the > > > directory structures. > > > > There is a small window when renamed inode is neither in source nor in > > target directory. Fsck will reclaim such inode and add it to lost+found - > > just like on EXT2. > > ... ouch. If you have to choose, it'd be better to link it to the second > directory then unlink it from the first one. Then your fsck can detect > it has the wrong count and fix up the count (ie link it into both > directories rather than neither). I admit that this is lame and I'll fix it. Rename is not so performance-critical, so I can add a small journal for this. > > If you think that the lack of journaling is show-stopper, I can implement > > it. But then, I'll have something that has complexity of EXT4 and > > performance of EXT4. So that there will no longer be any reason why to use > > NVFS over EXT4. Without journaling, it will be faster than EXT4 and it may > > attract some users who want good performance and who don't care about GID > > and UID being updated atomically, etc. > > Well, what's your intent with nvfs? Do you already have customers in mind > who want to use this in production, or is this somewhere to play with and > develop concepts that might make it into one of the longer-established > filesystems? I develop it just because I thought it may be interesting. So far, it doesn't have any serious users (the physical format is still changing). I hope that it could be useable as a general purpose root filesystem when Optane DIMMs become common. Mikulas
Re: NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache)
On Wed, 23 Sep 2020, Dave Chinner wrote: > > > dir-test /mnt/test/linux-2.6 63000 1048576 > > > nvfs 6.6s > > > ext4 dax 8.4s > > > xfs dax 12.2s > > > > > > > > > dir-test /mnt/test/linux-2.6 63000 1048576 link > > > nvfs 4.7s > > > ext4 dax 5.6s > > > xfs dax 7.8s > > > > > > dir-test /mnt/test/linux-2.6 63000 1048576 dir > > > nvfs 8.2s > > > ext4 dax 15.1s > > > xfs dax 11.8s > > > > > > Yes, nvfs is faster than both ext4 and XFS on DAX, but it's not a > > > huge difference - it's not orders of magnitude faster. > > > > If I increase the size of the test directory, NVFS is order of magnitude > > faster: > > > > time dir-test /mnt/test/ 200 200 > > NVFS: 0m29,395s > > XFS: 1m59,523s > > EXT4: 1m14,176s > > What happened to NVFS there? The runtime went up by a factor of 5, > even though the number of ops performed only doubled. This test is from a different machine (K10 Opteron) than the above test (Skylake Xeon). I borrowed the Xeon for a short time and I no longer have access to it. > > time dir-test /mnt/test/ 800 800 > > NVFS: 2m13,507s > > XFS: 14m31,261s > > EXT4: reports "file 1976882 can't be created: No space left on device", > > (although there are free blocks and inodes) > > Is it a bug or expected behavior? > > Exponential increase in runtime for a workload like this indicates > the XFS journal is too small to run large scale operations. I'm > guessing you're just testing on a small device? In this test, the pmem device had 64GiB. I've created 1TiB ramdisk, formatted it with XFS and ran dir-test 800 on it, however it wasn't much better - it took 14m8,824s. > In which case, you'd get a 16MB log for XFS, which is tiny and most > definitely will limit performance of any large scale metadta > operation. Performance should improve significantly for large scale > operations with a much larger log, and that should bring the XFS > runtimes down significantly. Is there some mkfs.xfs option that can increase log size? > > If you think that the lack of journaling is show-stopper, I can implement > > it. > > I did not say that. My comments are about the requirement for > atomicity of object changes, not journalling. Journalling is an > -implementation that can provide change atomicity-, it is not a > design constraint for metadata modification algorithms. > > Really, you can chose how to do object update however you want. What > I want to review is the design documentation and a correctness proof > for whatever mechanism you choose to use. Without that information, > we have absolutely no chance of reviewing the filesystem > implementation for correctness. We don't need a proof for something > that uses journalling (because we all know how that works), but for > something that uses soft updates we most definitely need the proof > of correctness for the update algorithm before we can determine if > the implementation is good... > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com I am thinking about this: I can implement lightweight journaling that will journal just a few writes - I'll allocate some small per-cpu intent log for that. For example, in nvfs_rename, we call nvfs_delete_de and nvfs_finish_add - these functions are very simple, both of them write just one word - so we can add these two words to the intent log. The same for setattr requesting simultaneous uid/gid/mode change - they are small, so they'll fit into the intent log well. Regarding verifiability, I can do this - the writes to pmem are wrapped in a macro nv_store. So, I can modify this macro so that it logs all modifications. Then I take the log, cut it at random time, reorder the entries (to simulate reordering in the CPU write-combining buffers), replay it, run nvfsck on it and mount it. This way, we can verify that no matter where the crash happened, either an old file or a new file is present in a directory. Do you agree with that? Mikulas
Re: NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache)
On Wed, 23 Sep 2020, Jan Kara wrote: > On Tue 22-09-20 12:46:05, Mikulas Patocka wrote: > > > mapping 2^21 blocks requires a 5 level indirect tree. Which one if going > > > to be faster to truncate away - a single record or 2 million individual > > > blocks? > > > > > > IOWs, we can take afford to take an extra cacheline miss or two on a > > > tree block search, because we're accessing and managing orders of > > > magnitude fewer records in the mapping tree than an indirect block > > > tree. > > > > > > PMEM doesn't change this: extents are more time and space efficient > > > at scale for mapping trees than indirect block trees regardless > > > of the storage medium in use. > > > > PMEM doesn't have to be read linearly, so the attempts to allocate large > > linear space are not needed. They won't harm but they won't help either. > > > > That's why NVFS has very simple block allocation alrogithm - it uses a > > per-cpu pointer and tries to allocate by a bit scan from this pointer. If > > the group is full, it tries a random group with above-average number of > > free blocks. > > I agree with Dave here. People are interested in 2MB or 1GB contiguous > allocations for DAX so that files can be mapped at PMD or event PUD levels > thus saving a lot of CPU time on page faults and TLB. NVFS has upper limit on block size 1MB. So, should raise it to 2MB? Will 2MB blocks be useful to someone? Is there some API how userspace can ask the kernel for aligned allocation? fallocate() doesn't seem to offer an option for alignment. > > EXT4 uses bit scan for allocations and people haven't complained that it's > > inefficient, so it is probably OK. > > Yes, it is more or less OK but once you get to 1TB filesystem size and > larger, the number of block groups grows enough that it isn't that great > anymore. We are actually considering new allocation schemes for ext4 for > this large filesystems... NVFS can run with block size larger than page size, so you can reduce the number of block groups by increasing block size. (ext4 also has bigalloc feature that will do it) > > If you think that the lack of journaling is show-stopper, I can implement > > it. But then, I'll have something that has complexity of EXT4 and > > performance of EXT4. So that there will no longer be any reason why to use > > NVFS over EXT4. Without journaling, it will be faster than EXT4 and it may > > attract some users who want good performance and who don't care about GID > > and UID being updated atomically, etc. > > I'd hope that your filesystem offers more performance benefits than just > what you can get from a lack of journalling :). ext4 can be configured to I also don't know how to implement journling on persistent memory :) On EXT4 or XFS you can pin dirty buffers in memory until the journal is flushed. This is obviously impossible on persistent memory. So, I'm considering implementing only some lightweight journaling that will guarantee atomicity between just a few writes. > run without a journal as well - mkfs.ext4 -O ^has_journal. And yes, it does > significantly improve performance for some workloads but you have to have > some way to recover from crashes so it's mostly used for scratch > filesystems (e.g. in build systems, Google uses this feature a lot for some > of their infrastructure as well). > > Honza > -- > Jan Kara > SUSE Labs, CR I've run "dir-test /mnt/test/ 800 800" and the result is: EXT4 with journal - 5m54,019s EXT4 without journal- 4m4,444s NVFS- 2m9,482s Mikulas
Re: A bug in ext4 with big directories (was: NVFS XFS metadata)
On Wed, 23 Sep 2020, Jan Kara wrote: > Hi! > > On Wed 23-09-20 05:20:55, Mikulas Patocka wrote: > > There seems to be a bug in ext4 - when I create very large directory, ext4 > > fails with -ENOSPC despite the fact that there is plenty of free space and > > free inodes on the filesystem. > > > > How to reproduce: > > download the program dir-test: > > http://people.redhat.com/~mpatocka/benchmarks/dir-test.c > > > > # modprobe brd rd_size=67108864 > > # mkfs.ext4 /dev/ram0 > > # mount -t ext4 /dev/ram0 /mnt/test > > # dir-test /mnt/test/ 800 800 > > deleting: 7999000 > > 254 > > file 2515327 can't be created: No space left on device > > # df /mnt/test > > /dev/ram065531436 633752 61525860 2% /mnt/test > > # df -i /mnt/test > > /dev/ram04194304 1881547 2312757 45% /mnt/test > > Yeah, you likely run out of space in ext4 directory h-tree. You can enable > higher depth h-trees with large_dir feature (mkfs.ext4 -O large_dir). Does > that help? Yes, this helps. Mikulas > > Honza > > -- > Jan Kara > SUSE Labs, CR
A bug in ext4 with big directories (was: NVFS XFS metadata)
Hi There seems to be a bug in ext4 - when I create very large directory, ext4 fails with -ENOSPC despite the fact that there is plenty of free space and free inodes on the filesystem. How to reproduce: download the program dir-test: http://people.redhat.com/~mpatocka/benchmarks/dir-test.c # modprobe brd rd_size=67108864 # mkfs.ext4 /dev/ram0 # mount -t ext4 /dev/ram0 /mnt/test # dir-test /mnt/test/ 800 800 deleting: 7999000 254 file 2515327 can't be created: No space left on device # df /mnt/test /dev/ram065531436 633752 61525860 2% /mnt/test # df -i /mnt/test /dev/ram04194304 1881547 2312757 45% /mnt/test (I tried to increase journal size, but it has no effect on this bug) Mikulas
Re: NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache)
Hi Thanks for reviewing NVFS. On Tue, 22 Sep 2020, Dave Chinner wrote: > Hi Mikulas, > > I'll say up front that I think you're barking up the wrong tree > trying to knock down XFS and ext4 to justify NVFS. NVFS will stand > or fall on it's own merits, not on how you think it's better than > other filesystems... > > I have some fundamental concerns about the NVFS integrity model, > they came out as I wrote this response to your characterisations of > XFS and journalling filesysetms. Maybe I'm missing something about > NVFS that isn't clearly explained > > On Mon, Sep 21, 2020 at 12:20:42PM -0400, Mikulas Patocka wrote: > > On Wed, 16 Sep 2020, Mikulas Patocka wrote: > > > On Wed, 16 Sep 2020, Dan Williams wrote: > > > > On Wed, Sep 16, 2020 at 10:24 AM Mikulas Patocka > > > > wrote: > > > > > > > > > > > My first question about nvfs is how it compares to a daxfs with > > > > > > executables and other binaries configured to use page cache with the > > > > > > new per-file dax facility? > > > > > > > > > > nvfs is faster than dax-based filesystems on metadata-heavy operations > > > > > because it doesn't have the overhead of the buffer cache and bios. See > > > > > this: http://people.redhat.com/~mpatocka/nvfs/BENCHMARKS > > > > > > > > ...and that metadata problem is intractable upstream? Christoph poked > > > > at bypassing the block layer for xfs metadata operations [1], I just > > > > have not had time to carry that further. > > > > > > > > [1]: "xfs: use dax_direct_access for log writes", although it seems > > > > he's dropped that branch from his xfs.git > > > > > > XFS is very big. I wanted to create something small. > > > > And the another difference is that XFS metadata are optimized for disks > > and SSDs. > > Ah, that old chestnut. :) > > > On disks and SSDs, reading one byte is as costly as reading a full block. > > So we must put as much information to a block as possible. XFS uses > > b+trees for file block mapping and for directories - it is reasonable > > decision because b+trees minimize the number of disk accesses. > > Actually, no, that wasn't why XFS was implemented using btrees. The > btrees are an implementation detail, not a design requirement to > minimise the number of disk accesses. > > XFS was intended for huge disk arrays (hundreds to thousands on > individual spindles) and so no attempt was made in the design to > minimise disk accesses. There was -always- going to be a huge number > of IOPS available in the intended large scale deployments, so > concurrency and *efficiency at scale* was far more important at the > design level than minimising the number of disk ops for any given > operation. > > To that end, simulations were done that showed that extent based > trees were much more CPU, memory and IO efficient than bitmaps, > hybrid tree-bitmaps or multi-layer indirect block referencing when > trying to index and access large amounts of data. > > To be efficient at scale, all operations need to be O(log N) or > better, and extent based encoding is much, more compact than direct > block indexing. Extent trees are also much more effective for finding > exact fits, identifying large contiguous spaces, and manipulating > large ranges of indexed data than other filesystem indexing > mechanisms. They are also not bound by alignment restrictions like > hierarchical binary/buddy bitmap schemes are, and their maximum size > is bounded and can be calculated at runtime. > > IOWs, extent based trees were chosen because of scalability, > efficiency, and flexibility reasons before the actual tree structure > that it would be implemented with was decided on. b+trees were used > in the implementation because one tree implementation could do > everything as all that needed to change btree trees was the pointer > and record format. I agree that the b+tree were a good choice for XFS. In RAM-based maps, red-black trees or avl trees are used often. In disk-based maps, btrees or b+trees are used. That's because in RAM, you are optimizing for the number of cache lines accessed, and on the disk, you are optimizing for the number of blocks accessed. > The result of this is that we have made -zero- changes to the XFS > structure and algorithms for SSDs. We don't do different things > based on the blkdev rotational flag, or anything like that. XFS > behaves exactly the same on spinning disks as it does SSDs as it > does PMEM and it performs well on all of them. And that performance > doesn't
Re: NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache)
On Tue, 22 Sep 2020, Matthew Wilcox wrote: > On Mon, Sep 21, 2020 at 12:20:42PM -0400, Mikulas Patocka wrote: > > The same for directories - NVFS hashes the file name and uses radix-tree > > to locate a directory page where the directory entry is located. XFS > > b+trees would result in much more accesses than the radix-tree. > > What? Radix trees behave _horribly_ badly when indexed by a hash. > If you have a 64-bit hash and use 8 bits per level of the tree, you have > to traverse 8 pointers to get to your destination. You might as well > use a linked list! In NVFS, radix trees are cut off - they have only as much internal levels, as is needed to disambiguate the directory entries. Read this document: http://people.redhat.com/~mpatocka/nvfs/INTERNALS the section "DIRECTORIES". Perhaps, I should call it differently than "radix-trees", but I don't really know what is the official name for this data structure. Mikulas
NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache)
On Wed, 16 Sep 2020, Mikulas Patocka wrote: > > > On Wed, 16 Sep 2020, Dan Williams wrote: > > > On Wed, Sep 16, 2020 at 10:24 AM Mikulas Patocka > > wrote: > > > > > > > My first question about nvfs is how it compares to a daxfs with > > > > executables and other binaries configured to use page cache with the > > > > new per-file dax facility? > > > > > > nvfs is faster than dax-based filesystems on metadata-heavy operations > > > because it doesn't have the overhead of the buffer cache and bios. See > > > this: http://people.redhat.com/~mpatocka/nvfs/BENCHMARKS > > > > ...and that metadata problem is intractable upstream? Christoph poked > > at bypassing the block layer for xfs metadata operations [1], I just > > have not had time to carry that further. > > > > [1]: "xfs: use dax_direct_access for log writes", although it seems > > he's dropped that branch from his xfs.git > > XFS is very big. I wanted to create something small. And the another difference is that XFS metadata are optimized for disks and SSDs. On disks and SSDs, reading one byte is as costly as reading a full block. So we must put as much information to a block as possible. XFS uses b+trees for file block mapping and for directories - it is reasonable decision because b+trees minimize the number of disk accesses. On persistent memory, each access has its own cost, so NVFS uses metadata structures that minimize the number of cache lines accessed (rather than the number of blocks accessed). For block mapping, NVFS uses the classic unix dierct/indirect blocks - if a file block is mapped by a 3-rd level indirect block, we do just three memory accesses and we are done. If we used b+trees, the number of accesses would be much larger than 3 (we would have to do binary search in the b+tree nodes). The same for directories - NVFS hashes the file name and uses radix-tree to locate a directory page where the directory entry is located. XFS b+trees would result in much more accesses than the radix-tree. Regarding journaling - NVFS doesn't do it because persistent memory is so fast that we can just check it in the case of crash. NVFS has a multithreaded fsck that can do 3 million inodes per second. XFS does journaling (it was reasonable decision for disks where fsck took hours) and it will cause overhead for all the filesystem operations. Mikulas
Re: [RFC] nvfs: a filesystem for persistent memory
On Tue, 15 Sep 2020, Dan Williams wrote: > > TODO: > > > > - programs run approximately 4% slower when running from Optane-based > > persistent memory. Therefore, programs and libraries should use page cache > > and not DAX mapping. > > This needs to be based on platform firmware data f(ACPI HMAT) for the > relative performance of a PMEM range vs DRAM. For example, this > tradeoff should not exist with battery backed DRAM, or virtio-pmem. Hi I have implemented this functionality - if we mmap a file with (vma->vm_flags & VM_DENYWRITE), then it is assumed that this is executable file mapping - the flag S_DAX on the inode is cleared on and the inode will use normal page cache. Is there some way how to test if we are using Optane-based module (where this optimization should be applied) or battery backed DRAM (where it should not)? I've added mount options dax=never, dax=auto, dax=always, so that the user can override the automatic behavior. Mikulas
the "read" syscall sees partial effects of the "write" syscall
Hi I'd like to ask about this problem: when we write to a file, the kernel takes the write inode lock. When we read from a file, no lock is taken - thus the read syscall can read data that are halfway modified by the write syscall. The standard specifies the effects of the write syscall are atomic - see this: https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09_07 > 2.9.7 Thread Interactions with Regular File Operations > > All of the following functions shall be atomic with respect to each > other in the effects specified in POSIX.1-2017 when they operate on > regular files or symbolic links: > > chmod()fchownat() lseek() readv() unlink() > chown()fcntl() lstat() pwrite()unlinkat() > close()fstat() open() rename()utime() > creat()fstatat() openat() renameat() utimensat() > dup2() ftruncate() pread() stat() utimes() > fchmod() lchown()read() symlink() write() > fchmodat() link() readlink() symlinkat() writev() > fchown() linkat()readlinkat() truncate() > > If two threads each call one of these functions, each call shall either > see all of the specified effects of the other call, or none of them. The > requirement on the close() function shall also apply whenever a file > descriptor is successfully closed, however caused (for example, as a > consequence of calling close(), calling dup2(), or of process > termination). Should the read call take the read inode lock to make it atomic w.r.t. the write syscall? (I know - taking the read lock causes big performance hit due to cache line bouncing) I've created this program to test it - it has two threads, one writing and the other reading and verifying. When I run it on OpenBSD or FreeBSD, it passes, on Linux it fails with "we read modified bytes". Mikulas #include #include #include #include #include #include #define L 65536 static int h; static pthread_barrier_t barrier; static pthread_t thr; static char rpattern[L]; static char wpattern[L]; static void *reader(__attribute__((unused)) void *ptr) { while (1) { int r; size_t i; r = pthread_barrier_wait(); if (r > 0) fprintf(stderr, "pthread_barrier_wait: %s\n", strerror(r)), exit(1); r = pread(h, rpattern, L, 0); if (r != L) perror("pread"), exit(1); for (i = 0; i < L; i++) { if (rpattern[i] != rpattern[0]) fprintf(stderr, "we read modified bytes\n"), exit(1); } } return NULL; } int main(__attribute__((unused)) int argc, char *argv[]) { int r; h = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0644); if (h < 0) perror("open"), exit(1); r = pwrite(h, wpattern, L, 0); if (r != L) perror("pwrite"), exit(1); r = pthread_barrier_init(, NULL, 2); if (r) fprintf(stderr, "pthread_barrier_init: %s\n", strerror(r)), exit(1); r = pthread_create(, NULL, reader, NULL); if (r) fprintf(stderr, "pthread_create: %s\n", strerror(r)), exit(1); while (1) { size_t i; for (i = 0; i < L; i++) wpattern[i]++; r = pthread_barrier_wait(); if (r > 0) fprintf(stderr, "pthread_barrier_wait: %s\n", strerror(r)), exit(1); r = pwrite(h, wpattern, L, 0); if (r != L) perror("pwrite"), exit(1); } }
Re: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache
On Wed, 16 Sep 2020, Dan Williams wrote: > On Wed, Sep 16, 2020 at 3:57 AM Mikulas Patocka wrote: > > > > > > > > I'm submitting this patch that adds the required exports (so that we could > > use __copy_from_user_flushcache on x86, arm64 and powerpc). Please, queue > > it for the next merge window. > > Why? This should go with the first user, and it's not clear that it > needs to be relative to the current dax_operations export scheme. Before nvfs gets included in the kernel, I need to distribute it as a module. So, it would make my maintenance easier. But if you don't want to export it now, no problem, I can just copy __copy_user_flushcache from the kernel to the module. > My first question about nvfs is how it compares to a daxfs with > executables and other binaries configured to use page cache with the > new per-file dax facility? nvfs is faster than dax-based filesystems on metadata-heavy operations because it doesn't have the overhead of the buffer cache and bios. See this: http://people.redhat.com/~mpatocka/nvfs/BENCHMARKS Mikulas
[PATCH] pmem: fix __copy_user_flushcache
On Wed, 16 Sep 2020, Dan Williams wrote: > On Wed, Sep 16, 2020 at 10:24 AM Mikulas Patocka wrote: > > > > > > > > On Wed, 16 Sep 2020, Dan Williams wrote: > > > > > On Wed, Sep 16, 2020 at 3:57 AM Mikulas Patocka > > > wrote: > > > > > > > > > > > > > > > > I'm submitting this patch that adds the required exports (so that we > > > > could > > > > use __copy_from_user_flushcache on x86, arm64 and powerpc). Please, > > > > queue > > > > it for the next merge window. > > > > > > Why? This should go with the first user, and it's not clear that it > > > needs to be relative to the current dax_operations export scheme. > > > > Before nvfs gets included in the kernel, I need to distribute it as a > > module. So, it would make my maintenance easier. But if you don't want to > > export it now, no problem, I can just copy __copy_user_flushcache from the > > kernel to the module. > > That sounds a better plan than exporting symbols with no in-kernel consumer. BTW, this function is buggy. Here I'm submitting the patch. From: Mikulas Patocka If we copy less than 8 bytes and if the destination crosses a cache line, __copy_user_flushcache would invalidate only the first cache line. This patch makes it invalidate the second cache line as well. Signed-off-by: Mikulas Patocka Cc: sta...@vger.kernel.org --- arch/x86/lib/usercopy_64.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6/arch/x86/lib/usercopy_64.c === --- linux-2.6.orig/arch/x86/lib/usercopy_64.c 2020-09-05 10:01:27.0 +0200 +++ linux-2.6/arch/x86/lib/usercopy_64.c2020-09-16 20:48:31.0 +0200 @@ -120,7 +120,7 @@ long __copy_user_flushcache(void *dst, c */ if (size < 8) { if (!IS_ALIGNED(dest, 4) || size != 4) - clean_cache_range(dst, 1); + clean_cache_range(dst, size); } else { if (!IS_ALIGNED(dest, 8)) { dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
Re: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache
On Wed, 16 Sep 2020, Dan Williams wrote: > On Wed, Sep 16, 2020 at 10:24 AM Mikulas Patocka wrote: > > > > > My first question about nvfs is how it compares to a daxfs with > > > executables and other binaries configured to use page cache with the > > > new per-file dax facility? > > > > nvfs is faster than dax-based filesystems on metadata-heavy operations > > because it doesn't have the overhead of the buffer cache and bios. See > > this: http://people.redhat.com/~mpatocka/nvfs/BENCHMARKS > > ...and that metadata problem is intractable upstream? Christoph poked > at bypassing the block layer for xfs metadata operations [1], I just > have not had time to carry that further. > > [1]: "xfs: use dax_direct_access for log writes", although it seems > he's dropped that branch from his xfs.git XFS is very big. I wanted to create something small. Mikulas
[PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache
On Tue, 15 Sep 2020, Mikulas Patocka wrote: > > > On Tue, 15 Sep 2020, Mikulas Patocka wrote: > > > > > - __copy_from_user_inatomic_nocache doesn't flush cache for leading and > > > > trailing bytes. > > > > > > You want copy_user_flushcache(). See how fs/dax.c arranges for > > > dax_copy_from_iter() to route to pmem_copy_from_iter(). > > > > Is it something new for the kernel 5.10? I see only __copy_user_flushcache > > that is implemented just for x86 and arm64. > > > > There is __copy_from_user_flushcache implemented for x86, arm64 and power. > > It is used in lib/iov_iter.c under > > #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE - so should I use this? > > > > Mikulas > > ... and __copy_user_flushcache is not exported for modules. So, I am stuck > with __copy_from_user_inatomic_nocache. > > Mikulas I'm submitting this patch that adds the required exports (so that we could use __copy_from_user_flushcache on x86, arm64 and powerpc). Please, queue it for the next merge window. From: Mikulas Patocka Export the symbols __copy_user_flushcache and __copy_from_user_flushcache, so that modules can use this functionality. Signed-off-by: Mikulas Patocka --- arch/arm64/lib/uaccess_flushcache.c |1 + arch/powerpc/lib/pmem.c |1 + arch/x86/lib/usercopy_64.c |1 + 3 files changed, 3 insertions(+) Index: linux-2.6/arch/arm64/lib/uaccess_flushcache.c === --- linux-2.6.orig/arch/arm64/lib/uaccess_flushcache.c 2020-09-16 12:44:15.068038000 +0200 +++ linux-2.6/arch/arm64/lib/uaccess_flushcache.c 2020-09-16 12:44:15.068038000 +0200 @@ -38,3 +38,4 @@ unsigned long __copy_user_flushcache(voi __clean_dcache_area_pop(to, n - rc); return rc; } +EXPORT_SYMBOL_GPL(__copy_user_flushcache); Index: linux-2.6/arch/x86/lib/usercopy_64.c === --- linux-2.6.orig/arch/x86/lib/usercopy_64.c 2020-09-16 12:44:15.068038000 +0200 +++ linux-2.6/arch/x86/lib/usercopy_64.c2020-09-16 12:44:15.068038000 +0200 @@ -134,6 +134,7 @@ long __copy_user_flushcache(void *dst, c return rc; } +EXPORT_SYMBOL_GPL(__copy_user_flushcache); void __memcpy_flushcache(void *_dst, const void *_src, size_t size) { Index: linux-2.6/arch/powerpc/lib/pmem.c === --- linux-2.6.orig/arch/powerpc/lib/pmem.c 2020-09-16 12:44:15.068038000 +0200 +++ linux-2.6/arch/powerpc/lib/pmem.c 2020-09-16 12:44:15.068038000 +0200 @@ -75,6 +75,7 @@ long __copy_from_user_flushcache(void *d return copied; } +EXPORT_SYMBOL_GPL(__copy_from_user_flushcache); void memcpy_flushcache(void *dest, const void *src, size_t size) {
Re: [RFC] nvfs: a filesystem for persistent memory
On Tue, 15 Sep 2020, Matthew Wilcox wrote: > On Tue, Sep 15, 2020 at 08:34:41AM -0400, Mikulas Patocka wrote: > > - when the fsck.nvfs tool mmaps the device /dev/pmem0, the kernel uses > > buffer cache for the mapping. The buffer cache slows does fsck by a factor > > of 5 to 10. Could it be possible to change the kernel so that it maps DAX > > based block devices directly? > > Oh, because fs/block_dev.c has: > .mmap = generic_file_mmap, > > I don't see why we shouldn't have a blkdev_mmap modelled after > ext2_file_mmap() with the corresponding blkdev_dax_vm_ops. Yes, that's possible - and we whould also have to rewrite methods read_iter and write_iter on DAX block devices, so that they are coherent with mmap. Mikulas
Re: [RFC] nvfs: a filesystem for persistent memory
On Tue, 15 Sep 2020, Dan Williams wrote: > > - when the fsck.nvfs tool mmaps the device /dev/pmem0, the kernel uses > > buffer cache for the mapping. The buffer cache slows does fsck by a factor > > of 5 to 10. Could it be possible to change the kernel so that it maps DAX > > based block devices directly? > > We've been down this path before. > > 5a023cdba50c block: enable dax for raw block devices > 9f4736fe7ca8 block: revert runtime dax control of the raw block device > acc93d30d7d4 Revert "block: enable dax for raw block devices" It says "The functionality is superseded by the new 'Device DAX' facility". But the fsck tool can't change a fsdax device into a devdax device just for checking. Or can it? > EXT2/4 metadata buffer management depends on the page cache and we > eliminated a class of bugs by removing that support. The problems are > likely tractable, but there was not a straightforward fix visible at > the time. Thinking about it - it isn't as easy as it looks... Suppose that the user mounts an ext2 filesystem and then uses the tune2fs tool on the mounted block device. The tune2fs tool reads and writes the mounted superblock directly. So, read/write must be coherent with the buffer cache (otherwise the kernel would not see the changes written by tune2fs). And mmap must be coherent with read/write. So, if we want to map the pmem device directly, we could add a new flag MAP_DAX. Or we could test if the fd has O_DIRECT flag and map it directly in this case. But the default must be to map it coherently in order to not break existing programs. > > - __copy_from_user_inatomic_nocache doesn't flush cache for leading and > > trailing bytes. > > You want copy_user_flushcache(). See how fs/dax.c arranges for > dax_copy_from_iter() to route to pmem_copy_from_iter(). Is it something new for the kernel 5.10? I see only __copy_user_flushcache that is implemented just for x86 and arm64. There is __copy_from_user_flushcache implemented for x86, arm64 and power. It is used in lib/iov_iter.c under #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE - so should I use this? Mikulas
Re: [RFC] nvfs: a filesystem for persistent memory
On Tue, 15 Sep 2020, Mikulas Patocka wrote: > > > - __copy_from_user_inatomic_nocache doesn't flush cache for leading and > > > trailing bytes. > > > > You want copy_user_flushcache(). See how fs/dax.c arranges for > > dax_copy_from_iter() to route to pmem_copy_from_iter(). > > Is it something new for the kernel 5.10? I see only __copy_user_flushcache > that is implemented just for x86 and arm64. > > There is __copy_from_user_flushcache implemented for x86, arm64 and power. > It is used in lib/iov_iter.c under > #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE - so should I use this? > > Mikulas ... and __copy_user_flushcache is not exported for modules. So, I am stuck with __copy_from_user_inatomic_nocache. Mikulas
[RFC] nvfs: a filesystem for persistent memory
Hi I am developing a new filesystem suitable for persistent memory - nvfs. The goal is to have a small and fast filesystem that can be used on DAX-based devices. Nvfs maps the whole device into linear address space and it completely bypasses the overhead of the block layer and buffer cache. In the past, there was nova filesystem for pmem, but it was abandoned a year ago (the last version is for the kernel 5.1 - https://github.com/NVSL/linux-nova ). Nvfs is smaller and performs better. The design of nvfs is similar to ext2/ext4, so that it fits into the VFS layer naturally, without too much glue code. I'd like to ask you to review it. tarballs: http://people.redhat.com/~mpatocka/nvfs/ git: git://leontynka.twibright.com/nvfs.git the description of filesystem internals: http://people.redhat.com/~mpatocka/nvfs/INTERNALS benchmarks: http://people.redhat.com/~mpatocka/nvfs/BENCHMARKS TODO: - programs run approximately 4% slower when running from Optane-based persistent memory. Therefore, programs and libraries should use page cache and not DAX mapping. - when the fsck.nvfs tool mmaps the device /dev/pmem0, the kernel uses buffer cache for the mapping. The buffer cache slows does fsck by a factor of 5 to 10. Could it be possible to change the kernel so that it maps DAX based block devices directly? - __copy_from_user_inatomic_nocache doesn't flush cache for leading and trailing bytes. Mikulas
Re: [PATCH 2/2] xfs: don't update mtime on COW faults
On Wed, 9 Sep 2020, Darrick J. Wong wrote: > On Sat, Sep 05, 2020 at 01:02:33PM -0400, Mikulas Patocka wrote: > > > > > > > > I've written this program that tests it - you can integrate it into your > > testsuite. > > I don't get it. You're a filesystem maintainer too, which means you're > a regular contributor. Do you: > > (a) not use fstests? If you don't, I really hope you use something else > to QA hpfs. I don't use xfstests on HPFS. I was testing it just by using it. Now I use it just a little, but I don't modify it much. > (b) really think that it's my problem to integrate and submit your > regression tests for you? > > and (c) what do you want me to do with a piece of code that has no > signoff tag, no copyright, and no license? This is your patch, and > therefore your responsibility to develop enough of an appropriate > regression test in a proper form that the rest of us can easily > determine we have the rights to contribute to it. If you want a full patch (I copied the script from test 313), I send it here. Mikulas From: Mikulas Patocka Subject: [PATCH] check ctime and mtime vs mmap Check ctime and mtime are not updated on COW faults and that they are updated on shared faults Signed-off-by: Mikulas Patocka --- src/Makefile |3 +- src/mmap-timestamp.c | 53 ++ tests/generic/609 | 40 + tests/generic/609.out |2 + tests/generic/group |1 5 files changed, 98 insertions(+), 1 deletion(-) Index: xfstests-dev/src/Makefile === --- xfstests-dev.orig/src/Makefile 2020-09-06 12:38:40.0 +0200 +++ xfstests-dev/src/Makefile 2020-09-11 17:39:04.0 +0200 @@ -17,7 +17,8 @@ TARGETS = dirstress fill fill2 getpagesi t_mmap_cow_race t_mmap_fallocate fsync-err t_mmap_write_ro \ t_ext4_dax_journal_corruption t_ext4_dax_inline_corruption \ t_ofd_locks t_mmap_collision mmap-write-concurrent \ - t_get_file_time t_create_short_dirs t_create_long_dirs + t_get_file_time t_create_short_dirs t_create_long_dirs \ + mmap-timestamp LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \ preallo_rw_pattern_writer ftrunc trunc fs_perms testx looptest \ Index: xfstests-dev/src/mmap-timestamp.c === --- /dev/null 1970-01-01 00:00:00.0 + +++ xfstests-dev/src/mmap-timestamp.c 2020-09-11 18:21:40.0 +0200 @@ -0,0 +1,53 @@ +// SPDX-License-Identifier: GPL-2.0 +// Copyright (C) 2020 Red Hat, Inc. All Rights reserved. + +#include +#include +#include +#include +#include +#include +#include + +#define FILE_NAME argv[1] + +static struct stat st1, st2; + +int main(int argc, char *argv[]) +{ + int h, r; + char *map; + unlink(FILE_NAME); + h = creat(FILE_NAME, 0600); + if (h == -1) perror("creat"), exit(1); + r = write(h, "x", 1); + if (r != 1) perror("write"), exit(1); + if (close(h)) perror("close"), exit(1); + h = open(FILE_NAME, O_RDWR); + if (h == -1) perror("open"), exit(1); + + map = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE, h, 0); + if (map == MAP_FAILED) perror("mmap"), exit(1); + if (fstat(h, )) perror("fstat"), exit(1); + sleep(2); + *map = 'y'; + if (fstat(h, )) perror("fstat"), exit(1); + if (st1.st_mtime != st2.st_mtime) fprintf(stderr, "BUG: COW fault changed mtime!\n"), exit(1); + if (st1.st_ctime != st2.st_ctime) fprintf(stderr, "BUG: COW fault changed ctime!\n"), exit(1); + if (munmap(map, 4096)) perror("munmap"), exit(1); + + map = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, h, 0); + if (map == MAP_FAILED) perror("mmap"), exit(1); + if (fstat(h, )) perror("fstat"), exit(1); + sleep(2); + *map = 'z'; + if (msync(map, 4096, MS_SYNC)) perror("msync"), exit(1); + if (fstat(h, )) perror("fstat"), exit(1); + if (st1.st_mtime == st2.st_mtime) fprintf(stderr, "BUG: Shared fault did not change mtime!\n"), exit(1); + if (st1.st_ctime == st2.st_ctime) fprintf(stderr, "BUG: Shared fault did not change ctime!\n"), exit(1); + if (munmap(map, 4096)) perror("munmap"), exit(1); + + if (close(h)) perror("close"), exit(1); + if (unlink(FILE_NAME)) perror("unlink"), exit(1); + return 0; +} Index: xfstests-dev/tests/generic/609 === --- /dev/null 1970-01-01 00:00:00.0 + +++ xfs
[PATCH 2/2 v2] xfs: don't update mtime on COW faults
When running in a dax mode, if the user maps a page with MAP_PRIVATE and PROT_WRITE, the xfs filesystem would incorrectly update ctime and mtime when the user hits a COW fault. This breaks building of the Linux kernel. How to reproduce: 1. extract the Linux kernel tree on dax-mounted xfs filesystem 2. run make clean 3. run make -j12 4. run make -j12 - at step 4, make would incorrectly rebuild the whole kernel (although it was already built in step 3). The reason for the breakage is that almost all object files depend on objtool. When we run objtool, it takes COW page fault on its .data section, and these faults will incorrectly update the timestamp of the objtool binary. The updated timestamp causes make to rebuild the whole tree. Signed-off-by: Mikulas Patocka Cc: sta...@vger.kernel.org --- fs/xfs/xfs_file.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) Index: linux-2.6/fs/xfs/xfs_file.c === --- linux-2.6.orig/fs/xfs/xfs_file.c2020-09-05 18:48:54.0 +0200 +++ linux-2.6/fs/xfs/xfs_file.c 2020-09-05 18:56:48.0 +0200 @@ -1223,14 +1223,21 @@ __xfs_filemap_fault( return ret; } +static inline bool +xfs_is_shared_dax_write_fault( + struct vm_fault *vmf) +{ + return IS_DAX(file_inode(vmf->vma->vm_file)) && + vmf->flags & FAULT_FLAG_WRITE && vmf->vma->vm_flags & VM_SHARED; +} + static vm_fault_t xfs_filemap_fault( struct vm_fault *vmf) { /* DAX can shortcut the normal fault path on write faults! */ return __xfs_filemap_fault(vmf, PE_SIZE_PTE, - IS_DAX(file_inode(vmf->vma->vm_file)) && - (vmf->flags & FAULT_FLAG_WRITE)); + xfs_is_shared_dax_write_fault(vmf)); } static vm_fault_t @@ -1243,7 +1250,7 @@ xfs_filemap_huge_fault( /* DAX can shortcut the normal fault path on write faults! */ return __xfs_filemap_fault(vmf, pe_size, - (vmf->flags & FAULT_FLAG_WRITE)); + xfs_is_shared_dax_write_fault(vmf)); } static vm_fault_t
Re: [PATCH 2/2] xfs: don't update mtime on COW faults
On Sat, 5 Sep 2020, Darrick J. Wong wrote: > On Sat, Sep 05, 2020 at 08:13:02AM -0400, Mikulas Patocka wrote: > > When running in a dax mode, if the user maps a page with MAP_PRIVATE and > > PROT_WRITE, the xfs filesystem would incorrectly update ctime and mtime > > when the user hits a COW fault. > > > > This breaks building of the Linux kernel. > > How to reproduce: > > 1. extract the Linux kernel tree on dax-mounted xfs filesystem > > 2. run make clean > > 3. run make -j12 > > 4. run make -j12 > > - at step 4, make would incorrectly rebuild the whole kernel (although it > > was already built in step 3). > > > > The reason for the breakage is that almost all object files depend on > > objtool. When we run objtool, it takes COW page fault on its .data > > section, and these faults will incorrectly update the timestamp of the > > objtool binary. The updated timestamp causes make to rebuild the whole > > tree. > > > > Signed-off-by: Mikulas Patocka > > Cc: sta...@vger.kernel.org > > > > --- > > fs/xfs/xfs_file.c | 11 +-- > > 1 file changed, 9 insertions(+), 2 deletions(-) > > > > Index: linux-2.6/fs/xfs/xfs_file.c > > === > > --- linux-2.6.orig/fs/xfs/xfs_file.c2020-09-05 10:01:42.0 > > +0200 > > +++ linux-2.6/fs/xfs/xfs_file.c 2020-09-05 13:59:12.0 +0200 > > @@ -1223,6 +1223,13 @@ __xfs_filemap_fault( > > return ret; > > } > > > > +static bool > > +xfs_is_write_fault( > > Call this xfs_is_shared_dax_write_fault, and throw in the IS_DAX() test? > > You might as well make it a static inline. Yes, it is possible. I'll send a second version. > > + struct vm_fault *vmf) > > +{ > > + return vmf->flags & FAULT_FLAG_WRITE && vmf->vma->vm_flags & VM_SHARED; > > Also, is "shortcutting the normal fault path" the reason for ext2 and > xfs both being broken? > > /me puzzles over why write_fault is always true for page_mkwrite and > pfn_mkwrite, but not for fault and huge_fault... > > Also: Can you please turn this (checking for timestamp update behavior > wrt shared and private mapping write faults) into an fstest so we don't > mess this up again? I've written this program that tests it - you can integrate it into your testsuite. Mikulas #include #include #include #include #include #include #include #define FILE_NAME "test.txt" static struct stat st1, st2; int main(void) { int h, r; char *map; unlink(FILE_NAME); h = creat(FILE_NAME, 0600); if (h == -1) perror("creat"), exit(1); r = write(h, "x", 1); if (r != 1) perror("write"), exit(1); if (close(h)) perror("close"), exit(1); h = open(FILE_NAME, O_RDWR); if (h == -1) perror("open"), exit(1); map = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE, h, 0); if (map == MAP_FAILED) perror("mmap"), exit(1); if (fstat(h, )) perror("fstat"), exit(1); sleep(2); *map = 'y'; if (fstat(h, )) perror("fstat"), exit(1); if (memcmp(, , sizeof(struct stat))) fprintf(stderr, "BUG: COW fault changed time!\n"), exit(1); if (munmap(map, 4096)) perror("munmap"), exit(1); map = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, h, 0); if (map == MAP_FAILED) perror("mmap"), exit(1); if (fstat(h, )) perror("fstat"), exit(1); sleep(2); *map = 'z'; if (fstat(h, )) perror("fstat"), exit(1); if (st1.st_mtime == st2.st_mtime) fprintf(stderr, "BUG: Shared fault did not change mtime!\n"), exit(1); if (st1.st_ctime == st2.st_ctime) fprintf(stderr, "BUG: Shared fault did not change ctime!\n"), exit(1); if (munmap(map, 4096)) perror("munmap"), exit(1); if (close(h)) perror("close"), exit(1); if (unlink(FILE_NAME)) perror("unlink"), exit(1); return 0; }
[PATCH 2/2] xfs: don't update mtime on COW faults
When running in a dax mode, if the user maps a page with MAP_PRIVATE and PROT_WRITE, the xfs filesystem would incorrectly update ctime and mtime when the user hits a COW fault. This breaks building of the Linux kernel. How to reproduce: 1. extract the Linux kernel tree on dax-mounted xfs filesystem 2. run make clean 3. run make -j12 4. run make -j12 - at step 4, make would incorrectly rebuild the whole kernel (although it was already built in step 3). The reason for the breakage is that almost all object files depend on objtool. When we run objtool, it takes COW page fault on its .data section, and these faults will incorrectly update the timestamp of the objtool binary. The updated timestamp causes make to rebuild the whole tree. Signed-off-by: Mikulas Patocka Cc: sta...@vger.kernel.org --- fs/xfs/xfs_file.c | 11 +-- 1 file changed, 9 insertions(+), 2 deletions(-) Index: linux-2.6/fs/xfs/xfs_file.c === --- linux-2.6.orig/fs/xfs/xfs_file.c2020-09-05 10:01:42.0 +0200 +++ linux-2.6/fs/xfs/xfs_file.c 2020-09-05 13:59:12.0 +0200 @@ -1223,6 +1223,13 @@ __xfs_filemap_fault( return ret; } +static bool +xfs_is_write_fault( + struct vm_fault *vmf) +{ + return vmf->flags & FAULT_FLAG_WRITE && vmf->vma->vm_flags & VM_SHARED; +} + static vm_fault_t xfs_filemap_fault( struct vm_fault *vmf) @@ -1230,7 +1237,7 @@ xfs_filemap_fault( /* DAX can shortcut the normal fault path on write faults! */ return __xfs_filemap_fault(vmf, PE_SIZE_PTE, IS_DAX(file_inode(vmf->vma->vm_file)) && - (vmf->flags & FAULT_FLAG_WRITE)); + xfs_is_write_fault(vmf)); } static vm_fault_t @@ -1243,7 +1250,7 @@ xfs_filemap_huge_fault( /* DAX can shortcut the normal fault path on write faults! */ return __xfs_filemap_fault(vmf, pe_size, - (vmf->flags & FAULT_FLAG_WRITE)); + xfs_is_write_fault(vmf)); } static vm_fault_t
[PATCH 1/2] ext2: don't update mtime on COW faults
When running in a dax mode, if the user maps a page with MAP_PRIVATE and PROT_WRITE, the ext2 filesystem would incorrectly update ctime and mtime when the user hits a COW fault. This breaks building of the Linux kernel. How to reproduce: 1. extract the Linux kernel tree on dax-mounted ext2 filesystem 2. run make clean 3. run make -j12 4. run make -j12 - at step 4, make would incorrectly rebuild the whole kernel (although it was already built in step 3). The reason for the breakage is that almost all object files depend on objtool. When we run objtool, it takes COW page fault on its .data section, and these faults will incorrectly update the timestamp of the objtool binary. The updated timestamp causes make to rebuild the whole tree. Signed-off-by: Mikulas Patocka Cc: sta...@vger.kernel.org --- fs/ext2/file.c |6 -- 1 file changed, 4 insertions(+), 2 deletions(-) Index: linux-2.6/fs/ext2/file.c === --- linux-2.6.orig/fs/ext2/file.c 2020-09-05 10:01:41.0 +0200 +++ linux-2.6/fs/ext2/file.c2020-09-05 13:09:50.0 +0200 @@ -93,8 +93,10 @@ static vm_fault_t ext2_dax_fault(struct struct inode *inode = file_inode(vmf->vma->vm_file); struct ext2_inode_info *ei = EXT2_I(inode); vm_fault_t ret; + bool write = (vmf->flags & FAULT_FLAG_WRITE) && + (vmf->vma->vm_flags & VM_SHARED); - if (vmf->flags & FAULT_FLAG_WRITE) { + if (write) { sb_start_pagefault(inode->i_sb); file_update_time(vmf->vma->vm_file); } @@ -103,7 +105,7 @@ static vm_fault_t ext2_dax_fault(struct ret = dax_iomap_fault(vmf, PE_SIZE_PTE, NULL, NULL, _iomap_ops); up_read(>dax_sem); - if (vmf->flags & FAULT_FLAG_WRITE) + if (write) sb_end_pagefault(inode->i_sb); return ret; }
Re: make misbehavior on ext2 in dax mode (was: a crash when running strace from persistent memory)
On Fri, 4 Sep 2020, Mikulas Patocka wrote: > Hmm, so I've found another bug in dax mode. > > If you extract the Linux kernel tree on dax-based ext2 filesystem (use the > real ext2 driver, not ext4), and then you run make twice, the second > invocation will rebuild everything. It seems like a problem with > timestamps. > > mount -t ext2 -o dax /dev/pmem0 /mnt/ext2/ > cd /mnt/ext2/usr/src/git/linux-2.6 > make clean > make -j12 > make -j12 <--- this rebuilds the whole tree, althought it shouldn't > > I wasn't able to bisect it because this bug seems to be present in every > kernel I tried (back to 4.16.0). Ext4 doesn't seem to have this bug. > > Mikulas I've found out the root cause for this bug (XFS has it too) and I'm sending patches. Mikulas
make misbehavior on ext2 in dax mode (was: a crash when running strace from persistent memory)
On Thu, 3 Sep 2020, Mikulas Patocka wrote: > Hi > > There's a bug when you run strace from dax-based filesystem. Hmm, so I've found another bug in dax mode. If you extract the Linux kernel tree on dax-based ext2 filesystem (use the real ext2 driver, not ext4), and then you run make twice, the second invocation will rebuild everything. It seems like a problem with timestamps. mount -t ext2 -o dax /dev/pmem0 /mnt/ext2/ cd /mnt/ext2/usr/src/git/linux-2.6 make clean make -j12 make -j12 <--- this rebuilds the whole tree, althought it shouldn't I wasn't able to bisect it because this bug seems to be present in every kernel I tried (back to 4.16.0). Ext4 doesn't seem to have this bug. Mikulas
Re: a crash when running strace from persistent memory
On Thu, 3 Sep 2020, Linus Torvalds wrote: > On Thu, Sep 3, 2020 at 12:24 PM Mikulas Patocka wrote: > > > > There's a bug when you run strace from dax-based filesystem. > > > > -- create real or emulated persistent memory device (/dev/pmem0) > > mkfs.ext2 /dev/pmem0 > > -- mount it > > mount -t ext2 -o dax /dev/pmem0 /mnt/test > > -- copy the system to it (well, you can copy just a few files that are > >needed for running strace and ls) > > cp -ax / /mnt/test > > -- bind the system directories > > mount --bind /dev /mnt/test/dev > > mount --bind /proc /mnt/test/proc > > mount --bind /sys /mnt/test/sys > > -- run strace on the ls command > > chroot /mnt/test/ strace /bin/ls > > > > You get this warning and ls is killed with SIGSEGV. > > > > I bisected the problem and it is caused by the commit > > 17839856fd588f4ab6b789f482ed3ffd7c403e1f (gup: document and work around > > "COW can break either way" issue). When I revert the patch (on the kernel > > 5.9-rc3), the bug goes away. > > Funky. I really don't see how it could cause that, but we have the > UDDF issue too, so I'm guessing I will have to fix it the radical way > with Peter Xu's series based on my "rip out COW special cases" patch. > > Or maybe I'm just using that as an excuse for really wanting to apply > that series.. Because we can't just revert that GUP commit due to > security concerns. > > > [ 84.191504] WARNING: CPU: 6 PID: 1350 at mm/memory.c:2486 > > wp_page_copy.cold+0xdb/0xf6 > > I'm assuming this is the WARN_ON_ONCE(1) on line 2482, and you have > some extra debug patch that causes that line to be off by 4? Because > at least for me, line 2486 is actually an empty line in v5.9-rc3. Yes, that's it. I added a few printk to look at the control flow. > That said, I really think this is a pre-existing race, and all the > "COW can break either way" patch does is change the timing (presumably > due to the actual pattern of actually doing the COW changing). > > See commit c3e5ea6ee574 ("mm: avoid data corruption on CoW fault into > PFN-mapped VMA") for background. > > Mikulas, can you check that everything works ok for that case if you > apply Peter's series? See > > https://lore.kernel.org/lkml/20200821234958.7896-1-pet...@redhat.com/ I applied these four patches and strace works well. There is no longer any warning or crash. Mikulas > or if you have 'b4' installed, use > > b4 am 20200821234958.7896-1-pet...@redhat.com > > to get the series.. > > Linus >
a crash when running strace from persistent memory
Hi There's a bug when you run strace from dax-based filesystem. -- create real or emulated persistent memory device (/dev/pmem0) mkfs.ext2 /dev/pmem0 -- mount it mount -t ext2 -o dax /dev/pmem0 /mnt/test -- copy the system to it (well, you can copy just a few files that are needed for running strace and ls) cp -ax / /mnt/test -- bind the system directories mount --bind /dev /mnt/test/dev mount --bind /proc /mnt/test/proc mount --bind /sys /mnt/test/sys -- run strace on the ls command chroot /mnt/test/ strace /bin/ls You get this warning and ls is killed with SIGSEGV. I bisected the problem and it is caused by the commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f (gup: document and work around "COW can break either way" issue). When I revert the patch (on the kernel 5.9-rc3), the bug goes away. Mikulas [ 84.190961] [ cut here ] [ 84.191504] WARNING: CPU: 6 PID: 1350 at mm/memory.c:2486 wp_page_copy.cold+0xdb/0xf6 [ 84.192398] Modules linked in: ext2 uvesafb cfbfillrect cfbimgblt cn cfbcopyarea fb fbdev ipv6 tun autofs4 binfmt_misc configfs af_packet mousedev virtio_balloon virtio_rng evdev rng_core pcspkr button raid10 raid456 async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx libcrc32c raid1 raid0 md_mod sd_mod t10_pi virtio_scsi virtio_net psmouse net_failover scsi_mod failover [ 84.196301] CPU: 6 PID: 1350 Comm: strace Not tainted 5.9.0-rc3 #6 [ 84.197020] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 84.197685] RIP: 0010:wp_page_copy.cold+0xdb/0xf6 [ 84.198231] Code: ff ff ff 0f 00 eb 8e 48 8b 3c 24 48 8b 74 24 08 ba 00 10 00 00 e8 33 87 1f 00 85 c0 74 1f 48 c7 c7 a7 2b ba 81 e8 cc b6 f0 ff <0f> 0b 48 8b 3c 24 e8 08 82 1f 00 41 be 01 00 00 00 eb ae 41 be 01 [ 84.200410] RSP: 0018:88940c1dba58 EFLAGS: 00010282 [ 84.201035] RAX: 0006 RBX: 88940c1dbb00 RCX: [ 84.201842] RDX: 0003 RSI: 81b9c0fa RDI: [ 84.202650] RBP: ea004f0e4d80 R08: R09: [ 84.203460] R10: 0046 R11: R12: [ 84.204265] R13: 88940aa86318 R14: f7fac000 R15: 8893c8db3c40 [ 84.205083] FS: 7fd8a8320740() GS:88940fb8() knlGS: [ 84.206000] CS: 0010 DS: ES: CR0: 80050033 [ 84.206664] CR2: f7fac000 CR3: 0013c93a6000 CR4: 06a0 [ 84.207481] Call Trace: [ 84.207883] do_wp_page+0x172/0x6a0 [ 84.208285] handle_mm_fault+0xd0b/0x1540 [ 84.208753] __get_user_pages+0x21a/0x6c0 [ 84.209213] __get_user_pages_remote+0xc8/0x2a0 [ 84.209735] process_vm_rw_core.isra.0+0x1ac/0x440 [ 84.210318] ? __might_fault+0x26/0x40 [ 84.210758] ? _copy_from_user+0x6a/0xa0 [ 84.211208] ? __might_fault+0x26/0x40 [ 84.211642] ? _copy_from_user+0x6a/0xa0 [ 84.212091] process_vm_rw+0xd1/0x100 [ 84.212511] ? _copy_to_user+0x69/0x80 [ 84.212946] ? ptrace_get_syscall_info+0x9b/0x180 [ 84.213484] ? find_held_lock+0x2b/0x80 [ 84.213926] ? __x64_sys_ptrace+0x106/0x140 [ 84.214405] ? fpregs_assert_state_consistent+0x19/0x40 [ 84.215002] ? exit_to_user_mode_prepare+0x2d/0x120 [ 84.215556] __x64_sys_process_vm_readv+0x22/0x40 [ 84.216103] do_syscall_64+0x2d/0x80 [ 84.216518] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 84.217098] RIP: 0033:0x7fd8a84896da [ 84.217512] Code: 48 8b 15 b9 f7 0b 00 f7 d8 64 89 02 b8 ff ff ff ff eb d2 e8 18 f0 00 00 0f 1f 84 00 00 00 00 00 49 89 ca b8 36 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 06 c3 0f 1f 44 00 00 48 8b 15 81 f7 0b 00 f7 [ 84.219618] RSP: 002b:7ffd08c3c678 EFLAGS: 0246 ORIG_RAX: 0136 [ 84.220563] RAX: ffda RBX: f7fac000 RCX: 7fd8a84896da [ 84.221380] RDX: 0001 RSI: 7ffd08c3c680 RDI: 0549 [ 84.222194] RBP: 7ffd08c3c760 R08: 0001 R09: [ 84.222999] R10: 7ffd08c3c690 R11: 0246 R12: f7faca80 [ 84.223804] R13: 0580 R14: 0549 R15: 5589a50eee80 [ 84.223804] R13: 0580 R14: 0549 R15: 5589a50eee80 [ 84.224612] ---[ end trace d8dbf2da5dc1b7ca ]---
[PATCH] dm-bufio: do cleanup from a workqueue
On Tue, 30 Jun 2020, Dave Chinner wrote: > https://lore.kernel.org/linux-fsdevel/20190809215733.gz7...@dread.disaster.area/ > > If you did that when I suggested it, this problem would be solved. > i.e. The only way to fix this problem once adn for all is to stop > using the shrinker as a mechanism to issue and wait on IO. If you > need background writeback of dirty buffers, do it from a > WQ_MEM_RECLAIM workqueue that isn't directly in the memory reclaim > path and so can issue writeback and block safely from a GFP_KERNEL > context. Kick the workqueue from the shrinker context, but get rid > of the IO submission and waiting from the shrinker and all the > GFP_NOFS memory reclaim recursion problems go away. > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com Hi This is a patch that moves buffer cleanup to a workqueue. Please review it. Mikulas From: Mikulas Patocka kswapd should not block because it degrades system performance. So, move reclaim of buffers to a workqueue. Signed-off-by: Mikulas Patocka --- drivers/md/dm-bufio.c | 60 ++ 1 file changed, 41 insertions(+), 19 deletions(-) Index: linux-2.6/drivers/md/dm-bufio.c === --- linux-2.6.orig/drivers/md/dm-bufio.c2020-07-03 14:07:43.0 +0200 +++ linux-2.6/drivers/md/dm-bufio.c 2020-07-03 15:35:23.0 +0200 @@ -108,7 +108,10 @@ struct dm_bufio_client { int async_write_error; struct list_head client_list; + struct shrinker shrinker; + struct work_struct shrink_work; + atomic_long_t need_shrink; }; /* @@ -1634,8 +1637,7 @@ static unsigned long get_retain_buffers( return retain_bytes; } -static unsigned long __scan(struct dm_bufio_client *c, unsigned long nr_to_scan, - gfp_t gfp_mask) +static void __scan(struct dm_bufio_client *c) { int l; struct dm_buffer *b, *tmp; @@ -1646,42 +1648,58 @@ static unsigned long __scan(struct dm_bu for (l = 0; l < LIST_SIZE; l++) { list_for_each_entry_safe_reverse(b, tmp, >lru[l], lru_list) { - if (__try_evict_buffer(b, gfp_mask)) + if (count - freed <= retain_target) + atomic_long_set(>need_shrink, 0); + if (!atomic_long_read(>need_shrink)) + return; + if (__try_evict_buffer(b, GFP_KERNEL)) { + atomic_long_dec(>need_shrink); freed++; - if (!--nr_to_scan || ((count - freed) <= retain_target)) - return freed; + } cond_resched(); } } - return freed; } -static unsigned long -dm_bufio_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) +static void shrink_work(struct work_struct *w) +{ + struct dm_bufio_client *c = container_of(w, struct dm_bufio_client, shrink_work); + + dm_bufio_lock(c); + __scan(c); + dm_bufio_unlock(c); +} + +static unsigned long dm_bufio_shrink_scan(struct shrinker *shrink, struct shrink_control *sc) { struct dm_bufio_client *c; - unsigned long freed; c = container_of(shrink, struct dm_bufio_client, shrinker); - if (sc->gfp_mask & __GFP_FS) - dm_bufio_lock(c); - else if (!dm_bufio_trylock(c)) - return SHRINK_STOP; + atomic_long_add(sc->nr_to_scan, >need_shrink); + queue_work(dm_bufio_wq, >shrink_work); - freed = __scan(c, sc->nr_to_scan, sc->gfp_mask); - dm_bufio_unlock(c); - return freed; + return sc->nr_to_scan; } -static unsigned long -dm_bufio_shrink_count(struct shrinker *shrink, struct shrink_control *sc) +static unsigned long dm_bufio_shrink_count(struct shrinker *shrink, struct shrink_control *sc) { struct dm_bufio_client *c = container_of(shrink, struct dm_bufio_client, shrinker); unsigned long count = READ_ONCE(c->n_buffers[LIST_CLEAN]) + READ_ONCE(c->n_buffers[LIST_DIRTY]); unsigned long retain_target = get_retain_buffers(c); + unsigned long queued_for_cleanup = atomic_long_read(>need_shrink); + + if (unlikely(count < retain_target)) + count = 0; + else + count -= retain_target; - return (count < retain_target) ? 0 : (count - retain_target); + if (unlikely(count < queued_for_cleanup)) + count = 0; + else + count -= queued_for_cleanup; + + return count; } /* @@ -1772,6 +1790,9 @@ struct dm_bufio_client *dm_bufio_client_ __free_buffer_wa
[PATCH 1/3 v6] crypto: introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY
Introduce a new flag CRYPTO_ALG_ALLOCATES_MEMORY and pass it down the crypto stack. If the flag is set, then the crypto driver allocates memory in its request routine. Such drivers are not suitable for disk encryption because GFP_ATOMIC allocation can fail anytime (causing random I/O errors) and GFP_KERNEL allocation can recurse into the block layer, causing a deadlock. Pass the flag CRYPTO_ALG_ALLOCATES_MEMORY down through the crypto API. Signed-off-by: Mikulas Patocka --- crypto/adiantum.c |8 +--- crypto/authenc.c |7 --- crypto/authencesn.c |7 --- crypto/ccm.c |8 crypto/chacha20poly1305.c |7 --- crypto/cryptd.c | 16 +++- crypto/ctr.c |4 ++-- crypto/cts.c |4 ++-- crypto/essiv.c|4 ++-- crypto/gcm.c | 15 --- crypto/geniv.c|4 ++-- crypto/lrw.c |4 ++-- crypto/pcrypt.c | 14 +- crypto/rsa-pkcs1pad.c |4 ++-- crypto/seqiv.c|2 ++ crypto/xts.c |4 ++-- include/crypto/algapi.h | 10 ++ include/linux/crypto.h| 15 +++ 18 files changed, 86 insertions(+), 51 deletions(-) Index: linux-2.6/include/linux/crypto.h === --- linux-2.6.orig/include/linux/crypto.h 2020-06-29 16:03:07.346417000 +0200 +++ linux-2.6/include/linux/crypto.h2020-06-29 16:03:07.336417000 +0200 @@ -102,6 +102,21 @@ #define CRYPTO_NOLOAD 0x8000 /* + * The driver may allocate memory during request processing, so it shouldn't be + * used in cases where memory allocation failures aren't acceptable, such as + * during block device encryption. + */ +#define CRYPTO_ALG_ALLOCATES_MEMORY0x0001 + +/* + * When an algorithm uses another algorithm (e.g., if it's an instance of a + * template), these are the flags that always get set on the "outer" algorithm + * if any "inner" algorithm has them set. In some cases other flags are + * inherited too; these are just the flags that are *always* inherited. + */ +#define CRYPTO_ALG_INHERITED_FLAGS (CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY) + +/* * Transform masks and values (for crt_flags). */ #define CRYPTO_TFM_NEED_KEY0x0001 Index: linux-2.6/crypto/authenc.c === --- linux-2.6.orig/crypto/authenc.c 2020-06-29 16:03:07.346417000 +0200 +++ linux-2.6/crypto/authenc.c 2020-06-30 15:47:56.516417000 +0200 @@ -388,7 +388,7 @@ static int crypto_authenc_create(struct if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask) return -EINVAL; - mask = crypto_requires_sync(algt->type, algt->mask); + mask = crypto_alg_inherited_mask(algt->type, algt->mask); inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL); if (!inst) @@ -423,8 +423,9 @@ static int crypto_authenc_create(struct enc->base.cra_driver_name) >= CRYPTO_MAX_ALG_NAME) goto err_free_inst; - inst->alg.base.cra_flags = (auth_base->cra_flags | - enc->base.cra_flags) & CRYPTO_ALG_ASYNC; + inst->alg.base.cra_flags = + (auth_base->cra_flags | enc->base.cra_flags) & + CRYPTO_ALG_INHERITED_FLAGS; inst->alg.base.cra_priority = enc->base.cra_priority * 10 + auth_base->cra_priority; inst->alg.base.cra_blocksize = enc->base.cra_blocksize; Index: linux-2.6/crypto/authencesn.c === --- linux-2.6.orig/crypto/authencesn.c 2020-06-29 16:03:07.346417000 +0200 +++ linux-2.6/crypto/authencesn.c 2020-06-30 15:48:11.996417000 +0200 @@ -406,7 +406,7 @@ static int crypto_authenc_esn_create(str if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask) return -EINVAL; - mask = crypto_requires_sync(algt->type, algt->mask); + mask = crypto_alg_inherited_mask(algt->type, algt->mask); inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL); if (!inst) @@ -437,8 +437,9 @@ static int crypto_authenc_esn_create(str enc->base.cra_driver_name) >= CRYPTO_MAX_ALG_NAME) goto err_free_inst; - inst->alg.base.cra_flags = (auth_base->cra_flags | - enc->base.cra_flags) & CRYPTO_ALG_ASYNC; + inst->alg.base.cra_flags = + (auth_base->cra_flags | enc->base.cra_flags) & + CRYPTO_ALG_INHERITED_FLAGS; inst->alg.base.cra_priority = enc->base.cra_priority * 10 +
Re: [dm-devel] [PATCH 1/3 v4] crypto: introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY
On Tue, 30 Jun 2020, Eric Biggers wrote: > On Tue, Jun 30, 2020 at 01:01:16PM -0400, Mikulas Patocka wrote: > > > diff --git a/crypto/pcrypt.c b/crypto/pcrypt.c > > > index 7240e8bbdebb..643f7f1cc91c 100644 > > > --- a/crypto/pcrypt.c > > > +++ b/crypto/pcrypt.c > > > @@ -232,12 +232,15 @@ static int pcrypt_create_aead(struct > > > crypto_template *tmpl, struct rtattr **tb, > > > struct crypto_attr_type *algt; > > > struct aead_instance *inst; > > > struct aead_alg *alg; > > > + u32 mask; > > > int err; > > > > > > algt = crypto_get_attr_type(tb); > > > if (IS_ERR(algt)) > > > return PTR_ERR(algt); > > > > > > + mask = crypto_alg_inherited_mask(algt->type, algt->mask); > > > + > > > inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL); > > > if (!inst) > > > return -ENOMEM; > > > @@ -254,7 +257,7 @@ static int pcrypt_create_aead(struct crypto_template > > > *tmpl, struct rtattr **tb, > > > goto err_free_inst; > > > > > > err = crypto_grab_aead(>spawn, aead_crypto_instance(inst), > > > -crypto_attr_alg_name(tb[1]), 0, 0); > > > +crypto_attr_alg_name(tb[1]), 0, mask); > > > if (err) > > > goto err_free_inst; > > > > > > > I added "mask" there - but there is still a "mask" argument that is > > unused - is it a bug to have two "mask" variables? > > Right, I didn't see that algt->type and algt->mask are already being passed > to > pcrypt_create_aead(). It's redundant because pcrypt_create_aead() has access > to > those via crypto_get_attr_type() anyway. > > How about just removing those two arguments for now? > > - Eric Yes. Mikulas
[PATCH 1/3 v5] crypto: introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY
Introduce a new flag CRYPTO_ALG_ALLOCATES_MEMORY and pass it down the crypto stack. If the flag is set, then the crypto driver allocates memory in its request routine. Such drivers are not suitable for disk encryption because GFP_ATOMIC allocation can fail anytime (causing random I/O errors) and GFP_KERNEL allocation can recurse into the block layer, causing a deadlock. Pass the flag CRYPTO_ALG_ALLOCATES_MEMORY down through the crypto API. Signed-off-by: Mikulas Patocka --- crypto/adiantum.c |8 +--- crypto/authenc.c |7 --- crypto/authencesn.c |7 --- crypto/ccm.c |8 crypto/chacha20poly1305.c |7 --- crypto/cryptd.c | 16 +++- crypto/ctr.c |4 ++-- crypto/cts.c |4 ++-- crypto/essiv.c|4 ++-- crypto/gcm.c | 15 --- crypto/geniv.c|4 ++-- crypto/lrw.c |4 ++-- crypto/pcrypt.c |9 +++-- crypto/rsa-pkcs1pad.c |4 ++-- crypto/seqiv.c|2 ++ crypto/xts.c |4 ++-- include/crypto/algapi.h | 10 ++ include/linux/crypto.h| 15 +++ 18 files changed, 84 insertions(+), 48 deletions(-) Index: linux-2.6/include/linux/crypto.h === --- linux-2.6.orig/include/linux/crypto.h 2020-06-29 16:03:07.346417000 +0200 +++ linux-2.6/include/linux/crypto.h2020-06-29 16:03:07.336417000 +0200 @@ -102,6 +102,21 @@ #define CRYPTO_NOLOAD 0x8000 /* + * The driver may allocate memory during request processing, so it shouldn't be + * used in cases where memory allocation failures aren't acceptable, such as + * during block device encryption. + */ +#define CRYPTO_ALG_ALLOCATES_MEMORY0x0001 + +/* + * When an algorithm uses another algorithm (e.g., if it's an instance of a + * template), these are the flags that always get set on the "outer" algorithm + * if any "inner" algorithm has them set. In some cases other flags are + * inherited too; these are just the flags that are *always* inherited. + */ +#define CRYPTO_ALG_INHERITED_FLAGS (CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY) + +/* * Transform masks and values (for crt_flags). */ #define CRYPTO_TFM_NEED_KEY0x0001 Index: linux-2.6/crypto/authenc.c === --- linux-2.6.orig/crypto/authenc.c 2020-06-29 16:03:07.346417000 +0200 +++ linux-2.6/crypto/authenc.c 2020-06-30 15:47:56.516417000 +0200 @@ -388,7 +388,7 @@ static int crypto_authenc_create(struct if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask) return -EINVAL; - mask = crypto_requires_sync(algt->type, algt->mask); + mask = crypto_alg_inherited_mask(algt->type, algt->mask); inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL); if (!inst) @@ -423,8 +423,9 @@ static int crypto_authenc_create(struct enc->base.cra_driver_name) >= CRYPTO_MAX_ALG_NAME) goto err_free_inst; - inst->alg.base.cra_flags = (auth_base->cra_flags | - enc->base.cra_flags) & CRYPTO_ALG_ASYNC; + inst->alg.base.cra_flags = + (auth_base->cra_flags | enc->base.cra_flags) & + CRYPTO_ALG_INHERITED_FLAGS; inst->alg.base.cra_priority = enc->base.cra_priority * 10 + auth_base->cra_priority; inst->alg.base.cra_blocksize = enc->base.cra_blocksize; Index: linux-2.6/crypto/authencesn.c === --- linux-2.6.orig/crypto/authencesn.c 2020-06-29 16:03:07.346417000 +0200 +++ linux-2.6/crypto/authencesn.c 2020-06-30 15:48:11.996417000 +0200 @@ -406,7 +406,7 @@ static int crypto_authenc_esn_create(str if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask) return -EINVAL; - mask = crypto_requires_sync(algt->type, algt->mask); + mask = crypto_alg_inherited_mask(algt->type, algt->mask); inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL); if (!inst) @@ -437,8 +437,9 @@ static int crypto_authenc_esn_create(str enc->base.cra_driver_name) >= CRYPTO_MAX_ALG_NAME) goto err_free_inst; - inst->alg.base.cra_flags = (auth_base->cra_flags | - enc->base.cra_flags) & CRYPTO_ALG_ASYNC; + inst->alg.base.cra_flags = + (auth_base->cra_flags | enc->base.cra_flags) & + CRYPTO_ALG_INHERITED_FLAGS; inst->alg.base.cra_priority = enc->base.cra_priority * 10 +
Re: [dm-devel] [PATCH 1/3 v4] crypto: introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY
On Tue, 30 Jun 2020, Eric Biggers wrote: > On Tue, Jun 30, 2020 at 09:56:22AM -0400, Mikulas Patocka wrote: > > Index: linux-2.6/crypto/cryptd.c > > === > > --- linux-2.6.orig/crypto/cryptd.c 2020-06-29 16:03:07.346417000 +0200 > > +++ linux-2.6/crypto/cryptd.c 2020-06-30 15:49:04.206417000 +0200 > > @@ -202,6 +202,7 @@ static inline void cryptd_check_internal > > > > *type |= algt->type & CRYPTO_ALG_INTERNAL; > > *mask |= algt->mask & CRYPTO_ALG_INTERNAL; > > + *mask |= algt->mask & CRYPTO_ALG_INHERITED_FLAGS; > > } > > This needs to use the correct logic (crypto_alg_inherited_mask) to decide > which > inherited flags to set in 'mask'. I added "*mask |= crypto_alg_inherited_mask(algt->type, algt->mask);" there. > > --- linux-2.6.orig/crypto/adiantum.c2020-06-29 16:03:07.346417000 > > +0200 > > +++ linux-2.6/crypto/adiantum.c 2020-06-29 16:03:07.346417000 +0200 > > @@ -507,7 +507,7 @@ static int adiantum_create(struct crypto > > if ((algt->type ^ CRYPTO_ALG_TYPE_SKCIPHER) & algt->mask) > > return -EINVAL; > > > > - mask = crypto_requires_sync(algt->type, algt->mask); > > + mask = crypto_alg_inherited_mask(algt->type, algt->mask); > > > > inst = kzalloc(sizeof(*inst) + sizeof(*ictx), GFP_KERNEL); > > if (!inst) > > This is still missing setting the flags correctly on the template instance > being > constructed. The flags need to be inherited from all "inner" algorithms: > > inst->alg.base.cra_flags = (streamcipher_alg->base.cra_flags | > blockcipher_alg->cra_flags | > hash_alg->base.cra_flags) & > CRYPTO_ALG_INHERITED_FLAGS; > > If we don't do that, the template instance may allocate memory but not have > CRYPTO_ALG_ALLOCATES_MEMORY set. OK > > Index: linux-2.6/crypto/pcrypt.c > > === > > --- linux-2.6.orig/crypto/pcrypt.c 2020-06-29 16:03:07.346417000 +0200 > > +++ linux-2.6/crypto/pcrypt.c 2020-06-30 15:47:32.776417000 +0200 > > @@ -263,7 +263,9 @@ static int pcrypt_create_aead(struct cry > > if (err) > > goto err_free_inst; > > > > - inst->alg.base.cra_flags = CRYPTO_ALG_ASYNC; > > + inst->alg.base.cra_flags = > > + CRYPTO_ALG_ASYNC | > > + (alg->base.cra_flags & CRYPTO_ALG_INHERITED_FLAGS); > > > > inst->alg.ivsize = crypto_aead_alg_ivsize(alg); > > inst->alg.maxauthsize = crypto_aead_alg_maxauthsize(alg); What's wrong with this? > And this is still missing setting the mask: > > diff --git a/crypto/pcrypt.c b/crypto/pcrypt.c > index 7240e8bbdebb..643f7f1cc91c 100644 > --- a/crypto/pcrypt.c > +++ b/crypto/pcrypt.c > @@ -232,12 +232,15 @@ static int pcrypt_create_aead(struct crypto_template > *tmpl, struct rtattr **tb, > struct crypto_attr_type *algt; > struct aead_instance *inst; > struct aead_alg *alg; > + u32 mask; > int err; > > algt = crypto_get_attr_type(tb); > if (IS_ERR(algt)) > return PTR_ERR(algt); > > + mask = crypto_alg_inherited_mask(algt->type, algt->mask); > + > inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL); > if (!inst) > return -ENOMEM; > @@ -254,7 +257,7 @@ static int pcrypt_create_aead(struct crypto_template > *tmpl, struct rtattr **tb, > goto err_free_inst; > > err = crypto_grab_aead(>spawn, aead_crypto_instance(inst), > -crypto_attr_alg_name(tb[1]), 0, 0); > +crypto_attr_alg_name(tb[1]), 0, mask); > if (err) > goto err_free_inst; > I added "mask" there - but there is still a "mask" argument that is unused - is it a bug to have two "mask" variables? > Can you please think logically about what you're trying to accomplish? I am not an expert in crypto code. > In particular, if someone requests an algorithm that doesn't allocate memory > (which is indicated by type=0, mask=CRYPTO_ALG_ALLOCATES_MEMORY), that request > needs to be honored. > > - Eric Mikulas
Re: [PATCH v3] dm writecache: reject asynchronous pmem.
On Tue, 30 Jun 2020, Mikulas Patocka wrote: > > > On Tue, 30 Jun 2020, Michal Suchanek wrote: > > > The writecache driver does not handle asynchronous pmem. Reject it when > > supplied as cache. > > > > Link: https://lore.kernel.org/linux-nvdimm/87lfk5hahc@linux.ibm.com/ > > Fixes: 6e84200c0a29 ("virtio-pmem: Add virtio pmem driver") > > > > Signed-off-by: Michal Suchanek > > Acked-by: Mikulas Patocka BTW, we should also add Cc: sta...@vger.kernel.org # v5.3+ > > --- > > drivers/md/dm-writecache.c | 6 ++ > > 1 file changed, 6 insertions(+) > > > > diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c > > index 30505d70f423..5358894bb9fd 100644 > > --- a/drivers/md/dm-writecache.c > > +++ b/drivers/md/dm-writecache.c > > @@ -2266,6 +2266,12 @@ static int writecache_ctr(struct dm_target *ti, > > unsigned argc, char **argv) > > } > > > > if (WC_MODE_PMEM(wc)) { > > + if (!dax_synchronous(wc->ssd_dev->dax_dev)) { > > + r = -EOPNOTSUPP; > > + ti->error = "Asynchronous persistent memory not > > supported as pmem cache"; > > + goto bad; > > + } > > + > > r = persistent_memory_claim(wc); > > if (r) { > > ti->error = "Unable to map persistent memory for cache"; > > -- > > 2.26.2 > > >
Re: [PATCH v3] dm writecache: reject asynchronous pmem.
On Tue, 30 Jun 2020, Michal Suchanek wrote: > The writecache driver does not handle asynchronous pmem. Reject it when > supplied as cache. > > Link: https://lore.kernel.org/linux-nvdimm/87lfk5hahc@linux.ibm.com/ > Fixes: 6e84200c0a29 ("virtio-pmem: Add virtio pmem driver") > > Signed-off-by: Michal Suchanek Acked-by: Mikulas Patocka > --- > drivers/md/dm-writecache.c | 6 ++ > 1 file changed, 6 insertions(+) > > diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c > index 30505d70f423..5358894bb9fd 100644 > --- a/drivers/md/dm-writecache.c > +++ b/drivers/md/dm-writecache.c > @@ -2266,6 +2266,12 @@ static int writecache_ctr(struct dm_target *ti, > unsigned argc, char **argv) > } > > if (WC_MODE_PMEM(wc)) { > + if (!dax_synchronous(wc->ssd_dev->dax_dev)) { > + r = -EOPNOTSUPP; > + ti->error = "Asynchronous persistent memory not > supported as pmem cache"; > + goto bad; > + } > + > r = persistent_memory_claim(wc); > if (r) { > ti->error = "Unable to map persistent memory for cache"; > -- > 2.26.2 >
Re: [PATCH v2] dm writecache: reject asynchronous pmem.
On Tue, 30 Jun 2020, Michal Suchanek wrote: > The writecache driver does not handle asynchronous pmem. Reject it when > supplied as cache. > > Link: https://lore.kernel.org/linux-nvdimm/87lfk5hahc@linux.ibm.com/ > Fixes: 6e84200c0a29 ("virtio-pmem: Add virtio pmem driver") > > Signed-off-by: Michal Suchanek OK. I suggest to move this test before persistent_memory_claim (so that you don't try to map the whole device). Mikulas > --- > drivers/md/dm-writecache.c | 6 ++ > 1 file changed, 6 insertions(+) > > diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c > index 30505d70f423..1e4f37249e28 100644 > --- a/drivers/md/dm-writecache.c > +++ b/drivers/md/dm-writecache.c > @@ -2271,6 +2271,12 @@ static int writecache_ctr(struct dm_target *ti, > unsigned argc, char **argv) > ti->error = "Unable to map persistent memory for cache"; > goto bad; > } > + > + if (!dax_synchronous(wc->ssd_dev->dax_dev)) { > + r = -EOPNOTSUPP; > + ti->error = "Asynchronous persistent memory not > supported as pmem cache"; > + goto bad; > + } > } else { > size_t n_blocks, n_metadata_blocks; > uint64_t n_bitmap_bits; > -- > 2.26.2 >
[PATCH 1/3 v4] crypto: introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY
Introduce a new flag CRYPTO_ALG_ALLOCATES_MEMORY and pass it down the crypto stack. If the flag is set, then the crypto driver allocates memory in its request routine. Such drivers are not suitable for disk encryption because GFP_ATOMIC allocation can fail anytime (causing random I/O errors) and GFP_KERNEL allocation can recurse into the block layer, causing a deadlock. Pass the flag CRYPTO_ALG_ALLOCATES_MEMORY down through the crypto API. Signed-off-by: Mikulas Patocka --- crypto/adiantum.c |2 +- crypto/authenc.c |7 --- crypto/authencesn.c |7 --- crypto/ccm.c |8 crypto/chacha20poly1305.c |7 --- crypto/cryptd.c | 16 +++- crypto/ctr.c |4 ++-- crypto/cts.c |4 ++-- crypto/essiv.c|4 ++-- crypto/gcm.c | 15 --- crypto/geniv.c|4 ++-- crypto/lrw.c |4 ++-- crypto/pcrypt.c |4 +++- crypto/rsa-pkcs1pad.c |4 ++-- crypto/seqiv.c|2 ++ crypto/xts.c |4 ++-- include/crypto/algapi.h | 10 ++ include/linux/crypto.h| 15 +++ 18 files changed, 76 insertions(+), 45 deletions(-) Index: linux-2.6/include/linux/crypto.h === --- linux-2.6.orig/include/linux/crypto.h 2020-06-29 16:03:07.346417000 +0200 +++ linux-2.6/include/linux/crypto.h2020-06-29 16:03:07.336417000 +0200 @@ -102,6 +102,21 @@ #define CRYPTO_NOLOAD 0x8000 /* + * The driver may allocate memory during request processing, so it shouldn't be + * used in cases where memory allocation failures aren't acceptable, such as + * during block device encryption. + */ +#define CRYPTO_ALG_ALLOCATES_MEMORY0x0001 + +/* + * When an algorithm uses another algorithm (e.g., if it's an instance of a + * template), these are the flags that always get set on the "outer" algorithm + * if any "inner" algorithm has them set. In some cases other flags are + * inherited too; these are just the flags that are *always* inherited. + */ +#define CRYPTO_ALG_INHERITED_FLAGS (CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY) + +/* * Transform masks and values (for crt_flags). */ #define CRYPTO_TFM_NEED_KEY0x0001 Index: linux-2.6/crypto/authenc.c === --- linux-2.6.orig/crypto/authenc.c 2020-06-29 16:03:07.346417000 +0200 +++ linux-2.6/crypto/authenc.c 2020-06-30 15:47:56.516417000 +0200 @@ -388,7 +388,7 @@ static int crypto_authenc_create(struct if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask) return -EINVAL; - mask = crypto_requires_sync(algt->type, algt->mask); + mask = crypto_alg_inherited_mask(algt->type, algt->mask); inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL); if (!inst) @@ -423,8 +423,9 @@ static int crypto_authenc_create(struct enc->base.cra_driver_name) >= CRYPTO_MAX_ALG_NAME) goto err_free_inst; - inst->alg.base.cra_flags = (auth_base->cra_flags | - enc->base.cra_flags) & CRYPTO_ALG_ASYNC; + inst->alg.base.cra_flags = + (auth_base->cra_flags | enc->base.cra_flags) & + CRYPTO_ALG_INHERITED_FLAGS; inst->alg.base.cra_priority = enc->base.cra_priority * 10 + auth_base->cra_priority; inst->alg.base.cra_blocksize = enc->base.cra_blocksize; Index: linux-2.6/crypto/authencesn.c === --- linux-2.6.orig/crypto/authencesn.c 2020-06-29 16:03:07.346417000 +0200 +++ linux-2.6/crypto/authencesn.c 2020-06-30 15:48:11.996417000 +0200 @@ -406,7 +406,7 @@ static int crypto_authenc_esn_create(str if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask) return -EINVAL; - mask = crypto_requires_sync(algt->type, algt->mask); + mask = crypto_alg_inherited_mask(algt->type, algt->mask); inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL); if (!inst) @@ -437,8 +437,9 @@ static int crypto_authenc_esn_create(str enc->base.cra_driver_name) >= CRYPTO_MAX_ALG_NAME) goto err_free_inst; - inst->alg.base.cra_flags = (auth_base->cra_flags | - enc->base.cra_flags) & CRYPTO_ALG_ASYNC; + inst->alg.base.cra_flags = + (auth_base->cra_flags | enc->base.cra_flags) & + CRYPTO_ALG_INHERITED_FLAGS; inst->alg.base.cra_priority = enc->base.cra_priority * 10 +
Re: [dm-devel] [PATCH 1/3 v2] crypto: introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY
On Sun, 28 Jun 2020, Eric Biggers wrote: > On Sun, Jun 28, 2020 at 03:07:49PM -0400, Mikulas Patocka wrote: > > > > > Also, "seqiv" instances can be created without > > > CRYPTO_ALG_ALLOCATES_MEMORY set, > > > despite seqiv_aead_encrypt() allocating memory. > > > > > This comment wasn't addressed. > > - Eric I've sent version 4 of the patch that adds CRYPTO_ALG_ALLOCATES_MEMORY to seqiv. Mikulas
Re: [dm-devel] [PATCH 1/3 v2] crypto: introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY
On Mon, 29 Jun 2020, Mikulas Patocka wrote: > On Sun, 28 Jun 2020, Eric Biggers wrote: > > > On Sun, Jun 28, 2020 at 03:07:49PM -0400, Mikulas Patocka wrote: > > > > > > > > cryptd_create_skcipher(), cryptd_create_hash(), cryptd_create_aead(), > > > > and > > > > crypto_rfc4309_create() are also missing setting the mask. > > > > > > > > pcrypt_create_aead() is missing both setting the mask and inheriting > > > > the flags. pcrypt_create_aead doesn't use "mask" and "type" arguments at all. Mikulas
Re: [PATCH] dm writecache: reject asynchronous pmem.
On Tue, 30 Jun 2020, Michal Suchanek wrote: > The writecache driver does not handle asynchronous pmem. Reject it when > supplied as cache. > > Link: https://lore.kernel.org/linux-nvdimm/87lfk5hahc@linux.ibm.com/ > Fixes: 6e84200c0a29 ("virtio-pmem: Add virtio pmem driver") > > Signed-off-by: Michal Suchanek > --- > drivers/md/dm-writecache.c | 6 ++ > 1 file changed, 6 insertions(+) > > diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c > index 30505d70f423..57b0a972f6fd 100644 > --- a/drivers/md/dm-writecache.c > +++ b/drivers/md/dm-writecache.c > @@ -2277,6 +2277,12 @@ static int writecache_ctr(struct dm_target *ti, > unsigned argc, char **argv) > > wc->memory_map_size -= (uint64_t)wc->start_sector << > SECTOR_SHIFT; > > + if (!dax_synchronous(wc->ssd_dev->dax_dev)) { > + r = -EOPNOTSUPP; > + ti->error = "Asynchronous persistent memory not > supported as pmem cache"; > + goto bad; > + } > + > bio_list_init(>flush_list); > wc->flush_thread = kthread_create(writecache_flush_thread, wc, > "dm_writecache_flush"); > if (IS_ERR(wc->flush_thread)) { > -- Hi Shouldn't this be in the "if (WC_MODE_PMEM(wc))" block? WC_MODE_PMEM(wc) retrurns true if we are using persistent memory as a cache device, otherwise we are using generic block device as a cache device. Mikulas
Re: [dm-devel] [PATCH 1/3 v2] crypto: introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY
On Sun, 28 Jun 2020, Eric Biggers wrote: > On Sun, Jun 28, 2020 at 03:07:49PM -0400, Mikulas Patocka wrote: > > > > > > cryptd_create_skcipher(), cryptd_create_hash(), cryptd_create_aead(), and > > > crypto_rfc4309_create() are also missing setting the mask. > > > > > > pcrypt_create_aead() is missing both setting the mask and inheriting the > > > flags. > > > > I added CRYPTO_ALG_ALLOCATES_MEMORY there. > > I don't see where the cryptd request processing functions allocate memory. > > It seems that cryptd should just inherit the flag, like most other templates. > > Likewise for pcrypt. > > And also likewise for rfc4309. > > Where are you seeing the memory allocations that would require > CRYPTO_ALG_ALLOCATES_MEMORY to always be enabled for these? > > - Eric This was some misunderstanding. You said "cryptd_create_skcipher ... is missing both setting the mask and inheriting the flags.", so I understood it so that it should inherit CRYPTO_ALG_INHERITED_FLAGS and set CRYPTO_ALG_ALLOCATES_MEMORY unconditionally. Mikulas
Re: [PATCH 0/6] Overhaul memalloc_no*
On Mon, 29 Jun 2020, Dave Chinner wrote: > On Sat, Jun 27, 2020 at 09:09:09AM -0400, Mikulas Patocka wrote: > > > > > > On Sat, 27 Jun 2020, Dave Chinner wrote: > > > > > On Fri, Jun 26, 2020 at 11:02:19AM -0400, Mikulas Patocka wrote: > > > > Hi > > > > > > > > I suggest to join memalloc_noio and memalloc_nofs into just one flag > > > > that > > > > prevents both filesystem recursion and i/o recursion. > > > > > > > > Note that any I/O can recurse into a filesystem via the loop device, > > > > thus > > > > it doesn't make much sense to have a context where PF_MEMALLOC_NOFS is > > > > set > > > > and PF_MEMALLOC_NOIO is not set. > > > > > > Correct me if I'm wrong, but I think that will prevent swapping from > > > GFP_NOFS memory reclaim contexts. > > > > Yes. > > > > > IOWs, this will substantially > > > change the behaviour of the memory reclaim system under sustained > > > GFP_NOFS memory pressure. Sustained GFP_NOFS memory pressure is > > > quite common, so I really don't think we want to telling memory > > > reclaim "you can't do IO at all" when all we are trying to do is > > > prevent recursion back into the same filesystem. > > > > So, we can define __GFP_ONLY_SWAP_IO and __GFP_IO. > > Uh, why? > > Exactly what problem are you trying to solve here? This: 1. The filesystem does a GFP_NOFS allocation. 2. The allocation calls directly a dm-bufio shrinker. 3. The dm-bufio shrinker sees that there is __GFP_IO set, so it assumes that it can do I/O. It selects some dirty buffers, writes them back and waits for the I/O to finish. 4. The dirty buffers belong to a loop device. 5. The loop device thread calls the filesystem that did the GFP_NOFS allocation in step 1 (and that is still waiting for the allocation to succeed). Note that setting PF_MEMALLOC_NOIO on the loop thread won't help with this deadlock. Do you argue that this is a bug in dm-bufio? Or a bug in the kernel? Or that it can't happen? > > I saw this deadlock in the past in the dm-bufio subsystem - see the commit > > 9d28eb12447ee08bb5d1e8bb3195cf20e1ecd1c0 that fixed it. > > 2014? > > /me looks closer. > > Hmmm. Only sent to dm-devel, no comments, no review, just merged. > No surprise that nobody else actually knows about this commit. Well, > time to review it ~6 years after it was merged > > | dm-bufio tested for __GFP_IO. However, dm-bufio can run on a loop block > | device that makes calls into the filesystem. If __GFP_IO is present and > | __GFP_FS isn't, dm-bufio could still block on filesystem operations if it > | runs on a loop block device. > > OK, so from an architectural POV, this commit is fundamentally > broken - block/device layer allocation should not allow relcaim > recursion into filesystems because filesystems are dependent on > the block layer making forwards progress. This commit is trying to > work around the loop device doing GFP_KERNEL/GFP_NOFS context > allocation back end IO path of the loop device. This part of the > loop device is a block device, so needs to run under GFP_NOIO > context. I agree that it is broken, but it fixes the above deadlock. Mikulas
Re: [dm-devel] [PATCH 1/3 v2] crypto: introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY
On Fri, 26 Jun 2020, Eric Biggers wrote: > On Fri, Jun 26, 2020 at 09:46:17AM -0700, Eric Biggers wrote: > > On Fri, Jun 26, 2020 at 12:16:33PM -0400, Mikulas Patocka wrote: > > > +/* > > > + * Pass these flags down through the crypto API. > > > + */ > > > +#define CRYPTO_ALG_INHERITED_FLAGS (CRYPTO_ALG_ASYNC | > > > CRYPTO_ALG_ALLOCATES_MEMORY) > > > > This comment is useless. How about: > > > > /* > > * When an algorithm uses another algorithm (e.g., if it's an instance of a > > * template), these are the flags that always get set on the "outer" > > algorithm > > * if any "inner" algorithm has them set. In some cases other flags are > > * inherited too; these are just the flags that are *always* inherited. > > */ > > #define CRYPTO_ALG_INHERITED_FLAGS (CRYPTO_ALG_ASYNC | > > CRYPTO_ALG_ALLOCATES_MEMORY) > > > > Also I wonder about the case where the inner algorithm is a fallback rather > > than > > part of a template instance. This patch only handles templates, not > > fallbacks. > > Is that intentional? Isn't that technically a bug? > > Also is CRYPTO_ALG_ALLOCATES_MEMORY meant to apply for algorithms of type > "cipher" and "shash"? The code doesn't handle those, so presumably not? > > What about "akcipher"? Yes - the patch should apply for these cases, but I don't know how to do it. Please, do it. > > > Index: linux-2.6/crypto/xts.c > > > === > > > --- linux-2.6.orig/crypto/xts.c 2020-06-26 17:24:03.566417000 +0200 > > > +++ linux-2.6/crypto/xts.c2020-06-26 17:24:03.566417000 +0200 > > > @@ -415,7 +415,7 @@ static int create(struct crypto_template > > > } else > > > goto err_free_inst; > > > > > > - inst->alg.base.cra_flags = alg->base.cra_flags & CRYPTO_ALG_ASYNC; > > > + inst->alg.base.cra_flags = alg->base.cra_flags & > > > CRYPTO_ALG_INHERITED_FLAGS; > > > inst->alg.base.cra_priority = alg->base.cra_priority; > > > inst->alg.base.cra_blocksize = XTS_BLOCK_SIZE; > > > inst->alg.base.cra_alignmask = alg->base.cra_alignmask | > > > > Need to set the mask correctly in this file. > > cryptd_create_skcipher(), cryptd_create_hash(), cryptd_create_aead(), and > crypto_rfc4309_create() are also missing setting the mask. > > pcrypt_create_aead() is missing both setting the mask and inheriting the > flags. I added CRYPTO_ALG_ALLOCATES_MEMORY there. > Also, "seqiv" instances can be created without CRYPTO_ALG_ALLOCATES_MEMORY > set, > despite seqiv_aead_encrypt() allocating memory. > > - Eric Mikulas
[PATCH 1/3 v3] crypto: introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY
Introduce a new flag CRYPTO_ALG_ALLOCATES_MEMORY and pass it down the crypto stack. If the flag is set, then the crypto driver allocates memory in its request routine. Such drivers are not suitable for disk encryption because GFP_ATOMIC allocation can fail anytime (causing random I/O errors) and GFP_KERNEL allocation can recurse into the block layer, causing a deadlock. Pass the flag CRYPTO_ALG_ALLOCATES_MEMORY down through the crypto API. Signed-off-by: Mikulas Patocka --- crypto/adiantum.c |2 +- crypto/authenc.c |4 ++-- crypto/authencesn.c |4 ++-- crypto/ccm.c |9 + crypto/chacha20poly1305.c |4 ++-- crypto/cryptd.c | 12 +--- crypto/ctr.c |4 ++-- crypto/cts.c |4 ++-- crypto/essiv.c|4 ++-- crypto/gcm.c | 12 ++-- crypto/geniv.c|4 ++-- crypto/lrw.c |4 ++-- crypto/pcrypt.c |4 +++- crypto/rsa-pkcs1pad.c |4 ++-- crypto/xts.c |2 +- include/crypto/algapi.h | 10 ++ include/linux/crypto.h| 15 +++ 17 files changed, 64 insertions(+), 38 deletions(-) Index: linux-2.6/include/linux/crypto.h === --- linux-2.6.orig/include/linux/crypto.h 2020-06-26 17:24:03.566417000 +0200 +++ linux-2.6/include/linux/crypto.h2020-06-28 17:55:21.976417000 +0200 @@ -102,6 +102,21 @@ #define CRYPTO_NOLOAD 0x8000 /* + * The driver may allocate memory during request processing, so it shouldn't be + * used in cases where memory allocation failures aren't acceptable, such as + * during block device encryption. + */ +#define CRYPTO_ALG_ALLOCATES_MEMORY0x0001 + +/* + * When an algorithm uses another algorithm (e.g., if it's an instance of a + * template), these are the flags that always get set on the "outer" algorithm + * if any "inner" algorithm has them set. In some cases other flags are + * inherited too; these are just the flags that are *always* inherited. + */ +#define CRYPTO_ALG_INHERITED_FLAGS (CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY) + +/* * Transform masks and values (for crt_flags). */ #define CRYPTO_TFM_NEED_KEY0x0001 Index: linux-2.6/crypto/authenc.c === --- linux-2.6.orig/crypto/authenc.c 2020-06-26 17:24:03.566417000 +0200 +++ linux-2.6/crypto/authenc.c 2020-06-28 20:33:29.136417000 +0200 @@ -388,7 +388,7 @@ static int crypto_authenc_create(struct if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask) return -EINVAL; - mask = crypto_requires_sync(algt->type, algt->mask); + mask = crypto_requires_inherited(algt->type, algt->mask); inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL); if (!inst) @@ -424,7 +424,7 @@ static int crypto_authenc_create(struct goto err_free_inst; inst->alg.base.cra_flags = (auth_base->cra_flags | - enc->base.cra_flags) & CRYPTO_ALG_ASYNC; + enc->base.cra_flags) & CRYPTO_ALG_INHERITED_FLAGS; inst->alg.base.cra_priority = enc->base.cra_priority * 10 + auth_base->cra_priority; inst->alg.base.cra_blocksize = enc->base.cra_blocksize; Index: linux-2.6/crypto/authencesn.c === --- linux-2.6.orig/crypto/authencesn.c 2020-06-26 17:24:03.566417000 +0200 +++ linux-2.6/crypto/authencesn.c 2020-06-28 20:33:32.556417000 +0200 @@ -406,7 +406,7 @@ static int crypto_authenc_esn_create(str if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask) return -EINVAL; - mask = crypto_requires_sync(algt->type, algt->mask); + mask = crypto_requires_inherited(algt->type, algt->mask); inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL); if (!inst) @@ -438,7 +438,7 @@ static int crypto_authenc_esn_create(str goto err_free_inst; inst->alg.base.cra_flags = (auth_base->cra_flags | - enc->base.cra_flags) & CRYPTO_ALG_ASYNC; + enc->base.cra_flags) & CRYPTO_ALG_INHERITED_FLAGS; inst->alg.base.cra_priority = enc->base.cra_priority * 10 + auth_base->cra_priority; inst->alg.base.cra_blocksize = enc->base.cra_blocksize; Index: linux-2.6/crypto/ccm.c === --- linux-2.6.orig/crypto/ccm.c 2020-06-26 17:24:03.566417000 +0200 +++ linux-2.6/crypto/ccm.c 2020-06-28 20:43:42.456417000 +0200 @@ -462,7 +462,7 @@
Re: [PATCH 1/3 v2] crypto: introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY
On Fri, 26 Jun 2020, Eric Biggers wrote: > On Fri, Jun 26, 2020 at 12:16:33PM -0400, Mikulas Patocka wrote: > > +/* > > + * Pass these flags down through the crypto API. > > + */ > > +#define CRYPTO_ALG_INHERITED_FLAGS (CRYPTO_ALG_ASYNC | > > CRYPTO_ALG_ALLOCATES_MEMORY) > > This comment is useless. How about: > > /* > * When an algorithm uses another algorithm (e.g., if it's an instance of a > * template), these are the flags that always get set on the "outer" algorithm > * if any "inner" algorithm has them set. In some cases other flags are > * inherited too; these are just the flags that are *always* inherited. > */ > #define CRYPTO_ALG_INHERITED_FLAGS(CRYPTO_ALG_ASYNC | > CRYPTO_ALG_ALLOCATES_MEMORY) > > Also I wonder about the case where the inner algorithm is a fallback rather > than > part of a template instance. This patch only handles templates, not > fallbacks. > Is that intentional? Isn't that technically a bug? I'm not an expert in crypto internals, so I don't know. I'll send version 3 of this patch and I'd like to ask you or Herbert to fix it. > > + > > +/* > > * Transform masks and values (for crt_flags). > > */ > > #define CRYPTO_TFM_NEED_KEY0x0001 > > Index: linux-2.6/crypto/authenc.c > > === > > --- linux-2.6.orig/crypto/authenc.c 2020-06-26 17:24:03.566417000 +0200 > > +++ linux-2.6/crypto/authenc.c 2020-06-26 17:24:03.566417000 +0200 > > @@ -388,7 +388,8 @@ static int crypto_authenc_create(struct > > if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask) > > return -EINVAL; > > > > - mask = crypto_requires_sync(algt->type, algt->mask); > > + mask = crypto_requires_sync(algt->type, algt->mask) | > > + crypto_requires_nomem(algt->type, algt->mask); > > As I suggested earlier, shouldn't there be a function that returns the mask > for > all inherited flags, rather than handling each flag individually? Yes - I've created crypto_requires_inherited for this purpose. > > > > inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL); > > if (!inst) > > @@ -424,7 +425,7 @@ static int crypto_authenc_create(struct > > goto err_free_inst; > > > > inst->alg.base.cra_flags = (auth_base->cra_flags | > > - enc->base.cra_flags) & CRYPTO_ALG_ASYNC; > > + enc->base.cra_flags) & CRYPTO_ALG_INHERITED_FLAGS; > > Strange indentation here. Likewise in most of the other files. I was told that the code should be 80-characters wide. > > Index: linux-2.6/crypto/xts.c > > === > > --- linux-2.6.orig/crypto/xts.c 2020-06-26 17:24:03.566417000 +0200 > > +++ linux-2.6/crypto/xts.c 2020-06-26 17:24:03.566417000 +0200 > > @@ -415,7 +415,7 @@ static int create(struct crypto_template > > } else > > goto err_free_inst; > > > > - inst->alg.base.cra_flags = alg->base.cra_flags & CRYPTO_ALG_ASYNC; > > + inst->alg.base.cra_flags = alg->base.cra_flags & > > CRYPTO_ALG_INHERITED_FLAGS; > > inst->alg.base.cra_priority = alg->base.cra_priority; > > inst->alg.base.cra_blocksize = XTS_BLOCK_SIZE; > > inst->alg.base.cra_alignmask = alg->base.cra_alignmask | > > Need to set the mask correctly in this file. I don't know what do you mean. > > Index: linux-2.6/crypto/adiantum.c > > === > > --- linux-2.6.orig/crypto/adiantum.c2020-06-26 17:24:03.566417000 > > +0200 > > +++ linux-2.6/crypto/adiantum.c 2020-06-26 17:24:03.566417000 +0200 > > @@ -507,7 +507,8 @@ static int adiantum_create(struct crypto > > if ((algt->type ^ CRYPTO_ALG_TYPE_SKCIPHER) & algt->mask) > > return -EINVAL; > > > > - mask = crypto_requires_sync(algt->type, algt->mask); > > + mask = crypto_requires_sync(algt->type, algt->mask) | > > + crypto_requires_nomem(algt->type, algt->mask); > > > > inst = kzalloc(sizeof(*inst) + sizeof(*ictx), GFP_KERNEL); > > if (!inst) > > Need to use CRYPTO_ALG_INHERITED_FLAGS in this file. OK. > - Eric Mikulas
Re: [PATCH 0/6] Overhaul memalloc_no*
On Sat, 27 Jun 2020, Dave Chinner wrote: > On Fri, Jun 26, 2020 at 11:02:19AM -0400, Mikulas Patocka wrote: > > Hi > > > > I suggest to join memalloc_noio and memalloc_nofs into just one flag that > > prevents both filesystem recursion and i/o recursion. > > > > Note that any I/O can recurse into a filesystem via the loop device, thus > > it doesn't make much sense to have a context where PF_MEMALLOC_NOFS is set > > and PF_MEMALLOC_NOIO is not set. > > Correct me if I'm wrong, but I think that will prevent swapping from > GFP_NOFS memory reclaim contexts. Yes. > IOWs, this will substantially > change the behaviour of the memory reclaim system under sustained > GFP_NOFS memory pressure. Sustained GFP_NOFS memory pressure is > quite common, so I really don't think we want to telling memory > reclaim "you can't do IO at all" when all we are trying to do is > prevent recursion back into the same filesystem. So, we can define __GFP_ONLY_SWAP_IO and __GFP_IO. > Given that the loop device IO path already operates under > memalloc_noio context, (i.e. the recursion restriction is applied in > only the context that needs is) I see no reason for making that a > global reclaim limitation I think this is a problem. Suppose that a filesystem does GFP_NOFS allocation, the allocation triggers an IO and waits for it to finish, the loop device driver redirects the IO to the same filesystem that did the GFP_NOFS allocation. I saw this deadlock in the past in the dm-bufio subsystem - see the commit 9d28eb12447ee08bb5d1e8bb3195cf20e1ecd1c0 that fixed it. Other subsystems that do IO in GFP_NOFS context may deadlock just like bufio. Mikulas
[PATCH 1/3 v2] crypto: introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY
Introduce a new flag CRYPTO_ALG_ALLOCATES_MEMORY and pass it down the crypto stack. If the flag is set, then the crypto driver allocates memory in its request routine. Such drivers are not suitable for disk encryption because GFP_ATOMIC allocation can fail anytime (causing random I/O errors) and GFP_KERNEL allocation can recurse into the block layer, causing a deadlock. Pass the flag CRYPTO_ALG_ALLOCATES_MEMORY down through the crypto API. Signed-off-by: Mikulas Patocka --- crypto/adiantum.c |3 ++- crypto/authenc.c |5 +++-- crypto/authencesn.c |5 +++-- crypto/ccm.c |7 --- crypto/chacha20poly1305.c |5 +++-- crypto/cryptd.c |9 ++--- crypto/ctr.c |3 ++- crypto/cts.c |5 +++-- crypto/essiv.c|5 +++-- crypto/gcm.c | 15 +-- crypto/geniv.c|3 ++- crypto/lrw.c |5 +++-- crypto/rsa-pkcs1pad.c |5 +++-- crypto/xts.c |2 +- include/crypto/algapi.h |9 + include/linux/crypto.h| 12 16 files changed, 68 insertions(+), 30 deletions(-) Index: linux-2.6/include/linux/crypto.h === --- linux-2.6.orig/include/linux/crypto.h 2020-06-26 17:24:03.566417000 +0200 +++ linux-2.6/include/linux/crypto.h2020-06-26 17:25:28.066417000 +0200 @@ -102,6 +102,18 @@ #define CRYPTO_NOLOAD 0x8000 /* + * The driver may allocate memory during request processing, so it shouldn't be + * used in cases where memory allocation failures aren't acceptable, such as + * during block device encryption. + */ +#define CRYPTO_ALG_ALLOCATES_MEMORY0x0001 + +/* + * Pass these flags down through the crypto API. + */ +#define CRYPTO_ALG_INHERITED_FLAGS (CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY) + +/* * Transform masks and values (for crt_flags). */ #define CRYPTO_TFM_NEED_KEY0x0001 Index: linux-2.6/crypto/authenc.c === --- linux-2.6.orig/crypto/authenc.c 2020-06-26 17:24:03.566417000 +0200 +++ linux-2.6/crypto/authenc.c 2020-06-26 17:24:03.566417000 +0200 @@ -388,7 +388,8 @@ static int crypto_authenc_create(struct if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask) return -EINVAL; - mask = crypto_requires_sync(algt->type, algt->mask); + mask = crypto_requires_sync(algt->type, algt->mask) | + crypto_requires_nomem(algt->type, algt->mask); inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL); if (!inst) @@ -424,7 +425,7 @@ static int crypto_authenc_create(struct goto err_free_inst; inst->alg.base.cra_flags = (auth_base->cra_flags | - enc->base.cra_flags) & CRYPTO_ALG_ASYNC; + enc->base.cra_flags) & CRYPTO_ALG_INHERITED_FLAGS; inst->alg.base.cra_priority = enc->base.cra_priority * 10 + auth_base->cra_priority; inst->alg.base.cra_blocksize = enc->base.cra_blocksize; Index: linux-2.6/crypto/authencesn.c === --- linux-2.6.orig/crypto/authencesn.c 2020-06-26 17:24:03.566417000 +0200 +++ linux-2.6/crypto/authencesn.c 2020-06-26 17:24:03.566417000 +0200 @@ -406,7 +406,8 @@ static int crypto_authenc_esn_create(str if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask) return -EINVAL; - mask = crypto_requires_sync(algt->type, algt->mask); + mask = crypto_requires_sync(algt->type, algt->mask) | + crypto_requires_nomem(algt->type, algt->mask); inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL); if (!inst) @@ -438,7 +439,7 @@ static int crypto_authenc_esn_create(str goto err_free_inst; inst->alg.base.cra_flags = (auth_base->cra_flags | - enc->base.cra_flags) & CRYPTO_ALG_ASYNC; + enc->base.cra_flags) & CRYPTO_ALG_INHERITED_FLAGS; inst->alg.base.cra_priority = enc->base.cra_priority * 10 + auth_base->cra_priority; inst->alg.base.cra_blocksize = enc->base.cra_blocksize; Index: linux-2.6/crypto/ccm.c === --- linux-2.6.orig/crypto/ccm.c 2020-06-26 17:24:03.566417000 +0200 +++ linux-2.6/crypto/ccm.c 2020-06-26 17:24:03.566417000 +0200 @@ -462,7 +462,8 @@ static int crypto_ccm_create_common(stru if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask) return -EINVAL; - mask = crypto_requires_sync(algt->t
Re: [PATCH 1/3] crypto: pass the flag CRYPTO_ALG_ALLOCATES_MEMORY
On Fri, 26 Jun 2020, Herbert Xu wrote: > On Wed, Jun 17, 2020 at 11:09:28AM -0400, Mikulas Patocka wrote: > > > > Index: linux-2.6/include/linux/crypto.h > > === > > --- linux-2.6.orig/include/linux/crypto.h > > +++ linux-2.6/include/linux/crypto.h > > @@ -97,9 +97,18 @@ > > #define CRYPTO_ALG_OPTIONAL_KEY0x4000 > > > > /* > > + * The driver may allocate memory during request processing, so it > > shouldn't be > > + * used in cases where memory allocation failures aren't acceptable, such > > as > > + * during block device encryption. > > + */ > > +#define CRYPTO_ALG_ALLOCATES_MEMORY0x8000 > > + > > +/* > > * Don't trigger module loading > > */ > > -#define CRYPTO_NOLOAD 0x8000 > > +#define CRYPTO_NOLOAD 0x0001 > > + > > +#define CRYPTO_ALG_INHERITED_FLAGS (CRYPTO_ALG_ASYNC | > > CRYPTO_ALG_ALLOCATES_MEMORY) > > Any reason why you need to renumber NOLOAD? If not please keep > the existing values. There is no hard reason for that. CRYPTO_NOLOAD is a "virtual" flag that could be only present in crypto algorithm requests and I thought that the intention was that the virtual flags go after real flags. If you don't want to change existing flags, there is no problem with that. Mikulas > Thanks, > -- > Email: Herbert Xu > Home Page: http://gondor.apana.org.au/~herbert/ > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt >
Re: [PATCH 0/6] Overhaul memalloc_no*
Hi I suggest to join memalloc_noio and memalloc_nofs into just one flag that prevents both filesystem recursion and i/o recursion. Note that any I/O can recurse into a filesystem via the loop device, thus it doesn't make much sense to have a context where PF_MEMALLOC_NOFS is set and PF_MEMALLOC_NOIO is not set. Mikulas On Thu, 25 Jun 2020, Matthew Wilcox (Oracle) wrote: > I want a memalloc_nowait like we have memalloc_noio and memalloc_nofs > for an upcoming patch series, and Jens also wants it for non-blocking > io_uring. It turns out we already have dm-bufio which could benefit > from memalloc_nowait, so it may as well go into the tree now. > > The biggest problem is that we're basically out of PF_ flags, so we need > to find somewhere else to store the PF_MEMALLOC_NOWAIT flag. It turns > out the PF_ flags are really supposed to be used for flags which are > accessed from other tasks, and the MEMALLOC flags are only going to > be used by this task. So shuffling everything around frees up some PF > flags and generally makes the world a better place. > > Patch series also available from > http://git.infradead.org/users/willy/linux.git/shortlog/refs/heads/memalloc > > Matthew Wilcox (Oracle) (6): > mm: Replace PF_MEMALLOC_NOIO with memalloc_noio > mm: Add become_kswapd and restore_kswapd > xfs: Convert to memalloc_nofs_save > mm: Replace PF_MEMALLOC_NOFS with memalloc_nofs > mm: Replace PF_MEMALLOC_NOIO with memalloc_nocma > mm: Add memalloc_nowait > > drivers/block/loop.c | 3 +- > drivers/md/dm-bufio.c | 30 > drivers/md/dm-zoned-metadata.c | 5 +- > fs/iomap/buffered-io.c | 2 +- > fs/xfs/kmem.c | 2 +- > fs/xfs/libxfs/xfs_btree.c | 14 +++--- > fs/xfs/xfs_aops.c | 4 +- > fs/xfs/xfs_buf.c | 2 +- > fs/xfs/xfs_linux.h | 6 --- > fs/xfs/xfs_trans.c | 14 +++--- > fs/xfs/xfs_trans.h | 2 +- > include/linux/sched.h | 7 +-- > include/linux/sched/mm.h | 84 ++ > kernel/sys.c | 8 ++-- > mm/vmscan.c| 16 +-- > 15 files changed, 105 insertions(+), 94 deletions(-) > > -- > 2.27.0 >
Re: [RFC PATCH 0/1] dm-crypt excessive overhead
On Sat, 20 Jun 2020, Herbert Xu wrote: > On Fri, Jun 19, 2020 at 02:39:39PM -0400, Mikulas Patocka wrote: > > > > I'm looking at this and I'd like to know why does the crypto API fail in > > hard-irq context and why does it work in tasklet context. What's the exact > > reason behind this? > > You're not supposed to do any real work in IRQ handlers. All > the substantial work should be postponed to softirq context. I see. BTW - should it also warn if it is running in a process context with interrupts disabled? Mikulas > Why do you need to do work in hard IRQ context? > > Cheers, > -- > Email: Herbert Xu > Home Page: http://gondor.apana.org.au/~herbert/ > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt >
Re: [RFC PATCH 0/1] dm-crypt excessive overhead
On Fri, 19 Jun 2020, Mike Snitzer wrote: > On Fri, Jun 19 2020 at 12:41pm -0400, > Ignat Korchagin wrote: > > > This is a follow up from the long-forgotten [1], but with some more > > convincing > > evidence. Consider the following script: > > > > [1]: https://www.spinics.net/lists/dm-crypt/msg07516.html > > [2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/ > > > > Ignat Korchagin (1): > > Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target > > > > drivers/md/dm-crypt.c | 55 +-- > > 1 file changed, 43 insertions(+), 12 deletions(-) > > > > -- > > 2.20.1 > > > > Hi, > > I saw [2] and have been expecting something from cloudflare ever since. > Nice to see this submission. > > There is useful context in your 0th patch header. I'll likely merge > parts of this patch header with the more terse 1/1 header (reality is > there only needed to be a single patch submission). > > Will review and stage accordingly if all looks fine to me. Mikulas, > please have a look too. > > Thanks, > Mike + if (test_bit(DM_CRYPT_FORCE_INLINE, >flags)) { + if (in_irq()) { + /* Crypto API will fail hard in hard IRQ context */ + tasklet_init(>tasklet, kcryptd_crypt_tasklet, (unsigned long)>work); + tasklet_schedule(>tasklet); + } else + kcryptd_crypt(>work); + } else { + INIT_WORK(>work, kcryptd_crypt); + queue_work(cc->crypt_queue, >work); + } I'm looking at this and I'd like to know why does the crypto API fail in hard-irq context and why does it work in tasklet context. What's the exact reason behind this? Mikulas
[PATCH 3/3] dm-crypt: don't use drivers that have CRYPTO_ALG_ALLOCATES_MEMORY
Don't use crypto drivers that have the flag CRYPTO_ALG_ALLOCATES_MEMORY set. These drivers allocate memory and thus they are unsuitable for block I/O processing. Signed-off-by: Mikulas Patocka --- drivers/md/dm-crypt.c | 17 +++-- 1 file changed, 11 insertions(+), 6 deletions(-) Index: linux-2.6/drivers/md/dm-crypt.c === --- linux-2.6.orig/drivers/md/dm-crypt.c +++ linux-2.6/drivers/md/dm-crypt.c @@ -419,7 +419,8 @@ static int crypt_iv_lmk_ctr(struct crypt return -EINVAL; } - lmk->hash_tfm = crypto_alloc_shash("md5", 0, 0); + lmk->hash_tfm = crypto_alloc_shash("md5", 0, + CRYPTO_ALG_ALLOCATES_MEMORY); if (IS_ERR(lmk->hash_tfm)) { ti->error = "Error initializing LMK hash"; return PTR_ERR(lmk->hash_tfm); @@ -581,7 +582,8 @@ static int crypt_iv_tcw_ctr(struct crypt return -EINVAL; } - tcw->crc32_tfm = crypto_alloc_shash("crc32", 0, 0); + tcw->crc32_tfm = crypto_alloc_shash("crc32", 0, + CRYPTO_ALG_ALLOCATES_MEMORY); if (IS_ERR(tcw->crc32_tfm)) { ti->error = "Error initializing CRC32 in TCW"; return PTR_ERR(tcw->crc32_tfm); @@ -768,7 +770,8 @@ static int crypt_iv_elephant_ctr(struct struct iv_elephant_private *elephant = >iv_gen_private.elephant; int r; - elephant->tfm = crypto_alloc_skcipher("ecb(aes)", 0, 0); + elephant->tfm = crypto_alloc_skcipher("ecb(aes)", 0, + CRYPTO_ALG_ALLOCATES_MEMORY); if (IS_ERR(elephant->tfm)) { r = PTR_ERR(elephant->tfm); elephant->tfm = NULL; @@ -2088,7 +2091,8 @@ static int crypt_alloc_tfms_skcipher(str return -ENOMEM; for (i = 0; i < cc->tfms_count; i++) { - cc->cipher_tfm.tfms[i] = crypto_alloc_skcipher(ciphermode, 0, 0); + cc->cipher_tfm.tfms[i] = crypto_alloc_skcipher(ciphermode, 0, + CRYPTO_ALG_ALLOCATES_MEMORY); if (IS_ERR(cc->cipher_tfm.tfms[i])) { err = PTR_ERR(cc->cipher_tfm.tfms[i]); crypt_free_tfms(cc); @@ -2114,7 +2118,8 @@ static int crypt_alloc_tfms_aead(struct if (!cc->cipher_tfm.tfms) return -ENOMEM; - cc->cipher_tfm.tfms_aead[0] = crypto_alloc_aead(ciphermode, 0, 0); + cc->cipher_tfm.tfms_aead[0] = crypto_alloc_aead(ciphermode, 0, + CRYPTO_ALG_ALLOCATES_MEMORY); if (IS_ERR(cc->cipher_tfm.tfms_aead[0])) { err = PTR_ERR(cc->cipher_tfm.tfms_aead[0]); crypt_free_tfms(cc); @@ -2565,7 +2570,7 @@ static int crypt_ctr_auth_cipher(struct return -ENOMEM; strncpy(mac_alg, start, end - start); - mac = crypto_alloc_ahash(mac_alg, 0, 0); + mac = crypto_alloc_ahash(mac_alg, 0, CRYPTO_ALG_ALLOCATES_MEMORY); kfree(mac_alg); if (IS_ERR(mac))
[PATCH 2/3] crypto: set the flag CRYPTO_ALG_ALLOCATES_MEMORY
Set the flag CRYPTO_ALG_ALLOCATES_MEMORY in the crypto drivers that allocate memory. drivers/crypto/allwinner/sun8i-ce/sun8i-ce-core.c: sun8i_ce_cipher drivers/crypto/allwinner/sun8i-ss/sun8i-ss-core.c: sun8i_ss_cipher drivers/crypto/amlogic/amlogic-gxl-core.c: meson_cipher drivers/crypto/axis/artpec6_crypto.c: artpec6_crypto_common_init drivers/crypto/bcm/cipher.c: spu_skcipher_rx_sg_create drivers/crypto/caam/caamalg.c: aead_edesc_alloc drivers/crypto/caam/caamalg_qi.c: aead_edesc_alloc drivers/crypto/caam/caamalg_qi2.c: aead_edesc_alloc drivers/crypto/caam/caamhash.c: hash_digest_key drivers/crypto/cavium/cpt/cptvf_algs.c: process_request drivers/crypto/cavium/nitrox/nitrox_aead.c: nitrox_process_se_request drivers/crypto/cavium/nitrox/nitrox_skcipher.c: nitrox_process_se_request drivers/crypto/ccp/ccp-crypto-aes-cmac.c: ccp_do_cmac_update drivers/crypto/ccp/ccp-crypto-aes-galois.c: ccp_crypto_enqueue_request drivers/crypto/ccp/ccp-crypto-aes-xts.c: ccp_crypto_enqueue_request drivers/crypto/ccp/ccp-crypto-aes.c: ccp_crypto_enqueue_request drivers/crypto/ccp/ccp-crypto-des3.c: ccp_crypto_enqueue_request drivers/crypto/ccp/ccp-crypto-sha.c: ccp_crypto_enqueue_request drivers/crypto/chelsio/chcr_algo.c: create_cipher_wr drivers/crypto/hisilicon/sec/sec_algs.c: sec_alloc_and_fill_hw_sgl drivers/crypto/hisilicon/sec2/sec_crypto.c: sec_alloc_req_id drivers/crypto/inside-secure/safexcel_cipher.c: safexcel_queue_req drivers/crypto/inside-secure/safexcel_hash.c: safexcel_ahash_enqueue drivers/crypto/ixp4xx_crypto.c: ablk_perform drivers/crypto/marvell/cesa/cipher.c: mv_cesa_skcipher_dma_req_init drivers/crypto/marvell/cesa/hash.c: mv_cesa_ahash_dma_req_init drivers/crypto/marvell/octeontx/otx_cptvf_algs.c: create_ctx_hdr drivers/crypto/n2_core.c: n2_compute_chunks drivers/crypto/picoxcell_crypto.c: spacc_sg_to_ddt drivers/crypto/qat/qat_common/qat_algs.c: qat_alg_skcipher_encrypt drivers/crypto/qce/skcipher.c: qce_skcipher_async_req_handle drivers/crypto/talitos.c : talitos_edesc_alloc drivers/crypto/virtio/virtio_crypto_algs.c: __virtio_crypto_skcipher_do_req drivers/crypto/xilinx/zynqmp-aes-gcm.c: zynqmp_aes_aead_cipher Signed-off-by: Mikulas Patocka --- drivers/crypto/allwinner/sun8i-ce/sun8i-ce-core.c |8 +- drivers/crypto/allwinner/sun8i-ss/sun8i-ss-core.c |8 +- drivers/crypto/amlogic/amlogic-gxl-core.c |4 - drivers/crypto/axis/artpec6_crypto.c | 20 +++-- drivers/crypto/bcm/cipher.c | 38 +- drivers/crypto/caam/caamalg.c |4 - drivers/crypto/caam/caamalg_qi.c |4 - drivers/crypto/caam/caamalg_qi2.c |6 - drivers/crypto/caam/caamhash.c|2 drivers/crypto/cavium/cpt/cptvf_algs.c| 12 +-- drivers/crypto/cavium/nitrox/nitrox_aead.c|4 - drivers/crypto/cavium/nitrox/nitrox_skcipher.c| 16 ++-- drivers/crypto/ccp/ccp-crypto-aes-cmac.c |1 drivers/crypto/ccp/ccp-crypto-aes-galois.c|1 drivers/crypto/ccp/ccp-crypto-aes-xts.c |1 drivers/crypto/ccp/ccp-crypto-aes.c |2 drivers/crypto/ccp/ccp-crypto-des3.c |1 drivers/crypto/ccp/ccp-crypto-sha.c |1 drivers/crypto/chelsio/chcr_algo.c|6 - drivers/crypto/hisilicon/sec/sec_algs.c | 16 ++-- drivers/crypto/hisilicon/sec2/sec_crypto.c|4 - drivers/crypto/inside-secure/safexcel_cipher.c| 47 + drivers/crypto/inside-secure/safexcel_hash.c | 18 + drivers/crypto/ixp4xx_crypto.c|6 + drivers/crypto/marvell/cesa/cipher.c | 12 +-- drivers/crypto/marvell/cesa/hash.c|6 + drivers/crypto/marvell/octeontx/otx_cptvf_algs.c | 30 drivers/crypto/n2_core.c |2 drivers/crypto/picoxcell_crypto.c | 17 drivers/crypto/qat/qat_common/qat_algs.c | 12 +-- drivers/crypto/qce/sha.c |2 drivers/crypto/qce/skcipher.c |1 drivers/crypto/talitos.c | 78 +++--- drivers/crypto/virtio/virtio_crypto_algs.c|2 drivers/crypto/xilinx/zynqmp-aes-gcm.c|1 35 files changed, 248 insertions(+), 145 deletions(-) Index: linux-2.6/drivers/crypto/qat/qat_common/qat_algs.c === --- linux-2.6.orig/drivers/crypto/qat/qat_common/qat_algs.c +++ linux-2.6/drivers/crypto/qat/qat_common/qat_algs.c @@ -1267,7 +1267,7 @@ static struct aead_alg qat_aeads[] = { { .cra_name = "authenc(hmac(sha1),cbc(aes))", .cra_driver_name = "qat_aes_cbc_hmac_sha1", .cra_priority = 4001, - .cra_flags = CRYPTO_ALG_ASYNC, + .cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_
[PATCH 1/3] crypto: pass the flag CRYPTO_ALG_ALLOCATES_MEMORY
Introduce a new flag CRYPTO_ALG_ALLOCATES_MEMORY and pass it down the crypto stack. If the flag is set, then the crypto driver allocates memory in its request routine. Such drivers are not suitable for disk encryption because GFP_ATOMIC allocation can fail anytime (causing random I/O errors) and GFP_KERNEL allocation can recurse into the block layer, causing a deadlock. Signed-off-by: Mikulas Patocka --- crypto/adiantum.c |3 ++- crypto/authenc.c |5 +++-- crypto/authencesn.c |5 +++-- crypto/ccm.c |7 --- crypto/chacha20poly1305.c |5 +++-- crypto/cryptd.c |9 ++--- crypto/ctr.c |3 ++- crypto/cts.c |5 +++-- crypto/essiv.c|5 +++-- crypto/gcm.c | 15 +-- crypto/geniv.c|3 ++- crypto/lrw.c |5 +++-- crypto/rsa-pkcs1pad.c |5 +++-- crypto/xts.c |2 +- include/crypto/algapi.h |9 + include/linux/crypto.h| 11 ++- 16 files changed, 66 insertions(+), 31 deletions(-) Index: linux-2.6/include/linux/crypto.h === --- linux-2.6.orig/include/linux/crypto.h +++ linux-2.6/include/linux/crypto.h @@ -97,9 +97,18 @@ #define CRYPTO_ALG_OPTIONAL_KEY0x4000 /* + * The driver may allocate memory during request processing, so it shouldn't be + * used in cases where memory allocation failures aren't acceptable, such as + * during block device encryption. + */ +#define CRYPTO_ALG_ALLOCATES_MEMORY0x8000 + +/* * Don't trigger module loading */ -#define CRYPTO_NOLOAD 0x8000 +#define CRYPTO_NOLOAD 0x0001 + +#define CRYPTO_ALG_INHERITED_FLAGS (CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY) /* * Transform masks and values (for crt_flags). Index: linux-2.6/crypto/authenc.c === --- linux-2.6.orig/crypto/authenc.c +++ linux-2.6/crypto/authenc.c @@ -388,7 +388,8 @@ static int crypto_authenc_create(struct if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask) return -EINVAL; - mask = crypto_requires_sync(algt->type, algt->mask); + mask = crypto_requires_sync(algt->type, algt->mask) | + crypto_requires_nomem(algt->type, algt->mask); inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL); if (!inst) @@ -424,7 +425,7 @@ static int crypto_authenc_create(struct goto err_free_inst; inst->alg.base.cra_flags = (auth_base->cra_flags | - enc->base.cra_flags) & CRYPTO_ALG_ASYNC; + enc->base.cra_flags) & CRYPTO_ALG_INHERITED_FLAGS; inst->alg.base.cra_priority = enc->base.cra_priority * 10 + auth_base->cra_priority; inst->alg.base.cra_blocksize = enc->base.cra_blocksize; Index: linux-2.6/crypto/authencesn.c === --- linux-2.6.orig/crypto/authencesn.c +++ linux-2.6/crypto/authencesn.c @@ -406,7 +406,8 @@ static int crypto_authenc_esn_create(str if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask) return -EINVAL; - mask = crypto_requires_sync(algt->type, algt->mask); + mask = crypto_requires_sync(algt->type, algt->mask) | + crypto_requires_nomem(algt->type, algt->mask); inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL); if (!inst) @@ -438,7 +439,7 @@ static int crypto_authenc_esn_create(str goto err_free_inst; inst->alg.base.cra_flags = (auth_base->cra_flags | - enc->base.cra_flags) & CRYPTO_ALG_ASYNC; + enc->base.cra_flags) & CRYPTO_ALG_INHERITED_FLAGS; inst->alg.base.cra_priority = enc->base.cra_priority * 10 + auth_base->cra_priority; inst->alg.base.cra_blocksize = enc->base.cra_blocksize; Index: linux-2.6/crypto/ccm.c === --- linux-2.6.orig/crypto/ccm.c +++ linux-2.6/crypto/ccm.c @@ -462,7 +462,8 @@ static int crypto_ccm_create_common(stru if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask) return -EINVAL; - mask = crypto_requires_sync(algt->type, algt->mask); + mask = crypto_requires_sync(algt->type, algt->mask) | + crypto_requires_nomem(algt->type, algt->mask); inst = kzalloc(sizeof(*inst) + sizeof(*ictx), GFP_KERNEL); if (!inst) @@ -507,7 +508,7 @@ static int crypto_ccm_create_common(stru mac->base.cra_drive
Re: [dm-devel] [PATCH 1/4] crypto: introduce CRYPTO_ALG_ALLOCATES_MEMORY
I'm resending the patches with your comments resolved... Mikulas On Tue, 16 Jun 2020, Eric Biggers wrote: > On Tue, Jun 16, 2020 at 11:01:31AM -0400, Mikulas Patocka wrote: > > Introduce a new flag CRYPTO_ALG_ALLOCATES_MEMORY and modify dm-crypt, so > > that it uses only drivers without this flag. > > > > If the flag is set, then the crypto driver allocates memory in its request > > routine. Such drivers are not suitable for disk encryption because > > GFP_ATOMIC allocation can fail anytime (causing random I/O errors) and > > GFP_KERNEL allocation can recurse into the block layer, causing a > > deadlock. > > > > Signed-off-by: Mikulas Patocka > > > > Index: linux-2.6/include/linux/crypto.h > > === > > --- linux-2.6.orig/include/linux/crypto.h > > +++ linux-2.6/include/linux/crypto.h > > @@ -97,9 +97,15 @@ > > #define CRYPTO_ALG_OPTIONAL_KEY0x4000 > > > > /* > > + * The driver is allocating emmory in its encrypt or decrypt callback, > > + * so it should not be used to encrypt block devices. > > + */ > > "is allocating emmory" => "may allocate memory" > > "so it should not be used to encrypt block devices" => > "so it shouldn't be used in cases where memory allocation failures aren't > acceptable, such as during block device encryption". > > Also, which types of algorithms does this flag apply to? E.g. if it applies > to > hash algorithms too, it's not sufficient to say "encrypt or decrypt callback". > > How about: > > /* > * The driver may allocate memory during request processing, so it shouldn't > be > * used in cases where memory allocation failures aren't acceptable, such as > * during block device encryption. > */ > > > +#define CRYPTO_ALG_ALLOCATES_MEMORY0x8000 > > + > > +/* > > * Don't trigger module loading > > */ > > -#define CRYPTO_NOLOAD 0x8000 > > +#define CRYPTO_NOLOAD 0x0001 > > > > /* > > * Transform masks and values (for crt_flags). > > Index: linux-2.6/drivers/md/dm-crypt.c > > === > > This would better belong as two separate patches: one to introduce > CRYPTO_ALG_ALLOCATES_MEMORY, and one to make dm-crypt use it. > > > --- linux-2.6.orig/drivers/md/dm-crypt.c > > +++ linux-2.6/drivers/md/dm-crypt.c > > @@ -419,7 +419,7 @@ static int crypt_iv_lmk_ctr(struct crypt > > return -EINVAL; > > } > > > > - lmk->hash_tfm = crypto_alloc_shash("md5", 0, 0); > > + lmk->hash_tfm = crypto_alloc_shash("md5", 0, > > CRYPTO_ALG_ALLOCATES_MEMORY); > > if (IS_ERR(lmk->hash_tfm)) { > > ti->error = "Error initializing LMK hash"; > > return PTR_ERR(lmk->hash_tfm); > > @@ -581,7 +581,7 @@ static int crypt_iv_tcw_ctr(struct crypt > > return -EINVAL; > > } > > > > - tcw->crc32_tfm = crypto_alloc_shash("crc32", 0, 0); > > + tcw->crc32_tfm = crypto_alloc_shash("crc32", 0, > > CRYPTO_ALG_ALLOCATES_MEMORY); > > if (IS_ERR(tcw->crc32_tfm)) { > > ti->error = "Error initializing CRC32 in TCW"; > > return PTR_ERR(tcw->crc32_tfm); > > @@ -768,7 +768,7 @@ static int crypt_iv_elephant_ctr(struct > > struct iv_elephant_private *elephant = >iv_gen_private.elephant; > > int r; > > > > - elephant->tfm = crypto_alloc_skcipher("ecb(aes)", 0, 0); > > + elephant->tfm = crypto_alloc_skcipher("ecb(aes)", 0, > > CRYPTO_ALG_ALLOCATES_MEMORY); > > if (IS_ERR(elephant->tfm)) { > > r = PTR_ERR(elephant->tfm); > > elephant->tfm = NULL; > > @@ -2088,7 +2088,7 @@ static int crypt_alloc_tfms_skcipher(str > > return -ENOMEM; > > > > for (i = 0; i < cc->tfms_count; i++) { > > - cc->cipher_tfm.tfms[i] = crypto_alloc_skcipher(ciphermode, 0, > > 0); > > + cc->cipher_tfm.tfms[i] = crypto_alloc_skcipher(ciphermode, 0, > > CRYPTO_ALG_ALLOCATES_MEMORY); > > Despite the recent relaxation in rules, the preferred length of a line is > still > 80 columns. > > - Eric >
[PATCH 2/2] hisilicon-crypto: don't sleep of CRYPTO_TFM_REQ_MAY_SLEEP was not specified
There is this call chain: sec_alg_skcipher_encrypt -> sec_alg_skcipher_crypto -> sec_alg_alloc_and_calc_split_sizes -> kcalloc where we call sleeping allocator function even if CRYPTO_TFM_REQ_MAY_SLEEP was not specified. Signed-off-by: Mikulas Patocka Cc: sta...@vger.kernel.org # v4.19+ Fixes: 915e4e8413da ("crypto: hisilicon - SEC security accelerator driver") --- drivers/crypto/hisilicon/sec/sec_algs.c | 34 1 file changed, 18 insertions(+), 16 deletions(-) Index: linux-2.6/drivers/crypto/hisilicon/sec/sec_algs.c === --- linux-2.6.orig/drivers/crypto/hisilicon/sec/sec_algs.c +++ linux-2.6/drivers/crypto/hisilicon/sec/sec_algs.c @@ -175,7 +175,8 @@ static int sec_alloc_and_fill_hw_sgl(str dma_addr_t *psec_sgl, struct scatterlist *sgl, int count, -struct sec_dev_info *info) +struct sec_dev_info *info, +gfp_t gfp) { struct sec_hw_sgl *sgl_current = NULL; struct sec_hw_sgl *sgl_next; @@ -190,7 +191,7 @@ static int sec_alloc_and_fill_hw_sgl(str sge_index = i % SEC_MAX_SGE_NUM; if (sge_index == 0) { sgl_next = dma_pool_zalloc(info->hw_sgl_pool, - GFP_KERNEL, _next_dma); + gfp, _next_dma); if (!sgl_next) { ret = -ENOMEM; goto err_free_hw_sgls; @@ -545,14 +546,14 @@ void sec_alg_callback(struct sec_bd_info } static int sec_alg_alloc_and_calc_split_sizes(int length, size_t **split_sizes, - int *steps) + int *steps, gfp_t gfp) { size_t *sizes; int i; /* Split into suitable sized blocks */ *steps = roundup(length, SEC_REQ_LIMIT) / SEC_REQ_LIMIT; - sizes = kcalloc(*steps, sizeof(*sizes), GFP_KERNEL); + sizes = kcalloc(*steps, sizeof(*sizes), gfp); if (!sizes) return -ENOMEM; @@ -568,7 +569,7 @@ static int sec_map_and_split_sg(struct s int steps, struct scatterlist ***splits, int **splits_nents, int sgl_len_in, - struct device *dev) + struct device *dev, gfp_t gfp) { int ret, count; @@ -576,12 +577,12 @@ static int sec_map_and_split_sg(struct s if (!count) return -EINVAL; - *splits = kcalloc(steps, sizeof(struct scatterlist *), GFP_KERNEL); + *splits = kcalloc(steps, sizeof(struct scatterlist *), gfp); if (!*splits) { ret = -ENOMEM; goto err_unmap_sg; } - *splits_nents = kcalloc(steps, sizeof(int), GFP_KERNEL); + *splits_nents = kcalloc(steps, sizeof(int), gfp); if (!*splits_nents) { ret = -ENOMEM; goto err_free_splits; @@ -589,7 +590,7 @@ static int sec_map_and_split_sg(struct s /* output the scatter list before and after this */ ret = sg_split(sgl, count, 0, steps, split_sizes, - *splits, *splits_nents, GFP_KERNEL); + *splits, *splits_nents, gfp); if (ret) { ret = -ENOMEM; goto err_free_splits_nents; @@ -630,13 +631,13 @@ static struct sec_request_el int el_size, bool different_dest, struct scatterlist *sgl_in, int n_ents_in, struct scatterlist *sgl_out, int n_ents_out, - struct sec_dev_info *info) + struct sec_dev_info *info, gfp_t gfp) { struct sec_request_el *el; struct sec_bd_info *req; int ret; - el = kzalloc(sizeof(*el), GFP_KERNEL); + el = kzalloc(sizeof(*el), gfp); if (!el) return ERR_PTR(-ENOMEM); el->el_length = el_size; @@ -668,7 +669,7 @@ static struct sec_request_el el->sgl_in = sgl_in; ret = sec_alloc_and_fill_hw_sgl(>in, >dma_in, el->sgl_in, - n_ents_in, info); + n_ents_in, info, gfp); if (ret) goto err_free_el; @@ -679,7 +680,7 @@ static struct sec_request_el el->sgl_out = sgl_out; ret = sec_alloc_and_fill_hw_sgl(>out, >dma_out, el->sgl_out, - n_ents_out, info); +
[PATCH 1/2] cpt-crypto: don't sleep of CRYPTO_TFM_REQ_MAY_SLEEP was not specified
There is this call chain: cvm_encrypt -> cvm_enc_dec -> cptvf_do_request -> process_request -> kzalloc where we call sleeping allocator function even if CRYPTO_TFM_REQ_MAY_SLEEP was not specified. Signed-off-by: Mikulas Patocka Cc: sta...@vger.kernel.org # v4.11+ Fixes: c694b233295b ("crypto: cavium - Add the Virtual Function driver for CPT") --- drivers/crypto/cavium/cpt/cptvf_algs.c |1 + drivers/crypto/cavium/cpt/cptvf_reqmanager.c | 12 ++-- drivers/crypto/cavium/cpt/request_manager.h |2 ++ 3 files changed, 9 insertions(+), 6 deletions(-) Index: linux-2.6/drivers/crypto/cavium/cpt/cptvf_algs.c === --- linux-2.6.orig/drivers/crypto/cavium/cpt/cptvf_algs.c +++ linux-2.6/drivers/crypto/cavium/cpt/cptvf_algs.c @@ -200,6 +200,7 @@ static inline int cvm_enc_dec(struct skc int status; memset(req_info, 0, sizeof(struct cpt_request_info)); + req_info->may_sleep = (req->base.flags & CRYPTO_TFM_REQ_MAY_SLEEP) != 0; memset(fctx, 0, sizeof(struct fc_context)); create_input_list(req, enc, enc_iv_len); create_output_list(req, enc_iv_len); Index: linux-2.6/drivers/crypto/cavium/cpt/cptvf_reqmanager.c === --- linux-2.6.orig/drivers/crypto/cavium/cpt/cptvf_reqmanager.c +++ linux-2.6/drivers/crypto/cavium/cpt/cptvf_reqmanager.c @@ -133,7 +133,7 @@ static inline int setup_sgio_list(struct /* Setup gather (input) components */ g_sz_bytes = ((req->incnt + 3) / 4) * sizeof(struct sglist_component); - info->gather_components = kzalloc(g_sz_bytes, GFP_KERNEL); + info->gather_components = kzalloc(g_sz_bytes, req->may_sleep ? GFP_KERNEL : GFP_ATOMIC); if (!info->gather_components) { ret = -ENOMEM; goto scatter_gather_clean; @@ -150,7 +150,7 @@ static inline int setup_sgio_list(struct /* Setup scatter (output) components */ s_sz_bytes = ((req->outcnt + 3) / 4) * sizeof(struct sglist_component); - info->scatter_components = kzalloc(s_sz_bytes, GFP_KERNEL); + info->scatter_components = kzalloc(s_sz_bytes, req->may_sleep ? GFP_KERNEL : GFP_ATOMIC); if (!info->scatter_components) { ret = -ENOMEM; goto scatter_gather_clean; @@ -167,7 +167,7 @@ static inline int setup_sgio_list(struct /* Create and initialize DPTR */ info->dlen = g_sz_bytes + s_sz_bytes + SG_LIST_HDR_SIZE; - info->in_buffer = kzalloc(info->dlen, GFP_KERNEL); + info->in_buffer = kzalloc(info->dlen, req->may_sleep ? GFP_KERNEL : GFP_ATOMIC); if (!info->in_buffer) { ret = -ENOMEM; goto scatter_gather_clean; @@ -195,7 +195,7 @@ static inline int setup_sgio_list(struct } /* Create and initialize RPTR */ - info->out_buffer = kzalloc(COMPLETION_CODE_SIZE, GFP_KERNEL); + info->out_buffer = kzalloc(COMPLETION_CODE_SIZE, req->may_sleep ? GFP_KERNEL : GFP_ATOMIC); if (!info->out_buffer) { ret = -ENOMEM; goto scatter_gather_clean; @@ -421,7 +421,7 @@ int process_request(struct cpt_vf *cptvf struct cpt_vq_command vq_cmd; union cpt_inst_s cptinst; - info = kzalloc(sizeof(*info), GFP_KERNEL); + info = kzalloc(sizeof(*info), req->may_sleep ? GFP_KERNEL : GFP_ATOMIC); if (unlikely(!info)) { dev_err(>dev, "Unable to allocate memory for info_buffer\n"); return -ENOMEM; @@ -443,7 +443,7 @@ int process_request(struct cpt_vf *cptvf * Get buffer for union cpt_res_s response * structure and its physical address */ - info->completion_addr = kzalloc(sizeof(union cpt_res_s), GFP_KERNEL); + info->completion_addr = kzalloc(sizeof(union cpt_res_s), req->may_sleep ? GFP_KERNEL : GFP_ATOMIC); if (unlikely(!info->completion_addr)) { dev_err(>dev, "Unable to allocate memory for completion_addr\n"); ret = -ENOMEM; Index: linux-2.6/drivers/crypto/cavium/cpt/request_manager.h === --- linux-2.6.orig/drivers/crypto/cavium/cpt/request_manager.h +++ linux-2.6/drivers/crypto/cavium/cpt/request_manager.h @@ -62,6 +62,8 @@ struct cpt_request_info { union ctrl_info ctrl; /* User control information */ struct cptvf_request req; /* Request Information (Core specific) */ + bool may_sleep; + struct buf_ptr in[MAX_BUF_CNT]; struct buf_ptr out[MAX_BUF_CNT];
Re: [dm-devel] [PATCH 4/4] crypto: fix the drivers that don't respect CRYPTO_TFM_REQ_MAY_SLEEP
On Tue, 16 Jun 2020, Eric Biggers wrote: > On Tue, Jun 16, 2020 at 02:18:17PM -0400, Mikulas Patocka wrote: > > > > > > On Tue, 16 Jun 2020, Eric Biggers wrote: > > > > > On Tue, Jun 16, 2020 at 11:02:50AM -0400, Mikulas Patocka wrote: > > > > Fix the crypto drivers that don't respect CRYPTO_TFM_REQ_MAY_SLEEP. If > > > > CRYPTO_TFM_REQ_MAY_SLEEP is not set, the driver must not do allocation > > > > that sleeps. > > > > > > > > Signed-off-by: Mikulas Patocka > > > > > > I think you need to split this up per driver with a proper explanation > > > and a > > > "Fixes:" tag for each driver. > > > > > > Also, these bugs should have been detected by the crypto self-tests > > > already, > > > since they test having preemption disabled and CRYPTO_TFM_REQ_MAY_SLEEP > > > cleared. > > > Can you double check whether these are all valid fixes? One thing to > > > watch out > > > > > > for is that CRYPTO_TFM_REQ_MAY_SLEEP only applies to the function call > > > like > > > crypto_skcipher_encrypt() itself. If the implementation is asynchronous > > > and the > > > request gets processed in the background (i.e. if > > > crypto_skcipher_encrypt() > > > returns -EINPROGRESS), the background work doesn't have to honor > > > CRYPTO_TFM_REQ_MAY_SLEEP. > > > > > > - Eric > > > > I can only compile-test this patch. I don't have the hardware. > > > > I'm just asking for you to check the code extra carefully. The fact that the > self-tests should have been detecting this type of bug implies that these > might > not actually be valid fixes. I've checked it more thoroughly and found out that 3 out of 5 drivers do the GFP_KERNEL allocation from crypto-engine callback. So, it is supposedly OK. > However, we do know that not all crypto drivers are being actively tested with > the latest self-tests and with kernel debugging options enabled. So it's > expected that some are indeed broken. > > - Eric The broken ones are drivers/crypto/cavium/cpt/ and drivers/crypto/hisilicon/sec/ I'm sending patches for them. Mikulas
Re: [dm-devel] [PATCH 4/4] crypto: fix the drivers that don't respect CRYPTO_TFM_REQ_MAY_SLEEP
On Tue, 16 Jun 2020, Eric Biggers wrote: > On Tue, Jun 16, 2020 at 11:02:50AM -0400, Mikulas Patocka wrote: > > Fix the crypto drivers that don't respect CRYPTO_TFM_REQ_MAY_SLEEP. If > > CRYPTO_TFM_REQ_MAY_SLEEP is not set, the driver must not do allocation > > that sleeps. > > > > Signed-off-by: Mikulas Patocka > > I think you need to split this up per driver with a proper explanation and a > "Fixes:" tag for each driver. > > Also, these bugs should have been detected by the crypto self-tests already, > since they test having preemption disabled and CRYPTO_TFM_REQ_MAY_SLEEP > cleared. > Can you double check whether these are all valid fixes? One thing to watch > out > for is that CRYPTO_TFM_REQ_MAY_SLEEP only applies to the function call like > crypto_skcipher_encrypt() itself. If the implementation is asynchronous and > the > request gets processed in the background (i.e. if crypto_skcipher_encrypt() > returns -EINPROGRESS), the background work doesn't have to honor > CRYPTO_TFM_REQ_MAY_SLEEP. > > - Eric I can only compile-test this patch. I don't have the hardware. Mikulas
[PATCH 4/4] crypto: fix the drivers that don't respect CRYPTO_TFM_REQ_MAY_SLEEP
Fix the crypto drivers that don't respect CRYPTO_TFM_REQ_MAY_SLEEP. If CRYPTO_TFM_REQ_MAY_SLEEP is not set, the driver must not do allocation that sleeps. Signed-off-by: Mikulas Patocka --- drivers/crypto/amlogic/amlogic-gxl-cipher.c |5 ++- drivers/crypto/cavium/cpt/cptvf_algs.c |1 drivers/crypto/cavium/cpt/cptvf_reqmanager.c | 12 - drivers/crypto/cavium/cpt/request_manager.h |2 + drivers/crypto/hisilicon/sec/sec_algs.c | 34 ++- drivers/crypto/virtio/virtio_crypto_algs.c |4 +-- drivers/crypto/xilinx/zynqmp-aes-gcm.c |4 +-- 7 files changed, 34 insertions(+), 28 deletions(-) Index: linux-2.6/drivers/crypto/virtio/virtio_crypto_algs.c === --- linux-2.6.orig/drivers/crypto/virtio/virtio_crypto_algs.c +++ linux-2.6/drivers/crypto/virtio/virtio_crypto_algs.c @@ -364,12 +364,12 @@ __virtio_crypto_skcipher_do_req(struct v /* Why 3? outhdr + iv + inhdr */ sg_total = src_nents + dst_nents + 3; - sgs = kcalloc_node(sg_total, sizeof(*sgs), GFP_KERNEL, + sgs = kcalloc_node(sg_total, sizeof(*sgs), req->base.flags & CRYPTO_TFM_REQ_MAY_SLEEP ? GFP_KERNEL : GFP_ATOMIC, dev_to_node(>vdev->dev)); if (!sgs) return -ENOMEM; - req_data = kzalloc_node(sizeof(*req_data), GFP_KERNEL, + req_data = kzalloc_node(sizeof(*req_data), req->base.flags & CRYPTO_TFM_REQ_MAY_SLEEP ? GFP_KERNEL : GFP_ATOMIC, dev_to_node(>vdev->dev)); if (!req_data) { kfree(sgs); Index: linux-2.6/drivers/crypto/cavium/cpt/cptvf_algs.c === --- linux-2.6.orig/drivers/crypto/cavium/cpt/cptvf_algs.c +++ linux-2.6/drivers/crypto/cavium/cpt/cptvf_algs.c @@ -200,6 +200,7 @@ static inline int cvm_enc_dec(struct skc int status; memset(req_info, 0, sizeof(struct cpt_request_info)); + req_info->may_sleep = (req->base.flags & CRYPTO_TFM_REQ_MAY_SLEEP) != 0; memset(fctx, 0, sizeof(struct fc_context)); create_input_list(req, enc, enc_iv_len); create_output_list(req, enc_iv_len); Index: linux-2.6/drivers/crypto/cavium/cpt/cptvf_reqmanager.c === --- linux-2.6.orig/drivers/crypto/cavium/cpt/cptvf_reqmanager.c +++ linux-2.6/drivers/crypto/cavium/cpt/cptvf_reqmanager.c @@ -133,7 +133,7 @@ static inline int setup_sgio_list(struct /* Setup gather (input) components */ g_sz_bytes = ((req->incnt + 3) / 4) * sizeof(struct sglist_component); - info->gather_components = kzalloc(g_sz_bytes, GFP_KERNEL); + info->gather_components = kzalloc(g_sz_bytes, req->may_sleep ? GFP_KERNEL : GFP_ATOMIC); if (!info->gather_components) { ret = -ENOMEM; goto scatter_gather_clean; @@ -150,7 +150,7 @@ static inline int setup_sgio_list(struct /* Setup scatter (output) components */ s_sz_bytes = ((req->outcnt + 3) / 4) * sizeof(struct sglist_component); - info->scatter_components = kzalloc(s_sz_bytes, GFP_KERNEL); + info->scatter_components = kzalloc(s_sz_bytes, req->may_sleep ? GFP_KERNEL : GFP_ATOMIC); if (!info->scatter_components) { ret = -ENOMEM; goto scatter_gather_clean; @@ -167,7 +167,7 @@ static inline int setup_sgio_list(struct /* Create and initialize DPTR */ info->dlen = g_sz_bytes + s_sz_bytes + SG_LIST_HDR_SIZE; - info->in_buffer = kzalloc(info->dlen, GFP_KERNEL); + info->in_buffer = kzalloc(info->dlen, req->may_sleep ? GFP_KERNEL : GFP_ATOMIC); if (!info->in_buffer) { ret = -ENOMEM; goto scatter_gather_clean; @@ -195,7 +195,7 @@ static inline int setup_sgio_list(struct } /* Create and initialize RPTR */ - info->out_buffer = kzalloc(COMPLETION_CODE_SIZE, GFP_KERNEL); + info->out_buffer = kzalloc(COMPLETION_CODE_SIZE, req->may_sleep ? GFP_KERNEL : GFP_ATOMIC); if (!info->out_buffer) { ret = -ENOMEM; goto scatter_gather_clean; @@ -421,7 +421,7 @@ int process_request(struct cpt_vf *cptvf struct cpt_vq_command vq_cmd; union cpt_inst_s cptinst; - info = kzalloc(sizeof(*info), GFP_KERNEL); + info = kzalloc(sizeof(*info), req->may_sleep ? GFP_KERNEL : GFP_ATOMIC); if (unlikely(!info)) { dev_err(>dev, "Unable to allocate memory for info_buffer\n"); return -ENOMEM; @@ -443,7 +443,7 @@ int process_request(struct cpt_vf *cptvf * Get buffer for union cpt_res_s response * structure and its physical address */ - info-&g
[PATCH 3/4] crypto: set the flag CRYPTO_ALG_ALLOCATES_MEMORY
Set the flag CRYPTO_ALG_ALLOCATES_MEMORY in the crypto drivers that allocate memory. Signed-off-by: Mikulas Patocka --- drivers/crypto/allwinner/sun8i-ce/sun8i-ce-core.c |8 +- drivers/crypto/allwinner/sun8i-ss/sun8i-ss-core.c |8 +- drivers/crypto/amlogic/amlogic-gxl-core.c |4 - drivers/crypto/axis/artpec6_crypto.c | 20 +++-- drivers/crypto/bcm/cipher.c | 38 +- drivers/crypto/caam/caamalg.c |4 - drivers/crypto/caam/caamalg_qi.c |4 - drivers/crypto/caam/caamalg_qi2.c |6 - drivers/crypto/caam/caamhash.c|2 drivers/crypto/cavium/cpt/cptvf_algs.c| 12 +-- drivers/crypto/cavium/nitrox/nitrox_aead.c|4 - drivers/crypto/cavium/nitrox/nitrox_skcipher.c| 16 ++-- drivers/crypto/ccp/ccp-crypto-aes-cmac.c |1 drivers/crypto/ccp/ccp-crypto-aes-galois.c|1 drivers/crypto/ccp/ccp-crypto-aes-xts.c |1 drivers/crypto/ccp/ccp-crypto-aes.c |2 drivers/crypto/ccp/ccp-crypto-des3.c |1 drivers/crypto/ccp/ccp-crypto-sha.c |1 drivers/crypto/chelsio/chcr_algo.c|6 - drivers/crypto/hisilicon/sec/sec_algs.c | 16 ++-- drivers/crypto/hisilicon/sec2/sec_crypto.c|4 - drivers/crypto/inside-secure/safexcel_cipher.c| 47 + drivers/crypto/inside-secure/safexcel_hash.c | 18 + drivers/crypto/ixp4xx_crypto.c|6 + drivers/crypto/marvell/cesa/cipher.c | 12 +-- drivers/crypto/marvell/cesa/hash.c|6 + drivers/crypto/marvell/octeontx/otx_cptvf_algs.c | 30 drivers/crypto/n2_core.c |2 drivers/crypto/picoxcell_crypto.c | 17 drivers/crypto/qat/qat_common/qat_algs.c | 12 +-- drivers/crypto/qce/sha.c |2 drivers/crypto/qce/skcipher.c |1 drivers/crypto/talitos.c | 78 +++--- drivers/crypto/virtio/virtio_crypto_algs.c|2 drivers/crypto/xilinx/zynqmp-aes-gcm.c|1 35 files changed, 248 insertions(+), 145 deletions(-) Index: linux-2.6/drivers/crypto/qat/qat_common/qat_algs.c === --- linux-2.6.orig/drivers/crypto/qat/qat_common/qat_algs.c +++ linux-2.6/drivers/crypto/qat/qat_common/qat_algs.c @@ -1267,7 +1267,7 @@ static struct aead_alg qat_aeads[] = { { .cra_name = "authenc(hmac(sha1),cbc(aes))", .cra_driver_name = "qat_aes_cbc_hmac_sha1", .cra_priority = 4001, - .cra_flags = CRYPTO_ALG_ASYNC, + .cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY, .cra_blocksize = AES_BLOCK_SIZE, .cra_ctxsize = sizeof(struct qat_alg_aead_ctx), .cra_module = THIS_MODULE, @@ -1284,7 +1284,7 @@ static struct aead_alg qat_aeads[] = { { .cra_name = "authenc(hmac(sha256),cbc(aes))", .cra_driver_name = "qat_aes_cbc_hmac_sha256", .cra_priority = 4001, - .cra_flags = CRYPTO_ALG_ASYNC, + .cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY, .cra_blocksize = AES_BLOCK_SIZE, .cra_ctxsize = sizeof(struct qat_alg_aead_ctx), .cra_module = THIS_MODULE, @@ -1301,7 +1301,7 @@ static struct aead_alg qat_aeads[] = { { .cra_name = "authenc(hmac(sha512),cbc(aes))", .cra_driver_name = "qat_aes_cbc_hmac_sha512", .cra_priority = 4001, - .cra_flags = CRYPTO_ALG_ASYNC, + .cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY, .cra_blocksize = AES_BLOCK_SIZE, .cra_ctxsize = sizeof(struct qat_alg_aead_ctx), .cra_module = THIS_MODULE, @@ -1319,7 +1319,7 @@ static struct skcipher_alg qat_skciphers .base.cra_name = "cbc(aes)", .base.cra_driver_name = "qat_aes_cbc", .base.cra_priority = 4001, - .base.cra_flags = CRYPTO_ALG_ASYNC, + .base.cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY, .base.cra_blocksize = AES_BLOCK_SIZE, .base.cra_ctxsize = sizeof(struct qat_alg_skcipher_ctx), .base.cra_alignmask = 0, @@ -1337,7 +1337,7 @@ static struct skcipher_alg qat_skciphers .base.cra_name = "ctr(aes)", .base.cra_driver_name = "qat_aes_ctr", .base.cra_priority = 4001, - .base.cra_flags = CRYPTO_ALG_ASYNC, + .base.cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY, .base.cra_blocksize = 1, .base.cra_ctxsize
[PATCH 1/4] crypto: introduce CRYPTO_ALG_ALLOCATES_MEMORY
Introduce a new flag CRYPTO_ALG_ALLOCATES_MEMORY and modify dm-crypt, so that it uses only drivers without this flag. If the flag is set, then the crypto driver allocates memory in its request routine. Such drivers are not suitable for disk encryption because GFP_ATOMIC allocation can fail anytime (causing random I/O errors) and GFP_KERNEL allocation can recurse into the block layer, causing a deadlock. Signed-off-by: Mikulas Patocka Index: linux-2.6/include/linux/crypto.h === --- linux-2.6.orig/include/linux/crypto.h +++ linux-2.6/include/linux/crypto.h @@ -97,9 +97,15 @@ #define CRYPTO_ALG_OPTIONAL_KEY0x4000 /* + * The driver is allocating emmory in its encrypt or decrypt callback, + * so it should not be used to encrypt block devices. + */ +#define CRYPTO_ALG_ALLOCATES_MEMORY0x8000 + +/* * Don't trigger module loading */ -#define CRYPTO_NOLOAD 0x8000 +#define CRYPTO_NOLOAD 0x0001 /* * Transform masks and values (for crt_flags). Index: linux-2.6/drivers/md/dm-crypt.c === --- linux-2.6.orig/drivers/md/dm-crypt.c +++ linux-2.6/drivers/md/dm-crypt.c @@ -419,7 +419,7 @@ static int crypt_iv_lmk_ctr(struct crypt return -EINVAL; } - lmk->hash_tfm = crypto_alloc_shash("md5", 0, 0); + lmk->hash_tfm = crypto_alloc_shash("md5", 0, CRYPTO_ALG_ALLOCATES_MEMORY); if (IS_ERR(lmk->hash_tfm)) { ti->error = "Error initializing LMK hash"; return PTR_ERR(lmk->hash_tfm); @@ -581,7 +581,7 @@ static int crypt_iv_tcw_ctr(struct crypt return -EINVAL; } - tcw->crc32_tfm = crypto_alloc_shash("crc32", 0, 0); + tcw->crc32_tfm = crypto_alloc_shash("crc32", 0, CRYPTO_ALG_ALLOCATES_MEMORY); if (IS_ERR(tcw->crc32_tfm)) { ti->error = "Error initializing CRC32 in TCW"; return PTR_ERR(tcw->crc32_tfm); @@ -768,7 +768,7 @@ static int crypt_iv_elephant_ctr(struct struct iv_elephant_private *elephant = >iv_gen_private.elephant; int r; - elephant->tfm = crypto_alloc_skcipher("ecb(aes)", 0, 0); + elephant->tfm = crypto_alloc_skcipher("ecb(aes)", 0, CRYPTO_ALG_ALLOCATES_MEMORY); if (IS_ERR(elephant->tfm)) { r = PTR_ERR(elephant->tfm); elephant->tfm = NULL; @@ -2088,7 +2088,7 @@ static int crypt_alloc_tfms_skcipher(str return -ENOMEM; for (i = 0; i < cc->tfms_count; i++) { - cc->cipher_tfm.tfms[i] = crypto_alloc_skcipher(ciphermode, 0, 0); + cc->cipher_tfm.tfms[i] = crypto_alloc_skcipher(ciphermode, 0, CRYPTO_ALG_ALLOCATES_MEMORY); if (IS_ERR(cc->cipher_tfm.tfms[i])) { err = PTR_ERR(cc->cipher_tfm.tfms[i]); crypt_free_tfms(cc); @@ -2114,7 +2114,7 @@ static int crypt_alloc_tfms_aead(struct if (!cc->cipher_tfm.tfms) return -ENOMEM; - cc->cipher_tfm.tfms_aead[0] = crypto_alloc_aead(ciphermode, 0, 0); + cc->cipher_tfm.tfms_aead[0] = crypto_alloc_aead(ciphermode, 0, CRYPTO_ALG_ALLOCATES_MEMORY); if (IS_ERR(cc->cipher_tfm.tfms_aead[0])) { err = PTR_ERR(cc->cipher_tfm.tfms_aead[0]); crypt_free_tfms(cc); @@ -2565,7 +2565,7 @@ static int crypt_ctr_auth_cipher(struct return -ENOMEM; strncpy(mac_alg, start, end - start); - mac = crypto_alloc_ahash(mac_alg, 0, 0); + mac = crypto_alloc_ahash(mac_alg, 0, CRYPTO_ALG_ALLOCATES_MEMORY); kfree(mac_alg); if (IS_ERR(mac))
[PATCH 2/4] crypto: pass the flag CRYPTO_ALG_ALLOCATES_MEMORY
Pass the flag CRYPTO_ALG_ALLOCATES_MEMORY down through the crypto API. Signed-off-by: Mikulas Patocka --- crypto/adiantum.c |3 ++- crypto/authenc.c |5 +++-- crypto/authencesn.c |5 +++-- crypto/ccm.c |7 --- crypto/chacha20poly1305.c |5 +++-- crypto/cryptd.c |7 +-- crypto/ctr.c |3 ++- crypto/cts.c |5 +++-- crypto/essiv.c|5 +++-- crypto/gcm.c | 15 +-- crypto/geniv.c|3 ++- crypto/lrw.c |5 +++-- crypto/rsa-pkcs1pad.c |5 +++-- crypto/xts.c |2 +- include/crypto/algapi.h |9 + 15 files changed, 55 insertions(+), 29 deletions(-) Index: linux-2.6/crypto/authenc.c === --- linux-2.6.orig/crypto/authenc.c +++ linux-2.6/crypto/authenc.c @@ -388,7 +388,8 @@ static int crypto_authenc_create(struct if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask) return -EINVAL; - mask = crypto_requires_sync(algt->type, algt->mask); + mask = crypto_requires_sync(algt->type, algt->mask) | + crypto_requires_nomem(algt->type, algt->mask); inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL); if (!inst) @@ -424,7 +425,7 @@ static int crypto_authenc_create(struct goto err_free_inst; inst->alg.base.cra_flags = (auth_base->cra_flags | - enc->base.cra_flags) & CRYPTO_ALG_ASYNC; + enc->base.cra_flags) & (CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY); inst->alg.base.cra_priority = enc->base.cra_priority * 10 + auth_base->cra_priority; inst->alg.base.cra_blocksize = enc->base.cra_blocksize; Index: linux-2.6/crypto/authencesn.c === --- linux-2.6.orig/crypto/authencesn.c +++ linux-2.6/crypto/authencesn.c @@ -406,7 +406,8 @@ static int crypto_authenc_esn_create(str if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask) return -EINVAL; - mask = crypto_requires_sync(algt->type, algt->mask); + mask = crypto_requires_sync(algt->type, algt->mask) | + crypto_requires_nomem(algt->type, algt->mask); inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL); if (!inst) @@ -438,7 +439,7 @@ static int crypto_authenc_esn_create(str goto err_free_inst; inst->alg.base.cra_flags = (auth_base->cra_flags | - enc->base.cra_flags) & CRYPTO_ALG_ASYNC; + enc->base.cra_flags) & (CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY); inst->alg.base.cra_priority = enc->base.cra_priority * 10 + auth_base->cra_priority; inst->alg.base.cra_blocksize = enc->base.cra_blocksize; Index: linux-2.6/crypto/ccm.c === --- linux-2.6.orig/crypto/ccm.c +++ linux-2.6/crypto/ccm.c @@ -462,7 +462,8 @@ static int crypto_ccm_create_common(stru if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask) return -EINVAL; - mask = crypto_requires_sync(algt->type, algt->mask); + mask = crypto_requires_sync(algt->type, algt->mask) | + crypto_requires_nomem(algt->type, algt->mask); inst = kzalloc(sizeof(*inst) + sizeof(*ictx), GFP_KERNEL); if (!inst) @@ -507,7 +508,7 @@ static int crypto_ccm_create_common(stru mac->base.cra_driver_name) >= CRYPTO_MAX_ALG_NAME) goto err_free_inst; - inst->alg.base.cra_flags = ctr->base.cra_flags & CRYPTO_ALG_ASYNC; + inst->alg.base.cra_flags = ctr->base.cra_flags & (CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY); inst->alg.base.cra_priority = (mac->base.cra_priority + ctr->base.cra_priority) / 2; inst->alg.base.cra_blocksize = 1; @@ -759,7 +760,7 @@ static int crypto_rfc4309_create(struct CRYPTO_MAX_ALG_NAME) goto err_free_inst; - inst->alg.base.cra_flags = alg->base.cra_flags & CRYPTO_ALG_ASYNC; + inst->alg.base.cra_flags = alg->base.cra_flags & (CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY); inst->alg.base.cra_priority = alg->base.cra_priority; inst->alg.base.cra_blocksize = 1; inst->alg.base.cra_alignmask = alg->base.cra_alignmask; Index: linux-2.6/crypto/chacha20poly1305.c === ---
Re: crypto API and GFP_ATOMIC
On Wed, 10 Jun 2020, Herbert Xu wrote: > On Wed, Jun 10, 2020 at 08:02:23AM -0400, Mikulas Patocka wrote: > > > > Yes, fixing the drivers would be the best - but you can hardly find any > > person who has all the crypto hardware and who is willing to rewrite all > > the drivers for it. > > We don't have to rewrite them straight away. We could mark the > known broken ones (or the known working ones) and then dm-crypt > can allocate only those using the types/mask to crypto_alloc. > > Cheers, I triaged the drivers in drivers/crypto and unfortunatelly, most of them do memory allocation in the encryption routine. Some of the do GFP_KERNEL allocation even in the absence of CRYPTO_TFM_REQ_MAY_SLEEP. I'm sending the patches: The first patch adds a new flag CRYPTO_ALG_ALLOCATES_MEMORY. The second patch passes CRYPTO_ALG_ALLOCATES_MEMORY through the crypto API stack (please check it - I am not an expert in this area). The third patch sets CRYPTO_ALG_ALLOCATES_MEMORY on drivers that allocate memory in the encrypt/decrypt routine. The fourth patch fixes the drivers that use GFP_KERNEL in non-blocking context. Mikulas
Re: [PATCH] dm writecache: skip writecache_wait when using pmem mode
On Fri, 12 Jun 2020, Huaisheng Ye wrote: > From: Huaisheng Ye > > The array bio_in_progress is only used with ssd mode. So skip > writecache_wait_for_ios in writecache_discard when pmem mode. > > Signed-off-by: Huaisheng Ye Acked-by: Mikulas Patocka > --- > drivers/md/dm-writecache.c | 6 -- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c > index 66f3a3b..4367cc7 100644 > --- a/drivers/md/dm-writecache.c > +++ b/drivers/md/dm-writecache.c > @@ -849,8 +849,10 @@ static void writecache_discard(struct dm_writecache *wc, > sector_t start, sector_ > > if (likely(!e->write_in_progress)) { > if (!discarded_something) { > - writecache_wait_for_ios(wc, READ); > - writecache_wait_for_ios(wc, WRITE); > + if (!WC_MODE_PMEM(wc)) { > + writecache_wait_for_ios(wc, READ); > + writecache_wait_for_ios(wc, WRITE); > + } > discarded_something = true; > } > writecache_free_entry(wc, e); > -- > 1.8.3.1 >
Re: [PATCH] dm writecache: correct uncommitted_block when discarding uncommitted entry
On Fri, 12 Jun 2020, Huaisheng Ye wrote: > From: Huaisheng Ye > > When uncommitted entry has been discarded, correct wc->uncommitted_block > for getting the exact number. > > Signed-off-by: Huaisheng Ye Acked-by: Mikulas Patocka Also, add: Cc: sta...@vger.kernel.org > --- > drivers/md/dm-writecache.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c > index 4367cc7..64b4527 100644 > --- a/drivers/md/dm-writecache.c > +++ b/drivers/md/dm-writecache.c > @@ -855,6 +855,8 @@ static void writecache_discard(struct dm_writecache *wc, > sector_t start, sector_ > } > discarded_something = true; > } > + if (!writecache_entry_is_committed(wc, e)) > + wc->uncommitted_blocks--; > writecache_free_entry(wc, e); > } > > -- > 1.8.3.1 >
Re: crypto API and GFP_ATOMIC
On Wed, 10 Jun 2020, Herbert Xu wrote: > On Tue, Jun 09, 2020 at 01:11:05PM -0400, Mikulas Patocka wrote: > > > > Do you have another idea how to solve this problem? > > I think the better approach would be to modify the drivers to not > allocate any memory. In general, any memory needed by the driver > to fulfil a request *should* be allocated within the crypto request > object. That's why we have the reqsize field to indicate how much > memory could be needed per request. > > Thanks, Yes, fixing the drivers would be the best - but you can hardly find any person who has all the crypto hardware and who is willing to rewrite all the drivers for it. Another possibility - I was thinking about setting CRYPTO_TFM_REQ_MAY_SLEEP in dm-crypt and calling the crypto function under memalloc_noio_save. But there are some drivers that do GFP_ATOMIC allocation regardless of CRYPTO_TFM_REQ_MAY_SLEEP. Mikulas
crypto API and GFP_ATOMIC
Hi I've found out that a lot of hardware crypto drivers use GFP_ATOMIC. Some of them switch between GFP_ATOMIC and GFP_KERNEL based on the flag CRYPTO_TFM_REQ_MAY_SLEEP. dm-crypt and dm-integrity don't use CRYPTO_TFM_REQ_MAY_SLEEP (because GFP_KERNEL allocation requests can recurse back to the block device drivers and cause deadlocks). So, basically, the crypto requests submitted by dm-crypt and dm-integrity can fail anytime. I'd like to ask, how to handle these random -ENOMEM return codes. If we pass -ENOMEM back to the block device stack, it could cause random I/O errors and data corruption. The question is - if the crypto request returns -ENOMEM, could we sleep and retry it? I thought about it - the problem could be, if the crypto requests proceeds hafway through and then returns -ENOMEM, and if we retried it, it would cause data corruption, because part of the data would be decrypted twice. Is it safe to assume that when we get -ENOMEM, the crypto driver didn't modify anything? Do you have another idea how to solve this problem? Mikulas
Re: [PATCH] dm writecache: reinitialize lru in writeback instead of endio
On Sat, 30 May 2020, Huaisheng Ye wrote: > From: Huaisheng Ye > > When wc_entry has been removed from wbl->list in writeback, it will > be not used again except waiting to be set free in writecache_free_entry. > > That is a little of annoying, it has to reinitialize lru of wc_entry > in endio before calling writecache_free_entry. > > Using list_del_init instead of list_del in writeback for simpler code. > > Signed-off-by: Huaisheng Ye This patch doesn't fix anything, so I think we don't need it. Actually, it's better to keep the list entry uninitialized, because it helps us catch bugs where this uninitialized list entry could be used improperly. Mikulas > --- > drivers/md/dm-writecache.c | 10 -- > 1 file changed, 4 insertions(+), 6 deletions(-) > > diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c > index 7bbc21b..66f3a3b 100644 > --- a/drivers/md/dm-writecache.c > +++ b/drivers/md/dm-writecache.c > @@ -1519,7 +1519,6 @@ static void __writecache_endio_pmem(struct > dm_writecache *wc, struct list_head * > e = wb->wc_list[i]; > BUG_ON(!e->write_in_progress); > e->write_in_progress = false; > - INIT_LIST_HEAD(>lru); > if (!writecache_has_error(wc)) > writecache_free_entry(wc, e); > BUG_ON(!wc->writeback_size); > @@ -1555,7 +1554,6 @@ static void __writecache_endio_ssd(struct dm_writecache > *wc, struct list_head *l > do { > BUG_ON(!e->write_in_progress); > e->write_in_progress = false; > - INIT_LIST_HEAD(>lru); > if (!writecache_has_error(wc)) > writecache_free_entry(wc, e); > > @@ -1654,7 +1652,7 @@ static void __writecache_writeback_pmem(struct > dm_writecache *wc, struct writeba > while (wbl->size) { > wbl->size--; > e = container_of(wbl->list.prev, struct wc_entry, lru); > - list_del(>lru); > + list_del_init(>lru); > > max_pages = e->wc_list_contiguous; > > @@ -1685,7 +1683,7 @@ static void __writecache_writeback_pmem(struct > dm_writecache *wc, struct writeba > if (!wc_add_block(wb, f, GFP_NOWAIT | __GFP_NOWARN)) > break; > wbl->size--; > - list_del(>lru); > + list_del_init(>lru); > wb->wc_list[wb->wc_list_n++] = f; > e = f; > } > @@ -1712,7 +1710,7 @@ static void __writecache_writeback_ssd(struct > dm_writecache *wc, struct writebac > > wbl->size--; > e = container_of(wbl->list.prev, struct wc_entry, lru); > - list_del(>lru); > + list_del_init(>lru); > > n_sectors = e->wc_list_contiguous << (wc->block_size_bits - > SECTOR_SHIFT); > > @@ -1732,7 +1730,7 @@ static void __writecache_writeback_ssd(struct > dm_writecache *wc, struct writebac > wbl->size--; > f = container_of(wbl->list.prev, struct wc_entry, lru); > BUG_ON(f != e + 1); > - list_del(>lru); > + list_del_init(>lru); > e = f; > } > > -- > 1.8.3.1 >
Re: [PATCH] tty:vt: Add check the return value of kzalloc to avoid oops
On Thu, 19 Sep 2019, Greg KH wrote: > On Thu, Sep 19, 2019 at 05:18:15PM +0800, Xiaoming Ni wrote: > > Using kzalloc() to allocate memory in function con_init(), but not > > checking the return value, there is a risk of null pointer references > > oops. > > > > Signed-off-by: Xiaoming Ni > > We keep having this be "reported" :( > > > --- > > drivers/tty/vt/vt.c | 18 ++ > > 1 file changed, 18 insertions(+) > > > > diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c > > index 34aa39d..db83e52 100644 > > --- a/drivers/tty/vt/vt.c > > +++ b/drivers/tty/vt/vt.c > > @@ -3357,15 +3357,33 @@ static int __init con_init(void) > > > > for (currcons = 0; currcons < MIN_NR_CONSOLES; currcons++) { > > vc_cons[currcons].d = vc = kzalloc(sizeof(struct vc_data), > > GFP_NOWAIT); > > + if (unlikely(!vc)) { > > + pr_warn("%s:failed to allocate memory for the %u vc\n", > > + __func__, currcons); > > + break; > > + } > > At init, this really can not happen. Have you see it ever happen? > > > INIT_WORK(_cons[currcons].SAK_work, vc_SAK); > > tty_port_init(>port); > > visual_init(vc, currcons, 1); > > vc->vc_screenbuf = kzalloc(vc->vc_screenbuf_size, GFP_NOWAIT); > > + if (unlikely(!vc->vc_screenbuf)) { > > Never use likely/unlikely unless you can actually measure the speed > difference. For something like this, the compiler will always get it > right without you having to do anything. > > And again, how can this fail? Have you seen it fail? > > > + pr_warn("%s:failed to allocate memory for the %u > > vc_screenbuf\n", > > + __func__, currcons); > > + visual_deinit(vc); > > + tty_port_destroy(>port); > > + kfree(vc); > > + vc_cons[currcons].d = NULL; > > + break; > > + } > > vc_init(vc, vc->vc_rows, vc->vc_cols, > > currcons || !vc->vc_sw->con_save_screen); > > } > > currcons = fg_console = 0; > > master_display_fg = vc = vc_cons[currcons].d; > > + if (unlikely(!vc)) { > > Again, never use likely/unlikely unless you can measure it. > > thanks, > > greg k-h Why does it use GFP_NOWAIT and not GFP_KERNEL? Is there some problem with GFP_KERNEL during initialization? Mikulas
RE: [External] Re: [PATCH] dm writecache: skip writecache_wait for pmem mode
On Wed, 4 Sep 2019, Huaisheng HS1 Ye wrote: > > -Original Message- > > From: Mikulas Patocka > > Sent: Wednesday, September 4, 2019 4:49 PM > > On Mon, 2 Sep 2019, Huaisheng Ye wrote: > > > > > From: Huaisheng Ye > > > > > > The array bio_in_progress[2] only have chance to be increased and > > > decreased with ssd mode. For pmem mode, they are not involved at all. > > > So skip writecache_wait_for_ios in writecache_flush for pmem. > > > > > > Suggested-by: Doris Yu > > > Signed-off-by: Huaisheng Ye > > > --- > > > drivers/md/dm-writecache.c | 3 ++- > > > 1 file changed, 2 insertions(+), 1 deletion(-) > > > > > > diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c > > > index c481947..d06b8aa 100644 > > > --- a/drivers/md/dm-writecache.c > > > +++ b/drivers/md/dm-writecache.c > > > @@ -726,7 +726,8 @@ static void writecache_flush(struct dm_writecache *wc) > > > } > > > writecache_commit_flushed(wc); > > > > > > - writecache_wait_for_ios(wc, WRITE); > > > + if (!WC_MODE_PMEM(wc)) > > > + writecache_wait_for_ios(wc, WRITE); > > > > > > wc->seq_count++; > > > pmem_assign(sb(wc)->seq_count, cpu_to_le64(wc->seq_count)); > > > -- > > > 1.8.3.1 > > > > I think this is not needed - wait_event in writecache_wait_for_ios exits > > immediatelly if the condition is true. > > > > This code path is not so hot that we would need microoptimizations like > > this to > > avoid function calls. > > Hi Mikulas, > > Thanks for your reply, I see what you mean, but I can't agree with you. > > For pmem mode, this code path (writecache_flush) is much more hot than > SSD mode. Because in the code, the AUTOCOMMIT_BLOCKS_PMEM has been > defined to 64, which means if more than 64 blocks have been inserted to > cache device, also called uncommitted, writecache_flush would be called. > Otherwise, there is a timer callback function will be called every 1000 > milliseconds. > > #define AUTOCOMMIT_BLOCKS_SSD 65536 > #define AUTOCOMMIT_BLOCKS_PMEM64 > #define AUTOCOMMIT_MSEC 1000 > > So when dm-writecache running in working mode, there are continuous > WRITE operations has been mapped to writecache_map, writecache_flush > will be used much more often than SSD mode. > > Cheers, > Huaisheng Ye So, you save one instruction cache line for every 64*4096 bytes written to persistent memory. If you insist on it, I can acknowledge it, but I think it is really an over-optimization. Acked-By: Mikulas Patocka Mikulas
[PATCH 2/2] dm: make dm_table_find_target return NULL
Currently, if we pass too high sector number to dm_table_find_target, it returns zeroed dm_target structure and callers test if the structure is zeroed with the macro dm_target_is_valid. However, returning NULL is common practice to indicate errors. This patch refactors the dm code, so that dm_table_find_target returns NULL and its callers test the returned value for NULL. The macro dm_target_is_valid is deleted. In alloc_targets, we no longer allocate an extra zeroed target. Signed-off-by: Mikulas Patocka --- drivers/md/dm-ioctl.c |2 +- drivers/md/dm-table.c |8 +++- drivers/md/dm.c |8 drivers/md/dm.h |5 - 4 files changed, 8 insertions(+), 15 deletions(-) Index: linux-2.6/drivers/md/dm.c === --- linux-2.6.orig/drivers/md/dm.c 2019-08-23 15:45:46.0 +0200 +++ linux-2.6/drivers/md/dm.c 2019-08-23 15:45:46.0 +0200 @@ -457,7 +457,7 @@ static int dm_blk_report_zones(struct ge return -EIO; tgt = dm_table_find_target(map, sector); - if (!dm_target_is_valid(tgt)) { + if (!tgt) { ret = -EIO; goto out; } @@ -1072,7 +1072,7 @@ static struct dm_target *dm_dax_get_live return NULL; ti = dm_table_find_target(map, sector); - if (!dm_target_is_valid(ti)) + if (!ti) return NULL; return ti; @@ -1572,7 +1572,7 @@ static int __split_and_process_non_flush int r; ti = dm_table_find_target(ci->map, ci->sector); - if (!dm_target_is_valid(ti)) + if (!ti) return -EIO; if (__process_abnormal_io(ci, ti, )) @@ -1748,7 +1748,7 @@ static blk_qc_t dm_process_bio(struct ma if (!ti) { ti = dm_table_find_target(map, bio->bi_iter.bi_sector); - if (unlikely(!ti || !dm_target_is_valid(ti))) { + if (unlikely(!ti)) { bio_io_error(bio); return ret; } Index: linux-2.6/drivers/md/dm-ioctl.c === --- linux-2.6.orig/drivers/md/dm-ioctl.c2019-08-23 15:45:46.0 +0200 +++ linux-2.6/drivers/md/dm-ioctl.c 2019-08-23 15:45:46.0 +0200 @@ -1592,7 +1592,7 @@ static int target_message(struct file *f } ti = dm_table_find_target(table, tmsg->sector); - if (!dm_target_is_valid(ti)) { + if (!ti) { DMWARN("Target message sector outside device."); r = -EINVAL; } else if (ti->type->message) Index: linux-2.6/drivers/md/dm-table.c === --- linux-2.6.orig/drivers/md/dm-table.c2019-08-23 15:45:46.0 +0200 +++ linux-2.6/drivers/md/dm-table.c 2019-08-23 15:47:18.0 +0200 @@ -163,10 +163,8 @@ static int alloc_targets(struct dm_table /* * Allocate both the target array and offset array at once. -* Append an empty entry to catch sectors beyond the end of -* the device. */ - n_highs = (sector_t *) dm_vcalloc(num + 1, sizeof(struct dm_target) + + n_highs = (sector_t *) dm_vcalloc(num, sizeof(struct dm_target) + sizeof(sector_t)); if (!n_highs) return -ENOMEM; @@ -1359,7 +1357,7 @@ struct dm_target *dm_table_get_target(st /* * Search the btree for the correct target. * - * Caller should check returned pointer with dm_target_is_valid() + * Caller should check returned pointer for NULL * to trap I/O beyond end of device. */ struct dm_target *dm_table_find_target(struct dm_table *t, sector_t sector) @@ -1368,7 +1366,7 @@ struct dm_target *dm_table_find_target(s sector_t *node; if (unlikely(sector >= dm_table_get_size(t))) - return >targets[t->num_targets]; + return NULL; for (l = 0; l < t->depth; l++) { n = get_child(n, k); Index: linux-2.6/drivers/md/dm.h === --- linux-2.6.orig/drivers/md/dm.h 2019-08-23 15:45:46.0 +0200 +++ linux-2.6/drivers/md/dm.h 2019-08-23 15:45:46.0 +0200 @@ -86,11 +86,6 @@ struct target_type *dm_get_immutable_tar int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t); /* - * To check the return value from dm_table_find_target(). - */ -#define dm_target_is_valid(t) ((t)->table) - -/* * To check whether the target type is bio-based or not (request-based). */ #define dm_target_bio_based(t) ((t)->type->map != NULL)
[PATCH 1/2] dm table: fix invalid memory accesses with too high sector number
If the sector number is too high, dm_table_find_target should return a pointer to a zeroed dm_target structure (the caller should test it with dm_target_is_valid). However, for some table sizes, the code in dm_table_find_target that performs btree lookup will access out of bound memory structures. This patch fixes the bug by testing the sector number at the beginning of dm_table_find_target. We add an "inline" keyword to the function dm_table_get_size because this is hot path. Signed-off-by: Mikulas Patocka Reported-by: Zhang Tao Fixes: 512875bd9661 ("dm: table detect io beyond device") Cc: sta...@vger.kernel.org --- drivers/md/dm-table.c |5 - 1 file changed, 4 insertions(+), 1 deletion(-) Index: linux-2.6/drivers/md/dm-table.c === --- linux-2.6.orig/drivers/md/dm-table.c2019-08-23 13:40:51.0 +0200 +++ linux-2.6/drivers/md/dm-table.c 2019-08-23 15:43:19.0 +0200 @@ -1342,7 +1342,7 @@ void dm_table_event(struct dm_table *t) } EXPORT_SYMBOL(dm_table_event); -sector_t dm_table_get_size(struct dm_table *t) +inline sector_t dm_table_get_size(struct dm_table *t) { return t->num_targets ? (t->highs[t->num_targets - 1] + 1) : 0; } @@ -1367,6 +1367,9 @@ struct dm_target *dm_table_find_target(s unsigned int l, n = 0, k = 0; sector_t *node; + if (unlikely(sector >= dm_table_get_size(t))) + return >targets[t->num_targets]; + for (l = 0; l < t->depth; l++) { n = get_child(n, k); node = get_node(t, l, n);
Re: [dm-devel] [PATCH] dm table: fix a potential array out of bounds
Hi I tested it and the bug is real - with some table sizes, dm_table_find_target will access memory out of bounds if the sector argument is beyond limit. Your patch fixes some of these cases, but not all of them. I used this script to test all possible table sizes: #!/bin/bash -e sync dmsetup remove_all || true rmmod dm_mod || true >t.txt for i in `seq 1 1`; do echo $i echo $((i-1)) 1 error >>t.txt dmsetup create error From: Zhang Tao > > allocate num + 1 for target and offset array, n_highs need num + 1 > elements, the last element will be used for node lookup in function > dm_table_find_target. > > Signed-off-by: Zhang Tao > --- > drivers/md/dm-table.c | 8 +--- > 1 file changed, 5 insertions(+), 3 deletions(-) > > diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c > index 7b6c3ee..fd7f604 100644 > --- a/drivers/md/dm-table.c > +++ b/drivers/md/dm-table.c > @@ -160,20 +160,22 @@ static int alloc_targets(struct dm_table *t, unsigned > int num) > { > sector_t *n_highs; > struct dm_target *n_targets; > + unsigned int alloc_num; > > /* >* Allocate both the target array and offset array at once. >* Append an empty entry to catch sectors beyond the end of >* the device. >*/ > - n_highs = (sector_t *) dm_vcalloc(num + 1, sizeof(struct dm_target) + > + alloc_num = num + 1; > + n_highs = (sector_t *) dm_vcalloc(alloc_num, sizeof(struct dm_target) + > sizeof(sector_t)); > if (!n_highs) > return -ENOMEM; > > - n_targets = (struct dm_target *) (n_highs + num); > + n_targets = (struct dm_target *) (n_highs + alloc_num); > > - memset(n_highs, -1, sizeof(*n_highs) * num); > + memset(n_highs, -1, sizeof(*n_highs) * alloc_num); > vfree(t->highs); > > t->num_allocated = num; > -- > 1.8.3.1 > > > -- > dm-devel mailing list > dm-de...@redhat.com > https://www.redhat.com/mailman/listinfo/dm-devel >
Re: [PATCH v3 0/5] Optimize writecache when using pmem as cache
On Thu, 31 Jan 2019, Huaisheng Ye wrote: > From: Huaisheng Ye > > This patch set could be used for dm-writecache when use persistent > memory as cache data device. > > Patch 1 and 2 go towards removing unused parameter and codes which > actually doesn't really work. I agree that there is some unused variables and code, but I would let it be as it is. The processors can write data to persistent memory either by using non-temporal stores (the movnti instruction) or by normal stores followed by the clwb instruction. Currently, the movnti instruction is faster - however, it may be possible that with some newer processors, the clwb instruction could be faster - and in that case, we need the code that you have removed. I would like to keep both flush strategies around (movnti and clwb), so that we can test how do they perform on various processors. Unfortunatelly, some upstream developers hate code with #ifdefs :-( Note that compiler optimizations already remove the unused parameter and the impossible code behind "if (WC_MODE_PMEM(wc)) if (!WC_MODE_PMEM(wc))". > Patch 3 and 4 are targeted at solving problem fn ctr failed to work > due to invalid magic or version, which is caused by the super block > of pmem has messy data stored. LVM zeros the beginning of new logical volumes, so there should be no problem with it. If the user wants to use the writecache target without LVM, he should zero the superblock with dd if=/dev/zero of=/dev/pmem0 bs=4k count=1 Note that other device mapper targets also follow this policy - for example see drivers/md/dm-snap-persistent.c: if (le32_to_cpu(dh->magic) == 0) { *new_snapshot = 1; return 0; } if (le32_to_cpu(dh->magic) != SNAP_MAGIC) { DMWARN("Invalid or corrupt snapshot"); r = -ENXIO; goto bad; } So, I think there is no need for these patches - dm-writecache just does what others targets do. > Patch 5 is used for getting the status of seq_count. It may be accepted if other LVM team members find some use for this value. Mikulas > Changes Since v2: > - seq_count is important for flush operations, output it within status > for debugging and analyzing code behavior. > [1]: https://lkml.org/lkml/2019/1/3/43 > [2]: https://lkml.org/lkml/2019/1/9/6 > > Huaisheng Ye (5): > dm-writecache: remove unused size to writecache_flush_region > dm-writecache: get rid of memory_data flush to writecache_flush_entry > dm-writecache: expand pmem_reinit for struct dm_writecache > Documentation/device-mapper: add optional parameter reinit > dm-writecache: output seq_count within status > > Documentation/device-mapper/writecache.txt | 4 > drivers/md/dm-writecache.c | 23 +-- > 2 files changed, 17 insertions(+), 10 deletions(-) > > -- > 1.8.3.1 > >
Re: Sleeping from invalid context in udlfb
On Thu, 2 Aug 2018, David Airlie wrote: > > I'm pretty sure udlkms handles this already. > > Dave. But it crashes on unplug :-) Mikulas > On Wed, Aug 1, 2018 at 11:34 PM, Mikulas Patocka wrote: > > > On Wed, 1 Aug 2018, Geert Uytterhoeven wrote: > > > Hi Mikulas, > > > > On Wed, Aug 1, 2018 at 12:59 PM Mikulas Patocka > wrote: > > > On Wed, 1 Aug 2018, Geert Uytterhoeven wrote: > > > > On Tue, Jul 31, 2018 at 5:23 PM Mikulas Patocka > wrote: > > > > > BTW when using the udlfb driver as a console, I've got this > warning. > > > > > vt_console_print takes a spinlock and then calls the > framebuffer driver > > > > > that sleeps. > > > > > > > > > > The question is - whose fault is this? Could the console code > somehow be > > > > > told to print characters without holding a spinlock? Or does it > mean that > > > > > framebuffer drivers can't sleep? > > > > > > > > > > udlfb communicates through USB, so the sleeping is inevitable. > > > > > > > > > > Mikulas > > > > > > > > > > > > > > > BUG: sleeping function called from invalid context at > mm/slab.h:421 > > > > > in_atomic(): 1, irqs_disabled(): 0, pid: 430, name: kworker/2:3 > > > > > 6 locks held by kworker/2:3/430: > > > > > #0: 1301127e ( (wq_completion)"events"){} , at: > process_one_work+0x17c/0x3a8 > > > > > #1: beacc951 ( > (work_completion)(&(>init_framebuffer_work)->work)){} , at: > process_one_work+0x17c/0x3a8 > > > > > #2: a402f826 ( registration_lock){} , at: > register_framebuffer+0x28/0x2c0 [fb] > > > > > #3: 21cbe902 ( console_lock){} , at: > register_framebuffer+0x258/0x2c0 [fb] > > > > > #4: 96d51735 ( console_owner){} , at: > console_unlock+0x174/0x500 > > > > > #5: faa7f206 ( printing_lock){} , at: > vt_console_print+0x60/0x3a0 > > > > > Preemption disabled at: [] > vt_console_print+0x60/0x3a0 > > > > > CPU: 2 PID: 430 Comm: kworker/2:3 Not tainted 4.17.10-debug #3 > > > > > Hardware name: Marvell Armada 8040 MacchiatoBin/Armada 8040 > MacchiatoBin, BIOS EDK II Jul 30 2018 > > > > > Workqueue: events dlfb_init_framebuffer_work [udlfb] > > > > > Call trace: > > > > > dump_backtrace+0x0/0x150 > > > > > show_stack+0x14/0x20 > > > > > dump_stack+0x8c/0xac > > > > > ___might_sleep+0x140/0x170 > > > > > __might_sleep+0x50/0x88 > > > > > __kmalloc+0x1b0/0x270 > > > > > xhci_urb_enqueue+0xa8/0x460 [xhci_hcd] > > > > > usb_hcd_submit_urb+0xc0/0x998 [usbcore] > > > > > usb_submit_urb+0x1e0/0x518 [usbcore] > > > > > dlfb_submit_urb+0x38/0x98 [udlfb] > > > > > dlfb_handle_damage.isra.4+0x1e0/0x210 [udlfb] > > > > > dlfb_ops_imageblit+0x28/0x38 [udlfb] > > > > > soft_cursor+0x15c/0x1d8 [fb] > > > > > bit_cursor+0x324/0x510 [fb] > > > > > fbcon_cursor+0x144/0x1a0 [fb] > > > > > hide_cursor+0x38/0xa0 > > > > > vt_console_print+0x334/0x3a0 > > > > > console_unlock+0x274/0x500 > > > > > register_framebuffer+0x22c/0x2c0 [fb] > > > > > dlfb_init_framebuffer_work+0x1ec/0x2fc [udlfb] > > > > > process_one_work+0x1e8/0x3a8 > > > > > worker_thread+0x44/0x418 > > > > > kthread+0x11c/0x120 > > > > > ret_from_fork+0x10/0x18 > > > > > > > > This is sort of expected: you cannot do USB transfers from > printk(). > > > > > > > > Gr{oetje,eeting}s, > > > > > > > > Geert > > > > > > So, should there be a framebuffer flag that prevents the console > from > > > binding to it? > > > > > > If I start the kernel with "console=ttyS0,115200", it doesn't try > to bind > > > to the udlfb driver, but if I start it without this flag, it does > and > > > crashes :-( > > > > Your frame buffer driver should offload tasks that may sleep to e.g. a > > workqueue. > > > > Gr{oetje,eeting}s, > > > > Geert > > I can try to do this - but - taking a spinlock and copying 8MB framebuffer > would damage scheduling latency even for PCI framebuffer drivers. > > Mikulas > > > >
Re: [LKP] d50d82faa0 [ 33.671845] WARNING: possible circular locking dependency detected
On Wed, 7 Nov 2018, Andrew Morton wrote: > On Wed, 7 Nov 2018 15:43:36 -0800 Andrew Morton > wrote: > > > On Tue, 23 Oct 2018 08:30:04 +0800 kernel test robot > > wrote: > > > > > Greetings, > > > > > > 0day kernel testing robot got the below dmesg and the first bad commit is > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master > > > > > > commit d50d82faa0c964e31f7a946ba8aba7c715ca7ab0 > > > Author: Mikulas Patocka > > > AuthorDate: Wed Jun 27 23:26:09 2018 -0700 > > > Commit: Linus Torvalds > > > CommitDate: Thu Jun 28 11:16:44 2018 -0700 > > > > > > slub: fix failure when we delete and create a slab cache > > > > This is ugly. Is there an alternative way of fixing the race which > > Mikulas attempted to address? Possibly cancel the work and reuse the > > existing sysfs file, or is that too stupid to live? > > > > 3b7b314053d021 ("slub: make sysfs file removal asynchronous") was > > pretty lame, really. As mentioned, > > > > : It'd be the cleanest to deal with the issue by removing sysfs files > > : without holding slab_mutex before the rest of shutdown; however, given > > : the current code structure, it is pretty difficult to do so. > > > > Would be a preferable approach. > > > > > > > > This uncovered a bug in the slub subsystem - if we delete a cache and > > > immediatelly create another cache with the same attributes, it fails > > > because of duplicate filename in /sys/kernel/slab/. The slub > > > subsystem > > > offloads freeing the cache to a workqueue - and if we create the new > > > cache before the workqueue runs, it complains because of duplicate > > > filename in sysfs. > > Alternatively, could we flush the workqueue before attempting to > (re)create the sysfs file? What if someone creates the slab cache from the workqueue? > Extra points for only doing this if the > first (re)creation attempt returned -EEXIST? If it returns -EEXIST, it has already written the warning to the log. Mikulas
Re: [LKP] d50d82faa0 [ 33.671845] WARNING: possible circular locking dependency detected
On Wed, 7 Nov 2018, Andrew Morton wrote: > On Wed, 7 Nov 2018 15:43:36 -0800 Andrew Morton > wrote: > > > On Tue, 23 Oct 2018 08:30:04 +0800 kernel test robot > > wrote: > > > > > Greetings, > > > > > > 0day kernel testing robot got the below dmesg and the first bad commit is > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master > > > > > > commit d50d82faa0c964e31f7a946ba8aba7c715ca7ab0 > > > Author: Mikulas Patocka > > > AuthorDate: Wed Jun 27 23:26:09 2018 -0700 > > > Commit: Linus Torvalds > > > CommitDate: Thu Jun 28 11:16:44 2018 -0700 > > > > > > slub: fix failure when we delete and create a slab cache > > > > This is ugly. Is there an alternative way of fixing the race which > > Mikulas attempted to address? Possibly cancel the work and reuse the > > existing sysfs file, or is that too stupid to live? > > > > 3b7b314053d021 ("slub: make sysfs file removal asynchronous") was > > pretty lame, really. As mentioned, > > > > : It'd be the cleanest to deal with the issue by removing sysfs files > > : without holding slab_mutex before the rest of shutdown; however, given > > : the current code structure, it is pretty difficult to do so. > > > > Would be a preferable approach. > > > > > > > > This uncovered a bug in the slub subsystem - if we delete a cache and > > > immediatelly create another cache with the same attributes, it fails > > > because of duplicate filename in /sys/kernel/slab/. The slub > > > subsystem > > > offloads freeing the cache to a workqueue - and if we create the new > > > cache before the workqueue runs, it complains because of duplicate > > > filename in sysfs. > > Alternatively, could we flush the workqueue before attempting to > (re)create the sysfs file? What if someone creates the slab cache from the workqueue? > Extra points for only doing this if the > first (re)creation attempt returned -EEXIST? If it returns -EEXIST, it has already written the warning to the log. Mikulas
[PATCH] vt: fix broken display when running aptitude
If you run aptitude on framebuffer console, the display is corrupted. The corruption is caused by the commit d8ae7242. The patch adds "offset" to "start" when calling scr_memsetw, but it forgets to do the same addition on a subsequent call to do_update_region. Signed-off-by: Mikulas Patocka Fixes: d8ae72427187 ("vt: preserve unicode values corresponding to screen characters") Cc: sta...@vger.kernel.org # 4.19 --- drivers/tty/vt/vt.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6/drivers/tty/vt/vt.c === --- linux-2.6.orig/drivers/tty/vt/vt.c 2018-10-23 16:15:08.0 +0200 +++ linux-2.6/drivers/tty/vt/vt.c 2018-10-23 16:29:20.0 +0200 @@ -1551,7 +1551,7 @@ static void csi_K(struct vc_data *vc, in scr_memsetw(start + offset, vc->vc_video_erase_char, 2 * count); vc->vc_need_wrap = 0; if (con_should_update(vc)) - do_update_region(vc, (unsigned long) start, count); + do_update_region(vc, (unsigned long)(start + offset), count); } static void csi_X(struct vc_data *vc, int vpar) /* erase the following vpar positions */
[PATCH] vt: fix broken display when running aptitude
If you run aptitude on framebuffer console, the display is corrupted. The corruption is caused by the commit d8ae7242. The patch adds "offset" to "start" when calling scr_memsetw, but it forgets to do the same addition on a subsequent call to do_update_region. Signed-off-by: Mikulas Patocka Fixes: d8ae72427187 ("vt: preserve unicode values corresponding to screen characters") Cc: sta...@vger.kernel.org # 4.19 --- drivers/tty/vt/vt.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6/drivers/tty/vt/vt.c === --- linux-2.6.orig/drivers/tty/vt/vt.c 2018-10-23 16:15:08.0 +0200 +++ linux-2.6/drivers/tty/vt/vt.c 2018-10-23 16:29:20.0 +0200 @@ -1551,7 +1551,7 @@ static void csi_K(struct vc_data *vc, in scr_memsetw(start + offset, vc->vc_video_erase_char, 2 * count); vc->vc_need_wrap = 0; if (con_should_update(vc)) - do_update_region(vc, (unsigned long) start, count); + do_update_region(vc, (unsigned long)(start + offset), count); } static void csi_X(struct vc_data *vc, int vpar) /* erase the following vpar positions */
Re: [PATCHv5 0/7] tty: Hold write ldisc sem in tty_reopen()
On Wed, 19 Sep 2018, Dmitry Safonov wrote: > On Wed, 2018-09-19 at 16:03 -0400, Mikulas Patocka wrote: > > > > On Wed, 19 Sep 2018, Dmitry Safonov wrote: > > > Thanks much for the testing, Mikulas. > > > Could you try to bisect which of the patches causes it? > > > The most important are 1,2,3 - probably, one of them caused it.. > > > I'll stare a bit into the code. > > > > The patch 3 causes it. > > > > The hangs during reboot take either 80 seconds, 3 minutes or 3:25. > > Thanks much for that again. > Any chance you have sysrq enabled and could print: locks held in system > and kernel backtraces for applications? Stacktrace doesn't work on parisc. It used to work on 4.18, but the kernel 4.19 reportedly requires patched binutils. Mikulas > I suppose, userspace hangs in n_tty_receive_buf_common(), which is > better of cause than null-ptr dereference, but I believe we can improve > it by stopping a reader earlier if there is any writer pending. > > -- > Thanks, > Dmitry >
Re: [PATCHv5 0/7] tty: Hold write ldisc sem in tty_reopen()
On Wed, 19 Sep 2018, Dmitry Safonov wrote: > On Wed, 2018-09-19 at 16:03 -0400, Mikulas Patocka wrote: > > > > On Wed, 19 Sep 2018, Dmitry Safonov wrote: > > > Thanks much for the testing, Mikulas. > > > Could you try to bisect which of the patches causes it? > > > The most important are 1,2,3 - probably, one of them caused it.. > > > I'll stare a bit into the code. > > > > The patch 3 causes it. > > > > The hangs during reboot take either 80 seconds, 3 minutes or 3:25. > > Thanks much for that again. > Any chance you have sysrq enabled and could print: locks held in system > and kernel backtraces for applications? Stacktrace doesn't work on parisc. It used to work on 4.18, but the kernel 4.19 reportedly requires patched binutils. Mikulas > I suppose, userspace hangs in n_tty_receive_buf_common(), which is > better of cause than null-ptr dereference, but I believe we can improve > it by stopping a reader earlier if there is any writer pending. > > -- > Thanks, > Dmitry >