Re: [PATCH 0/5] 4.14 backports of fixes for "CoW after fork() issue"

2021-04-07 Thread Mikulas Patocka



On Wed, 7 Apr 2021, Linus Torvalds wrote:

> On Wed, Apr 7, 2021 at 9:33 AM Suren Baghdasaryan  wrote:
> >
> > Trying my hand at backporting the patchsets Peter mentioned proved
> > this to be far from easy with many dependencies. Let me look into
> > Vlastimil's suggestion to backport only 17839856fd58 and it sounds
> > like 5.4 already followed that path.
> 
> Well, in many ways 17839856fd58 was the "simple and obvious" fix, and
> I do think it's easily backportable.
> 
> But it *did* cause problems too. Those problems may not be issues on
> those old kernels, though.
> 
> In particular, commit 17839856fd58 caused uffd-wp to stop working
> right, and it caused some issues with debugging (I forget the exact
> details, but I think it was strace accessing PROT_NONE or write-only
> pages or something like that, and COW failed).
> 
> But yes, in many ways that commit is a much simpler and more
> straightforward one (which is why I tried it once - we ended up with
> the much more subtle and far-reaching fixes after the UFFD issues
> crept up).
> 
> The issues that 17839856fd58 caused may be entire non-events in old
> kernels. In fact, the uffd writeprotect API was added fairly recently
> (see commit 63b2d4174c4a that made it into v5.7), so the uffd-wp issue
> that was triggered probably cannot happen in the old kernels.
> 
> The strace issue might not be relevant either, but I forget what the
> details were. Mikilas should know.
> 
> See
> 
>   
> https://lore.kernel.org/lkml/alpine.lrh.2.02.2009031328040.6...@file01.intranet.prod.int.rdu2.redhat.com/
> 
> for Mikulas report. I never looked into it in detail, because by then
> the uffd-wp issue had already come up, so it was juat another nail in
> the coffin for that simpler approach.
> 
> Mikulas, do you remember?
> 
> Linus

Hi

I think that we never found a root cause for this bug. I was testing if 
the whole system can run from persistent memory and found out that strace 
didn't work. I bisected it, reported it and when I received Peter Xu's 
patches (which fixed it), I stopped bothering about it.

So, we fixed it, but we don't know why.

Peter Xu's patchset that fixed it is here: 
https://lore.kernel.org/lkml/20200821234958.7896-1-pet...@redhat.com/

Mikulas



[RFC v3] nvfs: a filesystem for persistent memory

2021-02-15 Thread Mikulas Patocka
Hi

I announce a new version of NVFS - a filesystem for persistent memory.
http://people.redhat.com/~mpatocka/nvfs/
git://leontynka.twibright.com/nvfs.git

Changes since the last release:

I reworked file read/write handling:

* the functions nvfs_read and nvfs_write were deleted beacause it's 
  unlikely that the upstream kernel will allow them.

* the functions nvfs_read_iter and nvfs_write_iter have a fast path if 
  there is just one segment in iov_iter - they will call nvfs_read_locked 
  and nvfs_write_locked directly. This improves performance by 3% on the 
  read path and 1% on the write path.

* read_iter_locked uses copy_to_iter as suggested by Al Viro.

* write_iter_locked doesn't use copy_from_iter_flushcache, because we need 
  copy that doesn't advance the iter (the "copy" and "advance" must be two 
  separate operations). So, I added new operations "iov_iter_map" and 
  "iov_iter_unmap" - iov_iter_map will map the first segment of iov and 
  iov_iter_unmap will unmap it.

Do you think that introducing "iov_iter_map" and "iov_iter_unmap" is 
appropriate? Do you have some other idea how to handle it?

Mikukas



Re: Expense of read_iter

2021-01-21 Thread Mikulas Patocka



On Thu, 21 Jan 2021, Matthew Wilcox wrote:

> On Wed, Jan 20, 2021 at 10:12:01AM -0500, Mikulas Patocka wrote:
> > Do you have some idea how to optimize the generic code that calls 
> > ->read_iter?
> 
> Yes.
> 
> > It might be better to maintain an f_iocb_flags in the
> > struct file and just copy that unconditionally.  We'd need to remember
> > to update it in fcntl(F_SETFL), but I think that's the only place.
> 
> Want to give that a try?

Yes - send me the patch and I'll benchmark it.

Mikulas



Re: Expense of read_iter

2021-01-20 Thread Mikulas Patocka



On Wed, 20 Jan 2021, Jan Kara wrote:

> Yeah, I agree. I'm against ext4 private solution for this read problem. And
> I'm also against duplicating ->read_iter functionatily in ->read handler.
> The maintenance burden of this code duplication is IMHO just too big. We
> rather need to improve the generic code so that the fast path is faster.
> And every filesystem will benefit because this is not ext4 specific
> problem.
> 
>   Honza

Do you have some idea how to optimize the generic code that calls 
->read_iter?

vfs_read calls ->read if it is present. If not, it calls new_sync_read. 
new_sync_read's frame size is 128 bytes - it holds the structures iovec, 
kiocb and iov_iter. new_sync_read calls ->read_iter.

I have found out that the cost of calling new_sync_read is 3.3%, Zhongwei 
found out 3.9%. (the benchmark repeatedy reads the same 4k page)

I don't see any way how to optimize new_sync_read or how to reduce its 
frame size. Do you?

Mikulas



Re: Expense of read_iter

2021-01-13 Thread Mikulas Patocka



On Tue, 12 Jan 2021, Zhongwei Cai wrote:

> 
> I'm working with Mingkai on optimizations for Ext4-dax.

What specific patch are you working on? Please, post it somewhere.

> We think that optmizing the read-iter method cannot achieve the
> same performance as the read method for Ext4-dax. 
> We tried Mikulas's benchmark on Ext4-dax. The overall time and perf
> results are listed below:
> 
> Overall time of 2^26 4KB read.
> 
> Method   Time
> read 26.782s
> read-iter36.477s

What happens if you use this trick ( https://lkml.org/lkml/2021/1/11/1612 )
- detect in the "read_iter" method that there is just one segment and 
treat it like a "read" method. I think that it should improve performance 
for your case.

Mikulas



Re: Expense of read_iter

2021-01-11 Thread Mikulas Patocka



On Mon, 11 Jan 2021, Matthew Wilcox wrote:

> On Sun, Jan 10, 2021 at 04:19:15PM -0500, Mikulas Patocka wrote:
> > I put counters into vfs_read and vfs_readv.
> > 
> > After a fresh boot of the virtual machine, the counters show "13385 4". 
> > After a kernel compilation they show "4475220 8".
> > 
> > So, the readv path is almost unused.
> > 
> > My reasoning was that we should optimize for the "read" path and glue the 
> > "readv" path on the top of that. Currently, the kernel is doing the 
> > opposite - optimizing for "readv" and glueing "read" on the top of it.
> 
> But it's not about optimising for read vs readv.  read_iter handles
> a host of other cases, such as pread(), preadv(), AIO reads, splice,
> and reads to in-kernel buffers.

These things are used rarely compared to "read" and "pread". (BTW. "pread" 
could be handled by the read method too).

What's the reason why do you think that the "read" syscall should use the 
"read_iter" code path? Is it because duplicating the logic is discouraged? 
Or because of code size? Or something else?

> Some device drivers abused read() vs readv() to actually return different 
> information, depending which you called.  That's why there's now a
> prohibition against both.
> 
> So let's figure out how to make iter_read() perform well for sys_read().

I've got another idea - in nvfs_read_iter, test if the iovec contains just 
one entry and call nvfs_read_locked if it does.

diff --git a/file.c b/file.c
index f4b8a1a..e4d87b2 100644
--- a/file.c
+++ b/file.c
@@ -460,6 +460,10 @@ static ssize_t nvfs_read_iter(struct kiocb *iocb, struct 
iov_iter *iov)
if (!IS_DAX(>vfs_inode)) {
r = generic_file_read_iter(iocb, iov);
} else {
+   if (likely(iter_is_iovec(iov)) && likely(!iov->iov_offset) && 
likely(iov->nr_segs == 1)) {
+   r = nvfs_read_locked(nmi, iocb->ki_filp, 
iov->iov->iov_base, iov->count, true, >ki_pos);
+   goto unlock_ret;
+   }
 #if 1
r = nvfs_rw_iter_locked(iocb, iov, false);
 #else
@@ -467,6 +471,7 @@ static ssize_t nvfs_read_iter(struct kiocb *iocb, struct 
iov_iter *iov)
 #endif
}
 
+unlock_ret:
inode_unlock_shared(>vfs_inode);
 
return r;



The result is:

nvfs_read_iter  - 7.307s
Al Viro's read_iter_locked  - 7.147s
test for just one entry - 7.010s
the read method - 6.782s

So far, this is the best way how to do it, but it's still 3.3% worse than 
the read method. There's not anything more that could be optimized on the 
filesystem level - the rest of optimizations must be done in the VFS.

Mikulas



RE: [RFC v2] nvfs: a filesystem for persistent memory

2021-01-11 Thread Mikulas Patocka



On Mon, 11 Jan 2021, David Laight wrote:

> From: Al Viro  On Behalf Of Al Viro
> > Sent: 10 January 2021 16:20
> > 
> > On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote:
> > > Hi
> > >
> > > I announce a new version of NVFS - a filesystem for persistent memory.
> > >   http://people.redhat.com/~mpatocka/nvfs/
> > Utilities, AFAICS
> > 
> > >   git://leontynka.twibright.com/nvfs.git
> > Seems to hang on git pull at the moment...  Do you have it anywhere else?
> > 
> > > I found out that on NVFS, reading a file with the read method has 10%
> > > better performance than the read_iter method. The benchmark just reads the
> > > same 4k page over and over again - and the cost of creating and parsing
> > > the kiocb and iov_iter structures is just that high.
> > 
> > Apples and oranges...  What happens if you take
> > 
> > ssize_t read_iter_locked(struct file *file, struct iov_iter *to, loff_t 
> > *ppos)
> > {
> > struct inode *inode = file_inode(file);
> > struct nvfs_memory_inode *nmi = i_to_nmi(inode);
> > struct nvfs_superblock *nvs = inode->i_sb->s_fs_info;
> > ssize_t total = 0;
> > loff_t pos = *ppos;
> > int r;
> > int shift = nvs->log2_page_size;
> > size_t i_size;
> > 
> > i_size = inode->i_size;
> > if (pos >= i_size)
> > return 0;
> > iov_iter_truncate(to, i_size - pos);
> > 
> > while (iov_iter_count(to)) {
> > void *blk, *ptr;
> > size_t page_mask = (1UL << shift) - 1;
> > unsigned page_offset = pos & page_mask;
> > unsigned prealloc = (iov_iter_count(to) + page_mask) >> shift;
> > unsigned size;
> > 
> > blk = nvfs_bmap(nmi, pos >> shift, , NULL, NULL, NULL);
> > if (unlikely(IS_ERR(blk))) {
> > r = PTR_ERR(blk);
> > goto ret_r;
> > }
> > size = ((size_t)prealloc << shift) - page_offset;
> > ptr = blk + page_offset;
> > if (unlikely(!blk)) {
> > size = min(size, (unsigned)PAGE_SIZE);
> > ptr = empty_zero_page;
> > }
> > size = copy_to_iter(to, ptr, size);
> > if (unlikely(!size)) {
> > r = -EFAULT;
> > goto ret_r;
> > }
> > 
> > pos += size;
> > total += size;
> > } while (iov_iter_count(to));
> 
> That isn't the best formed loop!
> 
>   David

I removed the second "while" statement and fixed the arguments to 
copy_to_iter - other than that, Al's function works.

Mikuklas



Re: [RFC v2] nvfs: a filesystem for persistent memory

2021-01-11 Thread Mikulas Patocka



On Sun, 10 Jan 2021, Al Viro wrote:

> On Sun, Jan 10, 2021 at 04:14:55PM -0500, Mikulas Patocka wrote:
> 
> > That's a good point. I split nvfs_rw_iter to separate functions 
> > nvfs_read_iter and nvfs_write_iter - and inlined nvfs_rw_iter_locked into 
> > both of them. It improved performance by 1.3%.
> > 
> > > Not that it had been more useful on the write side, really,
> > > but that's another story (nvfs_write_pages() handling of
> > > copyin is... interesting).  Let's figure out what's going
> > > on with the read overhead first...
> > > 
> > > lib/iov_iter.c primitives certainly could use massage for
> > > better code generation, but let's find out how much of the
> > > PITA is due to those and how much comes from you fighing
> > > the damn thing instead of using it sanely...
> > 
> > The results are:
> > 
> > read:   6.744s
> > read_iter:  7.417s
> > read_iter - separate read and write path:   7.321s
> > Al's read_iter: 7.182s
> > Al's read_iter with _copy_to_iter:  7.181s
> 
> So
>   * overhead of hardening stuff is noise here
>   * switching to more straightforward ->read_iter() cuts
> the overhead by about 1/3.
> 
>   Interesting...  I wonder how much of that is spent in
> iterate_and_advance() glue inside copy_to_iter() here.  There's
> certainly quite a bit of optimizations possible in those
> primitives and your usecase makes a decent test for that...
> 
>   Could you profile that and see where is it spending
> the time, on instruction level?

This is the read method profile:

time9.056s
52.69%  pread[kernel.vmlinux]  [k] copy_user_generic_string
 6.24%  pread[kernel.vmlinux]  [k] current_time
 6.22%  pread[kernel.vmlinux]  [k] entry_SYSCALL_64
 4.88%  preadlibc-2.31.so  [.] __libc_pread
 3.75%  pread[kernel.vmlinux]  [k] syscall_return_via_sysret
 3.63%  pread[nvfs][k] nvfs_read
 2.83%  pread[nvfs][k] nvfs_bmap
 2.81%  pread[kernel.vmlinux]  [k] vfs_read
 2.63%  pread[kernel.vmlinux]  [k] __x64_sys_pread64
 2.27%  pread[kernel.vmlinux]  [k] __fsnotify_parent
 2.19%  pread[kernel.vmlinux]  [k] entry_SYSCALL_64_after_hwframe
 1.55%  pread[kernel.vmlinux]  [k] atime_needs_update
 1.17%  pread[kernel.vmlinux]  [k] syscall_enter_from_user_mode
 1.15%  pread[kernel.vmlinux]  [k] touch_atime
 0.84%  pread[kernel.vmlinux]  [k] down_read
 0.82%  pread[kernel.vmlinux]  [k] syscall_exit_to_user_mode
 0.71%  pread[kernel.vmlinux]  [k] do_syscall_64
 0.68%  pread[kernel.vmlinux]  [k] ktime_get_coarse_real_ts64
 0.66%  pread[kernel.vmlinux]  [k] __fget_light
 0.53%  pread[kernel.vmlinux]  [k] exit_to_user_mode_prepare
 0.45%  pread[kernel.vmlinux]  [k] up_read
 0.44%  preadpread [.] main
 0.44%  pread[kernel.vmlinux]  [k] syscall_exit_to_user_mode_prepare
 0.26%  pread[kernel.vmlinux]  [k] entry_SYSCALL_64_safe_stack
 0.12%  preadpread [.] pread@plt
 0.07%  pread[kernel.vmlinux]  [k] __fdget
 0.00%  perf [kernel.vmlinux]  [k] x86_pmu_enable_all


This is profile of "read_iter - separate read and write path":

time10.058s
53.05%  pread[kernel.vmlinux]  [k] copy_user_generic_string
 6.82%  pread[kernel.vmlinux]  [k] current_time
 6.27%  pread[nvfs][k] nvfs_read_iter
 4.70%  pread[kernel.vmlinux]  [k] entry_SYSCALL_64
 3.20%  preadlibc-2.31.so  [.] __libc_pread
 2.77%  pread[kernel.vmlinux]  [k] syscall_return_via_sysret
 2.31%  pread[kernel.vmlinux]  [k] vfs_read
 2.15%  pread[kernel.vmlinux]  [k] new_sync_read
 2.06%  pread[kernel.vmlinux]  [k] __fsnotify_parent
 2.02%  pread[nvfs][k] nvfs_bmap
 1.87%  pread[kernel.vmlinux]  [k] entry_SYSCALL_64_after_hwframe
 1.86%  pread[kernel.vmlinux]  [k] iov_iter_advance
 1.62%  pread[kernel.vmlinux]  [k] __x64_sys_pread64
 1.40%  pread[kernel.vmlinux]  [k] atime_needs_update
 0.99%  pread[kernel.vmlinux]  [k] syscall_enter_from_user_mode
 0.85%  pread[kernel.vmlinux]  [k] touch_atime
 0.85%  pread[kernel.vmlinux]  [k] down_read
 0.84%  pread[kernel.vmlinux]  [k] syscall_exit_to_user_mode
 0.78%  pread[kernel.vmlinux]  [k] ktime_get_coarse_real_ts64
 0.65%  pread[kernel.vmlinux]  [k] __fget_light
 0.57%  pread[kernel.vmlinux]  [k] exit_to_user_mode_prepare
 0.53%  pread[kernel.vmlinux]  [k] syscall_exit_to_user_mode_prepare
 0.45%  pread  

Re: Expense of read_iter

2021-01-10 Thread Mikulas Patocka



On Sun, 10 Jan 2021, Matthew Wilcox wrote:

> > That is the reason for that 10% degradation with read_iter.
> 
> You seem to be focusing on your argument for "let's just permit
> filesystems to implement both ->read and ->read_iter".  My suggestion
> is that we need to optimise the ->read_iter path, but to do that we need
> to know what's expensive.
> 
> nvfs_rw_iter_locked() looks very complicated.  I suspect it can
> be simplified.

I split it to a separate read and write function and it improved 
performance by 1.3%. Using Al Viro's read_iter improves performance by 3%.

> Of course new_sync_read() needs to be improved too,
> as do the other functions here, but fully a third of the difference
> between read() and read_iter() is the difference between nvfs_read()
> and nvfs_rw_iter_locked().

I put counters into vfs_read and vfs_readv.

After a fresh boot of the virtual machine, the counters show "13385 4". 
After a kernel compilation they show "4475220 8".

So, the readv path is almost unused.

My reasoning was that we should optimize for the "read" path and glue the 
"readv" path on the top of that. Currently, the kernel is doing the 
opposite - optimizing for "readv" and glueing "read" on the top of it.

Mikulas



Re: [RFC v2] nvfs: a filesystem for persistent memory

2021-01-10 Thread Mikulas Patocka



On Sun, 10 Jan 2021, Al Viro wrote:

> On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote:
> > Hi
> > 
> > I announce a new version of NVFS - a filesystem for persistent memory.
> > http://people.redhat.com/~mpatocka/nvfs/
> Utilities, AFAICS
> 
> > git://leontynka.twibright.com/nvfs.git
> Seems to hang on git pull at the moment...  Do you have it anywhere else?

I saw some errors 'git-daemon: fatal: the remote end hung up unexpectedly' 
in syslog. I don't know what's causing them.

> > I found out that on NVFS, reading a file with the read method has 10% 
> > better performance than the read_iter method. The benchmark just reads the 
> > same 4k page over and over again - and the cost of creating and parsing 
> > the kiocb and iov_iter structures is just that high.
> 
> Apples and oranges...  What happens if you take
> 
> ssize_t read_iter_locked(struct file *file, struct iov_iter *to, loff_t *ppos)
> {
>   struct inode *inode = file_inode(file);
>   struct nvfs_memory_inode *nmi = i_to_nmi(inode);
>   struct nvfs_superblock *nvs = inode->i_sb->s_fs_info;
>   ssize_t total = 0;
>   loff_t pos = *ppos;
>   int r;
>   int shift = nvs->log2_page_size;
>   size_t i_size;
> 
>   i_size = inode->i_size;
>   if (pos >= i_size)
>   return 0;
>   iov_iter_truncate(to, i_size - pos);
> 
>   while (iov_iter_count(to)) {
>   void *blk, *ptr;
>   size_t page_mask = (1UL << shift) - 1;
>   unsigned page_offset = pos & page_mask;
>   unsigned prealloc = (iov_iter_count(to) + page_mask) >> shift;
>   unsigned size;
> 
>   blk = nvfs_bmap(nmi, pos >> shift, , NULL, NULL, NULL);
>   if (unlikely(IS_ERR(blk))) {
>   r = PTR_ERR(blk);
>   goto ret_r;
>   }
>   size = ((size_t)prealloc << shift) - page_offset;
>   ptr = blk + page_offset;
>   if (unlikely(!blk)) {
>   size = min(size, (unsigned)PAGE_SIZE);
>   ptr = empty_zero_page;
>   }
>   size = copy_to_iter(to, ptr, size);
>   if (unlikely(!size)) {
>   r = -EFAULT;
>   goto ret_r;
>   }
> 
>   pos += size;
>   total += size;
>   } while (iov_iter_count(to));
> 
>   r = 0;
> 
> ret_r:
>   *ppos = pos;
> 
>   if (file)
>   file_accessed(file);
> 
>   return total ? total : r;
> }
> 
> and use that instead of your nvfs_rw_iter_locked() in your
> ->read_iter() for DAX read case?  Then the same with
> s/copy_to_iter/_copy_to_iter/, to see how much of that is
> "hardening" overhead.
> 
> Incidentally, what's the point of sharing nvfs_rw_iter() for
> read and write cases?  They have practically no overlap -
> count the lines common for wr and !wr cases.  And if you
> do the same in nvfs_rw_iter_locked(), you'll see that the
> shared parts _there_ are bloody pointless on the read side.

That's a good point. I split nvfs_rw_iter to separate functions 
nvfs_read_iter and nvfs_write_iter - and inlined nvfs_rw_iter_locked into 
both of them. It improved performance by 1.3%.

> Not that it had been more useful on the write side, really,
> but that's another story (nvfs_write_pages() handling of
> copyin is... interesting).  Let's figure out what's going
> on with the read overhead first...
> 
> lib/iov_iter.c primitives certainly could use massage for
> better code generation, but let's find out how much of the
> PITA is due to those and how much comes from you fighing
> the damn thing instead of using it sanely...

The results are:

read:   6.744s
read_iter:  7.417s
read_iter - separate read and write path:   7.321s
Al's read_iter: 7.182s
Al's read_iter with _copy_to_iter:  7.181s

Mikulas



Re: Expense of read_iter

2021-01-07 Thread Mikulas Patocka



On Thu, 7 Jan 2021, Matthew Wilcox wrote:

> On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote:
> > I'd like to ask about this piece of code in __kernel_read:
> > if (unlikely(!file->f_op->read_iter || file->f_op->read))
> > return warn_unsupported...
> > and __kernel_write:
> > if (unlikely(!file->f_op->write_iter || file->f_op->write))
> > return warn_unsupported...
> > 
> > - It exits with an error if both read_iter and read or write_iter and 
> > write are present.
> > 
> > I found out that on NVFS, reading a file with the read method has 10% 
> > better performance than the read_iter method. The benchmark just reads the 
> > same 4k page over and over again - and the cost of creating and parsing 
> > the kiocb and iov_iter structures is just that high.
> 
> Which part of it is so expensive?

The read_iter path is much bigger:
vfs_read- 0x160 bytes
new_sync_read   - 0x160 bytes
nvfs_rw_iter- 0x100 bytes
nvfs_rw_iter_locked - 0x4a0 bytes
iov_iter_advance- 0x300 bytes

If we go with the "read" method, there's just:
vfs_read- 0x160 bytes
nvfs_read   - 0x200 bytes

> Is it worth, eg adding an iov_iter
> type that points to a single buffer instead of a single-member iov?
> 
> +++ b/include/linux/uio.h
> @@ -19,6 +19,7 @@ struct kvec {
>  
>  enum iter_type {
> /* iter types */
> +   ITER_UBUF = 2,
> ITER_IOVEC = 4,
> ITER_KVEC = 8,
> ITER_BVEC = 16,
> @@ -36,6 +36,7 @@ struct iov_iter {
> size_t iov_offset;
> size_t count;
> union {
> +   void __user *buf;
> const struct iovec *iov;
> const struct kvec *kvec;
> const struct bio_vec *bvec;
> 
> and then doing all the appropriate changes to make that work.


I tried this benchmark on nvfs:

#include 
#include 
#include 

int main(void)
{
unsigned long i;
unsigned long l = 1UL << 38;
unsigned s = 4096;
void *a = valloc(s);
if (!a) perror("malloc"), exit(1);
for (i = 0; i < l; i += s) {
if (pread(0, a, s, 0) != s) perror("read"), exit(1);
}
return 0;
}


Result, using the read_iter method:

# To display the perf.data header info, please use --header/--header-only 
options.
#
#
# Total Lost Samples: 0
#
# Samples: 3K of event 'cycles'
# Event count (approx.): 1049885560
#
# Overhead  Command  Shared Object Symbol   
#   ...    .
#
47.32%  pread[kernel.vmlinux]  [k] copy_user_generic_string
 7.83%  pread[kernel.vmlinux]  [k] current_time
 6.57%  pread[nvfs][k] nvfs_rw_iter_locked
 5.59%  pread[kernel.vmlinux]  [k] entry_SYSCALL_64
 4.23%  preadlibc-2.31.so  [.] __libc_pread
 3.51%  pread[kernel.vmlinux]  [k] syscall_return_via_sysret
 2.34%  pread[kernel.vmlinux]  [k] entry_SYSCALL_64_after_hwframe
 2.34%  pread[kernel.vmlinux]  [k] vfs_read
 2.34%  pread[kernel.vmlinux]  [k] __fsnotify_parent
 2.31%  pread[kernel.vmlinux]  [k] new_sync_read
 2.21%  pread[nvfs][k] nvfs_bmap
 1.89%  pread[kernel.vmlinux]  [k] iov_iter_advance
 1.71%  pread[kernel.vmlinux]  [k] __x64_sys_pread64
 1.59%  pread[kernel.vmlinux]  [k] atime_needs_update
 1.24%  pread[nvfs][k] nvfs_rw_iter
 0.94%  pread[kernel.vmlinux]  [k] touch_atime
 0.75%  pread[kernel.vmlinux]  [k] syscall_enter_from_user_mode
 0.72%  pread[kernel.vmlinux]  [k] ktime_get_coarse_real_ts64
 0.68%  pread[kernel.vmlinux]  [k] down_read
 0.62%  pread[kernel.vmlinux]  [k] exit_to_user_mode_prepare
 0.52%  pread[kernel.vmlinux]  [k] syscall_exit_to_user_mode
 0.49%  pread[kernel.vmlinux]  [k] syscall_exit_to_user_mode_prepare
 0.47%  pread[kernel.vmlinux]  [k] __fget_light
 0.46%  pread[kernel.vmlinux]  [k] do_syscall_64
 0.42%  preadpread [.] main
 0.33%  pread[kernel.vmlinux]  [k] up_read
 0.29%  pread[kernel.vmlinux]  [k] iov_iter_init
 0.16%  pread[kernel.vmlinux]  [k] __fdget
 0.10%  pread[kernel.vmlinux]  [k] entry_SYSCALL_64_safe_stack
 0.03%  preadpread [.] pread@plt
 0.00%  perf [kernel.vmlinux]  [k] x86_pmu_enable_all


#
# (Tip: Use --symfs  if your symbol files are in non-standard locations)
#



Result, using the read method:

# To display the perf.data header info, please use --header/--header-only 
options.
#
#
# Total Lost Samples: 0
#
# Samples: 3K of even

[RFC v2] nvfs: a filesystem for persistent memory

2021-01-07 Thread Mikulas Patocka
Hi

I announce a new version of NVFS - a filesystem for persistent memory.
http://people.redhat.com/~mpatocka/nvfs/
git://leontynka.twibright.com/nvfs.git

Changes since the last release:

* I added a microjournal to the filesystem, it can hold up to 16 entries. 
  Each CPU has it's own journal, so that there is no lock contention. The 
  journal is used to provide atomicity of reaname() and extended attribute 
  replace.
  (note that file creation or deletion doesn't use the journal, because 
  these operations can be deterministically cleaned up by fsck)

* I created a framework that can be used to verify the filesystem driver. 
  It logs all writes and memory barriers to a file, the entries in the 
  file are randomly reordered (to simulate reordering in the CPU 
  write-combining buffers), the sequence is cut at a random point (to 
  simulate a system crash) and the result is replayed on a filesystem 
  image.
  With this framework, we can for example check that if a crash happens 
  during rename(), either old file or new file will be present in a 
  directory.
  This framework helped to find a few bugs in sequencing the writes.

* If we map an executable image, we turn off the DAX flag on the inode 
  (because executables run 4% slower from persistent memory). There is 
  also a switch that can turn DAX always off or always on.




I'd like to ask about this piece of code in __kernel_read:
if (unlikely(!file->f_op->read_iter || file->f_op->read))
return warn_unsupported...
and __kernel_write:
if (unlikely(!file->f_op->write_iter || file->f_op->write))
return warn_unsupported...

- It exits with an error if both read_iter and read or write_iter and 
write are present.

I found out that on NVFS, reading a file with the read method has 10% 
better performance than the read_iter method. The benchmark just reads the 
same 4k page over and over again - and the cost of creating and parsing 
the kiocb and iov_iter structures is just that high.

So, I'd like to have both read and read_iter methods. Could the above 
conditions be changed, so that they don't fail with an error if the "read" 
or "write" method is present?

Mikulas



Re: [PATCH v3 0/2] dm crypt: some fixes to support dm-crypt running in softirq context

2021-01-04 Thread Mikulas Patocka



On Mon, 4 Jan 2021, Ignat Korchagin wrote:

> Changes from v1:
>   * 0001: handle memory allocation failure for GFP_ATOMIC
> 
> Changes from v2:
>   * reordered patches
>   * 0002: crypt_convert will be retried from a workqueue, when a crypto 
> request
> allocation fails with GFP_ATOMIC instead of just returning an IO error to
> the user
> 
> Ignat Korchagin (2):
>   dm crypt: do not wait for backlogged crypto request completion in
> softirq
>   dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq
> 
>  drivers/md/dm-crypt.c | 138 +-
>  1 file changed, 123 insertions(+), 15 deletions(-)
> 
> -- 
> 2.20.1
> 

Acked-by: Mikulas Patocka 



Re: [PATCH v2 1/2] dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq

2021-01-01 Thread Mikulas Patocka



On Wed, 30 Dec 2020, Ignat Korchagin wrote:

> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index 53791138d78b..e4fd690c70e1 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -1539,7 +1549,10 @@ static blk_status_t crypt_convert(struct crypt_config 
> *cc,
>  
>   while (ctx->iter_in.bi_size && ctx->iter_out.bi_size) {
>  
> - crypt_alloc_req(cc, ctx);
> + r = crypt_alloc_req(cc, ctx);
> + if (r)
> + return BLK_STS_RESOURCE;
> +
>   atomic_inc(>cc_pending);
>  
>   if (crypt_integrity_aead(cc))
> -- 
> 2.20.1

I'm not quite convinced that returning BLK_STS_RESOURCE will help. The 
block layer will convert this value back to -ENOMEM and return it to the 
caller, resulting in an I/O error.

Note that GFP_ATOMIC allocations may fail anytime and you must handle 
allocation failure gracefully - i.e. process the request without any 
error.

An acceptable solution would be to punt the request to a workqueue and do 
GFP_NOIO allocation from the workqueue. Or add the request to some list 
and process the list when some other request completes.

You should write a test that simulates allocation failure and verify that 
the kernel handles it gracefully without any I/O error.

Mikulas



Re: [PATCH 1/2] dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq

2020-12-30 Thread Mikulas Patocka
Hi

This patch doesn't handle allocation failure gracefully.

Mikulas



On Tue, 29 Dec 2020, Ignat Korchagin wrote:

> Commit 39d42fa96ba1b7d2544db3f8ed5da8fb0d5cb877 made it possible for some code
> paths in dm-crypt to be executed in softirq context, when the underlying 
> driver
> processes IO requests in interrupt/softirq context.
> 
> In this case sometimes when allocating a new crypto request we may get a
> stacktrace like below:
> 
> [  210.103008][C0] BUG: sleeping function called from invalid context at 
> mm/mempool.c:381
> [  210.104746][C0] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 
> 2602, name: fio
> [  210.106599][C0] CPU: 0 PID: 2602 Comm: fio Tainted: GW 
> 5.10.0+ #50
> [  210.108331][C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> BIOS 0.0.0 02/06/2015
> [  210.110212][C0] Call Trace:
> [  210.110921][C0]  
> [  210.111527][C0]  dump_stack+0x7d/0xa3
> [  210.112411][C0]  ___might_sleep.cold+0x122/0x151
> [  210.113527][C0]  mempool_alloc+0x16b/0x2f0
> [  210.114524][C0]  ? __queue_work+0x515/0xde0
> [  210.115553][C0]  ? mempool_resize+0x700/0x700
> [  210.116586][C0]  ? crypt_endio+0x91/0x180
> [  210.117479][C0]  ? blk_update_request+0x757/0x1150
> [  210.118513][C0]  ? blk_mq_end_request+0x4b/0x480
> [  210.119572][C0]  ? blk_done_softirq+0x21d/0x340
> [  210.120628][C0]  ? __do_softirq+0x190/0x611
> [  210.121626][C0]  crypt_convert+0x29f9/0x4c00
> [  210.122668][C0]  ? _raw_spin_lock_irqsave+0x87/0xe0
> [  210.123824][C0]  ? kasan_set_track+0x1c/0x30
> [  210.124858][C0]  ? crypt_iv_tcw_ctr+0x4a0/0x4a0
> [  210.125930][C0]  ? kmem_cache_free+0x104/0x470
> [  210.126973][C0]  ? crypt_endio+0x91/0x180
> [  210.127947][C0]  kcryptd_crypt_read_convert+0x30e/0x420
> [  210.129165][C0]  blk_update_request+0x757/0x1150
> [  210.130231][C0]  blk_mq_end_request+0x4b/0x480
> [  210.131294][C0]  blk_done_softirq+0x21d/0x340
> [  210.132332][C0]  ? _raw_spin_lock+0x81/0xd0
> [  210.133289][C0]  ? blk_mq_stop_hw_queue+0x30/0x30
> [  210.134399][C0]  ? _raw_read_lock_irq+0x40/0x40
> [  210.135458][C0]  __do_softirq+0x190/0x611
> [  210.136409][C0]  ? handle_edge_irq+0x221/0xb60
> [  210.137447][C0]  asm_call_irq_on_stack+0x12/0x20
> [  210.138507][C0]  
> [  210.139118][C0]  do_softirq_own_stack+0x37/0x40
> [  210.140191][C0]  irq_exit_rcu+0x110/0x1b0
> [  210.141151][C0]  common_interrupt+0x74/0x120
> [  210.142171][C0]  asm_common_interrupt+0x1e/0x40
> [  210.143206][C0] RIP: 0010:_aesni_enc1+0x65/0xb0
> [  210.144313][C0] Code: 38 dc c2 41 0f 28 52 d0 66 0f 38 dc c2 41 0f 28 
> 52 e0 66 0f 38 dc c2 41 0f 28 52 f0 66 0f 38 dc c2 41 0f 28 12 66 0f 38 dc c2 
> <41> 0f 28 52 10 66 0f 38 dc c2 41 0f 28 52 20 66 0f 38 dc c2 41 0f
> [  210.148542][C0] RSP: 0018:88810dbe6db0 EFLAGS: 0286
> [  210.149842][C0] RAX: 9a90cdc0 RBX:  RCX: 
> 0200
> [  210.151576][C0] RDX: 888101e5f240 RSI: 888101e5f240 RDI: 
> 888b5020
> [  210.153339][C0] RBP: 88810dbe6e20 R08:  R09: 
> 0020
> [  210.155063][C0] R10: 888b5090 R11: 111021b7cdcc R12: 
> 9e87cd40
> [  210.156791][C0] R13: 888b5210 R14: 888101e5f0d8 R15: 
> 
> [  210.158497][C0]  ? aesni_set_key+0x1e0/0x1e0
> [  210.159523][C0]  ? aesni_enc+0xf/0x20
> [  210.160408][C0]  ? glue_xts_req_128bit+0x181/0x6f0
> [  210.161571][C0]  ? aesni_set_key+0x1e0/0x1e0
> [  210.162560][C0]  ? glue_ctr_req_128bit+0x630/0x630
> [  210.163706][C0]  ? kasan_save_stack+0x37/0x50
> [  210.164761][C0]  ? kasan_save_stack+0x20/0x50
> [  210.165786][C0]  ? get_page_from_freelist+0x2052/0x36a0
> [  210.167024][C0]  ? __blkdev_direct_IO_simple+0x43b/0x7e0
> [  210.168288][C0]  ? blkdev_direct_IO+0xd16/0x1020
> [  210.169396][C0]  ? generic_file_direct_write+0x1a3/0x480
> [  210.170648][C0]  ? __generic_file_write_iter+0x1d9/0x530
> [  210.171882][C0]  ? blkdev_write_iter+0x20d/0x3e0
> [  210.172954][C0]  ? vfs_write+0x524/0x770
> [  210.173889][C0]  ? do_syscall_64+0x33/0x40
> [  210.174859][C0]  ? __zone_watermark_ok+0x340/0x340
> [  210.175977][C0]  ? crypt_convert+0x28b6/0x4c00
> [  210.177079][C0]  ? mempool_alloc+0x107/0x2f0
> [  210.178096][C0]  ? crypt_iv_tcw_ctr+0x4a0/0x4a0
> [  210.179193][C0]  ? bio_add_page+0x111/0x170
> [  210.180251][C0]  ? __bio_try_merge_page+0x480/0x480
> [  210.181446][C0]  ? bio_associate_blkg+0x6d/0x100
> [  210.182558][C0]  ? kcryptd_crypt_write_convert+0x5ea/0x980
> [  210.183852][C0]  ? crypt_map+0x5bf/0xc80
> [  210.184838][C0]  ? bio_clone_blkg_association+0x10e/0x2c0
> [  210.186125][C0]  ? __map_bio.isra.0+0x109/0x3f0
> [  210.187204][C0]  ? 

Re: [RFC PATCH v3 1/2] block: add simple copy support

2020-12-11 Thread Mikulas Patocka



On Fri, 11 Dec 2020, Johannes Thumshirn wrote:

> On 11/12/2020 15:57, SelvaKumar S wrote:
> [...] 
> > +int blk_copy_emulate(struct block_device *bdev, struct blk_copy_payload 
> > *payload,
> > +   gfp_t gfp_mask)
> > +{
> > +   struct request_queue *q = bdev_get_queue(bdev);
> > +   struct bio *bio;
> > +   void *buf = NULL;
> > +   int i, nr_srcs, max_range_len, ret, cur_dest, cur_size;
> > +
> > +   nr_srcs = payload->copy_range;
> > +   max_range_len = q->limits.max_copy_range_sectors << SECTOR_SHIFT;
> > +   cur_dest = payload->dest;
> > +   buf = kvmalloc(max_range_len, GFP_ATOMIC);
> 
> Why GFP_ATOMIC and not the passed in gfp_mask? Especially as this is a 
> kvmalloc()
> which has the potential to grow quite big.

You are right, this is confusing.

There's this piece of code at the top of kvmalloc_node:
if ((flags & GFP_KERNEL) != GFP_KERNEL)
return kmalloc_node(size, flags, node);

So, when you use GFP_ATOMIC flag, it will always fall back to kmalloc.

Mikulas



Re: md: dm-writeback: add __noreturn to BUG-ging function

2020-11-18 Thread Mikulas Patocka



On Wed, 18 Nov 2020, Mike Snitzer wrote:

> On Wed, Nov 18 2020 at 10:49am -0500,
> Mike Snitzer  wrote:
> 
> > I don't think my suggestion will help.. given it'd still leave
> > persistent_memory_claim() without a return statement.
> > 
> > Think it worthwhile to just add a dummy 'return 0;' after the BUG().
> 
> Decided to go with this, now staged for 5.11:
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-5.11=a1e4865b4dda7071f3707f7e551289ead66e38b1

Hi

I would just use "return -EOPNOTSUPP;" and drop the "#ifdef 
DM_WRITECACHE_HAS_PMEM" that you added.

That BUG/return -EOPNOTSUPP code can't happen at all - if 
DM_WRITECACHE_HAS_PMEM is not defined, WC_MODE_PMEM(wc) always returns 
false - so persistent_memory_claim and BUG() can't ever be called. And if 
it can't be called, you don't need to add a code that prints an error in 
that case.

If we don't have DM_WRITECACHE_HAS_PMEM, the compiler optimizer will 
remove all the code guarded with if (WC_MODE_PMEM(wc)) as unreachable.

Mikulas


From: Mikulas Patocka 
Subject: [PATCH] dm writecache: remove BUG() and fail gracefully 
insteadfor-nextdm-5.11

Building on arch/s390/ results in this build error:

cc1: some warnings being treated as errors
../drivers/md/dm-writecache.c: In function 'persistent_memory_claim':
../drivers/md/dm-writecache.c:323:1: error: no return statement in function 
returning non-void [-Werror=return-type]

Fix this by replacing the BUG() with a -EOPNOTSUPP return.

Fixes: 48debafe4f2f ("dm: add writecache target")
Cc: sta...@vger.kernel.org  # v4.18+
Reported-by: Randy Dunlap 
Signed-off-by: Mikulas Patocka 

Index: linux-2.6/drivers/md/dm-writecache.c
===
--- linux-2.6.orig/drivers/md/dm-writecache.c
+++ linux-2.6/drivers/md/dm-writecache.c
@@ -319,7 +319,7 @@ err1:
 #else
 static int persistent_memory_claim(struct dm_writecache *wc)
 {
-   BUG();
+   return -EOPNOTSUPP;
 }
 #endif
 



Re: [PATCH] Revert "dm cache: fix arm link errors with inline"

2020-11-11 Thread Mikulas Patocka
Acked-by: Mikulas Patocka 



On Tue, 10 Nov 2020, Nick Desaulniers wrote:

> This reverts commit 43aeaa29573924df76f44eda2bbd94ca36e407b5.
> 
> Since
> commit 0bddd227f3dc ("Documentation: update for gcc 4.9 requirement")
> the minimum supported version of GCC is gcc-4.9. It's now safe to remove
> this code.
> 
> Link: https://github.com/ClangBuiltLinux/linux/issues/427
> Signed-off-by: Nick Desaulniers 
> ---
>  drivers/md/dm-cache-target.c | 4 
>  1 file changed, 4 deletions(-)
> 
> diff --git a/drivers/md/dm-cache-target.c b/drivers/md/dm-cache-target.c
> index 9644424591da..4bc453f5bbaa 100644
> --- a/drivers/md/dm-cache-target.c
> +++ b/drivers/md/dm-cache-target.c
> @@ -712,10 +712,6 @@ static bool block_size_is_power_of_two(struct cache 
> *cache)
>   return cache->sectors_per_block_shift >= 0;
>  }
>  
> -/* gcc on ARM generates spurious references to __udivdi3 and __umoddi3 */
> -#if defined(CONFIG_ARM) && __GNUC__ == 4 && __GNUC_MINOR__ <= 6
> -__always_inline
> -#endif
>  static dm_block_t block_div(dm_block_t b, uint32_t n)
>  {
>   do_div(b, n);
> -- 
> 2.29.2.222.g5d2a92d10f8-goog
> 



Re: NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache)

2020-09-28 Thread Mikulas Patocka



On Thu, 24 Sep 2020, Mikulas Patocka wrote:

> On Tue, 22 Sep 2020, Matthew Wilcox wrote:
> 
> > > There is a small window when renamed inode is neither in source nor in 
> > > target directory. Fsck will reclaim such inode and add it to lost+found - 
> > > just like on EXT2.
> > 
> > ... ouch.  If you have to choose, it'd be better to link it to the second
> > directory then unlink it from the first one.  Then your fsck can detect
> > it has the wrong count and fix up the count (ie link it into both
> > directories rather than neither).
> 
> I admit that this is lame and I'll fix it. Rename is not so 
> performance-critical, so I can add a small journal for this.

Hi

I have implmemented transactions in nvfs and I use them for rename, 
setattr, atomic xattr replacement and for RENAME_EXCHANGE.

You can download the current version here:
git://leontynka.twibright.com/nvfs.git

Mikulas



Re: NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache)

2020-09-24 Thread Mikulas Patocka



On Tue, 22 Sep 2020, Matthew Wilcox wrote:

> > > The NVFS indirect block tree has a fan-out of 16,
> > 
> > No. The top level in the inode contains 16 blocks (11 direct and 5 
> > indirect). And each indirect block can have 512 pointers (4096/8). You can 
> > format the device with larger block size and this increases the fanout 
> > (the NVFS block size must be greater or equal than the system page size).
> > 
> > 2 levels can map 1GiB (4096*512^2), 3 levels can map 512 GiB, 4 levels can 
> > map 256 TiB and 5 levels can map 128 PiB.
> 
> But compare to an unfragmented file ... you can map the entire thing with
> a single entry.  Even if you have to use a leaf node, you can get four
> extents in a single cacheline (and that's a fairly naive leaf node layout;
> I don't know exactly what XFS uses)

But the benchmarks show that it is comparable to extent-based filesystems.

> > > Rename is another operation that has specific "operation has atomic
> > > behaviour" expectations. I haven't looked at how you've
> > > implementated that yet, but I suspect it also is extremely difficult
> > > to implement in an atomic manner using direct pmem updates to the
> > > directory structures.
> > 
> > There is a small window when renamed inode is neither in source nor in 
> > target directory. Fsck will reclaim such inode and add it to lost+found - 
> > just like on EXT2.
> 
> ... ouch.  If you have to choose, it'd be better to link it to the second
> directory then unlink it from the first one.  Then your fsck can detect
> it has the wrong count and fix up the count (ie link it into both
> directories rather than neither).

I admit that this is lame and I'll fix it. Rename is not so 
performance-critical, so I can add a small journal for this.

> > If you think that the lack of journaling is show-stopper, I can implement 
> > it. But then, I'll have something that has complexity of EXT4 and 
> > performance of EXT4. So that there will no longer be any reason why to use 
> > NVFS over EXT4. Without journaling, it will be faster than EXT4 and it may 
> > attract some users who want good performance and who don't care about GID 
> > and UID being updated atomically, etc.
> 
> Well, what's your intent with nvfs?  Do you already have customers in mind
> who want to use this in production, or is this somewhere to play with and
> develop concepts that might make it into one of the longer-established
> filesystems?

I develop it just because I thought it may be interesting. So far, it 
doesn't have any serious users (the physical format is still changing). I 
hope that it could be useable as a general purpose root filesystem when 
Optane DIMMs become common.

Mikulas



Re: NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache)

2020-09-23 Thread Mikulas Patocka



On Wed, 23 Sep 2020, Dave Chinner wrote:

> > > dir-test /mnt/test/linux-2.6 63000 1048576
> > > nvfs  6.6s
> > > ext4 dax  8.4s
> > > xfs dax   12.2s
> > > 
> > > 
> > > dir-test /mnt/test/linux-2.6 63000 1048576 link
> > > nvfs  4.7s
> > > ext4 dax  5.6s
> > > xfs dax   7.8s
> > > 
> > > dir-test /mnt/test/linux-2.6 63000 1048576 dir
> > > nvfs  8.2s
> > > ext4 dax  15.1s
> > > xfs dax   11.8s
> > > 
> > > Yes, nvfs is faster than both ext4 and XFS on DAX, but it's  not a
> > > huge difference - it's not orders of magnitude faster.
> > 
> > If I increase the size of the test directory, NVFS is order of magnitude 
> > faster:
> > 
> > time dir-test /mnt/test/ 200 200
> > NVFS: 0m29,395s
> > XFS:  1m59,523s
> > EXT4: 1m14,176s
> 
> What happened to NVFS there? The runtime went up by a factor of 5,
> even though the number of ops performed only doubled.

This test is from a different machine (K10 Opteron) than the above test 
(Skylake Xeon). I borrowed the Xeon for a short time and I no longer have 
access to it.

> > time dir-test /mnt/test/ 800 800
> > NVFS: 2m13,507s
> > XFS: 14m31,261s
> > EXT4: reports "file 1976882 can't be created: No space left on device", 
> > (although there are free blocks and inodes)
> > Is it a bug or expected behavior?
> 
> Exponential increase in runtime for a workload like this indicates
> the XFS journal is too small to run large scale operations. I'm
> guessing you're just testing on a small device?

In this test, the pmem device had 64GiB.

I've created 1TiB ramdisk, formatted it with XFS and ran dir-test 800 
on it, however it wasn't much better - it took 14m8,824s.

> In which case, you'd get a 16MB log for XFS, which is tiny and most
> definitely will limit performance of any large scale metadta
> operation. Performance should improve significantly for large scale
> operations with a much larger log, and that should bring the XFS
> runtimes down significantly.

Is there some mkfs.xfs option that can increase log size?

> > If you think that the lack of journaling is show-stopper, I can implement 
> > it.
> 
> I did not say that. My comments are about the requirement for
> atomicity of object changes, not journalling. Journalling is an
> -implementation that can provide change atomicity-, it is not a
> design constraint for metadata modification algorithms.
> 
> Really, you can chose how to do object update however you want. What
> I want to review is the design documentation and a correctness proof
> for whatever mechanism you choose to use. Without that information,
> we have absolutely no chance of reviewing the filesystem
> implementation for correctness. We don't need a proof for something
> that uses journalling (because we all know how that works), but for
> something that uses soft updates we most definitely need the proof
> of correctness for the update algorithm before we can determine if
> the implementation is good...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com

I am thinking about this: I can implement lightweight journaling that will 
journal just a few writes - I'll allocate some small per-cpu intent log 
for that.

For example, in nvfs_rename, we call nvfs_delete_de and nvfs_finish_add - 
these functions are very simple, both of them write just one word - so we 
can add these two words to the intent log. The same for setattr requesting 
simultaneous uid/gid/mode change - they are small, so they'll fit into the 
intent log well.

Regarding verifiability, I can do this - the writes to pmem are wrapped in 
a macro nv_store. So, I can modify this macro so that it logs all 
modifications. Then I take the log, cut it at random time, reorder the 
entries (to simulate reordering in the CPU write-combining buffers), 
replay it, run nvfsck on it and mount it. This way, we can verify that no 
matter where the crash happened, either an old file or a new file is 
present in a directory.

Do you agree with that?

Mikulas



Re: NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache)

2020-09-23 Thread Mikulas Patocka



On Wed, 23 Sep 2020, Jan Kara wrote:

> On Tue 22-09-20 12:46:05, Mikulas Patocka wrote:
> > > mapping 2^21 blocks requires a 5 level indirect tree. Which one if going 
> > > to be faster to truncate away - a single record or 2 million individual 
> > > blocks?
> > > 
> > > IOWs, we can take afford to take an extra cacheline miss or two on a
> > > tree block search, because we're accessing and managing orders of
> > > magnitude fewer records in the mapping tree than an indirect block
> > > tree.
> > > 
> > > PMEM doesn't change this: extents are more time and space efficient
> > > at scale for mapping trees than indirect block trees regardless
> > > of the storage medium in use.
> > 
> > PMEM doesn't have to be read linearly, so the attempts to allocate large 
> > linear space are not needed. They won't harm but they won't help either.
> > 
> > That's why NVFS has very simple block allocation alrogithm - it uses a 
> > per-cpu pointer and tries to allocate by a bit scan from this pointer. If 
> > the group is full, it tries a random group with above-average number of 
> > free blocks.
> 
> I agree with Dave here. People are interested in 2MB or 1GB contiguous
> allocations for DAX so that files can be mapped at PMD or event PUD levels
> thus saving a lot of CPU time on page faults and TLB.

NVFS has upper limit on block size 1MB. So, should raise it to 2MB? Will 
2MB blocks be useful to someone?

Is there some API how userspace can ask the kernel for aligned allocation? 
fallocate() doesn't seem to offer an option for alignment.

> > EXT4 uses bit scan for allocations and people haven't complained that it's 
> > inefficient, so it is probably OK.
> 
> Yes, it is more or less OK but once you get to 1TB filesystem size and
> larger, the number of block groups grows enough that it isn't that great
> anymore. We are actually considering new allocation schemes for ext4 for
> this large filesystems...

NVFS can run with block size larger than page size, so you can reduce the 
number of block groups by increasing block size.

(ext4 also has bigalloc feature that will do it)

> > If you think that the lack of journaling is show-stopper, I can implement 
> > it. But then, I'll have something that has complexity of EXT4 and 
> > performance of EXT4. So that there will no longer be any reason why to use 
> > NVFS over EXT4. Without journaling, it will be faster than EXT4 and it may 
> > attract some users who want good performance and who don't care about GID 
> > and UID being updated atomically, etc.
> 
> I'd hope that your filesystem offers more performance benefits than just
> what you can get from a lack of journalling :). ext4 can be configured to

I also don't know how to implement journling on persistent memory :) On 
EXT4 or XFS you can pin dirty buffers in memory until the journal is 
flushed. This is obviously impossible on persistent memory. So, I'm 
considering implementing only some lightweight journaling that will 
guarantee atomicity between just a few writes.

> run without a journal as well - mkfs.ext4 -O ^has_journal. And yes, it does
> significantly improve performance for some workloads but you have to have
> some way to recover from crashes so it's mostly used for scratch
> filesystems (e.g. in build systems, Google uses this feature a lot for some
> of their infrastructure as well).
> 
>   Honza
> -- 
> Jan Kara 
> SUSE Labs, CR

I've run "dir-test /mnt/test/ 800 800" and the result is:
EXT4 with journal   - 5m54,019s
EXT4 without journal- 4m4,444s
NVFS- 2m9,482s

Mikulas



Re: A bug in ext4 with big directories (was: NVFS XFS metadata)

2020-09-23 Thread Mikulas Patocka



On Wed, 23 Sep 2020, Jan Kara wrote:

> Hi!
> 
> On Wed 23-09-20 05:20:55, Mikulas Patocka wrote:
> > There seems to be a bug in ext4 - when I create very large directory, ext4 
> > fails with -ENOSPC despite the fact that there is plenty of free space and 
> > free inodes on the filesystem.
> > 
> > How to reproduce:
> > download the program dir-test: 
> > http://people.redhat.com/~mpatocka/benchmarks/dir-test.c
> > 
> > # modprobe brd rd_size=67108864
> > # mkfs.ext4 /dev/ram0
> > # mount -t ext4 /dev/ram0 /mnt/test
> > # dir-test /mnt/test/ 800 800
> > deleting: 7999000
> > 254
> > file 2515327 can't be created: No space left on device
> > # df /mnt/test
> > /dev/ram065531436 633752 61525860   2% /mnt/test
> > # df -i /mnt/test
> > /dev/ram04194304 1881547 2312757   45% /mnt/test
> 
> Yeah, you likely run out of space in ext4 directory h-tree. You can enable
> higher depth h-trees with large_dir feature (mkfs.ext4 -O large_dir). Does
> that help?

Yes, this helps.

Mikulas

> 
>   Honza
> 
> -- 
> Jan Kara 
> SUSE Labs, CR



A bug in ext4 with big directories (was: NVFS XFS metadata)

2020-09-23 Thread Mikulas Patocka
Hi

There seems to be a bug in ext4 - when I create very large directory, ext4 
fails with -ENOSPC despite the fact that there is plenty of free space and 
free inodes on the filesystem.

How to reproduce:
download the program dir-test: 
http://people.redhat.com/~mpatocka/benchmarks/dir-test.c

# modprobe brd rd_size=67108864
# mkfs.ext4 /dev/ram0
# mount -t ext4 /dev/ram0 /mnt/test
# dir-test /mnt/test/ 800 800
deleting: 7999000
254
file 2515327 can't be created: No space left on device
# df /mnt/test
/dev/ram065531436 633752 61525860   2% /mnt/test
# df -i /mnt/test
/dev/ram04194304 1881547 2312757   45% /mnt/test

(I tried to increase journal size, but it has no effect on this bug)

Mikulas



Re: NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache)

2020-09-22 Thread Mikulas Patocka
Hi

Thanks for reviewing NVFS.


On Tue, 22 Sep 2020, Dave Chinner wrote:

> Hi Mikulas,
> 
> I'll say up front that I think you're barking up the wrong tree
> trying to knock down XFS and ext4 to justify NVFS. NVFS will stand
> or fall on it's own merits, not on how you think it's better than
> other filesystems...
> 
> I have some fundamental concerns about the NVFS integrity model,
> they came out as I wrote this response to your characterisations of
> XFS and journalling filesysetms. Maybe I'm missing something about
> NVFS that isn't clearly explained
> 
> On Mon, Sep 21, 2020 at 12:20:42PM -0400, Mikulas Patocka wrote:
> > On Wed, 16 Sep 2020, Mikulas Patocka wrote:
> > > On Wed, 16 Sep 2020, Dan Williams wrote:
> > > > On Wed, Sep 16, 2020 at 10:24 AM Mikulas Patocka  
> > > > wrote:
> > > > >
> > > > > > My first question about nvfs is how it compares to a daxfs with
> > > > > > executables and other binaries configured to use page cache with the
> > > > > > new per-file dax facility?
> > > > >
> > > > > nvfs is faster than dax-based filesystems on metadata-heavy operations
> > > > > because it doesn't have the overhead of the buffer cache and bios. See
> > > > > this: http://people.redhat.com/~mpatocka/nvfs/BENCHMARKS
> > > > 
> > > > ...and that metadata problem is intractable upstream? Christoph poked
> > > > at bypassing the block layer for xfs metadata operations [1], I just
> > > > have not had time to carry that further.
> > > > 
> > > > [1]: "xfs: use dax_direct_access for log writes", although it seems
> > > > he's dropped that branch from his xfs.git
> > > 
> > > XFS is very big. I wanted to create something small.
> > 
> > And the another difference is that XFS metadata are optimized for disks 
> > and SSDs.
> 
> Ah, that old chestnut. :)
> 
> > On disks and SSDs, reading one byte is as costly as reading a full block. 
> > So we must put as much information to a block as possible. XFS uses 
> > b+trees for file block mapping and for directories - it is reasonable 
> > decision because b+trees minimize the number of disk accesses.
> 
> Actually, no, that wasn't why XFS was implemented using btrees. The
> btrees are an implementation detail, not a design requirement to
> minimise the number of disk accesses.
> 
> XFS was intended for huge disk arrays (hundreds to thousands on
> individual spindles) and so no attempt was made in the design to
> minimise disk accesses. There was -always- going to be a huge number
> of IOPS available in the intended large scale deployments, so
> concurrency and *efficiency at scale* was far more important at the
> design level than minimising the number of disk ops for any given
> operation.
> 
> To that end, simulations were done that showed that extent based
> trees were much more CPU, memory and IO efficient than bitmaps,
> hybrid tree-bitmaps or multi-layer indirect block referencing when
> trying to index and access large amounts of data.
> 
> To be efficient at scale, all operations need to be O(log N) or
> better, and extent based encoding is much, more compact than direct
> block indexing. Extent trees are also much more effective for finding
> exact fits, identifying large contiguous spaces, and manipulating
> large ranges of indexed data than other filesystem indexing
> mechanisms.  They are also not bound by alignment restrictions like
> hierarchical binary/buddy bitmap schemes are, and their maximum size
> is bounded and can be calculated at runtime.
> 
> IOWs, extent based trees were chosen because of scalability,
> efficiency, and flexibility reasons before the actual tree structure
> that it would be implemented with was decided on.  b+trees were used
> in the implementation because one tree implementation could do
> everything as all that needed to change btree trees was the pointer
> and record format.

I agree that the b+tree were a good choice for XFS.

In RAM-based maps, red-black trees or avl trees are used often. In 
disk-based maps, btrees or b+trees are used. That's because in RAM, you 
are optimizing for the number of cache lines accessed, and on the disk, 
you are optimizing for the number of blocks accessed.

> The result of this is that we have made -zero- changes to the XFS
> structure and algorithms for SSDs. We don't do different things
> based on the blkdev rotational flag, or anything like that. XFS
> behaves exactly the same on spinning disks as it does SSDs as it
> does PMEM and it performs well on all of them. And that performance
> doesn't 

Re: NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache)

2020-09-22 Thread Mikulas Patocka



On Tue, 22 Sep 2020, Matthew Wilcox wrote:

> On Mon, Sep 21, 2020 at 12:20:42PM -0400, Mikulas Patocka wrote:
> > The same for directories - NVFS hashes the file name and uses radix-tree 
> > to locate a directory page where the directory entry is located. XFS 
> > b+trees would result in much more accesses than the radix-tree.
> 
> What?  Radix trees behave _horribly_ badly when indexed by a hash.
> If you have a 64-bit hash and use 8 bits per level of the tree, you have
> to traverse 8 pointers to get to your destination.  You might as well
> use a linked list!

In NVFS, radix trees are cut off - they have only as much internal levels, 
as is needed to disambiguate the directory entries.

Read this document: http://people.redhat.com/~mpatocka/nvfs/INTERNALS
the section "DIRECTORIES".

Perhaps, I should call it differently than "radix-trees", but I don't 
really know what is the official name for this data structure.

Mikulas



NVFS XFS metadata (was: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache)

2020-09-21 Thread Mikulas Patocka



On Wed, 16 Sep 2020, Mikulas Patocka wrote:

> 
> 
> On Wed, 16 Sep 2020, Dan Williams wrote:
> 
> > On Wed, Sep 16, 2020 at 10:24 AM Mikulas Patocka  
> > wrote:
> > >
> > > > My first question about nvfs is how it compares to a daxfs with
> > > > executables and other binaries configured to use page cache with the
> > > > new per-file dax facility?
> > >
> > > nvfs is faster than dax-based filesystems on metadata-heavy operations
> > > because it doesn't have the overhead of the buffer cache and bios. See
> > > this: http://people.redhat.com/~mpatocka/nvfs/BENCHMARKS
> > 
> > ...and that metadata problem is intractable upstream? Christoph poked
> > at bypassing the block layer for xfs metadata operations [1], I just
> > have not had time to carry that further.
> > 
> > [1]: "xfs: use dax_direct_access for log writes", although it seems
> > he's dropped that branch from his xfs.git
> 
> XFS is very big. I wanted to create something small.

And the another difference is that XFS metadata are optimized for disks 
and SSDs.

On disks and SSDs, reading one byte is as costly as reading a full block. 
So we must put as much information to a block as possible. XFS uses 
b+trees for file block mapping and for directories - it is reasonable 
decision because b+trees minimize the number of disk accesses.

On persistent memory, each access has its own cost, so NVFS uses metadata 
structures that minimize the number of cache lines accessed (rather than 
the number of blocks accessed). For block mapping, NVFS uses the classic 
unix dierct/indirect blocks - if a file block is mapped by a 3-rd level 
indirect block, we do just three memory accesses and we are done. If we 
used b+trees, the number of accesses would be much larger than 3 (we would 
have to do binary search in the b+tree nodes).

The same for directories - NVFS hashes the file name and uses radix-tree 
to locate a directory page where the directory entry is located. XFS 
b+trees would result in much more accesses than the radix-tree.

Regarding journaling - NVFS doesn't do it because persistent memory is so 
fast that we can just check it in the case of crash. NVFS has a 
multithreaded fsck that can do 3 million inodes per second. XFS does 
journaling (it was reasonable decision for disks where fsck took hours) 
and it will cause overhead for all the filesystem operations.

Mikulas



Re: [RFC] nvfs: a filesystem for persistent memory

2020-09-21 Thread Mikulas Patocka



On Tue, 15 Sep 2020, Dan Williams wrote:

> > TODO:
> >
> > - programs run approximately 4% slower when running from Optane-based
> > persistent memory. Therefore, programs and libraries should use page cache
> > and not DAX mapping.
> 
> This needs to be based on platform firmware data f(ACPI HMAT) for the
> relative performance of a PMEM range vs DRAM. For example, this
> tradeoff should not exist with battery backed DRAM, or virtio-pmem.

Hi

I have implemented this functionality - if we mmap a file with 
(vma->vm_flags & VM_DENYWRITE), then it is assumed that this is executable 
file mapping - the flag S_DAX on the inode is cleared on and the inode 
will use normal page cache.

Is there some way how to test if we are using Optane-based module (where 
this optimization should be applied) or battery backed DRAM (where it 
should not)?

I've added mount options dax=never, dax=auto, dax=always, so that the user 
can override the automatic behavior.

Mikulas



the "read" syscall sees partial effects of the "write" syscall

2020-09-18 Thread Mikulas Patocka
Hi

I'd like to ask about this problem: when we write to a file, the kernel 
takes the write inode lock. When we read from a file, no lock is taken - 
thus the read syscall can read data that are halfway modified by the write 
syscall.

The standard specifies the effects of the write syscall are atomic - see 
this:
https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09_07

> 2.9.7 Thread Interactions with Regular File Operations
> 
> All of the following functions shall be atomic with respect to each 
> other in the effects specified in POSIX.1-2017 when they operate on 
> regular files or symbolic links:
> 
> chmod()fchownat()  lseek()  readv() unlink()
> chown()fcntl() lstat()  pwrite()unlinkat()
> close()fstat() open()   rename()utime()
> creat()fstatat()   openat() renameat()  utimensat()
> dup2() ftruncate() pread()  stat()  utimes()
> fchmod()   lchown()read()   symlink()   write()
> fchmodat() link()  readlink()   symlinkat() writev()
> fchown()   linkat()readlinkat() truncate()
> 
> If two threads each call one of these functions, each call shall either 
> see all of the specified effects of the other call, or none of them. The 
> requirement on the close() function shall also apply whenever a file 
> descriptor is successfully closed, however caused (for example, as a 
> consequence of calling close(), calling dup2(), or of process 
> termination).

Should the read call take the read inode lock to make it atomic w.r.t. the 
write syscall? (I know - taking the read lock causes big performance hit 
due to cache line bouncing)

I've created this program to test it - it has two threads, one writing and 
the other reading and verifying. When I run it on OpenBSD or FreeBSD, it 
passes, on Linux it fails with "we read modified bytes".

Mikulas



#include 
#include 
#include 
#include 
#include 
#include 

#define L   65536

static int h;
static pthread_barrier_t barrier;
static pthread_t thr;

static char rpattern[L];
static char wpattern[L];

static void *reader(__attribute__((unused)) void *ptr)
{
while (1) {
int r;
size_t i;
r = pthread_barrier_wait();
if (r > 0) fprintf(stderr, "pthread_barrier_wait: %s\n", 
strerror(r)), exit(1);
r = pread(h, rpattern, L, 0);
if (r != L) perror("pread"), exit(1);
for (i = 0; i < L; i++) {
if (rpattern[i] != rpattern[0])
fprintf(stderr, "we read modified bytes\n"), 
exit(1);
}
}
return NULL;
}

int main(__attribute__((unused)) int argc, char *argv[])
{
int r;
h = open(argv[1], O_RDWR | O_CREAT | O_TRUNC, 0644);
if (h < 0) perror("open"), exit(1);
r = pwrite(h, wpattern, L, 0);
if (r != L) perror("pwrite"), exit(1);
r = pthread_barrier_init(, NULL, 2);
if (r) fprintf(stderr, "pthread_barrier_init: %s\n", strerror(r)), 
exit(1);
r = pthread_create(, NULL, reader, NULL);
if (r) fprintf(stderr, "pthread_create: %s\n", strerror(r)), exit(1);
while (1) {
size_t i;
for (i = 0; i < L; i++)
wpattern[i]++;
r = pthread_barrier_wait();
if (r > 0) fprintf(stderr, "pthread_barrier_wait: %s\n", 
strerror(r)), exit(1);
r = pwrite(h, wpattern, L, 0);
if (r != L) perror("pwrite"), exit(1);
}
}



Re: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache

2020-09-16 Thread Mikulas Patocka



On Wed, 16 Sep 2020, Dan Williams wrote:

> On Wed, Sep 16, 2020 at 3:57 AM Mikulas Patocka  wrote:
> >
> >
> >
> > I'm submitting this patch that adds the required exports (so that we could
> > use __copy_from_user_flushcache on x86, arm64 and powerpc). Please, queue
> > it for the next merge window.
> 
> Why? This should go with the first user, and it's not clear that it
> needs to be relative to the current dax_operations export scheme.

Before nvfs gets included in the kernel, I need to distribute it as a 
module. So, it would make my maintenance easier. But if you don't want to 
export it now, no problem, I can just copy __copy_user_flushcache from the 
kernel to the module.

> My first question about nvfs is how it compares to a daxfs with
> executables and other binaries configured to use page cache with the
> new per-file dax facility?

nvfs is faster than dax-based filesystems on metadata-heavy operations 
because it doesn't have the overhead of the buffer cache and bios. See 
this: http://people.redhat.com/~mpatocka/nvfs/BENCHMARKS

Mikulas



[PATCH] pmem: fix __copy_user_flushcache

2020-09-16 Thread Mikulas Patocka



On Wed, 16 Sep 2020, Dan Williams wrote:

> On Wed, Sep 16, 2020 at 10:24 AM Mikulas Patocka  wrote:
> >
> >
> >
> > On Wed, 16 Sep 2020, Dan Williams wrote:
> >
> > > On Wed, Sep 16, 2020 at 3:57 AM Mikulas Patocka  
> > > wrote:
> > > >
> > > >
> > > >
> > > > I'm submitting this patch that adds the required exports (so that we 
> > > > could
> > > > use __copy_from_user_flushcache on x86, arm64 and powerpc). Please, 
> > > > queue
> > > > it for the next merge window.
> > >
> > > Why? This should go with the first user, and it's not clear that it
> > > needs to be relative to the current dax_operations export scheme.
> >
> > Before nvfs gets included in the kernel, I need to distribute it as a
> > module. So, it would make my maintenance easier. But if you don't want to
> > export it now, no problem, I can just copy __copy_user_flushcache from the
> > kernel to the module.
> 
> That sounds a better plan than exporting symbols with no in-kernel consumer.

BTW, this function is buggy. Here I'm submitting the patch.



From: Mikulas Patocka 

If we copy less than 8 bytes and if the destination crosses a cache line,
__copy_user_flushcache would invalidate only the first cache line. This
patch makes it invalidate the second cache line as well.

Signed-off-by: Mikulas Patocka 
Cc: sta...@vger.kernel.org

---
 arch/x86/lib/usercopy_64.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/arch/x86/lib/usercopy_64.c
===
--- linux-2.6.orig/arch/x86/lib/usercopy_64.c   2020-09-05 10:01:27.0 
+0200
+++ linux-2.6/arch/x86/lib/usercopy_64.c2020-09-16 20:48:31.0 
+0200
@@ -120,7 +120,7 @@ long __copy_user_flushcache(void *dst, c
 */
if (size < 8) {
if (!IS_ALIGNED(dest, 4) || size != 4)
-   clean_cache_range(dst, 1);
+   clean_cache_range(dst, size);
} else {
if (!IS_ALIGNED(dest, 8)) {
dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);



Re: [PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache

2020-09-16 Thread Mikulas Patocka



On Wed, 16 Sep 2020, Dan Williams wrote:

> On Wed, Sep 16, 2020 at 10:24 AM Mikulas Patocka  wrote:
> >
> > > My first question about nvfs is how it compares to a daxfs with
> > > executables and other binaries configured to use page cache with the
> > > new per-file dax facility?
> >
> > nvfs is faster than dax-based filesystems on metadata-heavy operations
> > because it doesn't have the overhead of the buffer cache and bios. See
> > this: http://people.redhat.com/~mpatocka/nvfs/BENCHMARKS
> 
> ...and that metadata problem is intractable upstream? Christoph poked
> at bypassing the block layer for xfs metadata operations [1], I just
> have not had time to carry that further.
> 
> [1]: "xfs: use dax_direct_access for log writes", although it seems
> he's dropped that branch from his xfs.git

XFS is very big. I wanted to create something small.

Mikulas



[PATCH] pmem: export the symbols __copy_user_flushcache and __copy_from_user_flushcache

2020-09-16 Thread Mikulas Patocka



On Tue, 15 Sep 2020, Mikulas Patocka wrote:

> 
> 
> On Tue, 15 Sep 2020, Mikulas Patocka wrote:
> 
> > > > - __copy_from_user_inatomic_nocache doesn't flush cache for leading and
> > > > trailing bytes.
> > > 
> > > You want copy_user_flushcache(). See how fs/dax.c arranges for
> > > dax_copy_from_iter() to route to pmem_copy_from_iter().
> > 
> > Is it something new for the kernel 5.10? I see only __copy_user_flushcache 
> > that is implemented just for x86 and arm64.
> > 
> > There is __copy_from_user_flushcache implemented for x86, arm64 and power. 
> > It is used in lib/iov_iter.c under
> > #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE - so should I use this?
> > 
> > Mikulas
> 
> ... and __copy_user_flushcache is not exported for modules. So, I am stuck 
> with __copy_from_user_inatomic_nocache.
> 
> Mikulas

I'm submitting this patch that adds the required exports (so that we could 
use __copy_from_user_flushcache on x86, arm64 and powerpc). Please, queue 
it for the next merge window.


From: Mikulas Patocka 

Export the symbols __copy_user_flushcache and __copy_from_user_flushcache,
so that modules can use this functionality.

Signed-off-by: Mikulas Patocka 

---
 arch/arm64/lib/uaccess_flushcache.c |1 +
 arch/powerpc/lib/pmem.c |1 +
 arch/x86/lib/usercopy_64.c  |1 +
 3 files changed, 3 insertions(+)

Index: linux-2.6/arch/arm64/lib/uaccess_flushcache.c
===
--- linux-2.6.orig/arch/arm64/lib/uaccess_flushcache.c  2020-09-16 
12:44:15.068038000 +0200
+++ linux-2.6/arch/arm64/lib/uaccess_flushcache.c   2020-09-16 
12:44:15.068038000 +0200
@@ -38,3 +38,4 @@ unsigned long __copy_user_flushcache(voi
__clean_dcache_area_pop(to, n - rc);
return rc;
 }
+EXPORT_SYMBOL_GPL(__copy_user_flushcache);
Index: linux-2.6/arch/x86/lib/usercopy_64.c
===
--- linux-2.6.orig/arch/x86/lib/usercopy_64.c   2020-09-16 12:44:15.068038000 
+0200
+++ linux-2.6/arch/x86/lib/usercopy_64.c2020-09-16 12:44:15.068038000 
+0200
@@ -134,6 +134,7 @@ long __copy_user_flushcache(void *dst, c
 
return rc;
 }
+EXPORT_SYMBOL_GPL(__copy_user_flushcache);
 
 void __memcpy_flushcache(void *_dst, const void *_src, size_t size)
 {
Index: linux-2.6/arch/powerpc/lib/pmem.c
===
--- linux-2.6.orig/arch/powerpc/lib/pmem.c  2020-09-16 12:44:15.068038000 
+0200
+++ linux-2.6/arch/powerpc/lib/pmem.c   2020-09-16 12:44:15.068038000 +0200
@@ -75,6 +75,7 @@ long __copy_from_user_flushcache(void *d
 
return copied;
 }
+EXPORT_SYMBOL_GPL(__copy_from_user_flushcache);
 
 void memcpy_flushcache(void *dest, const void *src, size_t size)
 {



Re: [RFC] nvfs: a filesystem for persistent memory

2020-09-15 Thread Mikulas Patocka



On Tue, 15 Sep 2020, Matthew Wilcox wrote:

> On Tue, Sep 15, 2020 at 08:34:41AM -0400, Mikulas Patocka wrote:
> > - when the fsck.nvfs tool mmaps the device /dev/pmem0, the kernel uses 
> > buffer cache for the mapping. The buffer cache slows does fsck by a factor 
> > of 5 to 10. Could it be possible to change the kernel so that it maps DAX 
> > based block devices directly?
> 
> Oh, because fs/block_dev.c has:
> .mmap   = generic_file_mmap,
> 
> I don't see why we shouldn't have a blkdev_mmap modelled after
> ext2_file_mmap() with the corresponding blkdev_dax_vm_ops.

Yes, that's possible - and we whould also have to rewrite methods 
read_iter and write_iter on DAX block devices, so that they are coherent 
with mmap.

Mikulas



Re: [RFC] nvfs: a filesystem for persistent memory

2020-09-15 Thread Mikulas Patocka



On Tue, 15 Sep 2020, Dan Williams wrote:

> > - when the fsck.nvfs tool mmaps the device /dev/pmem0, the kernel uses
> > buffer cache for the mapping. The buffer cache slows does fsck by a factor
> > of 5 to 10. Could it be possible to change the kernel so that it maps DAX
> > based block devices directly?
> 
> We've been down this path before.
> 
> 5a023cdba50c block: enable dax for raw block devices
> 9f4736fe7ca8 block: revert runtime dax control of the raw block device
> acc93d30d7d4 Revert "block: enable dax for raw block devices"

It says "The functionality is superseded by the new 'Device DAX' 
facility". But the fsck tool can't change a fsdax device into a devdax 
device just for checking. Or can it?

> EXT2/4 metadata buffer management depends on the page cache and we
> eliminated a class of bugs by removing that support. The problems are
> likely tractable, but there was not a straightforward fix visible at
> the time.

Thinking about it - it isn't as easy as it looks...

Suppose that the user mounts an ext2 filesystem and then uses the tune2fs 
tool on the mounted block device. The tune2fs tool reads and writes the 
mounted superblock directly.

So, read/write must be coherent with the buffer cache (otherwise the 
kernel would not see the changes written by tune2fs). And mmap must be 
coherent with read/write.

So, if we want to map the pmem device directly, we could add a new flag 
MAP_DAX. Or we could test if the fd has O_DIRECT flag and map it directly 
in this case. But the default must be to map it coherently in order to not 
break existing programs.

> > - __copy_from_user_inatomic_nocache doesn't flush cache for leading and
> > trailing bytes.
> 
> You want copy_user_flushcache(). See how fs/dax.c arranges for
> dax_copy_from_iter() to route to pmem_copy_from_iter().

Is it something new for the kernel 5.10? I see only __copy_user_flushcache 
that is implemented just for x86 and arm64.

There is __copy_from_user_flushcache implemented for x86, arm64 and power. 
It is used in lib/iov_iter.c under
#ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE - so should I use this?

Mikulas



Re: [RFC] nvfs: a filesystem for persistent memory

2020-09-15 Thread Mikulas Patocka



On Tue, 15 Sep 2020, Mikulas Patocka wrote:

> > > - __copy_from_user_inatomic_nocache doesn't flush cache for leading and
> > > trailing bytes.
> > 
> > You want copy_user_flushcache(). See how fs/dax.c arranges for
> > dax_copy_from_iter() to route to pmem_copy_from_iter().
> 
> Is it something new for the kernel 5.10? I see only __copy_user_flushcache 
> that is implemented just for x86 and arm64.
> 
> There is __copy_from_user_flushcache implemented for x86, arm64 and power. 
> It is used in lib/iov_iter.c under
> #ifdef CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE - so should I use this?
> 
> Mikulas

... and __copy_user_flushcache is not exported for modules. So, I am stuck 
with __copy_from_user_inatomic_nocache.

Mikulas



[RFC] nvfs: a filesystem for persistent memory

2020-09-15 Thread Mikulas Patocka
Hi

I am developing a new filesystem suitable for persistent memory - nvfs. 
The goal is to have a small and fast filesystem that can be used on 
DAX-based devices. Nvfs maps the whole device into linear address space 
and it completely bypasses the overhead of the block layer and buffer 
cache.

In the past, there was nova filesystem for pmem, but it was abandoned a 
year ago (the last version is for the kernel 5.1 - 
https://github.com/NVSL/linux-nova ). Nvfs is smaller and performs better.

The design of nvfs is similar to ext2/ext4, so that it fits into the VFS 
layer naturally, without too much glue code.

I'd like to ask you to review it.


tarballs:
http://people.redhat.com/~mpatocka/nvfs/
git:
git://leontynka.twibright.com/nvfs.git
the description of filesystem internals:
http://people.redhat.com/~mpatocka/nvfs/INTERNALS
benchmarks:
http://people.redhat.com/~mpatocka/nvfs/BENCHMARKS


TODO:

- programs run approximately 4% slower when running from Optane-based 
persistent memory. Therefore, programs and libraries should use page cache 
and not DAX mapping.

- when the fsck.nvfs tool mmaps the device /dev/pmem0, the kernel uses 
buffer cache for the mapping. The buffer cache slows does fsck by a factor 
of 5 to 10. Could it be possible to change the kernel so that it maps DAX 
based block devices directly?

- __copy_from_user_inatomic_nocache doesn't flush cache for leading and 
trailing bytes.

Mikulas



Re: [PATCH 2/2] xfs: don't update mtime on COW faults

2020-09-11 Thread Mikulas Patocka



On Wed, 9 Sep 2020, Darrick J. Wong wrote:

> On Sat, Sep 05, 2020 at 01:02:33PM -0400, Mikulas Patocka wrote:
> > > > 
> > 
> > I've written this program that tests it - you can integrate it into your 
> > testsuite.
> 
> I don't get it.  You're a filesystem maintainer too, which means you're
> a regular contributor.  Do you:
> 
> (a) not use fstests?  If you don't, I really hope you use something else
> to QA hpfs.

I don't use xfstests on HPFS. I was testing it just by using it. Now I use 
it just a little, but I don't modify it much.

> (b) really think that it's my problem to integrate and submit your
> regression tests for you?
> 
> and (c) what do you want me to do with a piece of code that has no 
> signoff tag, no copyright, and no license?  This is your patch, and 
> therefore your responsibility to develop enough of an appropriate 
> regression test in a proper form that the rest of us can easily 
> determine we have the rights to contribute to it.


If you want a full patch (I copied the script from test 313), I send it 
here.

Mikulas


From: Mikulas Patocka 
Subject: [PATCH] check ctime and mtime vs mmap

Check ctime and mtime are not updated on COW faults
and that they are updated on shared faults

Signed-off-by: Mikulas Patocka 

---
 src/Makefile  |3 +-
 src/mmap-timestamp.c  |   53 ++
 tests/generic/609 |   40 +
 tests/generic/609.out |2 +
 tests/generic/group   |1 
 5 files changed, 98 insertions(+), 1 deletion(-)

Index: xfstests-dev/src/Makefile
===
--- xfstests-dev.orig/src/Makefile  2020-09-06 12:38:40.0 +0200
+++ xfstests-dev/src/Makefile   2020-09-11 17:39:04.0 +0200
@@ -17,7 +17,8 @@ TARGETS = dirstress fill fill2 getpagesi
t_mmap_cow_race t_mmap_fallocate fsync-err t_mmap_write_ro \
t_ext4_dax_journal_corruption t_ext4_dax_inline_corruption \
t_ofd_locks t_mmap_collision mmap-write-concurrent \
-   t_get_file_time t_create_short_dirs t_create_long_dirs
+   t_get_file_time t_create_short_dirs t_create_long_dirs \
+   mmap-timestamp
 
 LINUX_TARGETS = xfsctl bstat t_mtab getdevicesize preallo_rw_pattern_reader \
preallo_rw_pattern_writer ftrunc trunc fs_perms testx looptest \
Index: xfstests-dev/src/mmap-timestamp.c
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ xfstests-dev/src/mmap-timestamp.c   2020-09-11 18:21:40.0 +0200
@@ -0,0 +1,53 @@
+// SPDX-License-Identifier: GPL-2.0
+// Copyright (C) 2020 Red Hat, Inc. All Rights reserved.
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define FILE_NAME  argv[1]
+
+static struct stat st1, st2;
+
+int main(int argc, char *argv[])
+{
+   int h, r;
+   char *map;
+   unlink(FILE_NAME);
+   h = creat(FILE_NAME, 0600);
+   if (h == -1) perror("creat"), exit(1);
+   r = write(h, "x", 1);
+   if (r != 1) perror("write"), exit(1);
+   if (close(h)) perror("close"), exit(1);
+   h = open(FILE_NAME, O_RDWR);
+   if (h == -1) perror("open"), exit(1);
+
+   map = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE, h, 0);
+   if (map == MAP_FAILED) perror("mmap"), exit(1);
+   if (fstat(h, )) perror("fstat"), exit(1);
+   sleep(2);
+   *map = 'y';
+   if (fstat(h, )) perror("fstat"), exit(1);
+   if (st1.st_mtime != st2.st_mtime) fprintf(stderr, "BUG: COW fault 
changed mtime!\n"), exit(1);
+   if (st1.st_ctime != st2.st_ctime) fprintf(stderr, "BUG: COW fault 
changed ctime!\n"), exit(1);
+   if (munmap(map, 4096)) perror("munmap"), exit(1);
+
+   map = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, h, 0);
+   if (map == MAP_FAILED) perror("mmap"), exit(1);
+   if (fstat(h, )) perror("fstat"), exit(1);
+   sleep(2);
+   *map = 'z';
+   if (msync(map, 4096, MS_SYNC)) perror("msync"), exit(1);
+   if (fstat(h, )) perror("fstat"), exit(1);
+   if (st1.st_mtime == st2.st_mtime) fprintf(stderr, "BUG: Shared fault 
did not change mtime!\n"), exit(1);
+   if (st1.st_ctime == st2.st_ctime) fprintf(stderr, "BUG: Shared fault 
did not change ctime!\n"), exit(1);
+   if (munmap(map, 4096)) perror("munmap"), exit(1);
+
+   if (close(h)) perror("close"), exit(1);
+   if (unlink(FILE_NAME)) perror("unlink"), exit(1);
+   return 0;
+}
Index: xfstests-dev/tests/generic/609
===
--- /dev/null   1970-01-01 00:00:00.0 +
+++ xfs

[PATCH 2/2 v2] xfs: don't update mtime on COW faults

2020-09-05 Thread Mikulas Patocka
When running in a dax mode, if the user maps a page with MAP_PRIVATE and
PROT_WRITE, the xfs filesystem would incorrectly update ctime and mtime
when the user hits a COW fault.

This breaks building of the Linux kernel.
How to reproduce:
1. extract the Linux kernel tree on dax-mounted xfs filesystem
2. run make clean
3. run make -j12
4. run make -j12
- at step 4, make would incorrectly rebuild the whole kernel (although it
  was already built in step 3).

The reason for the breakage is that almost all object files depend on
objtool. When we run objtool, it takes COW page fault on its .data
section, and these faults will incorrectly update the timestamp of the
objtool binary. The updated timestamp causes make to rebuild the whole
tree.

Signed-off-by: Mikulas Patocka 
Cc: sta...@vger.kernel.org

---
 fs/xfs/xfs_file.c |   13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

Index: linux-2.6/fs/xfs/xfs_file.c
===
--- linux-2.6.orig/fs/xfs/xfs_file.c2020-09-05 18:48:54.0 +0200
+++ linux-2.6/fs/xfs/xfs_file.c 2020-09-05 18:56:48.0 +0200
@@ -1223,14 +1223,21 @@ __xfs_filemap_fault(
return ret;
 }
 
+static inline bool
+xfs_is_shared_dax_write_fault(
+   struct vm_fault *vmf)
+{
+   return IS_DAX(file_inode(vmf->vma->vm_file)) &&
+  vmf->flags & FAULT_FLAG_WRITE && vmf->vma->vm_flags & VM_SHARED;
+}
+
 static vm_fault_t
 xfs_filemap_fault(
struct vm_fault *vmf)
 {
/* DAX can shortcut the normal fault path on write faults! */
return __xfs_filemap_fault(vmf, PE_SIZE_PTE,
-   IS_DAX(file_inode(vmf->vma->vm_file)) &&
-   (vmf->flags & FAULT_FLAG_WRITE));
+   xfs_is_shared_dax_write_fault(vmf));
 }
 
 static vm_fault_t
@@ -1243,7 +1250,7 @@ xfs_filemap_huge_fault(
 
/* DAX can shortcut the normal fault path on write faults! */
return __xfs_filemap_fault(vmf, pe_size,
-   (vmf->flags & FAULT_FLAG_WRITE));
+   xfs_is_shared_dax_write_fault(vmf));
 }
 
 static vm_fault_t



Re: [PATCH 2/2] xfs: don't update mtime on COW faults

2020-09-05 Thread Mikulas Patocka



On Sat, 5 Sep 2020, Darrick J. Wong wrote:

> On Sat, Sep 05, 2020 at 08:13:02AM -0400, Mikulas Patocka wrote:
> > When running in a dax mode, if the user maps a page with MAP_PRIVATE and
> > PROT_WRITE, the xfs filesystem would incorrectly update ctime and mtime
> > when the user hits a COW fault.
> > 
> > This breaks building of the Linux kernel.
> > How to reproduce:
> > 1. extract the Linux kernel tree on dax-mounted xfs filesystem
> > 2. run make clean
> > 3. run make -j12
> > 4. run make -j12
> > - at step 4, make would incorrectly rebuild the whole kernel (although it
> >   was already built in step 3).
> > 
> > The reason for the breakage is that almost all object files depend on
> > objtool. When we run objtool, it takes COW page fault on its .data
> > section, and these faults will incorrectly update the timestamp of the
> > objtool binary. The updated timestamp causes make to rebuild the whole
> > tree.
> > 
> > Signed-off-by: Mikulas Patocka 
> > Cc: sta...@vger.kernel.org
> > 
> > ---
> >  fs/xfs/xfs_file.c |   11 +--
> >  1 file changed, 9 insertions(+), 2 deletions(-)
> > 
> > Index: linux-2.6/fs/xfs/xfs_file.c
> > ===
> > --- linux-2.6.orig/fs/xfs/xfs_file.c2020-09-05 10:01:42.0 
> > +0200
> > +++ linux-2.6/fs/xfs/xfs_file.c 2020-09-05 13:59:12.0 +0200
> > @@ -1223,6 +1223,13 @@ __xfs_filemap_fault(
> > return ret;
> >  }
> >  
> > +static bool
> > +xfs_is_write_fault(
> 
> Call this xfs_is_shared_dax_write_fault, and throw in the IS_DAX() test?
> 
> You might as well make it a static inline.

Yes, it is possible. I'll send a second version.

> > +   struct vm_fault *vmf)
> > +{
> > +   return vmf->flags & FAULT_FLAG_WRITE && vmf->vma->vm_flags & VM_SHARED;
> 
> Also, is "shortcutting the normal fault path" the reason for ext2 and
> xfs both being broken?
> 
> /me puzzles over why write_fault is always true for page_mkwrite and
> pfn_mkwrite, but not for fault and huge_fault...
> 
> Also: Can you please turn this (checking for timestamp update behavior
> wrt shared and private mapping write faults) into an fstest so we don't
> mess this up again?

I've written this program that tests it - you can integrate it into your 
testsuite.

Mikulas


#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define FILE_NAME   "test.txt"

static struct stat st1, st2;

int main(void)
{
int h, r;
char *map;
unlink(FILE_NAME);
h = creat(FILE_NAME, 0600);
if (h == -1) perror("creat"), exit(1);
r = write(h, "x", 1);
if (r != 1) perror("write"), exit(1);
if (close(h)) perror("close"), exit(1);
h = open(FILE_NAME, O_RDWR);
if (h == -1) perror("open"), exit(1);

map = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_PRIVATE, h, 0);
if (map == MAP_FAILED) perror("mmap"), exit(1);
if (fstat(h, )) perror("fstat"), exit(1);
sleep(2);
*map = 'y';
if (fstat(h, )) perror("fstat"), exit(1);
if (memcmp(, , sizeof(struct stat))) fprintf(stderr, "BUG: COW 
fault changed time!\n"), exit(1);
if (munmap(map, 4096)) perror("munmap"), exit(1);

map = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, h, 0);
if (map == MAP_FAILED) perror("mmap"), exit(1);
if (fstat(h, )) perror("fstat"), exit(1);
sleep(2);
*map = 'z';
if (fstat(h, )) perror("fstat"), exit(1);
if (st1.st_mtime == st2.st_mtime) fprintf(stderr, "BUG: Shared fault 
did not change mtime!\n"), exit(1);
if (st1.st_ctime == st2.st_ctime) fprintf(stderr, "BUG: Shared fault 
did not change ctime!\n"), exit(1);
if (munmap(map, 4096)) perror("munmap"), exit(1);

if (close(h)) perror("close"), exit(1);
if (unlink(FILE_NAME)) perror("unlink"), exit(1);
return 0;
}



[PATCH 2/2] xfs: don't update mtime on COW faults

2020-09-05 Thread Mikulas Patocka
When running in a dax mode, if the user maps a page with MAP_PRIVATE and
PROT_WRITE, the xfs filesystem would incorrectly update ctime and mtime
when the user hits a COW fault.

This breaks building of the Linux kernel.
How to reproduce:
1. extract the Linux kernel tree on dax-mounted xfs filesystem
2. run make clean
3. run make -j12
4. run make -j12
- at step 4, make would incorrectly rebuild the whole kernel (although it
  was already built in step 3).

The reason for the breakage is that almost all object files depend on
objtool. When we run objtool, it takes COW page fault on its .data
section, and these faults will incorrectly update the timestamp of the
objtool binary. The updated timestamp causes make to rebuild the whole
tree.

Signed-off-by: Mikulas Patocka 
Cc: sta...@vger.kernel.org

---
 fs/xfs/xfs_file.c |   11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

Index: linux-2.6/fs/xfs/xfs_file.c
===
--- linux-2.6.orig/fs/xfs/xfs_file.c2020-09-05 10:01:42.0 +0200
+++ linux-2.6/fs/xfs/xfs_file.c 2020-09-05 13:59:12.0 +0200
@@ -1223,6 +1223,13 @@ __xfs_filemap_fault(
return ret;
 }
 
+static bool
+xfs_is_write_fault(
+   struct vm_fault *vmf)
+{
+   return vmf->flags & FAULT_FLAG_WRITE && vmf->vma->vm_flags & VM_SHARED;
+}
+
 static vm_fault_t
 xfs_filemap_fault(
struct vm_fault *vmf)
@@ -1230,7 +1237,7 @@ xfs_filemap_fault(
/* DAX can shortcut the normal fault path on write faults! */
return __xfs_filemap_fault(vmf, PE_SIZE_PTE,
IS_DAX(file_inode(vmf->vma->vm_file)) &&
-   (vmf->flags & FAULT_FLAG_WRITE));
+   xfs_is_write_fault(vmf));
 }
 
 static vm_fault_t
@@ -1243,7 +1250,7 @@ xfs_filemap_huge_fault(
 
/* DAX can shortcut the normal fault path on write faults! */
return __xfs_filemap_fault(vmf, pe_size,
-   (vmf->flags & FAULT_FLAG_WRITE));
+   xfs_is_write_fault(vmf));
 }
 
 static vm_fault_t



[PATCH 1/2] ext2: don't update mtime on COW faults

2020-09-05 Thread Mikulas Patocka
When running in a dax mode, if the user maps a page with MAP_PRIVATE and
PROT_WRITE, the ext2 filesystem would incorrectly update ctime and mtime
when the user hits a COW fault.

This breaks building of the Linux kernel.
How to reproduce:
1. extract the Linux kernel tree on dax-mounted ext2 filesystem
2. run make clean
3. run make -j12
4. run make -j12
- at step 4, make would incorrectly rebuild the whole kernel (although it
  was already built in step 3).

The reason for the breakage is that almost all object files depend on
objtool. When we run objtool, it takes COW page fault on its .data
section, and these faults will incorrectly update the timestamp of the
objtool binary. The updated timestamp causes make to rebuild the whole
tree.

Signed-off-by: Mikulas Patocka 
Cc: sta...@vger.kernel.org

---
 fs/ext2/file.c |6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

Index: linux-2.6/fs/ext2/file.c
===
--- linux-2.6.orig/fs/ext2/file.c   2020-09-05 10:01:41.0 +0200
+++ linux-2.6/fs/ext2/file.c2020-09-05 13:09:50.0 +0200
@@ -93,8 +93,10 @@ static vm_fault_t ext2_dax_fault(struct
struct inode *inode = file_inode(vmf->vma->vm_file);
struct ext2_inode_info *ei = EXT2_I(inode);
vm_fault_t ret;
+   bool write = (vmf->flags & FAULT_FLAG_WRITE) &&
+   (vmf->vma->vm_flags & VM_SHARED);
 
-   if (vmf->flags & FAULT_FLAG_WRITE) {
+   if (write) {
sb_start_pagefault(inode->i_sb);
file_update_time(vmf->vma->vm_file);
}
@@ -103,7 +105,7 @@ static vm_fault_t ext2_dax_fault(struct
ret = dax_iomap_fault(vmf, PE_SIZE_PTE, NULL, NULL, _iomap_ops);
 
up_read(>dax_sem);
-   if (vmf->flags & FAULT_FLAG_WRITE)
+   if (write)
sb_end_pagefault(inode->i_sb);
return ret;
 }



Re: make misbehavior on ext2 in dax mode (was: a crash when running strace from persistent memory)

2020-09-05 Thread Mikulas Patocka



On Fri, 4 Sep 2020, Mikulas Patocka wrote:

> Hmm, so I've found another bug in dax mode.
> 
> If you extract the Linux kernel tree on dax-based ext2 filesystem (use the 
> real ext2 driver, not ext4), and then you run make twice, the second 
> invocation will rebuild everything. It seems like a problem with 
> timestamps.
> 
> mount -t ext2 -o dax /dev/pmem0 /mnt/ext2/
> cd /mnt/ext2/usr/src/git/linux-2.6
> make clean
> make -j12
> make -j12 <--- this rebuilds the whole tree, althought it shouldn't
> 
> I wasn't able to bisect it because this bug seems to be present in every 
> kernel I tried (back to 4.16.0). Ext4 doesn't seem to have this bug.
> 
> Mikulas

I've found out the root cause for this bug (XFS has it too) and I'm 
sending patches.

Mikulas



make misbehavior on ext2 in dax mode (was: a crash when running strace from persistent memory)

2020-09-04 Thread Mikulas Patocka



On Thu, 3 Sep 2020, Mikulas Patocka wrote:

> Hi
> 
> There's a bug when you run strace from dax-based filesystem.

Hmm, so I've found another bug in dax mode.

If you extract the Linux kernel tree on dax-based ext2 filesystem (use the 
real ext2 driver, not ext4), and then you run make twice, the second 
invocation will rebuild everything. It seems like a problem with 
timestamps.

mount -t ext2 -o dax /dev/pmem0 /mnt/ext2/
cd /mnt/ext2/usr/src/git/linux-2.6
make clean
make -j12
make -j12   <--- this rebuilds the whole tree, althought it shouldn't

I wasn't able to bisect it because this bug seems to be present in every 
kernel I tried (back to 4.16.0). Ext4 doesn't seem to have this bug.

Mikulas



Re: a crash when running strace from persistent memory

2020-09-04 Thread Mikulas Patocka



On Thu, 3 Sep 2020, Linus Torvalds wrote:

> On Thu, Sep 3, 2020 at 12:24 PM Mikulas Patocka  wrote:
> >
> > There's a bug when you run strace from dax-based filesystem.
> >
> > -- create real or emulated persistent memory device (/dev/pmem0)
> > mkfs.ext2 /dev/pmem0
> > -- mount it
> > mount -t ext2 -o dax /dev/pmem0 /mnt/test
> > -- copy the system to it (well, you can copy just a few files that are
> >needed for running strace and ls)
> > cp -ax / /mnt/test
> > -- bind the system directories
> > mount --bind /dev /mnt/test/dev
> > mount --bind /proc /mnt/test/proc
> > mount --bind /sys /mnt/test/sys
> > -- run strace on the ls command
> > chroot /mnt/test/ strace /bin/ls
> >
> > You get this warning and ls is killed with SIGSEGV.
> >
> > I bisected the problem and it is caused by the commit
> > 17839856fd588f4ab6b789f482ed3ffd7c403e1f (gup: document and work around
> > "COW can break either way" issue). When I revert the patch (on the kernel
> > 5.9-rc3), the bug goes away.
> 
> Funky. I really don't see how it could cause that, but we have the
> UDDF issue too, so I'm guessing I will have to fix it the radical way
> with Peter Xu's series based on my "rip out COW special cases" patch.
> 
> Or maybe I'm just using that as an excuse for really wanting to apply
> that series.. Because we can't just revert that GUP commit due to
> security concerns.
> 
> > [   84.191504] WARNING: CPU: 6 PID: 1350 at mm/memory.c:2486 
> > wp_page_copy.cold+0xdb/0xf6
> 
> I'm assuming this is the WARN_ON_ONCE(1) on line 2482, and you have
> some extra debug patch that causes that line to be off by 4? Because
> at least for me, line 2486 is actually an empty line in v5.9-rc3.

Yes, that's it. I added a few printk to look at the control flow.

> That said, I really think this is a pre-existing race, and all the
> "COW can break either way" patch does is change the timing (presumably
> due to the actual pattern of actually doing the COW changing).
> 
> See commit c3e5ea6ee574 ("mm: avoid data corruption on CoW fault into
> PFN-mapped VMA") for background.
> 
> Mikulas, can you check that everything works ok for that case if you
> apply Peter's series? See
> 
> https://lore.kernel.org/lkml/20200821234958.7896-1-pet...@redhat.com/

I applied these four patches and strace works well. There is no longer any 
warning or crash.

Mikulas

> or if you have 'b4' installed, use
> 
> b4 am 20200821234958.7896-1-pet...@redhat.com
> 
> to get the series..
> 
>  Linus
> 



a crash when running strace from persistent memory

2020-09-03 Thread Mikulas Patocka
Hi

There's a bug when you run strace from dax-based filesystem.

-- create real or emulated persistent memory device (/dev/pmem0)
mkfs.ext2 /dev/pmem0
-- mount it
mount -t ext2 -o dax /dev/pmem0 /mnt/test
-- copy the system to it (well, you can copy just a few files that are 
   needed for running strace and ls)
cp -ax / /mnt/test
-- bind the system directories
mount --bind /dev /mnt/test/dev
mount --bind /proc /mnt/test/proc
mount --bind /sys /mnt/test/sys
-- run strace on the ls command
chroot /mnt/test/ strace /bin/ls

You get this warning and ls is killed with SIGSEGV.

I bisected the problem and it is caused by the commit 
17839856fd588f4ab6b789f482ed3ffd7c403e1f (gup: document and work around 
"COW can break either way" issue). When I revert the patch (on the kernel 
5.9-rc3), the bug goes away.

Mikulas


[   84.190961] [ cut here ]
[   84.191504] WARNING: CPU: 6 PID: 1350 at mm/memory.c:2486 
wp_page_copy.cold+0xdb/0xf6
[   84.192398] Modules linked in: ext2 uvesafb cfbfillrect cfbimgblt cn 
cfbcopyarea fb fbdev ipv6 tun autofs4 binfmt_misc configfs af_packet mousedev 
virtio_balloon virtio_rng evdev rng_core pcspkr button raid10 raid456 
async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx 
libcrc32c raid1 raid0 md_mod sd_mod t10_pi virtio_scsi virtio_net psmouse 
net_failover scsi_mod failover
[   84.196301] CPU: 6 PID: 1350 Comm: strace Not tainted 5.9.0-rc3 #6
[   84.197020] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   84.197685] RIP: 0010:wp_page_copy.cold+0xdb/0xf6
[   84.198231] Code: ff ff ff 0f 00 eb 8e 48 8b 3c 24 48 8b 74 24 08 ba 00 10 
00 00 e8 33 87 1f 00 85 c0 74 1f 48 c7 c7 a7 2b ba 81 e8 cc b6 f0 ff <0f> 0b 48 
8b 3c 24 e8 08 82 1f 00 41 be 01 00 00 00 eb ae 41 be 01
[   84.200410] RSP: 0018:88940c1dba58 EFLAGS: 00010282
[   84.201035] RAX: 0006 RBX: 88940c1dbb00 RCX: 
[   84.201842] RDX: 0003 RSI: 81b9c0fa RDI: 
[   84.202650] RBP: ea004f0e4d80 R08:  R09: 
[   84.203460] R10: 0046 R11:  R12: 
[   84.204265] R13: 88940aa86318 R14: f7fac000 R15: 8893c8db3c40
[   84.205083] FS:  7fd8a8320740() GS:88940fb8() 
knlGS:
[   84.206000] CS:  0010 DS:  ES:  CR0: 80050033
[   84.206664] CR2: f7fac000 CR3: 0013c93a6000 CR4: 06a0
[   84.207481] Call Trace:
[   84.207883]  do_wp_page+0x172/0x6a0
[   84.208285]  handle_mm_fault+0xd0b/0x1540
[   84.208753]  __get_user_pages+0x21a/0x6c0
[   84.209213]  __get_user_pages_remote+0xc8/0x2a0
[   84.209735]  process_vm_rw_core.isra.0+0x1ac/0x440
[   84.210318]  ? __might_fault+0x26/0x40
[   84.210758]  ? _copy_from_user+0x6a/0xa0
[   84.211208]  ? __might_fault+0x26/0x40
[   84.211642]  ? _copy_from_user+0x6a/0xa0
[   84.212091]  process_vm_rw+0xd1/0x100
[   84.212511]  ? _copy_to_user+0x69/0x80
[   84.212946]  ? ptrace_get_syscall_info+0x9b/0x180
[   84.213484]  ? find_held_lock+0x2b/0x80
[   84.213926]  ? __x64_sys_ptrace+0x106/0x140
[   84.214405]  ? fpregs_assert_state_consistent+0x19/0x40
[   84.215002]  ? exit_to_user_mode_prepare+0x2d/0x120
[   84.215556]  __x64_sys_process_vm_readv+0x22/0x40
[   84.216103]  do_syscall_64+0x2d/0x80
[   84.216518]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   84.217098] RIP: 0033:0x7fd8a84896da
[   84.217512] Code: 48 8b 15 b9 f7 0b 00 f7 d8 64 89 02 b8 ff ff ff ff eb d2 
e8 18 f0 00 00 0f 1f 84 00 00 00 00 00 49 89 ca b8 36 01 00 00 0f 05 <48> 3d 00 
f0 ff ff 77 06 c3 0f 1f 44 00 00 48 8b 15 81 f7 0b 00 f7
[   84.219618] RSP: 002b:7ffd08c3c678 EFLAGS: 0246 ORIG_RAX: 
0136
[   84.220563] RAX: ffda RBX: f7fac000 RCX: 7fd8a84896da
[   84.221380] RDX: 0001 RSI: 7ffd08c3c680 RDI: 0549
[   84.222194] RBP: 7ffd08c3c760 R08: 0001 R09: 
[   84.222999] R10: 7ffd08c3c690 R11: 0246 R12: f7faca80
[   84.223804] R13: 0580 R14: 0549 R15: 5589a50eee80
[   84.223804] R13: 0580 R14: 0549 R15: 5589a50eee80
[   84.224612] ---[ end trace d8dbf2da5dc1b7ca ]---



[PATCH] dm-bufio: do cleanup from a workqueue

2020-07-03 Thread Mikulas Patocka



On Tue, 30 Jun 2020, Dave Chinner wrote:

> https://lore.kernel.org/linux-fsdevel/20190809215733.gz7...@dread.disaster.area/
> 
> If you did that when I suggested it, this problem would be solved.
> i.e. The only way to fix this problem once adn for all is to stop
> using the shrinker as a mechanism to issue and wait on IO. If you
> need background writeback of dirty buffers, do it from a
> WQ_MEM_RECLAIM workqueue that isn't directly in the memory reclaim
> path and so can issue writeback and block safely from a GFP_KERNEL
> context. Kick the workqueue from the shrinker context, but get rid
> of the IO submission and waiting from the shrinker and all the
> GFP_NOFS memory reclaim recursion problems go away.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com

Hi

This is a patch that moves buffer cleanup to a workqueue. Please review 
it.

Mikulas



From: Mikulas Patocka 

kswapd should not block because it degrades system performance.
So, move reclaim of buffers to a workqueue.

Signed-off-by: Mikulas Patocka 

---
 drivers/md/dm-bufio.c |   60 ++
 1 file changed, 41 insertions(+), 19 deletions(-)

Index: linux-2.6/drivers/md/dm-bufio.c
===
--- linux-2.6.orig/drivers/md/dm-bufio.c2020-07-03 14:07:43.0 
+0200
+++ linux-2.6/drivers/md/dm-bufio.c 2020-07-03 15:35:23.0 +0200
@@ -108,7 +108,10 @@ struct dm_bufio_client {
int async_write_error;
 
struct list_head client_list;
+
struct shrinker shrinker;
+   struct work_struct shrink_work;
+   atomic_long_t need_shrink;
 };
 
 /*
@@ -1634,8 +1637,7 @@ static unsigned long get_retain_buffers(
return retain_bytes;
 }
 
-static unsigned long __scan(struct dm_bufio_client *c, unsigned long 
nr_to_scan,
-   gfp_t gfp_mask)
+static void __scan(struct dm_bufio_client *c)
 {
int l;
struct dm_buffer *b, *tmp;
@@ -1646,42 +1648,58 @@ static unsigned long __scan(struct dm_bu
 
for (l = 0; l < LIST_SIZE; l++) {
list_for_each_entry_safe_reverse(b, tmp, >lru[l], lru_list) {
-   if (__try_evict_buffer(b, gfp_mask))
+   if (count - freed <= retain_target)
+   atomic_long_set(>need_shrink, 0);
+   if (!atomic_long_read(>need_shrink))
+   return;
+   if (__try_evict_buffer(b, GFP_KERNEL)) {
+   atomic_long_dec(>need_shrink);
freed++;
-   if (!--nr_to_scan || ((count - freed) <= retain_target))
-   return freed;
+   }
cond_resched();
}
}
-   return freed;
 }
 
-static unsigned long
-dm_bufio_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
+static void shrink_work(struct work_struct *w)
+{
+   struct dm_bufio_client *c = container_of(w, struct dm_bufio_client, 
shrink_work);
+
+   dm_bufio_lock(c);
+   __scan(c);
+   dm_bufio_unlock(c);
+}
+
+static unsigned long dm_bufio_shrink_scan(struct shrinker *shrink, struct 
shrink_control *sc)
 {
struct dm_bufio_client *c;
-   unsigned long freed;
 
c = container_of(shrink, struct dm_bufio_client, shrinker);
-   if (sc->gfp_mask & __GFP_FS)
-   dm_bufio_lock(c);
-   else if (!dm_bufio_trylock(c))
-   return SHRINK_STOP;
+   atomic_long_add(sc->nr_to_scan, >need_shrink);
+   queue_work(dm_bufio_wq, >shrink_work);
 
-   freed  = __scan(c, sc->nr_to_scan, sc->gfp_mask);
-   dm_bufio_unlock(c);
-   return freed;
+   return sc->nr_to_scan;
 }
 
-static unsigned long
-dm_bufio_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
+static unsigned long dm_bufio_shrink_count(struct shrinker *shrink, struct 
shrink_control *sc)
 {
struct dm_bufio_client *c = container_of(shrink, struct 
dm_bufio_client, shrinker);
unsigned long count = READ_ONCE(c->n_buffers[LIST_CLEAN]) +
  READ_ONCE(c->n_buffers[LIST_DIRTY]);
unsigned long retain_target = get_retain_buffers(c);
+   unsigned long queued_for_cleanup = atomic_long_read(>need_shrink);
+
+   if (unlikely(count < retain_target))
+   count = 0;
+   else
+   count -= retain_target;
 
-   return (count < retain_target) ? 0 : (count - retain_target);
+   if (unlikely(count < queued_for_cleanup))
+   count = 0;
+   else
+   count -= queued_for_cleanup;
+
+   return count;
 }
 
 /*
@@ -1772,6 +1790,9 @@ struct dm_bufio_client *dm_bufio_client_
__free_buffer_wa

[PATCH 1/3 v6] crypto: introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY

2020-06-30 Thread Mikulas Patocka
Introduce a new flag CRYPTO_ALG_ALLOCATES_MEMORY and pass it down the
crypto stack.

If the flag is set, then the crypto driver allocates memory in its request
routine. Such drivers are not suitable for disk encryption because
GFP_ATOMIC allocation can fail anytime (causing random I/O errors) and
GFP_KERNEL allocation can recurse into the block layer, causing a
deadlock.

Pass the flag CRYPTO_ALG_ALLOCATES_MEMORY down through the crypto API.

Signed-off-by: Mikulas Patocka 

---
 crypto/adiantum.c |8 +---
 crypto/authenc.c  |7 ---
 crypto/authencesn.c   |7 ---
 crypto/ccm.c  |8 
 crypto/chacha20poly1305.c |7 ---
 crypto/cryptd.c   |   16 +++-
 crypto/ctr.c  |4 ++--
 crypto/cts.c  |4 ++--
 crypto/essiv.c|4 ++--
 crypto/gcm.c  |   15 ---
 crypto/geniv.c|4 ++--
 crypto/lrw.c  |4 ++--
 crypto/pcrypt.c   |   14 +-
 crypto/rsa-pkcs1pad.c |4 ++--
 crypto/seqiv.c|2 ++
 crypto/xts.c  |4 ++--
 include/crypto/algapi.h   |   10 ++
 include/linux/crypto.h|   15 +++
 18 files changed, 86 insertions(+), 51 deletions(-)

Index: linux-2.6/include/linux/crypto.h
===
--- linux-2.6.orig/include/linux/crypto.h   2020-06-29 16:03:07.346417000 
+0200
+++ linux-2.6/include/linux/crypto.h2020-06-29 16:03:07.336417000 +0200
@@ -102,6 +102,21 @@
 #define CRYPTO_NOLOAD  0x8000
 
 /*
+ * The driver may allocate memory during request processing, so it shouldn't be
+ * used in cases where memory allocation failures aren't acceptable, such as
+ * during block device encryption.
+ */
+#define CRYPTO_ALG_ALLOCATES_MEMORY0x0001
+
+/*
+ * When an algorithm uses another algorithm (e.g., if it's an instance of a
+ * template), these are the flags that always get set on the "outer" algorithm
+ * if any "inner" algorithm has them set.  In some cases other flags are
+ * inherited too; these are just the flags that are *always* inherited.
+ */
+#define CRYPTO_ALG_INHERITED_FLAGS (CRYPTO_ALG_ASYNC | 
CRYPTO_ALG_ALLOCATES_MEMORY)
+
+/*
  * Transform masks and values (for crt_flags).
  */
 #define CRYPTO_TFM_NEED_KEY0x0001
Index: linux-2.6/crypto/authenc.c
===
--- linux-2.6.orig/crypto/authenc.c 2020-06-29 16:03:07.346417000 +0200
+++ linux-2.6/crypto/authenc.c  2020-06-30 15:47:56.516417000 +0200
@@ -388,7 +388,7 @@ static int crypto_authenc_create(struct
if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask)
return -EINVAL;
 
-   mask = crypto_requires_sync(algt->type, algt->mask);
+   mask = crypto_alg_inherited_mask(algt->type, algt->mask);
 
inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL);
if (!inst)
@@ -423,8 +423,9 @@ static int crypto_authenc_create(struct
 enc->base.cra_driver_name) >= CRYPTO_MAX_ALG_NAME)
goto err_free_inst;
 
-   inst->alg.base.cra_flags = (auth_base->cra_flags |
-   enc->base.cra_flags) & CRYPTO_ALG_ASYNC;
+   inst->alg.base.cra_flags =
+   (auth_base->cra_flags | enc->base.cra_flags) &
+   CRYPTO_ALG_INHERITED_FLAGS;
inst->alg.base.cra_priority = enc->base.cra_priority * 10 +
  auth_base->cra_priority;
inst->alg.base.cra_blocksize = enc->base.cra_blocksize;
Index: linux-2.6/crypto/authencesn.c
===
--- linux-2.6.orig/crypto/authencesn.c  2020-06-29 16:03:07.346417000 +0200
+++ linux-2.6/crypto/authencesn.c   2020-06-30 15:48:11.996417000 +0200
@@ -406,7 +406,7 @@ static int crypto_authenc_esn_create(str
if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask)
return -EINVAL;
 
-   mask = crypto_requires_sync(algt->type, algt->mask);
+   mask = crypto_alg_inherited_mask(algt->type, algt->mask);
 
inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL);
if (!inst)
@@ -437,8 +437,9 @@ static int crypto_authenc_esn_create(str
 enc->base.cra_driver_name) >= CRYPTO_MAX_ALG_NAME)
goto err_free_inst;
 
-   inst->alg.base.cra_flags = (auth_base->cra_flags |
-   enc->base.cra_flags) & CRYPTO_ALG_ASYNC;
+   inst->alg.base.cra_flags =
+   (auth_base->cra_flags | enc->base.cra_flags) &
+   CRYPTO_ALG_INHERITED_FLAGS;
inst->alg.base.cra_priority = enc->base.cra_priority * 10 +
   

Re: [dm-devel] [PATCH 1/3 v4] crypto: introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY

2020-06-30 Thread Mikulas Patocka



On Tue, 30 Jun 2020, Eric Biggers wrote:

> On Tue, Jun 30, 2020 at 01:01:16PM -0400, Mikulas Patocka wrote:
> > > diff --git a/crypto/pcrypt.c b/crypto/pcrypt.c
> > > index 7240e8bbdebb..643f7f1cc91c 100644
> > > --- a/crypto/pcrypt.c
> > > +++ b/crypto/pcrypt.c
> > > @@ -232,12 +232,15 @@ static int pcrypt_create_aead(struct 
> > > crypto_template *tmpl, struct rtattr **tb,
> > >   struct crypto_attr_type *algt;
> > >   struct aead_instance *inst;
> > >   struct aead_alg *alg;
> > > + u32 mask;
> > >   int err;
> > >  
> > >   algt = crypto_get_attr_type(tb);
> > >   if (IS_ERR(algt))
> > >   return PTR_ERR(algt);
> > >  
> > > + mask = crypto_alg_inherited_mask(algt->type, algt->mask);
> > > +
> > >   inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL);
> > >   if (!inst)
> > >   return -ENOMEM;
> > > @@ -254,7 +257,7 @@ static int pcrypt_create_aead(struct crypto_template 
> > > *tmpl, struct rtattr **tb,
> > >   goto err_free_inst;
> > >  
> > >   err = crypto_grab_aead(>spawn, aead_crypto_instance(inst),
> > > -crypto_attr_alg_name(tb[1]), 0, 0);
> > > +crypto_attr_alg_name(tb[1]), 0, mask);
> > >   if (err)
> > >   goto err_free_inst;
> > >  
> > 
> > I added "mask" there - but there is still a "mask" argument that is 
> > unused - is it a bug to have two "mask" variables?
> 
> Right, I didn't see that algt->type and algt->mask are already being passed 
> to 
> pcrypt_create_aead().  It's redundant because pcrypt_create_aead() has access 
> to
> those via crypto_get_attr_type() anyway.
> 
> How about just removing those two arguments for now?
> 
> - Eric

Yes.

Mikulas



[PATCH 1/3 v5] crypto: introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY

2020-06-30 Thread Mikulas Patocka
Introduce a new flag CRYPTO_ALG_ALLOCATES_MEMORY and pass it down the
crypto stack.

If the flag is set, then the crypto driver allocates memory in its request
routine. Such drivers are not suitable for disk encryption because
GFP_ATOMIC allocation can fail anytime (causing random I/O errors) and
GFP_KERNEL allocation can recurse into the block layer, causing a
deadlock.

Pass the flag CRYPTO_ALG_ALLOCATES_MEMORY down through the crypto API.

Signed-off-by: Mikulas Patocka 

---
 crypto/adiantum.c |8 +---
 crypto/authenc.c  |7 ---
 crypto/authencesn.c   |7 ---
 crypto/ccm.c  |8 
 crypto/chacha20poly1305.c |7 ---
 crypto/cryptd.c   |   16 +++-
 crypto/ctr.c  |4 ++--
 crypto/cts.c  |4 ++--
 crypto/essiv.c|4 ++--
 crypto/gcm.c  |   15 ---
 crypto/geniv.c|4 ++--
 crypto/lrw.c  |4 ++--
 crypto/pcrypt.c   |9 +++--
 crypto/rsa-pkcs1pad.c |4 ++--
 crypto/seqiv.c|2 ++
 crypto/xts.c  |4 ++--
 include/crypto/algapi.h   |   10 ++
 include/linux/crypto.h|   15 +++
 18 files changed, 84 insertions(+), 48 deletions(-)

Index: linux-2.6/include/linux/crypto.h
===
--- linux-2.6.orig/include/linux/crypto.h   2020-06-29 16:03:07.346417000 
+0200
+++ linux-2.6/include/linux/crypto.h2020-06-29 16:03:07.336417000 +0200
@@ -102,6 +102,21 @@
 #define CRYPTO_NOLOAD  0x8000
 
 /*
+ * The driver may allocate memory during request processing, so it shouldn't be
+ * used in cases where memory allocation failures aren't acceptable, such as
+ * during block device encryption.
+ */
+#define CRYPTO_ALG_ALLOCATES_MEMORY0x0001
+
+/*
+ * When an algorithm uses another algorithm (e.g., if it's an instance of a
+ * template), these are the flags that always get set on the "outer" algorithm
+ * if any "inner" algorithm has them set.  In some cases other flags are
+ * inherited too; these are just the flags that are *always* inherited.
+ */
+#define CRYPTO_ALG_INHERITED_FLAGS (CRYPTO_ALG_ASYNC | 
CRYPTO_ALG_ALLOCATES_MEMORY)
+
+/*
  * Transform masks and values (for crt_flags).
  */
 #define CRYPTO_TFM_NEED_KEY0x0001
Index: linux-2.6/crypto/authenc.c
===
--- linux-2.6.orig/crypto/authenc.c 2020-06-29 16:03:07.346417000 +0200
+++ linux-2.6/crypto/authenc.c  2020-06-30 15:47:56.516417000 +0200
@@ -388,7 +388,7 @@ static int crypto_authenc_create(struct
if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask)
return -EINVAL;
 
-   mask = crypto_requires_sync(algt->type, algt->mask);
+   mask = crypto_alg_inherited_mask(algt->type, algt->mask);
 
inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL);
if (!inst)
@@ -423,8 +423,9 @@ static int crypto_authenc_create(struct
 enc->base.cra_driver_name) >= CRYPTO_MAX_ALG_NAME)
goto err_free_inst;
 
-   inst->alg.base.cra_flags = (auth_base->cra_flags |
-   enc->base.cra_flags) & CRYPTO_ALG_ASYNC;
+   inst->alg.base.cra_flags =
+   (auth_base->cra_flags | enc->base.cra_flags) &
+   CRYPTO_ALG_INHERITED_FLAGS;
inst->alg.base.cra_priority = enc->base.cra_priority * 10 +
  auth_base->cra_priority;
inst->alg.base.cra_blocksize = enc->base.cra_blocksize;
Index: linux-2.6/crypto/authencesn.c
===
--- linux-2.6.orig/crypto/authencesn.c  2020-06-29 16:03:07.346417000 +0200
+++ linux-2.6/crypto/authencesn.c   2020-06-30 15:48:11.996417000 +0200
@@ -406,7 +406,7 @@ static int crypto_authenc_esn_create(str
if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask)
return -EINVAL;
 
-   mask = crypto_requires_sync(algt->type, algt->mask);
+   mask = crypto_alg_inherited_mask(algt->type, algt->mask);
 
inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL);
if (!inst)
@@ -437,8 +437,9 @@ static int crypto_authenc_esn_create(str
 enc->base.cra_driver_name) >= CRYPTO_MAX_ALG_NAME)
goto err_free_inst;
 
-   inst->alg.base.cra_flags = (auth_base->cra_flags |
-   enc->base.cra_flags) & CRYPTO_ALG_ASYNC;
+   inst->alg.base.cra_flags =
+   (auth_base->cra_flags | enc->base.cra_flags) &
+   CRYPTO_ALG_INHERITED_FLAGS;
inst->alg.base.cra_priority = enc->base.cra_priority * 10 +
   

Re: [dm-devel] [PATCH 1/3 v4] crypto: introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY

2020-06-30 Thread Mikulas Patocka



On Tue, 30 Jun 2020, Eric Biggers wrote:

> On Tue, Jun 30, 2020 at 09:56:22AM -0400, Mikulas Patocka wrote:
> > Index: linux-2.6/crypto/cryptd.c
> > ===
> > --- linux-2.6.orig/crypto/cryptd.c  2020-06-29 16:03:07.346417000 +0200
> > +++ linux-2.6/crypto/cryptd.c   2020-06-30 15:49:04.206417000 +0200
> > @@ -202,6 +202,7 @@ static inline void cryptd_check_internal
> >  
> > *type |= algt->type & CRYPTO_ALG_INTERNAL;
> > *mask |= algt->mask & CRYPTO_ALG_INTERNAL;
> > +   *mask |= algt->mask & CRYPTO_ALG_INHERITED_FLAGS;
> >  }
> 
> This needs to use the correct logic (crypto_alg_inherited_mask) to decide 
> which
> inherited flags to set in 'mask'.

I added "*mask |= crypto_alg_inherited_mask(algt->type, algt->mask);" 
there.


> > --- linux-2.6.orig/crypto/adiantum.c2020-06-29 16:03:07.346417000 
> > +0200
> > +++ linux-2.6/crypto/adiantum.c 2020-06-29 16:03:07.346417000 +0200
> > @@ -507,7 +507,7 @@ static int adiantum_create(struct crypto
> > if ((algt->type ^ CRYPTO_ALG_TYPE_SKCIPHER) & algt->mask)
> > return -EINVAL;
> >  
> > -   mask = crypto_requires_sync(algt->type, algt->mask);
> > +   mask = crypto_alg_inherited_mask(algt->type, algt->mask);
> >  
> > inst = kzalloc(sizeof(*inst) + sizeof(*ictx), GFP_KERNEL);
> > if (!inst)
> 
> This is still missing setting the flags correctly on the template instance 
> being
> constructed.  The flags need to be inherited from all "inner" algorithms:
> 
>   inst->alg.base.cra_flags = (streamcipher_alg->base.cra_flags |
>   blockcipher_alg->cra_flags |
>   hash_alg->base.cra_flags) &
>  CRYPTO_ALG_INHERITED_FLAGS;
> 
> If we don't do that, the template instance may allocate memory but not have
> CRYPTO_ALG_ALLOCATES_MEMORY set.

OK


> > Index: linux-2.6/crypto/pcrypt.c
> > ===
> > --- linux-2.6.orig/crypto/pcrypt.c  2020-06-29 16:03:07.346417000 +0200
> > +++ linux-2.6/crypto/pcrypt.c   2020-06-30 15:47:32.776417000 +0200
> > @@ -263,7 +263,9 @@ static int pcrypt_create_aead(struct cry
> > if (err)
> > goto err_free_inst;
> >  
> > -   inst->alg.base.cra_flags = CRYPTO_ALG_ASYNC;
> > +   inst->alg.base.cra_flags =
> > +   CRYPTO_ALG_ASYNC |
> > +   (alg->base.cra_flags & CRYPTO_ALG_INHERITED_FLAGS);
> >  
> > inst->alg.ivsize = crypto_aead_alg_ivsize(alg);
> > inst->alg.maxauthsize = crypto_aead_alg_maxauthsize(alg);

What's wrong with this?


> And this is still missing setting the mask:
> 
> diff --git a/crypto/pcrypt.c b/crypto/pcrypt.c
> index 7240e8bbdebb..643f7f1cc91c 100644
> --- a/crypto/pcrypt.c
> +++ b/crypto/pcrypt.c
> @@ -232,12 +232,15 @@ static int pcrypt_create_aead(struct crypto_template 
> *tmpl, struct rtattr **tb,
>   struct crypto_attr_type *algt;
>   struct aead_instance *inst;
>   struct aead_alg *alg;
> + u32 mask;
>   int err;
>  
>   algt = crypto_get_attr_type(tb);
>   if (IS_ERR(algt))
>   return PTR_ERR(algt);
>  
> + mask = crypto_alg_inherited_mask(algt->type, algt->mask);
> +
>   inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL);
>   if (!inst)
>   return -ENOMEM;
> @@ -254,7 +257,7 @@ static int pcrypt_create_aead(struct crypto_template 
> *tmpl, struct rtattr **tb,
>   goto err_free_inst;
>  
>   err = crypto_grab_aead(>spawn, aead_crypto_instance(inst),
> -crypto_attr_alg_name(tb[1]), 0, 0);
> +crypto_attr_alg_name(tb[1]), 0, mask);
>   if (err)
>   goto err_free_inst;
>  

I added "mask" there - but there is still a "mask" argument that is 
unused - is it a bug to have two "mask" variables?


> Can you please think logically about what you're trying to accomplish?

I am not an expert in crypto code.

> In particular, if someone requests an algorithm that doesn't allocate memory
> (which is indicated by type=0, mask=CRYPTO_ALG_ALLOCATES_MEMORY), that request
> needs to be honored.
> 
> - Eric

Mikulas



Re: [PATCH v3] dm writecache: reject asynchronous pmem.

2020-06-30 Thread Mikulas Patocka



On Tue, 30 Jun 2020, Mikulas Patocka wrote:

> 
> 
> On Tue, 30 Jun 2020, Michal Suchanek wrote:
> 
> > The writecache driver does not handle asynchronous pmem. Reject it when
> > supplied as cache.
> > 
> > Link: https://lore.kernel.org/linux-nvdimm/87lfk5hahc@linux.ibm.com/
> > Fixes: 6e84200c0a29 ("virtio-pmem: Add virtio pmem driver")
> > 
> > Signed-off-by: Michal Suchanek 
> 
> Acked-by: Mikulas Patocka 

BTW, we should also add

Cc: sta...@vger.kernel.org  # v5.3+

> > ---
> >  drivers/md/dm-writecache.c | 6 ++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
> > index 30505d70f423..5358894bb9fd 100644
> > --- a/drivers/md/dm-writecache.c
> > +++ b/drivers/md/dm-writecache.c
> > @@ -2266,6 +2266,12 @@ static int writecache_ctr(struct dm_target *ti, 
> > unsigned argc, char **argv)
> > }
> >  
> > if (WC_MODE_PMEM(wc)) {
> > +   if (!dax_synchronous(wc->ssd_dev->dax_dev)) {
> > +   r = -EOPNOTSUPP;
> > +   ti->error = "Asynchronous persistent memory not 
> > supported as pmem cache";
> > +   goto bad;
> > +   }
> > +
> > r = persistent_memory_claim(wc);
> > if (r) {
> > ti->error = "Unable to map persistent memory for cache";
> > -- 
> > 2.26.2
> > 
> 



Re: [PATCH v3] dm writecache: reject asynchronous pmem.

2020-06-30 Thread Mikulas Patocka



On Tue, 30 Jun 2020, Michal Suchanek wrote:

> The writecache driver does not handle asynchronous pmem. Reject it when
> supplied as cache.
> 
> Link: https://lore.kernel.org/linux-nvdimm/87lfk5hahc@linux.ibm.com/
> Fixes: 6e84200c0a29 ("virtio-pmem: Add virtio pmem driver")
> 
> Signed-off-by: Michal Suchanek 

Acked-by: Mikulas Patocka 

> ---
>  drivers/md/dm-writecache.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
> index 30505d70f423..5358894bb9fd 100644
> --- a/drivers/md/dm-writecache.c
> +++ b/drivers/md/dm-writecache.c
> @@ -2266,6 +2266,12 @@ static int writecache_ctr(struct dm_target *ti, 
> unsigned argc, char **argv)
>   }
>  
>   if (WC_MODE_PMEM(wc)) {
> + if (!dax_synchronous(wc->ssd_dev->dax_dev)) {
> + r = -EOPNOTSUPP;
> + ti->error = "Asynchronous persistent memory not 
> supported as pmem cache";
> + goto bad;
> + }
> +
>   r = persistent_memory_claim(wc);
>   if (r) {
>   ti->error = "Unable to map persistent memory for cache";
> -- 
> 2.26.2
> 



Re: [PATCH v2] dm writecache: reject asynchronous pmem.

2020-06-30 Thread Mikulas Patocka



On Tue, 30 Jun 2020, Michal Suchanek wrote:

> The writecache driver does not handle asynchronous pmem. Reject it when
> supplied as cache.
> 
> Link: https://lore.kernel.org/linux-nvdimm/87lfk5hahc@linux.ibm.com/
> Fixes: 6e84200c0a29 ("virtio-pmem: Add virtio pmem driver")
> 
> Signed-off-by: Michal Suchanek 

OK. I suggest to move this test before persistent_memory_claim (so that 
you don't try to map the whole device).

Mikulas

> ---
>  drivers/md/dm-writecache.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
> index 30505d70f423..1e4f37249e28 100644
> --- a/drivers/md/dm-writecache.c
> +++ b/drivers/md/dm-writecache.c
> @@ -2271,6 +2271,12 @@ static int writecache_ctr(struct dm_target *ti, 
> unsigned argc, char **argv)
>   ti->error = "Unable to map persistent memory for cache";
>   goto bad;
>   }
> +
> + if (!dax_synchronous(wc->ssd_dev->dax_dev)) {
> + r = -EOPNOTSUPP;
> + ti->error = "Asynchronous persistent memory not 
> supported as pmem cache";
> + goto bad;
> + }
>   } else {
>   size_t n_blocks, n_metadata_blocks;
>   uint64_t n_bitmap_bits;
> -- 
> 2.26.2
> 



[PATCH 1/3 v4] crypto: introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY

2020-06-30 Thread Mikulas Patocka
Introduce a new flag CRYPTO_ALG_ALLOCATES_MEMORY and pass it down the
crypto stack.

If the flag is set, then the crypto driver allocates memory in its request
routine. Such drivers are not suitable for disk encryption because
GFP_ATOMIC allocation can fail anytime (causing random I/O errors) and
GFP_KERNEL allocation can recurse into the block layer, causing a
deadlock.

Pass the flag CRYPTO_ALG_ALLOCATES_MEMORY down through the crypto API.

Signed-off-by: Mikulas Patocka 

---
 crypto/adiantum.c |2 +-
 crypto/authenc.c  |7 ---
 crypto/authencesn.c   |7 ---
 crypto/ccm.c  |8 
 crypto/chacha20poly1305.c |7 ---
 crypto/cryptd.c   |   16 +++-
 crypto/ctr.c  |4 ++--
 crypto/cts.c  |4 ++--
 crypto/essiv.c|4 ++--
 crypto/gcm.c  |   15 ---
 crypto/geniv.c|4 ++--
 crypto/lrw.c  |4 ++--
 crypto/pcrypt.c   |4 +++-
 crypto/rsa-pkcs1pad.c |4 ++--
 crypto/seqiv.c|2 ++
 crypto/xts.c  |4 ++--
 include/crypto/algapi.h   |   10 ++
 include/linux/crypto.h|   15 +++
 18 files changed, 76 insertions(+), 45 deletions(-)

Index: linux-2.6/include/linux/crypto.h
===
--- linux-2.6.orig/include/linux/crypto.h   2020-06-29 16:03:07.346417000 
+0200
+++ linux-2.6/include/linux/crypto.h2020-06-29 16:03:07.336417000 +0200
@@ -102,6 +102,21 @@
 #define CRYPTO_NOLOAD  0x8000
 
 /*
+ * The driver may allocate memory during request processing, so it shouldn't be
+ * used in cases where memory allocation failures aren't acceptable, such as
+ * during block device encryption.
+ */
+#define CRYPTO_ALG_ALLOCATES_MEMORY0x0001
+
+/*
+ * When an algorithm uses another algorithm (e.g., if it's an instance of a
+ * template), these are the flags that always get set on the "outer" algorithm
+ * if any "inner" algorithm has them set.  In some cases other flags are
+ * inherited too; these are just the flags that are *always* inherited.
+ */
+#define CRYPTO_ALG_INHERITED_FLAGS (CRYPTO_ALG_ASYNC | 
CRYPTO_ALG_ALLOCATES_MEMORY)
+
+/*
  * Transform masks and values (for crt_flags).
  */
 #define CRYPTO_TFM_NEED_KEY0x0001
Index: linux-2.6/crypto/authenc.c
===
--- linux-2.6.orig/crypto/authenc.c 2020-06-29 16:03:07.346417000 +0200
+++ linux-2.6/crypto/authenc.c  2020-06-30 15:47:56.516417000 +0200
@@ -388,7 +388,7 @@ static int crypto_authenc_create(struct
if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask)
return -EINVAL;
 
-   mask = crypto_requires_sync(algt->type, algt->mask);
+   mask = crypto_alg_inherited_mask(algt->type, algt->mask);
 
inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL);
if (!inst)
@@ -423,8 +423,9 @@ static int crypto_authenc_create(struct
 enc->base.cra_driver_name) >= CRYPTO_MAX_ALG_NAME)
goto err_free_inst;
 
-   inst->alg.base.cra_flags = (auth_base->cra_flags |
-   enc->base.cra_flags) & CRYPTO_ALG_ASYNC;
+   inst->alg.base.cra_flags =
+   (auth_base->cra_flags | enc->base.cra_flags) &
+   CRYPTO_ALG_INHERITED_FLAGS;
inst->alg.base.cra_priority = enc->base.cra_priority * 10 +
  auth_base->cra_priority;
inst->alg.base.cra_blocksize = enc->base.cra_blocksize;
Index: linux-2.6/crypto/authencesn.c
===
--- linux-2.6.orig/crypto/authencesn.c  2020-06-29 16:03:07.346417000 +0200
+++ linux-2.6/crypto/authencesn.c   2020-06-30 15:48:11.996417000 +0200
@@ -406,7 +406,7 @@ static int crypto_authenc_esn_create(str
if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask)
return -EINVAL;
 
-   mask = crypto_requires_sync(algt->type, algt->mask);
+   mask = crypto_alg_inherited_mask(algt->type, algt->mask);
 
inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL);
if (!inst)
@@ -437,8 +437,9 @@ static int crypto_authenc_esn_create(str
 enc->base.cra_driver_name) >= CRYPTO_MAX_ALG_NAME)
goto err_free_inst;
 
-   inst->alg.base.cra_flags = (auth_base->cra_flags |
-   enc->base.cra_flags) & CRYPTO_ALG_ASYNC;
+   inst->alg.base.cra_flags =
+   (auth_base->cra_flags | enc->base.cra_flags) &
+   CRYPTO_ALG_INHERITED_FLAGS;
inst->alg.base.cra_priority = enc->base.cra_priority * 10 +
   

Re: [dm-devel] [PATCH 1/3 v2] crypto: introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY

2020-06-30 Thread Mikulas Patocka



On Sun, 28 Jun 2020, Eric Biggers wrote:

> On Sun, Jun 28, 2020 at 03:07:49PM -0400, Mikulas Patocka wrote:
> > 
> > > Also, "seqiv" instances can be created without 
> > > CRYPTO_ALG_ALLOCATES_MEMORY set,
> > > despite seqiv_aead_encrypt() allocating memory.
> > > 
> 
> This comment wasn't addressed.
> 
> - Eric

I've sent version 4 of the patch that adds CRYPTO_ALG_ALLOCATES_MEMORY to 
seqiv.

Mikulas



Re: [dm-devel] [PATCH 1/3 v2] crypto: introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY

2020-06-30 Thread Mikulas Patocka



On Mon, 29 Jun 2020, Mikulas Patocka wrote:

> On Sun, 28 Jun 2020, Eric Biggers wrote:
> 
> > On Sun, Jun 28, 2020 at 03:07:49PM -0400, Mikulas Patocka wrote:
> > > > 
> > > > cryptd_create_skcipher(), cryptd_create_hash(), cryptd_create_aead(), 
> > > > and
> > > > crypto_rfc4309_create() are also missing setting the mask.
> > > > 
> > > > pcrypt_create_aead() is missing both setting the mask and inheriting 
> > > > the flags.

pcrypt_create_aead doesn't use "mask" and "type" arguments at all.

Mikulas



Re: [PATCH] dm writecache: reject asynchronous pmem.

2020-06-30 Thread Mikulas Patocka



On Tue, 30 Jun 2020, Michal Suchanek wrote:

> The writecache driver does not handle asynchronous pmem. Reject it when
> supplied as cache.
> 
> Link: https://lore.kernel.org/linux-nvdimm/87lfk5hahc@linux.ibm.com/
> Fixes: 6e84200c0a29 ("virtio-pmem: Add virtio pmem driver")
> 
> Signed-off-by: Michal Suchanek 
> ---
>  drivers/md/dm-writecache.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
> index 30505d70f423..57b0a972f6fd 100644
> --- a/drivers/md/dm-writecache.c
> +++ b/drivers/md/dm-writecache.c
> @@ -2277,6 +2277,12 @@ static int writecache_ctr(struct dm_target *ti, 
> unsigned argc, char **argv)
>  
>   wc->memory_map_size -= (uint64_t)wc->start_sector << 
> SECTOR_SHIFT;
>  
> + if (!dax_synchronous(wc->ssd_dev->dax_dev)) {
> + r = -EOPNOTSUPP;
> + ti->error = "Asynchronous persistent memory not 
> supported as pmem cache";
> + goto bad;
> + }
> +
>   bio_list_init(>flush_list);
>   wc->flush_thread = kthread_create(writecache_flush_thread, wc, 
> "dm_writecache_flush");
>   if (IS_ERR(wc->flush_thread)) {
> -- 

Hi

Shouldn't this be in the "if (WC_MODE_PMEM(wc))" block?

WC_MODE_PMEM(wc) retrurns true if we are using persistent memory as a 
cache device, otherwise we are using generic block device as a cache 
device.

Mikulas



Re: [dm-devel] [PATCH 1/3 v2] crypto: introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY

2020-06-29 Thread Mikulas Patocka



On Sun, 28 Jun 2020, Eric Biggers wrote:

> On Sun, Jun 28, 2020 at 03:07:49PM -0400, Mikulas Patocka wrote:
> > > 
> > > cryptd_create_skcipher(), cryptd_create_hash(), cryptd_create_aead(), and
> > > crypto_rfc4309_create() are also missing setting the mask.
> > > 
> > > pcrypt_create_aead() is missing both setting the mask and inheriting the 
> > > flags.
> > 
> > I added CRYPTO_ALG_ALLOCATES_MEMORY there.
> 
> I don't see where the cryptd request processing functions allocate memory.
> 
> It seems that cryptd should just inherit the flag, like most other templates.
> 
> Likewise for pcrypt.
> 
> And also likewise for rfc4309.
> 
> Where are you seeing the memory allocations that would require
> CRYPTO_ALG_ALLOCATES_MEMORY to always be enabled for these?
> 
> - Eric

This was some misunderstanding. You said "cryptd_create_skcipher ... is 
missing both setting the mask and inheriting the flags.", so I understood 
it so that it should inherit CRYPTO_ALG_INHERITED_FLAGS and set 
CRYPTO_ALG_ALLOCATES_MEMORY unconditionally.

Mikulas



Re: [PATCH 0/6] Overhaul memalloc_no*

2020-06-29 Thread Mikulas Patocka



On Mon, 29 Jun 2020, Dave Chinner wrote:

> On Sat, Jun 27, 2020 at 09:09:09AM -0400, Mikulas Patocka wrote:
> > 
> > 
> > On Sat, 27 Jun 2020, Dave Chinner wrote:
> > 
> > > On Fri, Jun 26, 2020 at 11:02:19AM -0400, Mikulas Patocka wrote:
> > > > Hi
> > > > 
> > > > I suggest to join memalloc_noio and memalloc_nofs into just one flag 
> > > > that 
> > > > prevents both filesystem recursion and i/o recursion.
> > > > 
> > > > Note that any I/O can recurse into a filesystem via the loop device, 
> > > > thus 
> > > > it doesn't make much sense to have a context where PF_MEMALLOC_NOFS is 
> > > > set 
> > > > and PF_MEMALLOC_NOIO is not set.
> > > 
> > > Correct me if I'm wrong, but I think that will prevent swapping from
> > > GFP_NOFS memory reclaim contexts.
> > 
> > Yes.
> > 
> > > IOWs, this will substantially
> > > change the behaviour of the memory reclaim system under sustained
> > > GFP_NOFS memory pressure. Sustained GFP_NOFS memory pressure is
> > > quite common, so I really don't think we want to telling memory
> > > reclaim "you can't do IO at all" when all we are trying to do is
> > > prevent recursion back into the same filesystem.
> > 
> > So, we can define __GFP_ONLY_SWAP_IO and __GFP_IO.
> 
> Uh, why?
> 
> Exactly what problem are you trying to solve here?

This:

1. The filesystem does a GFP_NOFS allocation.
2. The allocation calls directly a dm-bufio shrinker.
3. The dm-bufio shrinker sees that there is __GFP_IO set, so it assumes 
   that it can do I/O. It selects some dirty buffers, writes them back and 
   waits for the I/O to finish.
4. The dirty buffers belong to a loop device.
5. The loop device thread calls the filesystem that did the GFP_NOFS 
   allocation in step 1 (and that is still waiting for the allocation to 
   succeed).

Note that setting PF_MEMALLOC_NOIO on the loop thread won't help with this 
deadlock.

Do you argue that this is a bug in dm-bufio? Or a bug in the kernel? Or 
that it can't happen?

> > I saw this deadlock in the past in the dm-bufio subsystem - see the commit 
> > 9d28eb12447ee08bb5d1e8bb3195cf20e1ecd1c0 that fixed it.
> 
> 2014?
> 
> /me looks closer.
> 
> Hmmm. Only sent to dm-devel, no comments, no review, just merged.
> No surprise that nobody else actually knows about this commit. Well,
> time to review it ~6 years after it was merged
> 
> | dm-bufio tested for __GFP_IO. However, dm-bufio can run on a loop block
> | device that makes calls into the filesystem. If __GFP_IO is present and
> | __GFP_FS isn't, dm-bufio could still block on filesystem operations if it
> | runs on a loop block device.
> 
> OK, so from an architectural POV, this commit is fundamentally
> broken - block/device layer allocation should not allow relcaim
> recursion into filesystems because filesystems are dependent on
> the block layer making forwards progress. This commit is trying to
> work around the loop device doing GFP_KERNEL/GFP_NOFS context
> allocation back end IO path of the loop device. This part of the
> loop device is a block device, so needs to run under GFP_NOIO
> context.

I agree that it is broken, but it fixes the above deadlock.

Mikulas



Re: [dm-devel] [PATCH 1/3 v2] crypto: introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY

2020-06-28 Thread Mikulas Patocka



On Fri, 26 Jun 2020, Eric Biggers wrote:

> On Fri, Jun 26, 2020 at 09:46:17AM -0700, Eric Biggers wrote:
> > On Fri, Jun 26, 2020 at 12:16:33PM -0400, Mikulas Patocka wrote:
> > > +/*
> > > + * Pass these flags down through the crypto API.
> > > + */
> > > +#define CRYPTO_ALG_INHERITED_FLAGS   (CRYPTO_ALG_ASYNC | 
> > > CRYPTO_ALG_ALLOCATES_MEMORY)
> > 
> > This comment is useless.  How about:
> > 
> > /*
> >  * When an algorithm uses another algorithm (e.g., if it's an instance of a
> >  * template), these are the flags that always get set on the "outer" 
> > algorithm
> >  * if any "inner" algorithm has them set.  In some cases other flags are
> >  * inherited too; these are just the flags that are *always* inherited.
> >  */
> > #define CRYPTO_ALG_INHERITED_FLAGS  (CRYPTO_ALG_ASYNC | 
> > CRYPTO_ALG_ALLOCATES_MEMORY)
> > 
> > Also I wonder about the case where the inner algorithm is a fallback rather 
> > than
> > part of a template instance.  This patch only handles templates, not 
> > fallbacks.
> > Is that intentional?  Isn't that technically a bug?
> 
> Also is CRYPTO_ALG_ALLOCATES_MEMORY meant to apply for algorithms of type
> "cipher" and "shash"?  The code doesn't handle those, so presumably not?
> 
> What about "akcipher"?

Yes - the patch should apply for these cases, but I don't know how to do 
it. Please, do it.

> > > Index: linux-2.6/crypto/xts.c
> > > ===
> > > --- linux-2.6.orig/crypto/xts.c   2020-06-26 17:24:03.566417000 +0200
> > > +++ linux-2.6/crypto/xts.c2020-06-26 17:24:03.566417000 +0200
> > > @@ -415,7 +415,7 @@ static int create(struct crypto_template
> > >   } else
> > >   goto err_free_inst;
> > >  
> > > - inst->alg.base.cra_flags = alg->base.cra_flags & CRYPTO_ALG_ASYNC;
> > > + inst->alg.base.cra_flags = alg->base.cra_flags & 
> > > CRYPTO_ALG_INHERITED_FLAGS;
> > >   inst->alg.base.cra_priority = alg->base.cra_priority;
> > >   inst->alg.base.cra_blocksize = XTS_BLOCK_SIZE;
> > >   inst->alg.base.cra_alignmask = alg->base.cra_alignmask |
> > 
> > Need to set the mask correctly in this file.
> 
> cryptd_create_skcipher(), cryptd_create_hash(), cryptd_create_aead(), and
> crypto_rfc4309_create() are also missing setting the mask.
> 
> pcrypt_create_aead() is missing both setting the mask and inheriting the 
> flags.

I added CRYPTO_ALG_ALLOCATES_MEMORY there.

> Also, "seqiv" instances can be created without CRYPTO_ALG_ALLOCATES_MEMORY 
> set,
> despite seqiv_aead_encrypt() allocating memory.
> 
> - Eric

Mikulas



[PATCH 1/3 v3] crypto: introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY

2020-06-28 Thread Mikulas Patocka
Introduce a new flag CRYPTO_ALG_ALLOCATES_MEMORY and pass it down the
crypto stack.

If the flag is set, then the crypto driver allocates memory in its request
routine. Such drivers are not suitable for disk encryption because
GFP_ATOMIC allocation can fail anytime (causing random I/O errors) and
GFP_KERNEL allocation can recurse into the block layer, causing a
deadlock.

Pass the flag CRYPTO_ALG_ALLOCATES_MEMORY down through the crypto API.

Signed-off-by: Mikulas Patocka 

---
 crypto/adiantum.c |2 +-
 crypto/authenc.c  |4 ++--
 crypto/authencesn.c   |4 ++--
 crypto/ccm.c  |9 +
 crypto/chacha20poly1305.c |4 ++--
 crypto/cryptd.c   |   12 +---
 crypto/ctr.c  |4 ++--
 crypto/cts.c  |4 ++--
 crypto/essiv.c|4 ++--
 crypto/gcm.c  |   12 ++--
 crypto/geniv.c|4 ++--
 crypto/lrw.c  |4 ++--
 crypto/pcrypt.c   |4 +++-
 crypto/rsa-pkcs1pad.c |4 ++--
 crypto/xts.c  |2 +-
 include/crypto/algapi.h   |   10 ++
 include/linux/crypto.h|   15 +++
 17 files changed, 64 insertions(+), 38 deletions(-)

Index: linux-2.6/include/linux/crypto.h
===
--- linux-2.6.orig/include/linux/crypto.h   2020-06-26 17:24:03.566417000 
+0200
+++ linux-2.6/include/linux/crypto.h2020-06-28 17:55:21.976417000 +0200
@@ -102,6 +102,21 @@
 #define CRYPTO_NOLOAD  0x8000
 
 /*
+ * The driver may allocate memory during request processing, so it shouldn't be
+ * used in cases where memory allocation failures aren't acceptable, such as
+ * during block device encryption.
+ */
+#define CRYPTO_ALG_ALLOCATES_MEMORY0x0001
+
+/*
+ * When an algorithm uses another algorithm (e.g., if it's an instance of a
+ * template), these are the flags that always get set on the "outer" algorithm
+ * if any "inner" algorithm has them set.  In some cases other flags are
+ * inherited too; these are just the flags that are *always* inherited.
+ */
+#define CRYPTO_ALG_INHERITED_FLAGS (CRYPTO_ALG_ASYNC | 
CRYPTO_ALG_ALLOCATES_MEMORY)
+
+/*
  * Transform masks and values (for crt_flags).
  */
 #define CRYPTO_TFM_NEED_KEY0x0001
Index: linux-2.6/crypto/authenc.c
===
--- linux-2.6.orig/crypto/authenc.c 2020-06-26 17:24:03.566417000 +0200
+++ linux-2.6/crypto/authenc.c  2020-06-28 20:33:29.136417000 +0200
@@ -388,7 +388,7 @@ static int crypto_authenc_create(struct
if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask)
return -EINVAL;
 
-   mask = crypto_requires_sync(algt->type, algt->mask);
+   mask = crypto_requires_inherited(algt->type, algt->mask);
 
inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL);
if (!inst)
@@ -424,7 +424,7 @@ static int crypto_authenc_create(struct
goto err_free_inst;
 
inst->alg.base.cra_flags = (auth_base->cra_flags |
-   enc->base.cra_flags) & CRYPTO_ALG_ASYNC;
+   enc->base.cra_flags) & CRYPTO_ALG_INHERITED_FLAGS;
inst->alg.base.cra_priority = enc->base.cra_priority * 10 +
  auth_base->cra_priority;
inst->alg.base.cra_blocksize = enc->base.cra_blocksize;
Index: linux-2.6/crypto/authencesn.c
===
--- linux-2.6.orig/crypto/authencesn.c  2020-06-26 17:24:03.566417000 +0200
+++ linux-2.6/crypto/authencesn.c   2020-06-28 20:33:32.556417000 +0200
@@ -406,7 +406,7 @@ static int crypto_authenc_esn_create(str
if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask)
return -EINVAL;
 
-   mask = crypto_requires_sync(algt->type, algt->mask);
+   mask = crypto_requires_inherited(algt->type, algt->mask);
 
inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL);
if (!inst)
@@ -438,7 +438,7 @@ static int crypto_authenc_esn_create(str
goto err_free_inst;
 
inst->alg.base.cra_flags = (auth_base->cra_flags |
-   enc->base.cra_flags) & CRYPTO_ALG_ASYNC;
+   enc->base.cra_flags) & CRYPTO_ALG_INHERITED_FLAGS;
inst->alg.base.cra_priority = enc->base.cra_priority * 10 +
  auth_base->cra_priority;
inst->alg.base.cra_blocksize = enc->base.cra_blocksize;
Index: linux-2.6/crypto/ccm.c
===
--- linux-2.6.orig/crypto/ccm.c 2020-06-26 17:24:03.566417000 +0200
+++ linux-2.6/crypto/ccm.c  2020-06-28 20:43:42.456417000 +0200
@@ -462,7 +462,7 @@

Re: [PATCH 1/3 v2] crypto: introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY

2020-06-28 Thread Mikulas Patocka



On Fri, 26 Jun 2020, Eric Biggers wrote:

> On Fri, Jun 26, 2020 at 12:16:33PM -0400, Mikulas Patocka wrote:
> > +/*
> > + * Pass these flags down through the crypto API.
> > + */
> > +#define CRYPTO_ALG_INHERITED_FLAGS (CRYPTO_ALG_ASYNC | 
> > CRYPTO_ALG_ALLOCATES_MEMORY)
> 
> This comment is useless.  How about:
> 
> /*
>  * When an algorithm uses another algorithm (e.g., if it's an instance of a
>  * template), these are the flags that always get set on the "outer" algorithm
>  * if any "inner" algorithm has them set.  In some cases other flags are
>  * inherited too; these are just the flags that are *always* inherited.
>  */
> #define CRYPTO_ALG_INHERITED_FLAGS(CRYPTO_ALG_ASYNC | 
> CRYPTO_ALG_ALLOCATES_MEMORY)
> 
> Also I wonder about the case where the inner algorithm is a fallback rather 
> than
> part of a template instance.  This patch only handles templates, not 
> fallbacks.
> Is that intentional?  Isn't that technically a bug?

I'm not an expert in crypto internals, so I don't know. I'll send version 
3 of this patch and I'd like to ask you or Herbert to fix it.

> > +
> > +/*
> >   * Transform masks and values (for crt_flags).
> >   */
> >  #define CRYPTO_TFM_NEED_KEY0x0001
> > Index: linux-2.6/crypto/authenc.c
> > ===
> > --- linux-2.6.orig/crypto/authenc.c 2020-06-26 17:24:03.566417000 +0200
> > +++ linux-2.6/crypto/authenc.c  2020-06-26 17:24:03.566417000 +0200
> > @@ -388,7 +388,8 @@ static int crypto_authenc_create(struct
> > if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask)
> > return -EINVAL;
> >  
> > -   mask = crypto_requires_sync(algt->type, algt->mask);
> > +   mask = crypto_requires_sync(algt->type, algt->mask) |
> > +  crypto_requires_nomem(algt->type, algt->mask);
> 
> As I suggested earlier, shouldn't there be a function that returns the mask 
> for
> all inherited flags, rather than handling each flag individually?

Yes - I've created crypto_requires_inherited for this purpose.

> >  
> > inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL);
> > if (!inst)
> > @@ -424,7 +425,7 @@ static int crypto_authenc_create(struct
> > goto err_free_inst;
> >  
> > inst->alg.base.cra_flags = (auth_base->cra_flags |
> > -   enc->base.cra_flags) & CRYPTO_ALG_ASYNC;
> > +   enc->base.cra_flags) & CRYPTO_ALG_INHERITED_FLAGS;
> 
> Strange indentation here.  Likewise in most of the other files.

I was told that the code should be 80-characters wide.

> > Index: linux-2.6/crypto/xts.c
> > ===
> > --- linux-2.6.orig/crypto/xts.c 2020-06-26 17:24:03.566417000 +0200
> > +++ linux-2.6/crypto/xts.c  2020-06-26 17:24:03.566417000 +0200
> > @@ -415,7 +415,7 @@ static int create(struct crypto_template
> > } else
> > goto err_free_inst;
> >  
> > -   inst->alg.base.cra_flags = alg->base.cra_flags & CRYPTO_ALG_ASYNC;
> > +   inst->alg.base.cra_flags = alg->base.cra_flags & 
> > CRYPTO_ALG_INHERITED_FLAGS;
> > inst->alg.base.cra_priority = alg->base.cra_priority;
> > inst->alg.base.cra_blocksize = XTS_BLOCK_SIZE;
> > inst->alg.base.cra_alignmask = alg->base.cra_alignmask |
> 
> Need to set the mask correctly in this file.

I don't know what do you mean.

> > Index: linux-2.6/crypto/adiantum.c
> > ===
> > --- linux-2.6.orig/crypto/adiantum.c2020-06-26 17:24:03.566417000 
> > +0200
> > +++ linux-2.6/crypto/adiantum.c 2020-06-26 17:24:03.566417000 +0200
> > @@ -507,7 +507,8 @@ static int adiantum_create(struct crypto
> > if ((algt->type ^ CRYPTO_ALG_TYPE_SKCIPHER) & algt->mask)
> > return -EINVAL;
> >  
> > -   mask = crypto_requires_sync(algt->type, algt->mask);
> > +   mask = crypto_requires_sync(algt->type, algt->mask) |
> > +  crypto_requires_nomem(algt->type, algt->mask);
> >  
> > inst = kzalloc(sizeof(*inst) + sizeof(*ictx), GFP_KERNEL);
> > if (!inst)
> 
> Need to use CRYPTO_ALG_INHERITED_FLAGS in this file.

OK.

> - Eric

Mikulas



Re: [PATCH 0/6] Overhaul memalloc_no*

2020-06-27 Thread Mikulas Patocka



On Sat, 27 Jun 2020, Dave Chinner wrote:

> On Fri, Jun 26, 2020 at 11:02:19AM -0400, Mikulas Patocka wrote:
> > Hi
> > 
> > I suggest to join memalloc_noio and memalloc_nofs into just one flag that 
> > prevents both filesystem recursion and i/o recursion.
> > 
> > Note that any I/O can recurse into a filesystem via the loop device, thus 
> > it doesn't make much sense to have a context where PF_MEMALLOC_NOFS is set 
> > and PF_MEMALLOC_NOIO is not set.
> 
> Correct me if I'm wrong, but I think that will prevent swapping from
> GFP_NOFS memory reclaim contexts.

Yes.

> IOWs, this will substantially
> change the behaviour of the memory reclaim system under sustained
> GFP_NOFS memory pressure. Sustained GFP_NOFS memory pressure is
> quite common, so I really don't think we want to telling memory
> reclaim "you can't do IO at all" when all we are trying to do is
> prevent recursion back into the same filesystem.

So, we can define __GFP_ONLY_SWAP_IO and __GFP_IO.

> Given that the loop device IO path already operates under
> memalloc_noio context, (i.e. the recursion restriction is applied in
> only the context that needs is) I see no reason for making that a
> global reclaim limitation

I think this is a problem.

Suppose that a filesystem does GFP_NOFS allocation, the allocation 
triggers an IO and waits for it to finish, the loop device driver 
redirects the IO to the same filesystem that did the GFP_NOFS allocation.

I saw this deadlock in the past in the dm-bufio subsystem - see the commit 
9d28eb12447ee08bb5d1e8bb3195cf20e1ecd1c0 that fixed it.

Other subsystems that do IO in GFP_NOFS context may deadlock just like 
bufio.

Mikulas



[PATCH 1/3 v2] crypto: introduce the flag CRYPTO_ALG_ALLOCATES_MEMORY

2020-06-26 Thread Mikulas Patocka
Introduce a new flag CRYPTO_ALG_ALLOCATES_MEMORY and pass it down the
crypto stack.

If the flag is set, then the crypto driver allocates memory in its request
routine. Such drivers are not suitable for disk encryption because
GFP_ATOMIC allocation can fail anytime (causing random I/O errors) and
GFP_KERNEL allocation can recurse into the block layer, causing a
deadlock.

Pass the flag CRYPTO_ALG_ALLOCATES_MEMORY down through the crypto API.

Signed-off-by: Mikulas Patocka 

---
 crypto/adiantum.c |3 ++-
 crypto/authenc.c  |5 +++--
 crypto/authencesn.c   |5 +++--
 crypto/ccm.c  |7 ---
 crypto/chacha20poly1305.c |5 +++--
 crypto/cryptd.c   |9 ++---
 crypto/ctr.c  |3 ++-
 crypto/cts.c  |5 +++--
 crypto/essiv.c|5 +++--
 crypto/gcm.c  |   15 +--
 crypto/geniv.c|3 ++-
 crypto/lrw.c  |5 +++--
 crypto/rsa-pkcs1pad.c |5 +++--
 crypto/xts.c  |2 +-
 include/crypto/algapi.h   |9 +
 include/linux/crypto.h|   12 
 16 files changed, 68 insertions(+), 30 deletions(-)

Index: linux-2.6/include/linux/crypto.h
===
--- linux-2.6.orig/include/linux/crypto.h   2020-06-26 17:24:03.566417000 
+0200
+++ linux-2.6/include/linux/crypto.h2020-06-26 17:25:28.066417000 +0200
@@ -102,6 +102,18 @@
 #define CRYPTO_NOLOAD  0x8000
 
 /*
+ * The driver may allocate memory during request processing, so it shouldn't be
+ * used in cases where memory allocation failures aren't acceptable, such as
+ * during block device encryption.
+ */
+#define CRYPTO_ALG_ALLOCATES_MEMORY0x0001
+
+/*
+ * Pass these flags down through the crypto API.
+ */
+#define CRYPTO_ALG_INHERITED_FLAGS (CRYPTO_ALG_ASYNC | 
CRYPTO_ALG_ALLOCATES_MEMORY)
+
+/*
  * Transform masks and values (for crt_flags).
  */
 #define CRYPTO_TFM_NEED_KEY0x0001
Index: linux-2.6/crypto/authenc.c
===
--- linux-2.6.orig/crypto/authenc.c 2020-06-26 17:24:03.566417000 +0200
+++ linux-2.6/crypto/authenc.c  2020-06-26 17:24:03.566417000 +0200
@@ -388,7 +388,8 @@ static int crypto_authenc_create(struct
if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask)
return -EINVAL;
 
-   mask = crypto_requires_sync(algt->type, algt->mask);
+   mask = crypto_requires_sync(algt->type, algt->mask) |
+  crypto_requires_nomem(algt->type, algt->mask);
 
inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL);
if (!inst)
@@ -424,7 +425,7 @@ static int crypto_authenc_create(struct
goto err_free_inst;
 
inst->alg.base.cra_flags = (auth_base->cra_flags |
-   enc->base.cra_flags) & CRYPTO_ALG_ASYNC;
+   enc->base.cra_flags) & CRYPTO_ALG_INHERITED_FLAGS;
inst->alg.base.cra_priority = enc->base.cra_priority * 10 +
  auth_base->cra_priority;
inst->alg.base.cra_blocksize = enc->base.cra_blocksize;
Index: linux-2.6/crypto/authencesn.c
===
--- linux-2.6.orig/crypto/authencesn.c  2020-06-26 17:24:03.566417000 +0200
+++ linux-2.6/crypto/authencesn.c   2020-06-26 17:24:03.566417000 +0200
@@ -406,7 +406,8 @@ static int crypto_authenc_esn_create(str
if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask)
return -EINVAL;
 
-   mask = crypto_requires_sync(algt->type, algt->mask);
+   mask = crypto_requires_sync(algt->type, algt->mask) |
+  crypto_requires_nomem(algt->type, algt->mask);
 
inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL);
if (!inst)
@@ -438,7 +439,7 @@ static int crypto_authenc_esn_create(str
goto err_free_inst;
 
inst->alg.base.cra_flags = (auth_base->cra_flags |
-   enc->base.cra_flags) & CRYPTO_ALG_ASYNC;
+   enc->base.cra_flags) & CRYPTO_ALG_INHERITED_FLAGS;
inst->alg.base.cra_priority = enc->base.cra_priority * 10 +
  auth_base->cra_priority;
inst->alg.base.cra_blocksize = enc->base.cra_blocksize;
Index: linux-2.6/crypto/ccm.c
===
--- linux-2.6.orig/crypto/ccm.c 2020-06-26 17:24:03.566417000 +0200
+++ linux-2.6/crypto/ccm.c  2020-06-26 17:24:03.566417000 +0200
@@ -462,7 +462,8 @@ static int crypto_ccm_create_common(stru
if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask)
return -EINVAL;
 
-   mask = crypto_requires_sync(algt->t

Re: [PATCH 1/3] crypto: pass the flag CRYPTO_ALG_ALLOCATES_MEMORY

2020-06-26 Thread Mikulas Patocka



On Fri, 26 Jun 2020, Herbert Xu wrote:

> On Wed, Jun 17, 2020 at 11:09:28AM -0400, Mikulas Patocka wrote:
> >
> > Index: linux-2.6/include/linux/crypto.h
> > ===
> > --- linux-2.6.orig/include/linux/crypto.h
> > +++ linux-2.6/include/linux/crypto.h
> > @@ -97,9 +97,18 @@
> >  #define CRYPTO_ALG_OPTIONAL_KEY0x4000
> >  
> >  /*
> > + * The driver may allocate memory during request processing, so it 
> > shouldn't be
> > + * used in cases where memory allocation failures aren't acceptable, such 
> > as
> > + * during block device encryption.
> > + */
> > +#define CRYPTO_ALG_ALLOCATES_MEMORY0x8000
> > +
> > +/*
> >   * Don't trigger module loading
> >   */
> > -#define CRYPTO_NOLOAD  0x8000
> > +#define CRYPTO_NOLOAD  0x0001
> > +
> > +#define CRYPTO_ALG_INHERITED_FLAGS (CRYPTO_ALG_ASYNC | 
> > CRYPTO_ALG_ALLOCATES_MEMORY)
> 
> Any reason why you need to renumber NOLOAD? If not please keep
> the existing values.

There is no hard reason for that. CRYPTO_NOLOAD is a "virtual" flag that 
could be only present in crypto algorithm requests and I thought that the 
intention was that the virtual flags go after real flags. If you don't 
want to change existing flags, there is no problem with that.

Mikulas

> Thanks,
> -- 
> Email: Herbert Xu 
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
> 



Re: [PATCH 0/6] Overhaul memalloc_no*

2020-06-26 Thread Mikulas Patocka
Hi

I suggest to join memalloc_noio and memalloc_nofs into just one flag that 
prevents both filesystem recursion and i/o recursion.

Note that any I/O can recurse into a filesystem via the loop device, thus 
it doesn't make much sense to have a context where PF_MEMALLOC_NOFS is set 
and PF_MEMALLOC_NOIO is not set.

Mikulas

On Thu, 25 Jun 2020, Matthew Wilcox (Oracle) wrote:

> I want a memalloc_nowait like we have memalloc_noio and memalloc_nofs
> for an upcoming patch series, and Jens also wants it for non-blocking
> io_uring.  It turns out we already have dm-bufio which could benefit
> from memalloc_nowait, so it may as well go into the tree now.
> 
> The biggest problem is that we're basically out of PF_ flags, so we need
> to find somewhere else to store the PF_MEMALLOC_NOWAIT flag.  It turns
> out the PF_ flags are really supposed to be used for flags which are
> accessed from other tasks, and the MEMALLOC flags are only going to
> be used by this task.  So shuffling everything around frees up some PF
> flags and generally makes the world a better place.
> 
> Patch series also available from
> http://git.infradead.org/users/willy/linux.git/shortlog/refs/heads/memalloc
> 
> Matthew Wilcox (Oracle) (6):
>   mm: Replace PF_MEMALLOC_NOIO with memalloc_noio
>   mm: Add become_kswapd and restore_kswapd
>   xfs: Convert to memalloc_nofs_save
>   mm: Replace PF_MEMALLOC_NOFS with memalloc_nofs
>   mm: Replace PF_MEMALLOC_NOIO with memalloc_nocma
>   mm: Add memalloc_nowait
> 
>  drivers/block/loop.c   |  3 +-
>  drivers/md/dm-bufio.c  | 30 
>  drivers/md/dm-zoned-metadata.c |  5 +-
>  fs/iomap/buffered-io.c |  2 +-
>  fs/xfs/kmem.c  |  2 +-
>  fs/xfs/libxfs/xfs_btree.c  | 14 +++---
>  fs/xfs/xfs_aops.c  |  4 +-
>  fs/xfs/xfs_buf.c   |  2 +-
>  fs/xfs/xfs_linux.h |  6 ---
>  fs/xfs/xfs_trans.c | 14 +++---
>  fs/xfs/xfs_trans.h |  2 +-
>  include/linux/sched.h  |  7 +--
>  include/linux/sched/mm.h   | 84 ++
>  kernel/sys.c   |  8 ++--
>  mm/vmscan.c| 16 +--
>  15 files changed, 105 insertions(+), 94 deletions(-)
> 
> -- 
> 2.27.0
> 



Re: [RFC PATCH 0/1] dm-crypt excessive overhead

2020-06-20 Thread Mikulas Patocka



On Sat, 20 Jun 2020, Herbert Xu wrote:

> On Fri, Jun 19, 2020 at 02:39:39PM -0400, Mikulas Patocka wrote:
> >
> > I'm looking at this and I'd like to know why does the crypto API fail in 
> > hard-irq context and why does it work in tasklet context. What's the exact 
> > reason behind this?
> 
> You're not supposed to do any real work in IRQ handlers.  All
> the substantial work should be postponed to softirq context.

I see.

BTW - should it also warn if it is running in a process context with 
interrupts disabled?

Mikulas

> Why do you need to do work in hard IRQ context?
> 
> Cheers,
> -- 
> Email: Herbert Xu 
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
> 



Re: [RFC PATCH 0/1] dm-crypt excessive overhead

2020-06-19 Thread Mikulas Patocka



On Fri, 19 Jun 2020, Mike Snitzer wrote:

> On Fri, Jun 19 2020 at 12:41pm -0400,
> Ignat Korchagin  wrote:
> 
> > This is a follow up from the long-forgotten [1], but with some more 
> > convincing
> > evidence. Consider the following script:
> > 
> > [1]: https://www.spinics.net/lists/dm-crypt/msg07516.html
> > [2]: https://blog.cloudflare.com/speeding-up-linux-disk-encryption/
> > 
> > Ignat Korchagin (1):
> >   Add DM_CRYPT_FORCE_INLINE flag to dm-crypt target
> > 
> >  drivers/md/dm-crypt.c | 55 +--
> >  1 file changed, 43 insertions(+), 12 deletions(-)
> > 
> > -- 
> > 2.20.1
> > 
> 
> Hi,
> 
> I saw [2] and have been expecting something from cloudflare ever since.
> Nice to see this submission.
> 
> There is useful context in your 0th patch header.  I'll likely merge
> parts of this patch header with the more terse 1/1 header (reality is
> there only needed to be a single patch submission).
> 
> Will review and stage accordingly if all looks fine to me.  Mikulas,
> please have a look too.
> 
> Thanks,
> Mike

+   if (test_bit(DM_CRYPT_FORCE_INLINE, >flags)) {
+   if (in_irq()) {
+   /* Crypto API will fail hard in hard IRQ context */
+   tasklet_init(>tasklet, kcryptd_crypt_tasklet, 
(unsigned long)>work);
+   tasklet_schedule(>tasklet);
+   } else
+   kcryptd_crypt(>work);
+   } else {
+   INIT_WORK(>work, kcryptd_crypt);
+   queue_work(cc->crypt_queue, >work);
+   }

I'm looking at this and I'd like to know why does the crypto API fail in 
hard-irq context and why does it work in tasklet context. What's the exact 
reason behind this?

Mikulas



[PATCH 3/3] dm-crypt: don't use drivers that have CRYPTO_ALG_ALLOCATES_MEMORY

2020-06-17 Thread Mikulas Patocka
Don't use crypto drivers that have the flag CRYPTO_ALG_ALLOCATES_MEMORY
set. These drivers allocate memory and thus they are unsuitable for block
I/O processing.

Signed-off-by: Mikulas Patocka 

---
 drivers/md/dm-crypt.c |   17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

Index: linux-2.6/drivers/md/dm-crypt.c
===
--- linux-2.6.orig/drivers/md/dm-crypt.c
+++ linux-2.6/drivers/md/dm-crypt.c
@@ -419,7 +419,8 @@ static int crypt_iv_lmk_ctr(struct crypt
return -EINVAL;
}
 
-   lmk->hash_tfm = crypto_alloc_shash("md5", 0, 0);
+   lmk->hash_tfm = crypto_alloc_shash("md5", 0,
+  CRYPTO_ALG_ALLOCATES_MEMORY);
if (IS_ERR(lmk->hash_tfm)) {
ti->error = "Error initializing LMK hash";
return PTR_ERR(lmk->hash_tfm);
@@ -581,7 +582,8 @@ static int crypt_iv_tcw_ctr(struct crypt
return -EINVAL;
}
 
-   tcw->crc32_tfm = crypto_alloc_shash("crc32", 0, 0);
+   tcw->crc32_tfm = crypto_alloc_shash("crc32", 0,
+   CRYPTO_ALG_ALLOCATES_MEMORY);
if (IS_ERR(tcw->crc32_tfm)) {
ti->error = "Error initializing CRC32 in TCW";
return PTR_ERR(tcw->crc32_tfm);
@@ -768,7 +770,8 @@ static int crypt_iv_elephant_ctr(struct
struct iv_elephant_private *elephant = >iv_gen_private.elephant;
int r;
 
-   elephant->tfm = crypto_alloc_skcipher("ecb(aes)", 0, 0);
+   elephant->tfm = crypto_alloc_skcipher("ecb(aes)", 0,
+ CRYPTO_ALG_ALLOCATES_MEMORY);
if (IS_ERR(elephant->tfm)) {
r = PTR_ERR(elephant->tfm);
elephant->tfm = NULL;
@@ -2088,7 +2091,8 @@ static int crypt_alloc_tfms_skcipher(str
return -ENOMEM;
 
for (i = 0; i < cc->tfms_count; i++) {
-   cc->cipher_tfm.tfms[i] = crypto_alloc_skcipher(ciphermode, 0, 
0);
+   cc->cipher_tfm.tfms[i] = crypto_alloc_skcipher(ciphermode, 0,
+   CRYPTO_ALG_ALLOCATES_MEMORY);
if (IS_ERR(cc->cipher_tfm.tfms[i])) {
err = PTR_ERR(cc->cipher_tfm.tfms[i]);
crypt_free_tfms(cc);
@@ -2114,7 +2118,8 @@ static int crypt_alloc_tfms_aead(struct
if (!cc->cipher_tfm.tfms)
return -ENOMEM;
 
-   cc->cipher_tfm.tfms_aead[0] = crypto_alloc_aead(ciphermode, 0, 0);
+   cc->cipher_tfm.tfms_aead[0] = crypto_alloc_aead(ciphermode, 0,
+   CRYPTO_ALG_ALLOCATES_MEMORY);
if (IS_ERR(cc->cipher_tfm.tfms_aead[0])) {
err = PTR_ERR(cc->cipher_tfm.tfms_aead[0]);
crypt_free_tfms(cc);
@@ -2565,7 +2570,7 @@ static int crypt_ctr_auth_cipher(struct
return -ENOMEM;
strncpy(mac_alg, start, end - start);
 
-   mac = crypto_alloc_ahash(mac_alg, 0, 0);
+   mac = crypto_alloc_ahash(mac_alg, 0, CRYPTO_ALG_ALLOCATES_MEMORY);
kfree(mac_alg);
 
if (IS_ERR(mac))



[PATCH 2/3] crypto: set the flag CRYPTO_ALG_ALLOCATES_MEMORY

2020-06-17 Thread Mikulas Patocka
Set the flag CRYPTO_ALG_ALLOCATES_MEMORY in the crypto drivers that
allocate memory.

drivers/crypto/allwinner/sun8i-ce/sun8i-ce-core.c: sun8i_ce_cipher
drivers/crypto/allwinner/sun8i-ss/sun8i-ss-core.c: sun8i_ss_cipher
drivers/crypto/amlogic/amlogic-gxl-core.c: meson_cipher
drivers/crypto/axis/artpec6_crypto.c: artpec6_crypto_common_init
drivers/crypto/bcm/cipher.c: spu_skcipher_rx_sg_create
drivers/crypto/caam/caamalg.c: aead_edesc_alloc
drivers/crypto/caam/caamalg_qi.c: aead_edesc_alloc
drivers/crypto/caam/caamalg_qi2.c: aead_edesc_alloc
drivers/crypto/caam/caamhash.c: hash_digest_key
drivers/crypto/cavium/cpt/cptvf_algs.c: process_request
drivers/crypto/cavium/nitrox/nitrox_aead.c: nitrox_process_se_request
drivers/crypto/cavium/nitrox/nitrox_skcipher.c: nitrox_process_se_request
drivers/crypto/ccp/ccp-crypto-aes-cmac.c: ccp_do_cmac_update
drivers/crypto/ccp/ccp-crypto-aes-galois.c: ccp_crypto_enqueue_request
drivers/crypto/ccp/ccp-crypto-aes-xts.c: ccp_crypto_enqueue_request
drivers/crypto/ccp/ccp-crypto-aes.c: ccp_crypto_enqueue_request
drivers/crypto/ccp/ccp-crypto-des3.c: ccp_crypto_enqueue_request
drivers/crypto/ccp/ccp-crypto-sha.c: ccp_crypto_enqueue_request
drivers/crypto/chelsio/chcr_algo.c: create_cipher_wr
drivers/crypto/hisilicon/sec/sec_algs.c: sec_alloc_and_fill_hw_sgl
drivers/crypto/hisilicon/sec2/sec_crypto.c: sec_alloc_req_id
drivers/crypto/inside-secure/safexcel_cipher.c: safexcel_queue_req
drivers/crypto/inside-secure/safexcel_hash.c: safexcel_ahash_enqueue
drivers/crypto/ixp4xx_crypto.c: ablk_perform
drivers/crypto/marvell/cesa/cipher.c: mv_cesa_skcipher_dma_req_init
drivers/crypto/marvell/cesa/hash.c: mv_cesa_ahash_dma_req_init
drivers/crypto/marvell/octeontx/otx_cptvf_algs.c: create_ctx_hdr
drivers/crypto/n2_core.c: n2_compute_chunks
drivers/crypto/picoxcell_crypto.c: spacc_sg_to_ddt
drivers/crypto/qat/qat_common/qat_algs.c: qat_alg_skcipher_encrypt
drivers/crypto/qce/skcipher.c: qce_skcipher_async_req_handle
drivers/crypto/talitos.c : talitos_edesc_alloc
drivers/crypto/virtio/virtio_crypto_algs.c: __virtio_crypto_skcipher_do_req
drivers/crypto/xilinx/zynqmp-aes-gcm.c: zynqmp_aes_aead_cipher

Signed-off-by: Mikulas Patocka 

---
 drivers/crypto/allwinner/sun8i-ce/sun8i-ce-core.c |8 +-
 drivers/crypto/allwinner/sun8i-ss/sun8i-ss-core.c |8 +-
 drivers/crypto/amlogic/amlogic-gxl-core.c |4 -
 drivers/crypto/axis/artpec6_crypto.c  |   20 +++--
 drivers/crypto/bcm/cipher.c   |   38 +-
 drivers/crypto/caam/caamalg.c |4 -
 drivers/crypto/caam/caamalg_qi.c  |4 -
 drivers/crypto/caam/caamalg_qi2.c |6 -
 drivers/crypto/caam/caamhash.c|2 
 drivers/crypto/cavium/cpt/cptvf_algs.c|   12 +--
 drivers/crypto/cavium/nitrox/nitrox_aead.c|4 -
 drivers/crypto/cavium/nitrox/nitrox_skcipher.c|   16 ++--
 drivers/crypto/ccp/ccp-crypto-aes-cmac.c  |1 
 drivers/crypto/ccp/ccp-crypto-aes-galois.c|1 
 drivers/crypto/ccp/ccp-crypto-aes-xts.c   |1 
 drivers/crypto/ccp/ccp-crypto-aes.c   |2 
 drivers/crypto/ccp/ccp-crypto-des3.c  |1 
 drivers/crypto/ccp/ccp-crypto-sha.c   |1 
 drivers/crypto/chelsio/chcr_algo.c|6 -
 drivers/crypto/hisilicon/sec/sec_algs.c   |   16 ++--
 drivers/crypto/hisilicon/sec2/sec_crypto.c|4 -
 drivers/crypto/inside-secure/safexcel_cipher.c|   47 +
 drivers/crypto/inside-secure/safexcel_hash.c  |   18 +
 drivers/crypto/ixp4xx_crypto.c|6 +
 drivers/crypto/marvell/cesa/cipher.c  |   12 +--
 drivers/crypto/marvell/cesa/hash.c|6 +
 drivers/crypto/marvell/octeontx/otx_cptvf_algs.c  |   30 
 drivers/crypto/n2_core.c  |2 
 drivers/crypto/picoxcell_crypto.c |   17 
 drivers/crypto/qat/qat_common/qat_algs.c  |   12 +--
 drivers/crypto/qce/sha.c  |2 
 drivers/crypto/qce/skcipher.c |1 
 drivers/crypto/talitos.c  |   78 +++---
 drivers/crypto/virtio/virtio_crypto_algs.c|2 
 drivers/crypto/xilinx/zynqmp-aes-gcm.c|1 
 35 files changed, 248 insertions(+), 145 deletions(-)

Index: linux-2.6/drivers/crypto/qat/qat_common/qat_algs.c
===
--- linux-2.6.orig/drivers/crypto/qat/qat_common/qat_algs.c
+++ linux-2.6/drivers/crypto/qat/qat_common/qat_algs.c
@@ -1267,7 +1267,7 @@ static struct aead_alg qat_aeads[] = { {
.cra_name = "authenc(hmac(sha1),cbc(aes))",
.cra_driver_name = "qat_aes_cbc_hmac_sha1",
.cra_priority = 4001,
-   .cra_flags = CRYPTO_ALG_ASYNC,
+   .cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_

[PATCH 1/3] crypto: pass the flag CRYPTO_ALG_ALLOCATES_MEMORY

2020-06-17 Thread Mikulas Patocka
Introduce a new flag CRYPTO_ALG_ALLOCATES_MEMORY and pass it down the
crypto stack.

If the flag is set, then the crypto driver allocates memory in its request
routine. Such drivers are not suitable for disk encryption because
GFP_ATOMIC allocation can fail anytime (causing random I/O errors) and
GFP_KERNEL allocation can recurse into the block layer, causing a
deadlock.

Signed-off-by: Mikulas Patocka 

---
 crypto/adiantum.c |3 ++-
 crypto/authenc.c  |5 +++--
 crypto/authencesn.c   |5 +++--
 crypto/ccm.c  |7 ---
 crypto/chacha20poly1305.c |5 +++--
 crypto/cryptd.c   |9 ++---
 crypto/ctr.c  |3 ++-
 crypto/cts.c  |5 +++--
 crypto/essiv.c|5 +++--
 crypto/gcm.c  |   15 +--
 crypto/geniv.c|3 ++-
 crypto/lrw.c  |5 +++--
 crypto/rsa-pkcs1pad.c |5 +++--
 crypto/xts.c  |2 +-
 include/crypto/algapi.h   |9 +
 include/linux/crypto.h|   11 ++-
 16 files changed, 66 insertions(+), 31 deletions(-)

Index: linux-2.6/include/linux/crypto.h
===
--- linux-2.6.orig/include/linux/crypto.h
+++ linux-2.6/include/linux/crypto.h
@@ -97,9 +97,18 @@
 #define CRYPTO_ALG_OPTIONAL_KEY0x4000
 
 /*
+ * The driver may allocate memory during request processing, so it shouldn't be
+ * used in cases where memory allocation failures aren't acceptable, such as
+ * during block device encryption.
+ */
+#define CRYPTO_ALG_ALLOCATES_MEMORY0x8000
+
+/*
  * Don't trigger module loading
  */
-#define CRYPTO_NOLOAD  0x8000
+#define CRYPTO_NOLOAD  0x0001
+
+#define CRYPTO_ALG_INHERITED_FLAGS (CRYPTO_ALG_ASYNC | 
CRYPTO_ALG_ALLOCATES_MEMORY)
 
 /*
  * Transform masks and values (for crt_flags).
Index: linux-2.6/crypto/authenc.c
===
--- linux-2.6.orig/crypto/authenc.c
+++ linux-2.6/crypto/authenc.c
@@ -388,7 +388,8 @@ static int crypto_authenc_create(struct
if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask)
return -EINVAL;
 
-   mask = crypto_requires_sync(algt->type, algt->mask);
+   mask = crypto_requires_sync(algt->type, algt->mask) |
+  crypto_requires_nomem(algt->type, algt->mask);
 
inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL);
if (!inst)
@@ -424,7 +425,7 @@ static int crypto_authenc_create(struct
goto err_free_inst;
 
inst->alg.base.cra_flags = (auth_base->cra_flags |
-   enc->base.cra_flags) & CRYPTO_ALG_ASYNC;
+   enc->base.cra_flags) & CRYPTO_ALG_INHERITED_FLAGS;
inst->alg.base.cra_priority = enc->base.cra_priority * 10 +
  auth_base->cra_priority;
inst->alg.base.cra_blocksize = enc->base.cra_blocksize;
Index: linux-2.6/crypto/authencesn.c
===
--- linux-2.6.orig/crypto/authencesn.c
+++ linux-2.6/crypto/authencesn.c
@@ -406,7 +406,8 @@ static int crypto_authenc_esn_create(str
if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask)
return -EINVAL;
 
-   mask = crypto_requires_sync(algt->type, algt->mask);
+   mask = crypto_requires_sync(algt->type, algt->mask) |
+  crypto_requires_nomem(algt->type, algt->mask);
 
inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL);
if (!inst)
@@ -438,7 +439,7 @@ static int crypto_authenc_esn_create(str
goto err_free_inst;
 
inst->alg.base.cra_flags = (auth_base->cra_flags |
-   enc->base.cra_flags) & CRYPTO_ALG_ASYNC;
+   enc->base.cra_flags) & CRYPTO_ALG_INHERITED_FLAGS;
inst->alg.base.cra_priority = enc->base.cra_priority * 10 +
  auth_base->cra_priority;
inst->alg.base.cra_blocksize = enc->base.cra_blocksize;
Index: linux-2.6/crypto/ccm.c
===
--- linux-2.6.orig/crypto/ccm.c
+++ linux-2.6/crypto/ccm.c
@@ -462,7 +462,8 @@ static int crypto_ccm_create_common(stru
if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask)
return -EINVAL;
 
-   mask = crypto_requires_sync(algt->type, algt->mask);
+   mask = crypto_requires_sync(algt->type, algt->mask) |
+  crypto_requires_nomem(algt->type, algt->mask);
 
inst = kzalloc(sizeof(*inst) + sizeof(*ictx), GFP_KERNEL);
if (!inst)
@@ -507,7 +508,7 @@ static int crypto_ccm_create_common(stru
 mac->base.cra_drive

Re: [dm-devel] [PATCH 1/4] crypto: introduce CRYPTO_ALG_ALLOCATES_MEMORY

2020-06-17 Thread Mikulas Patocka
I'm resending the patches with your comments resolved...

Mikulas



On Tue, 16 Jun 2020, Eric Biggers wrote:

> On Tue, Jun 16, 2020 at 11:01:31AM -0400, Mikulas Patocka wrote:
> > Introduce a new flag CRYPTO_ALG_ALLOCATES_MEMORY and modify dm-crypt, so
> > that it uses only drivers without this flag.
> > 
> > If the flag is set, then the crypto driver allocates memory in its request
> > routine. Such drivers are not suitable for disk encryption because
> > GFP_ATOMIC allocation can fail anytime (causing random I/O errors) and
> > GFP_KERNEL allocation can recurse into the block layer, causing a
> > deadlock.
> > 
> > Signed-off-by: Mikulas Patocka 
> > 
> > Index: linux-2.6/include/linux/crypto.h
> > ===
> > --- linux-2.6.orig/include/linux/crypto.h
> > +++ linux-2.6/include/linux/crypto.h
> > @@ -97,9 +97,15 @@
> >  #define CRYPTO_ALG_OPTIONAL_KEY0x4000
> >  
> >  /*
> > + * The driver is allocating emmory in its encrypt or decrypt callback,
> > + * so it should not be used to encrypt block devices.
> > + */
> 
> "is allocating emmory" => "may allocate memory"
> 
> "so it should not be used to encrypt block devices" =>
> "so it shouldn't be used in cases where memory allocation failures aren't
>  acceptable, such as during block device encryption".
> 
> Also, which types of algorithms does this flag apply to?  E.g. if it applies 
> to
> hash algorithms too, it's not sufficient to say "encrypt or decrypt callback".
> 
> How about:
> 
>  /*
>   * The driver may allocate memory during request processing, so it shouldn't 
> be
>   * used in cases where memory allocation failures aren't acceptable, such as
>   * during block device encryption.
>   */
> 
> > +#define CRYPTO_ALG_ALLOCATES_MEMORY0x8000
> > +
> > +/*
> >   * Don't trigger module loading
> >   */
> > -#define CRYPTO_NOLOAD  0x8000
> > +#define CRYPTO_NOLOAD  0x0001
> >  
> >  /*
> >   * Transform masks and values (for crt_flags).
> > Index: linux-2.6/drivers/md/dm-crypt.c
> > ===
> 
> This would better belong as two separate patches: one to introduce
> CRYPTO_ALG_ALLOCATES_MEMORY, and one to make dm-crypt use it.
> 
> > --- linux-2.6.orig/drivers/md/dm-crypt.c
> > +++ linux-2.6/drivers/md/dm-crypt.c
> > @@ -419,7 +419,7 @@ static int crypt_iv_lmk_ctr(struct crypt
> > return -EINVAL;
> > }
> >  
> > -   lmk->hash_tfm = crypto_alloc_shash("md5", 0, 0);
> > +   lmk->hash_tfm = crypto_alloc_shash("md5", 0, 
> > CRYPTO_ALG_ALLOCATES_MEMORY);
> > if (IS_ERR(lmk->hash_tfm)) {
> > ti->error = "Error initializing LMK hash";
> > return PTR_ERR(lmk->hash_tfm);
> > @@ -581,7 +581,7 @@ static int crypt_iv_tcw_ctr(struct crypt
> > return -EINVAL;
> > }
> >  
> > -   tcw->crc32_tfm = crypto_alloc_shash("crc32", 0, 0);
> > +   tcw->crc32_tfm = crypto_alloc_shash("crc32", 0, 
> > CRYPTO_ALG_ALLOCATES_MEMORY);
> > if (IS_ERR(tcw->crc32_tfm)) {
> > ti->error = "Error initializing CRC32 in TCW";
> > return PTR_ERR(tcw->crc32_tfm);
> > @@ -768,7 +768,7 @@ static int crypt_iv_elephant_ctr(struct
> > struct iv_elephant_private *elephant = >iv_gen_private.elephant;
> > int r;
> >  
> > -   elephant->tfm = crypto_alloc_skcipher("ecb(aes)", 0, 0);
> > +   elephant->tfm = crypto_alloc_skcipher("ecb(aes)", 0, 
> > CRYPTO_ALG_ALLOCATES_MEMORY);
> > if (IS_ERR(elephant->tfm)) {
> > r = PTR_ERR(elephant->tfm);
> > elephant->tfm = NULL;
> > @@ -2088,7 +2088,7 @@ static int crypt_alloc_tfms_skcipher(str
> > return -ENOMEM;
> >  
> > for (i = 0; i < cc->tfms_count; i++) {
> > -   cc->cipher_tfm.tfms[i] = crypto_alloc_skcipher(ciphermode, 0, 
> > 0);
> > +   cc->cipher_tfm.tfms[i] = crypto_alloc_skcipher(ciphermode, 0, 
> > CRYPTO_ALG_ALLOCATES_MEMORY);
> 
> Despite the recent relaxation in rules, the preferred length of a line is 
> still
> 80 columns.
> 
> - Eric
> 



[PATCH 2/2] hisilicon-crypto: don't sleep of CRYPTO_TFM_REQ_MAY_SLEEP was not specified

2020-06-17 Thread Mikulas Patocka
There is this call chain:
sec_alg_skcipher_encrypt -> sec_alg_skcipher_crypto ->
sec_alg_alloc_and_calc_split_sizes -> kcalloc
where we call sleeping allocator function even if CRYPTO_TFM_REQ_MAY_SLEEP
was not specified.

Signed-off-by: Mikulas Patocka 
Cc: sta...@vger.kernel.org  # v4.19+
Fixes: 915e4e8413da ("crypto: hisilicon - SEC security accelerator driver")

---
 drivers/crypto/hisilicon/sec/sec_algs.c |   34 
 1 file changed, 18 insertions(+), 16 deletions(-)

Index: linux-2.6/drivers/crypto/hisilicon/sec/sec_algs.c
===
--- linux-2.6.orig/drivers/crypto/hisilicon/sec/sec_algs.c
+++ linux-2.6/drivers/crypto/hisilicon/sec/sec_algs.c
@@ -175,7 +175,8 @@ static int sec_alloc_and_fill_hw_sgl(str
 dma_addr_t *psec_sgl,
 struct scatterlist *sgl,
 int count,
-struct sec_dev_info *info)
+struct sec_dev_info *info,
+gfp_t gfp)
 {
struct sec_hw_sgl *sgl_current = NULL;
struct sec_hw_sgl *sgl_next;
@@ -190,7 +191,7 @@ static int sec_alloc_and_fill_hw_sgl(str
sge_index = i % SEC_MAX_SGE_NUM;
if (sge_index == 0) {
sgl_next = dma_pool_zalloc(info->hw_sgl_pool,
-  GFP_KERNEL, _next_dma);
+  gfp, _next_dma);
if (!sgl_next) {
ret = -ENOMEM;
goto err_free_hw_sgls;
@@ -545,14 +546,14 @@ void sec_alg_callback(struct sec_bd_info
 }
 
 static int sec_alg_alloc_and_calc_split_sizes(int length, size_t **split_sizes,
- int *steps)
+ int *steps, gfp_t gfp)
 {
size_t *sizes;
int i;
 
/* Split into suitable sized blocks */
*steps = roundup(length, SEC_REQ_LIMIT) / SEC_REQ_LIMIT;
-   sizes = kcalloc(*steps, sizeof(*sizes), GFP_KERNEL);
+   sizes = kcalloc(*steps, sizeof(*sizes), gfp);
if (!sizes)
return -ENOMEM;
 
@@ -568,7 +569,7 @@ static int sec_map_and_split_sg(struct s
int steps, struct scatterlist ***splits,
int **splits_nents,
int sgl_len_in,
-   struct device *dev)
+   struct device *dev, gfp_t gfp)
 {
int ret, count;
 
@@ -576,12 +577,12 @@ static int sec_map_and_split_sg(struct s
if (!count)
return -EINVAL;
 
-   *splits = kcalloc(steps, sizeof(struct scatterlist *), GFP_KERNEL);
+   *splits = kcalloc(steps, sizeof(struct scatterlist *), gfp);
if (!*splits) {
ret = -ENOMEM;
goto err_unmap_sg;
}
-   *splits_nents = kcalloc(steps, sizeof(int), GFP_KERNEL);
+   *splits_nents = kcalloc(steps, sizeof(int), gfp);
if (!*splits_nents) {
ret = -ENOMEM;
goto err_free_splits;
@@ -589,7 +590,7 @@ static int sec_map_and_split_sg(struct s
 
/* output the scatter list before and after this */
ret = sg_split(sgl, count, 0, steps, split_sizes,
-  *splits, *splits_nents, GFP_KERNEL);
+  *splits, *splits_nents, gfp);
if (ret) {
ret = -ENOMEM;
goto err_free_splits_nents;
@@ -630,13 +631,13 @@ static struct sec_request_el
   int el_size, bool different_dest,
   struct scatterlist *sgl_in, int n_ents_in,
   struct scatterlist *sgl_out, int n_ents_out,
-  struct sec_dev_info *info)
+  struct sec_dev_info *info, gfp_t gfp)
 {
struct sec_request_el *el;
struct sec_bd_info *req;
int ret;
 
-   el = kzalloc(sizeof(*el), GFP_KERNEL);
+   el = kzalloc(sizeof(*el), gfp);
if (!el)
return ERR_PTR(-ENOMEM);
el->el_length = el_size;
@@ -668,7 +669,7 @@ static struct sec_request_el
el->sgl_in = sgl_in;
 
ret = sec_alloc_and_fill_hw_sgl(>in, >dma_in, el->sgl_in,
-   n_ents_in, info);
+   n_ents_in, info, gfp);
if (ret)
goto err_free_el;
 
@@ -679,7 +680,7 @@ static struct sec_request_el
el->sgl_out = sgl_out;
ret = sec_alloc_and_fill_hw_sgl(>out, >dma_out,
el->sgl_out,
-   n_ents_out, info);
+

[PATCH 1/2] cpt-crypto: don't sleep of CRYPTO_TFM_REQ_MAY_SLEEP was not specified

2020-06-17 Thread Mikulas Patocka
There is this call chain:
cvm_encrypt -> cvm_enc_dec -> cptvf_do_request -> process_request -> kzalloc
where we call sleeping allocator function even if CRYPTO_TFM_REQ_MAY_SLEEP 
was not specified.

Signed-off-by: Mikulas Patocka 
Cc: sta...@vger.kernel.org  # v4.11+
Fixes: c694b233295b ("crypto: cavium - Add the Virtual Function driver for CPT")

---
 drivers/crypto/cavium/cpt/cptvf_algs.c   |1 +
 drivers/crypto/cavium/cpt/cptvf_reqmanager.c |   12 ++--
 drivers/crypto/cavium/cpt/request_manager.h  |2 ++
 3 files changed, 9 insertions(+), 6 deletions(-)

Index: linux-2.6/drivers/crypto/cavium/cpt/cptvf_algs.c
===
--- linux-2.6.orig/drivers/crypto/cavium/cpt/cptvf_algs.c
+++ linux-2.6/drivers/crypto/cavium/cpt/cptvf_algs.c
@@ -200,6 +200,7 @@ static inline int cvm_enc_dec(struct skc
int status;
 
memset(req_info, 0, sizeof(struct cpt_request_info));
+   req_info->may_sleep = (req->base.flags & CRYPTO_TFM_REQ_MAY_SLEEP) != 0;
memset(fctx, 0, sizeof(struct fc_context));
create_input_list(req, enc, enc_iv_len);
create_output_list(req, enc_iv_len);
Index: linux-2.6/drivers/crypto/cavium/cpt/cptvf_reqmanager.c
===
--- linux-2.6.orig/drivers/crypto/cavium/cpt/cptvf_reqmanager.c
+++ linux-2.6/drivers/crypto/cavium/cpt/cptvf_reqmanager.c
@@ -133,7 +133,7 @@ static inline int setup_sgio_list(struct
 
/* Setup gather (input) components */
g_sz_bytes = ((req->incnt + 3) / 4) * sizeof(struct sglist_component);
-   info->gather_components = kzalloc(g_sz_bytes, GFP_KERNEL);
+   info->gather_components = kzalloc(g_sz_bytes, req->may_sleep ? 
GFP_KERNEL : GFP_ATOMIC);
if (!info->gather_components) {
ret = -ENOMEM;
goto  scatter_gather_clean;
@@ -150,7 +150,7 @@ static inline int setup_sgio_list(struct
 
/* Setup scatter (output) components */
s_sz_bytes = ((req->outcnt + 3) / 4) * sizeof(struct sglist_component);
-   info->scatter_components = kzalloc(s_sz_bytes, GFP_KERNEL);
+   info->scatter_components = kzalloc(s_sz_bytes, req->may_sleep ? 
GFP_KERNEL : GFP_ATOMIC);
if (!info->scatter_components) {
ret = -ENOMEM;
goto  scatter_gather_clean;
@@ -167,7 +167,7 @@ static inline int setup_sgio_list(struct
 
/* Create and initialize DPTR */
info->dlen = g_sz_bytes + s_sz_bytes + SG_LIST_HDR_SIZE;
-   info->in_buffer = kzalloc(info->dlen, GFP_KERNEL);
+   info->in_buffer = kzalloc(info->dlen, req->may_sleep ? GFP_KERNEL : 
GFP_ATOMIC);
if (!info->in_buffer) {
ret = -ENOMEM;
goto  scatter_gather_clean;
@@ -195,7 +195,7 @@ static inline int setup_sgio_list(struct
}
 
/* Create and initialize RPTR */
-   info->out_buffer = kzalloc(COMPLETION_CODE_SIZE, GFP_KERNEL);
+   info->out_buffer = kzalloc(COMPLETION_CODE_SIZE, req->may_sleep ? 
GFP_KERNEL : GFP_ATOMIC);
if (!info->out_buffer) {
ret = -ENOMEM;
goto scatter_gather_clean;
@@ -421,7 +421,7 @@ int process_request(struct cpt_vf *cptvf
struct cpt_vq_command vq_cmd;
union cpt_inst_s cptinst;
 
-   info = kzalloc(sizeof(*info), GFP_KERNEL);
+   info = kzalloc(sizeof(*info), req->may_sleep ? GFP_KERNEL : GFP_ATOMIC);
if (unlikely(!info)) {
dev_err(>dev, "Unable to allocate memory for 
info_buffer\n");
return -ENOMEM;
@@ -443,7 +443,7 @@ int process_request(struct cpt_vf *cptvf
 * Get buffer for union cpt_res_s response
 * structure and its physical address
 */
-   info->completion_addr = kzalloc(sizeof(union cpt_res_s), GFP_KERNEL);
+   info->completion_addr = kzalloc(sizeof(union cpt_res_s), req->may_sleep 
? GFP_KERNEL : GFP_ATOMIC);
if (unlikely(!info->completion_addr)) {
dev_err(>dev, "Unable to allocate memory for 
completion_addr\n");
ret = -ENOMEM;
Index: linux-2.6/drivers/crypto/cavium/cpt/request_manager.h
===
--- linux-2.6.orig/drivers/crypto/cavium/cpt/request_manager.h
+++ linux-2.6/drivers/crypto/cavium/cpt/request_manager.h
@@ -62,6 +62,8 @@ struct cpt_request_info {
union ctrl_info ctrl; /* User control information */
struct cptvf_request req; /* Request Information (Core specific) */
 
+   bool may_sleep;
+
struct buf_ptr in[MAX_BUF_CNT];
struct buf_ptr out[MAX_BUF_CNT];
 



Re: [dm-devel] [PATCH 4/4] crypto: fix the drivers that don't respect CRYPTO_TFM_REQ_MAY_SLEEP

2020-06-17 Thread Mikulas Patocka



On Tue, 16 Jun 2020, Eric Biggers wrote:

> On Tue, Jun 16, 2020 at 02:18:17PM -0400, Mikulas Patocka wrote:
> > 
> > 
> > On Tue, 16 Jun 2020, Eric Biggers wrote:
> > 
> > > On Tue, Jun 16, 2020 at 11:02:50AM -0400, Mikulas Patocka wrote:
> > > > Fix the crypto drivers that don't respect CRYPTO_TFM_REQ_MAY_SLEEP. If
> > > > CRYPTO_TFM_REQ_MAY_SLEEP is not set, the driver must not do allocation
> > > > that sleeps.
> > > > 
> > > > Signed-off-by: Mikulas Patocka 
> > > 
> > > I think you need to split this up per driver with a proper explanation 
> > > and a
> > > "Fixes:" tag for each driver.
> > > 
> > > Also, these bugs should have been detected by the crypto self-tests 
> > > already,
> > > since they test having preemption disabled and CRYPTO_TFM_REQ_MAY_SLEEP 
> > > cleared.
> > > Can you double check whether these are all valid fixes?  One thing to 
> > > watch out
> > > 
> > > for is that CRYPTO_TFM_REQ_MAY_SLEEP only applies to the function call 
> > > like
> > > crypto_skcipher_encrypt() itself.  If the implementation is asynchronous 
> > > and the
> > > request gets processed in the background (i.e. if 
> > > crypto_skcipher_encrypt()
> > > returns -EINPROGRESS), the background work doesn't have to honor
> > > CRYPTO_TFM_REQ_MAY_SLEEP.
> > > 
> > > - Eric
> > 
> > I can only compile-test this patch. I don't have the hardware.
> > 
> 
> I'm just asking for you to check the code extra carefully.  The fact that the
> self-tests should have been detecting this type of bug implies that these 
> might
> not actually be valid fixes.

I've checked it more thoroughly and found out that 3 out of 5 drivers do
the GFP_KERNEL allocation from crypto-engine callback. So, it is
supposedly OK.

> However, we do know that not all crypto drivers are being actively tested with
> the latest self-tests and with kernel debugging options enabled.  So it's
> expected that some are indeed broken.
> 
> - Eric

The broken ones are drivers/crypto/cavium/cpt/ and 
drivers/crypto/hisilicon/sec/

I'm sending patches for them.

Mikulas



Re: [dm-devel] [PATCH 4/4] crypto: fix the drivers that don't respect CRYPTO_TFM_REQ_MAY_SLEEP

2020-06-16 Thread Mikulas Patocka



On Tue, 16 Jun 2020, Eric Biggers wrote:

> On Tue, Jun 16, 2020 at 11:02:50AM -0400, Mikulas Patocka wrote:
> > Fix the crypto drivers that don't respect CRYPTO_TFM_REQ_MAY_SLEEP. If
> > CRYPTO_TFM_REQ_MAY_SLEEP is not set, the driver must not do allocation
> > that sleeps.
> > 
> > Signed-off-by: Mikulas Patocka 
> 
> I think you need to split this up per driver with a proper explanation and a
> "Fixes:" tag for each driver.
> 
> Also, these bugs should have been detected by the crypto self-tests already,
> since they test having preemption disabled and CRYPTO_TFM_REQ_MAY_SLEEP 
> cleared.
> Can you double check whether these are all valid fixes?  One thing to watch 
> out
> for is that CRYPTO_TFM_REQ_MAY_SLEEP only applies to the function call like
> crypto_skcipher_encrypt() itself.  If the implementation is asynchronous and 
> the
> request gets processed in the background (i.e. if crypto_skcipher_encrypt()
> returns -EINPROGRESS), the background work doesn't have to honor
> CRYPTO_TFM_REQ_MAY_SLEEP.
> 
> - Eric

I can only compile-test this patch. I don't have the hardware.

Mikulas



[PATCH 4/4] crypto: fix the drivers that don't respect CRYPTO_TFM_REQ_MAY_SLEEP

2020-06-16 Thread Mikulas Patocka
Fix the crypto drivers that don't respect CRYPTO_TFM_REQ_MAY_SLEEP. If
CRYPTO_TFM_REQ_MAY_SLEEP is not set, the driver must not do allocation
that sleeps.

Signed-off-by: Mikulas Patocka 

---
 drivers/crypto/amlogic/amlogic-gxl-cipher.c  |5 ++-
 drivers/crypto/cavium/cpt/cptvf_algs.c   |1 
 drivers/crypto/cavium/cpt/cptvf_reqmanager.c |   12 -
 drivers/crypto/cavium/cpt/request_manager.h  |2 +
 drivers/crypto/hisilicon/sec/sec_algs.c  |   34 ++-
 drivers/crypto/virtio/virtio_crypto_algs.c   |4 +--
 drivers/crypto/xilinx/zynqmp-aes-gcm.c   |4 +--
 7 files changed, 34 insertions(+), 28 deletions(-)

Index: linux-2.6/drivers/crypto/virtio/virtio_crypto_algs.c
===
--- linux-2.6.orig/drivers/crypto/virtio/virtio_crypto_algs.c
+++ linux-2.6/drivers/crypto/virtio/virtio_crypto_algs.c
@@ -364,12 +364,12 @@ __virtio_crypto_skcipher_do_req(struct v
 
/* Why 3?  outhdr + iv + inhdr */
sg_total = src_nents + dst_nents + 3;
-   sgs = kcalloc_node(sg_total, sizeof(*sgs), GFP_KERNEL,
+   sgs = kcalloc_node(sg_total, sizeof(*sgs), req->base.flags & 
CRYPTO_TFM_REQ_MAY_SLEEP ? GFP_KERNEL : GFP_ATOMIC,
dev_to_node(>vdev->dev));
if (!sgs)
return -ENOMEM;
 
-   req_data = kzalloc_node(sizeof(*req_data), GFP_KERNEL,
+   req_data = kzalloc_node(sizeof(*req_data), req->base.flags & 
CRYPTO_TFM_REQ_MAY_SLEEP ? GFP_KERNEL : GFP_ATOMIC,
dev_to_node(>vdev->dev));
if (!req_data) {
kfree(sgs);
Index: linux-2.6/drivers/crypto/cavium/cpt/cptvf_algs.c
===
--- linux-2.6.orig/drivers/crypto/cavium/cpt/cptvf_algs.c
+++ linux-2.6/drivers/crypto/cavium/cpt/cptvf_algs.c
@@ -200,6 +200,7 @@ static inline int cvm_enc_dec(struct skc
int status;
 
memset(req_info, 0, sizeof(struct cpt_request_info));
+   req_info->may_sleep = (req->base.flags & CRYPTO_TFM_REQ_MAY_SLEEP) != 0;
memset(fctx, 0, sizeof(struct fc_context));
create_input_list(req, enc, enc_iv_len);
create_output_list(req, enc_iv_len);
Index: linux-2.6/drivers/crypto/cavium/cpt/cptvf_reqmanager.c
===
--- linux-2.6.orig/drivers/crypto/cavium/cpt/cptvf_reqmanager.c
+++ linux-2.6/drivers/crypto/cavium/cpt/cptvf_reqmanager.c
@@ -133,7 +133,7 @@ static inline int setup_sgio_list(struct
 
/* Setup gather (input) components */
g_sz_bytes = ((req->incnt + 3) / 4) * sizeof(struct sglist_component);
-   info->gather_components = kzalloc(g_sz_bytes, GFP_KERNEL);
+   info->gather_components = kzalloc(g_sz_bytes, req->may_sleep ? 
GFP_KERNEL : GFP_ATOMIC);
if (!info->gather_components) {
ret = -ENOMEM;
goto  scatter_gather_clean;
@@ -150,7 +150,7 @@ static inline int setup_sgio_list(struct
 
/* Setup scatter (output) components */
s_sz_bytes = ((req->outcnt + 3) / 4) * sizeof(struct sglist_component);
-   info->scatter_components = kzalloc(s_sz_bytes, GFP_KERNEL);
+   info->scatter_components = kzalloc(s_sz_bytes, req->may_sleep ? 
GFP_KERNEL : GFP_ATOMIC);
if (!info->scatter_components) {
ret = -ENOMEM;
goto  scatter_gather_clean;
@@ -167,7 +167,7 @@ static inline int setup_sgio_list(struct
 
/* Create and initialize DPTR */
info->dlen = g_sz_bytes + s_sz_bytes + SG_LIST_HDR_SIZE;
-   info->in_buffer = kzalloc(info->dlen, GFP_KERNEL);
+   info->in_buffer = kzalloc(info->dlen, req->may_sleep ? GFP_KERNEL : 
GFP_ATOMIC);
if (!info->in_buffer) {
ret = -ENOMEM;
goto  scatter_gather_clean;
@@ -195,7 +195,7 @@ static inline int setup_sgio_list(struct
}
 
/* Create and initialize RPTR */
-   info->out_buffer = kzalloc(COMPLETION_CODE_SIZE, GFP_KERNEL);
+   info->out_buffer = kzalloc(COMPLETION_CODE_SIZE, req->may_sleep ? 
GFP_KERNEL : GFP_ATOMIC);
if (!info->out_buffer) {
ret = -ENOMEM;
goto scatter_gather_clean;
@@ -421,7 +421,7 @@ int process_request(struct cpt_vf *cptvf
struct cpt_vq_command vq_cmd;
union cpt_inst_s cptinst;
 
-   info = kzalloc(sizeof(*info), GFP_KERNEL);
+   info = kzalloc(sizeof(*info), req->may_sleep ? GFP_KERNEL : GFP_ATOMIC);
if (unlikely(!info)) {
dev_err(>dev, "Unable to allocate memory for 
info_buffer\n");
return -ENOMEM;
@@ -443,7 +443,7 @@ int process_request(struct cpt_vf *cptvf
 * Get buffer for union cpt_res_s response
 * structure and its physical address
 */
-   info-&g

[PATCH 3/4] crypto: set the flag CRYPTO_ALG_ALLOCATES_MEMORY

2020-06-16 Thread Mikulas Patocka
Set the flag CRYPTO_ALG_ALLOCATES_MEMORY in the crypto drivers that
allocate memory.

Signed-off-by: Mikulas Patocka 

---
 drivers/crypto/allwinner/sun8i-ce/sun8i-ce-core.c |8 +-
 drivers/crypto/allwinner/sun8i-ss/sun8i-ss-core.c |8 +-
 drivers/crypto/amlogic/amlogic-gxl-core.c |4 -
 drivers/crypto/axis/artpec6_crypto.c  |   20 +++--
 drivers/crypto/bcm/cipher.c   |   38 +-
 drivers/crypto/caam/caamalg.c |4 -
 drivers/crypto/caam/caamalg_qi.c  |4 -
 drivers/crypto/caam/caamalg_qi2.c |6 -
 drivers/crypto/caam/caamhash.c|2 
 drivers/crypto/cavium/cpt/cptvf_algs.c|   12 +--
 drivers/crypto/cavium/nitrox/nitrox_aead.c|4 -
 drivers/crypto/cavium/nitrox/nitrox_skcipher.c|   16 ++--
 drivers/crypto/ccp/ccp-crypto-aes-cmac.c  |1 
 drivers/crypto/ccp/ccp-crypto-aes-galois.c|1 
 drivers/crypto/ccp/ccp-crypto-aes-xts.c   |1 
 drivers/crypto/ccp/ccp-crypto-aes.c   |2 
 drivers/crypto/ccp/ccp-crypto-des3.c  |1 
 drivers/crypto/ccp/ccp-crypto-sha.c   |1 
 drivers/crypto/chelsio/chcr_algo.c|6 -
 drivers/crypto/hisilicon/sec/sec_algs.c   |   16 ++--
 drivers/crypto/hisilicon/sec2/sec_crypto.c|4 -
 drivers/crypto/inside-secure/safexcel_cipher.c|   47 +
 drivers/crypto/inside-secure/safexcel_hash.c  |   18 +
 drivers/crypto/ixp4xx_crypto.c|6 +
 drivers/crypto/marvell/cesa/cipher.c  |   12 +--
 drivers/crypto/marvell/cesa/hash.c|6 +
 drivers/crypto/marvell/octeontx/otx_cptvf_algs.c  |   30 
 drivers/crypto/n2_core.c  |2 
 drivers/crypto/picoxcell_crypto.c |   17 
 drivers/crypto/qat/qat_common/qat_algs.c  |   12 +--
 drivers/crypto/qce/sha.c  |2 
 drivers/crypto/qce/skcipher.c |1 
 drivers/crypto/talitos.c  |   78 +++---
 drivers/crypto/virtio/virtio_crypto_algs.c|2 
 drivers/crypto/xilinx/zynqmp-aes-gcm.c|1 
 35 files changed, 248 insertions(+), 145 deletions(-)

Index: linux-2.6/drivers/crypto/qat/qat_common/qat_algs.c
===
--- linux-2.6.orig/drivers/crypto/qat/qat_common/qat_algs.c
+++ linux-2.6/drivers/crypto/qat/qat_common/qat_algs.c
@@ -1267,7 +1267,7 @@ static struct aead_alg qat_aeads[] = { {
.cra_name = "authenc(hmac(sha1),cbc(aes))",
.cra_driver_name = "qat_aes_cbc_hmac_sha1",
.cra_priority = 4001,
-   .cra_flags = CRYPTO_ALG_ASYNC,
+   .cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY,
.cra_blocksize = AES_BLOCK_SIZE,
.cra_ctxsize = sizeof(struct qat_alg_aead_ctx),
.cra_module = THIS_MODULE,
@@ -1284,7 +1284,7 @@ static struct aead_alg qat_aeads[] = { {
.cra_name = "authenc(hmac(sha256),cbc(aes))",
.cra_driver_name = "qat_aes_cbc_hmac_sha256",
.cra_priority = 4001,
-   .cra_flags = CRYPTO_ALG_ASYNC,
+   .cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY,
.cra_blocksize = AES_BLOCK_SIZE,
.cra_ctxsize = sizeof(struct qat_alg_aead_ctx),
.cra_module = THIS_MODULE,
@@ -1301,7 +1301,7 @@ static struct aead_alg qat_aeads[] = { {
.cra_name = "authenc(hmac(sha512),cbc(aes))",
.cra_driver_name = "qat_aes_cbc_hmac_sha512",
.cra_priority = 4001,
-   .cra_flags = CRYPTO_ALG_ASYNC,
+   .cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY,
.cra_blocksize = AES_BLOCK_SIZE,
.cra_ctxsize = sizeof(struct qat_alg_aead_ctx),
.cra_module = THIS_MODULE,
@@ -1319,7 +1319,7 @@ static struct skcipher_alg qat_skciphers
.base.cra_name = "cbc(aes)",
.base.cra_driver_name = "qat_aes_cbc",
.base.cra_priority = 4001,
-   .base.cra_flags = CRYPTO_ALG_ASYNC,
+   .base.cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY,
.base.cra_blocksize = AES_BLOCK_SIZE,
.base.cra_ctxsize = sizeof(struct qat_alg_skcipher_ctx),
.base.cra_alignmask = 0,
@@ -1337,7 +1337,7 @@ static struct skcipher_alg qat_skciphers
.base.cra_name = "ctr(aes)",
.base.cra_driver_name = "qat_aes_ctr",
.base.cra_priority = 4001,
-   .base.cra_flags = CRYPTO_ALG_ASYNC,
+   .base.cra_flags = CRYPTO_ALG_ASYNC | CRYPTO_ALG_ALLOCATES_MEMORY,
.base.cra_blocksize = 1,
.base.cra_ctxsize 

[PATCH 1/4] crypto: introduce CRYPTO_ALG_ALLOCATES_MEMORY

2020-06-16 Thread Mikulas Patocka
Introduce a new flag CRYPTO_ALG_ALLOCATES_MEMORY and modify dm-crypt, so
that it uses only drivers without this flag.

If the flag is set, then the crypto driver allocates memory in its request
routine. Such drivers are not suitable for disk encryption because
GFP_ATOMIC allocation can fail anytime (causing random I/O errors) and
GFP_KERNEL allocation can recurse into the block layer, causing a
deadlock.

Signed-off-by: Mikulas Patocka 

Index: linux-2.6/include/linux/crypto.h
===
--- linux-2.6.orig/include/linux/crypto.h
+++ linux-2.6/include/linux/crypto.h
@@ -97,9 +97,15 @@
 #define CRYPTO_ALG_OPTIONAL_KEY0x4000
 
 /*
+ * The driver is allocating emmory in its encrypt or decrypt callback,
+ * so it should not be used to encrypt block devices.
+ */
+#define CRYPTO_ALG_ALLOCATES_MEMORY0x8000
+
+/*
  * Don't trigger module loading
  */
-#define CRYPTO_NOLOAD  0x8000
+#define CRYPTO_NOLOAD  0x0001
 
 /*
  * Transform masks and values (for crt_flags).
Index: linux-2.6/drivers/md/dm-crypt.c
===
--- linux-2.6.orig/drivers/md/dm-crypt.c
+++ linux-2.6/drivers/md/dm-crypt.c
@@ -419,7 +419,7 @@ static int crypt_iv_lmk_ctr(struct crypt
return -EINVAL;
}
 
-   lmk->hash_tfm = crypto_alloc_shash("md5", 0, 0);
+   lmk->hash_tfm = crypto_alloc_shash("md5", 0, 
CRYPTO_ALG_ALLOCATES_MEMORY);
if (IS_ERR(lmk->hash_tfm)) {
ti->error = "Error initializing LMK hash";
return PTR_ERR(lmk->hash_tfm);
@@ -581,7 +581,7 @@ static int crypt_iv_tcw_ctr(struct crypt
return -EINVAL;
}
 
-   tcw->crc32_tfm = crypto_alloc_shash("crc32", 0, 0);
+   tcw->crc32_tfm = crypto_alloc_shash("crc32", 0, 
CRYPTO_ALG_ALLOCATES_MEMORY);
if (IS_ERR(tcw->crc32_tfm)) {
ti->error = "Error initializing CRC32 in TCW";
return PTR_ERR(tcw->crc32_tfm);
@@ -768,7 +768,7 @@ static int crypt_iv_elephant_ctr(struct
struct iv_elephant_private *elephant = >iv_gen_private.elephant;
int r;
 
-   elephant->tfm = crypto_alloc_skcipher("ecb(aes)", 0, 0);
+   elephant->tfm = crypto_alloc_skcipher("ecb(aes)", 0, 
CRYPTO_ALG_ALLOCATES_MEMORY);
if (IS_ERR(elephant->tfm)) {
r = PTR_ERR(elephant->tfm);
elephant->tfm = NULL;
@@ -2088,7 +2088,7 @@ static int crypt_alloc_tfms_skcipher(str
return -ENOMEM;
 
for (i = 0; i < cc->tfms_count; i++) {
-   cc->cipher_tfm.tfms[i] = crypto_alloc_skcipher(ciphermode, 0, 
0);
+   cc->cipher_tfm.tfms[i] = crypto_alloc_skcipher(ciphermode, 0, 
CRYPTO_ALG_ALLOCATES_MEMORY);
if (IS_ERR(cc->cipher_tfm.tfms[i])) {
err = PTR_ERR(cc->cipher_tfm.tfms[i]);
crypt_free_tfms(cc);
@@ -2114,7 +2114,7 @@ static int crypt_alloc_tfms_aead(struct
if (!cc->cipher_tfm.tfms)
return -ENOMEM;
 
-   cc->cipher_tfm.tfms_aead[0] = crypto_alloc_aead(ciphermode, 0, 0);
+   cc->cipher_tfm.tfms_aead[0] = crypto_alloc_aead(ciphermode, 0, 
CRYPTO_ALG_ALLOCATES_MEMORY);
if (IS_ERR(cc->cipher_tfm.tfms_aead[0])) {
err = PTR_ERR(cc->cipher_tfm.tfms_aead[0]);
crypt_free_tfms(cc);
@@ -2565,7 +2565,7 @@ static int crypt_ctr_auth_cipher(struct
return -ENOMEM;
strncpy(mac_alg, start, end - start);
 
-   mac = crypto_alloc_ahash(mac_alg, 0, 0);
+   mac = crypto_alloc_ahash(mac_alg, 0, CRYPTO_ALG_ALLOCATES_MEMORY);
kfree(mac_alg);
 
if (IS_ERR(mac))



[PATCH 2/4] crypto: pass the flag CRYPTO_ALG_ALLOCATES_MEMORY

2020-06-16 Thread Mikulas Patocka
Pass the flag CRYPTO_ALG_ALLOCATES_MEMORY down through the crypto API.

Signed-off-by: Mikulas Patocka 

---
 crypto/adiantum.c |3 ++-
 crypto/authenc.c  |5 +++--
 crypto/authencesn.c   |5 +++--
 crypto/ccm.c  |7 ---
 crypto/chacha20poly1305.c |5 +++--
 crypto/cryptd.c   |7 +--
 crypto/ctr.c  |3 ++-
 crypto/cts.c  |5 +++--
 crypto/essiv.c|5 +++--
 crypto/gcm.c  |   15 +--
 crypto/geniv.c|3 ++-
 crypto/lrw.c  |5 +++--
 crypto/rsa-pkcs1pad.c |5 +++--
 crypto/xts.c  |2 +-
 include/crypto/algapi.h   |9 +
 15 files changed, 55 insertions(+), 29 deletions(-)

Index: linux-2.6/crypto/authenc.c
===
--- linux-2.6.orig/crypto/authenc.c
+++ linux-2.6/crypto/authenc.c
@@ -388,7 +388,8 @@ static int crypto_authenc_create(struct
if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask)
return -EINVAL;
 
-   mask = crypto_requires_sync(algt->type, algt->mask);
+   mask = crypto_requires_sync(algt->type, algt->mask) |
+  crypto_requires_nomem(algt->type, algt->mask);
 
inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL);
if (!inst)
@@ -424,7 +425,7 @@ static int crypto_authenc_create(struct
goto err_free_inst;
 
inst->alg.base.cra_flags = (auth_base->cra_flags |
-   enc->base.cra_flags) & CRYPTO_ALG_ASYNC;
+   enc->base.cra_flags) & (CRYPTO_ALG_ASYNC | 
CRYPTO_ALG_ALLOCATES_MEMORY);
inst->alg.base.cra_priority = enc->base.cra_priority * 10 +
  auth_base->cra_priority;
inst->alg.base.cra_blocksize = enc->base.cra_blocksize;
Index: linux-2.6/crypto/authencesn.c
===
--- linux-2.6.orig/crypto/authencesn.c
+++ linux-2.6/crypto/authencesn.c
@@ -406,7 +406,8 @@ static int crypto_authenc_esn_create(str
if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask)
return -EINVAL;
 
-   mask = crypto_requires_sync(algt->type, algt->mask);
+   mask = crypto_requires_sync(algt->type, algt->mask) |
+  crypto_requires_nomem(algt->type, algt->mask);
 
inst = kzalloc(sizeof(*inst) + sizeof(*ctx), GFP_KERNEL);
if (!inst)
@@ -438,7 +439,7 @@ static int crypto_authenc_esn_create(str
goto err_free_inst;
 
inst->alg.base.cra_flags = (auth_base->cra_flags |
-   enc->base.cra_flags) & CRYPTO_ALG_ASYNC;
+   enc->base.cra_flags) & (CRYPTO_ALG_ASYNC | 
CRYPTO_ALG_ALLOCATES_MEMORY);
inst->alg.base.cra_priority = enc->base.cra_priority * 10 +
  auth_base->cra_priority;
inst->alg.base.cra_blocksize = enc->base.cra_blocksize;
Index: linux-2.6/crypto/ccm.c
===
--- linux-2.6.orig/crypto/ccm.c
+++ linux-2.6/crypto/ccm.c
@@ -462,7 +462,8 @@ static int crypto_ccm_create_common(stru
if ((algt->type ^ CRYPTO_ALG_TYPE_AEAD) & algt->mask)
return -EINVAL;
 
-   mask = crypto_requires_sync(algt->type, algt->mask);
+   mask = crypto_requires_sync(algt->type, algt->mask) |
+  crypto_requires_nomem(algt->type, algt->mask);
 
inst = kzalloc(sizeof(*inst) + sizeof(*ictx), GFP_KERNEL);
if (!inst)
@@ -507,7 +508,7 @@ static int crypto_ccm_create_common(stru
 mac->base.cra_driver_name) >= CRYPTO_MAX_ALG_NAME)
goto err_free_inst;
 
-   inst->alg.base.cra_flags = ctr->base.cra_flags & CRYPTO_ALG_ASYNC;
+   inst->alg.base.cra_flags = ctr->base.cra_flags & (CRYPTO_ALG_ASYNC | 
CRYPTO_ALG_ALLOCATES_MEMORY);
inst->alg.base.cra_priority = (mac->base.cra_priority +
   ctr->base.cra_priority) / 2;
inst->alg.base.cra_blocksize = 1;
@@ -759,7 +760,7 @@ static int crypto_rfc4309_create(struct
CRYPTO_MAX_ALG_NAME)
goto err_free_inst;
 
-   inst->alg.base.cra_flags = alg->base.cra_flags & CRYPTO_ALG_ASYNC;
+   inst->alg.base.cra_flags = alg->base.cra_flags & (CRYPTO_ALG_ASYNC | 
CRYPTO_ALG_ALLOCATES_MEMORY);
inst->alg.base.cra_priority = alg->base.cra_priority;
inst->alg.base.cra_blocksize = 1;
inst->alg.base.cra_alignmask = alg->base.cra_alignmask;
Index: linux-2.6/crypto/chacha20poly1305.c
===
---

Re: crypto API and GFP_ATOMIC

2020-06-16 Thread Mikulas Patocka



On Wed, 10 Jun 2020, Herbert Xu wrote:

> On Wed, Jun 10, 2020 at 08:02:23AM -0400, Mikulas Patocka wrote:
> >
> > Yes, fixing the drivers would be the best - but you can hardly find any 
> > person who has all the crypto hardware and who is willing to rewrite all 
> > the drivers for it.
> 
> We don't have to rewrite them straight away.  We could mark the
> known broken ones (or the known working ones) and then dm-crypt
> can allocate only those using the types/mask to crypto_alloc.
> 
> Cheers,

I triaged the drivers in drivers/crypto and unfortunatelly, most of them 
do memory allocation in the encryption routine. Some of the do GFP_KERNEL 
allocation even in the absence of CRYPTO_TFM_REQ_MAY_SLEEP.

I'm sending the patches:

The first patch adds a new flag CRYPTO_ALG_ALLOCATES_MEMORY.

The second patch passes CRYPTO_ALG_ALLOCATES_MEMORY through the crypto API 
stack (please check it - I am not an expert in this area).

The third patch sets CRYPTO_ALG_ALLOCATES_MEMORY on drivers that allocate 
memory in the encrypt/decrypt routine.

The fourth patch fixes the drivers that use GFP_KERNEL in non-blocking 
context.

Mikulas



Re: [PATCH] dm writecache: skip writecache_wait when using pmem mode

2020-06-13 Thread Mikulas Patocka



On Fri, 12 Jun 2020, Huaisheng Ye wrote:

> From: Huaisheng Ye 
> 
> The array bio_in_progress is only used with ssd mode. So skip
> writecache_wait_for_ios in writecache_discard when pmem mode.
> 
> Signed-off-by: Huaisheng Ye 

Acked-by: Mikulas Patocka 

> ---
>  drivers/md/dm-writecache.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
> index 66f3a3b..4367cc7 100644
> --- a/drivers/md/dm-writecache.c
> +++ b/drivers/md/dm-writecache.c
> @@ -849,8 +849,10 @@ static void writecache_discard(struct dm_writecache *wc, 
> sector_t start, sector_
>  
>   if (likely(!e->write_in_progress)) {
>   if (!discarded_something) {
> - writecache_wait_for_ios(wc, READ);
> - writecache_wait_for_ios(wc, WRITE);
> + if (!WC_MODE_PMEM(wc)) {
> + writecache_wait_for_ios(wc, READ);
> + writecache_wait_for_ios(wc, WRITE);
> + }
>   discarded_something = true;
>   }
>   writecache_free_entry(wc, e);
> -- 
> 1.8.3.1
> 



Re: [PATCH] dm writecache: correct uncommitted_block when discarding uncommitted entry

2020-06-13 Thread Mikulas Patocka



On Fri, 12 Jun 2020, Huaisheng Ye wrote:

> From: Huaisheng Ye 
> 
> When uncommitted entry has been discarded, correct wc->uncommitted_block
> for getting the exact number.
> 
> Signed-off-by: Huaisheng Ye 

Acked-by: Mikulas Patocka 

Also, add:
Cc: sta...@vger.kernel.org

> ---
>  drivers/md/dm-writecache.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
> index 4367cc7..64b4527 100644
> --- a/drivers/md/dm-writecache.c
> +++ b/drivers/md/dm-writecache.c
> @@ -855,6 +855,8 @@ static void writecache_discard(struct dm_writecache *wc, 
> sector_t start, sector_
>   }
>   discarded_something = true;
>   }
> + if (!writecache_entry_is_committed(wc, e))
> + wc->uncommitted_blocks--;
>   writecache_free_entry(wc, e);
>   }
>  
> -- 
> 1.8.3.1
> 



Re: crypto API and GFP_ATOMIC

2020-06-10 Thread Mikulas Patocka



On Wed, 10 Jun 2020, Herbert Xu wrote:

> On Tue, Jun 09, 2020 at 01:11:05PM -0400, Mikulas Patocka wrote:
> >
> > Do you have another idea how to solve this problem?
> 
> I think the better approach would be to modify the drivers to not
> allocate any memory.  In general, any memory needed by the driver
> to fulfil a request *should* be allocated within the crypto request
> object.  That's why we have the reqsize field to indicate how much
> memory could be needed per request.
> 
> Thanks,

Yes, fixing the drivers would be the best - but you can hardly find any 
person who has all the crypto hardware and who is willing to rewrite all 
the drivers for it.

Another possibility - I was thinking about setting 
CRYPTO_TFM_REQ_MAY_SLEEP in dm-crypt and calling the crypto function under 
memalloc_noio_save. But there are some drivers that do GFP_ATOMIC 
allocation regardless of CRYPTO_TFM_REQ_MAY_SLEEP.

Mikulas



crypto API and GFP_ATOMIC

2020-06-09 Thread Mikulas Patocka
Hi

I've found out that a lot of hardware crypto drivers use GFP_ATOMIC. Some 
of them switch between GFP_ATOMIC and GFP_KERNEL based on the flag 
CRYPTO_TFM_REQ_MAY_SLEEP.

dm-crypt and dm-integrity don't use CRYPTO_TFM_REQ_MAY_SLEEP (because 
GFP_KERNEL allocation requests can recurse back to the block device 
drivers and cause deadlocks).

So, basically, the crypto requests submitted by dm-crypt and dm-integrity 
can fail anytime. I'd like to ask, how to handle these random -ENOMEM 
return codes. If we pass -ENOMEM back to the block device stack, it could 
cause random I/O errors and data corruption.

The question is - if the crypto request returns -ENOMEM, could we sleep 
and retry it? I thought about it - the problem could be, if the crypto 
requests proceeds hafway through and then returns -ENOMEM, and if we 
retried it, it would cause data corruption, because part of the data would 
be decrypted twice.

Is it safe to assume that when we get -ENOMEM, the crypto driver didn't 
modify anything?

Do you have another idea how to solve this problem?

Mikulas



Re: [PATCH] dm writecache: reinitialize lru in writeback instead of endio

2020-06-01 Thread Mikulas Patocka



On Sat, 30 May 2020, Huaisheng Ye wrote:

> From: Huaisheng Ye 
> 
> When wc_entry has been removed from wbl->list in writeback, it will
> be not used again except waiting to be set free in writecache_free_entry.
> 
> That is a little of annoying, it has to reinitialize lru of wc_entry
> in endio before calling writecache_free_entry.
> 
> Using list_del_init instead of list_del in writeback for simpler code.
> 
> Signed-off-by: Huaisheng Ye 

This patch doesn't fix anything, so I think we don't need it.

Actually, it's better to keep the list entry uninitialized, because it 
helps us catch bugs where this uninitialized list entry could be used 
improperly.

Mikulas

> ---
>  drivers/md/dm-writecache.c | 10 --
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
> index 7bbc21b..66f3a3b 100644
> --- a/drivers/md/dm-writecache.c
> +++ b/drivers/md/dm-writecache.c
> @@ -1519,7 +1519,6 @@ static void __writecache_endio_pmem(struct 
> dm_writecache *wc, struct list_head *
>   e = wb->wc_list[i];
>   BUG_ON(!e->write_in_progress);
>   e->write_in_progress = false;
> - INIT_LIST_HEAD(>lru);
>   if (!writecache_has_error(wc))
>   writecache_free_entry(wc, e);
>   BUG_ON(!wc->writeback_size);
> @@ -1555,7 +1554,6 @@ static void __writecache_endio_ssd(struct dm_writecache 
> *wc, struct list_head *l
>   do {
>   BUG_ON(!e->write_in_progress);
>   e->write_in_progress = false;
> - INIT_LIST_HEAD(>lru);
>   if (!writecache_has_error(wc))
>   writecache_free_entry(wc, e);
>  
> @@ -1654,7 +1652,7 @@ static void __writecache_writeback_pmem(struct 
> dm_writecache *wc, struct writeba
>   while (wbl->size) {
>   wbl->size--;
>   e = container_of(wbl->list.prev, struct wc_entry, lru);
> - list_del(>lru);
> + list_del_init(>lru);
>  
>   max_pages = e->wc_list_contiguous;
>  
> @@ -1685,7 +1683,7 @@ static void __writecache_writeback_pmem(struct 
> dm_writecache *wc, struct writeba
>   if (!wc_add_block(wb, f, GFP_NOWAIT | __GFP_NOWARN))
>   break;
>   wbl->size--;
> - list_del(>lru);
> + list_del_init(>lru);
>   wb->wc_list[wb->wc_list_n++] = f;
>   e = f;
>   }
> @@ -1712,7 +1710,7 @@ static void __writecache_writeback_ssd(struct 
> dm_writecache *wc, struct writebac
>  
>   wbl->size--;
>   e = container_of(wbl->list.prev, struct wc_entry, lru);
> - list_del(>lru);
> + list_del_init(>lru);
>  
>   n_sectors = e->wc_list_contiguous << (wc->block_size_bits - 
> SECTOR_SHIFT);
>  
> @@ -1732,7 +1730,7 @@ static void __writecache_writeback_ssd(struct 
> dm_writecache *wc, struct writebac
>   wbl->size--;
>   f = container_of(wbl->list.prev, struct wc_entry, lru);
>   BUG_ON(f != e + 1);
> - list_del(>lru);
> + list_del_init(>lru);
>   e = f;
>   }
>  
> -- 
> 1.8.3.1
> 



Re: [PATCH] tty:vt: Add check the return value of kzalloc to avoid oops

2019-09-19 Thread Mikulas Patocka



On Thu, 19 Sep 2019, Greg KH wrote:

> On Thu, Sep 19, 2019 at 05:18:15PM +0800, Xiaoming Ni wrote:
> > Using kzalloc() to allocate memory in function con_init(), but not
> > checking the return value, there is a risk of null pointer references
> > oops.
> > 
> > Signed-off-by: Xiaoming Ni 
> 
> We keep having this be "reported" :(
> 
> > ---
> >  drivers/tty/vt/vt.c | 18 ++
> >  1 file changed, 18 insertions(+)
> > 
> > diff --git a/drivers/tty/vt/vt.c b/drivers/tty/vt/vt.c
> > index 34aa39d..db83e52 100644
> > --- a/drivers/tty/vt/vt.c
> > +++ b/drivers/tty/vt/vt.c
> > @@ -3357,15 +3357,33 @@ static int __init con_init(void)
> >  
> > for (currcons = 0; currcons < MIN_NR_CONSOLES; currcons++) {
> > vc_cons[currcons].d = vc = kzalloc(sizeof(struct vc_data), 
> > GFP_NOWAIT);
> > +   if (unlikely(!vc)) {
> > +   pr_warn("%s:failed to allocate memory for the %u vc\n",
> > +   __func__, currcons);
> > +   break;
> > +   }
> 
> At init, this really can not happen.  Have you see it ever happen?
> 
> > INIT_WORK(_cons[currcons].SAK_work, vc_SAK);
> > tty_port_init(>port);
> > visual_init(vc, currcons, 1);
> > vc->vc_screenbuf = kzalloc(vc->vc_screenbuf_size, GFP_NOWAIT);
> > +   if (unlikely(!vc->vc_screenbuf)) {
> 
> Never use likely/unlikely unless you can actually measure the speed
> difference.  For something like this, the compiler will always get it
> right without you having to do anything.
> 
> And again, how can this fail?  Have you seen it fail?
> 
> > +   pr_warn("%s:failed to allocate memory for the %u 
> > vc_screenbuf\n",
> > +   __func__, currcons);
> > +   visual_deinit(vc);
> > +   tty_port_destroy(>port);
> > +   kfree(vc);
> > +   vc_cons[currcons].d = NULL;
> > +   break;
> > +   }
> > vc_init(vc, vc->vc_rows, vc->vc_cols,
> > currcons || !vc->vc_sw->con_save_screen);
> > }
> > currcons = fg_console = 0;
> > master_display_fg = vc = vc_cons[currcons].d;
> > +   if (unlikely(!vc)) {
> 
> Again, never use likely/unlikely unless you can measure it.
> 
> thanks,
> 
> greg k-h

Why does it use GFP_NOWAIT and not GFP_KERNEL? Is there some problem with 
GFP_KERNEL during initialization?

Mikulas


RE: [External] Re: [PATCH] dm writecache: skip writecache_wait for pmem mode

2019-09-04 Thread Mikulas Patocka



On Wed, 4 Sep 2019, Huaisheng HS1 Ye wrote:

> > -Original Message-
> > From: Mikulas Patocka 
> > Sent: Wednesday, September 4, 2019 4:49 PM
> > On Mon, 2 Sep 2019, Huaisheng Ye wrote:
> > 
> > > From: Huaisheng Ye 
> > >
> > > The array bio_in_progress[2] only have chance to be increased and
> > > decreased with ssd mode. For pmem mode, they are not involved at all.
> > > So skip writecache_wait_for_ios in writecache_flush for pmem.
> > >
> > > Suggested-by: Doris Yu 
> > > Signed-off-by: Huaisheng Ye 
> > > ---
> > >  drivers/md/dm-writecache.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/md/dm-writecache.c b/drivers/md/dm-writecache.c
> > > index c481947..d06b8aa 100644
> > > --- a/drivers/md/dm-writecache.c
> > > +++ b/drivers/md/dm-writecache.c
> > > @@ -726,7 +726,8 @@ static void writecache_flush(struct dm_writecache *wc)
> > >   }
> > >   writecache_commit_flushed(wc);
> > >
> > > - writecache_wait_for_ios(wc, WRITE);
> > > + if (!WC_MODE_PMEM(wc))
> > > + writecache_wait_for_ios(wc, WRITE);
> > >
> > >   wc->seq_count++;
> > >   pmem_assign(sb(wc)->seq_count, cpu_to_le64(wc->seq_count));
> > > --
> > > 1.8.3.1
> > 
> > I think this is not needed - wait_event in writecache_wait_for_ios exits
> > immediatelly if the condition is true.
> > 
> > This code path is not so hot that we would need microoptimizations like 
> > this to
> > avoid function calls.
> 
> Hi Mikulas,
> 
> Thanks for your reply, I see what you mean, but I can't agree with you.
> 
> For pmem mode, this code path (writecache_flush) is much more hot than 
> SSD mode. Because in the code, the AUTOCOMMIT_BLOCKS_PMEM has been 
> defined to 64, which means if more than 64 blocks have been inserted to 
> cache device, also called uncommitted, writecache_flush would be called. 
> Otherwise, there is a timer callback function will be called every 1000 
> milliseconds.
> 
> #define AUTOCOMMIT_BLOCKS_SSD 65536
> #define AUTOCOMMIT_BLOCKS_PMEM64
> #define AUTOCOMMIT_MSEC   1000
> 
> So when dm-writecache running in working mode, there are continuous 
> WRITE operations has been mapped to writecache_map, writecache_flush 
> will be used much more often than SSD mode.
> 
> Cheers,
> Huaisheng Ye

So, you save one instruction cache line for every 64*4096 bytes written to 
persistent memory.

If you insist on it, I can acknowledge it, but I think it is really an 
over-optimization.

Acked-By: Mikulas Patocka 

Mikulas


[PATCH 2/2] dm: make dm_table_find_target return NULL

2019-08-23 Thread Mikulas Patocka
Currently, if we pass too high sector number to dm_table_find_target, it
returns zeroed dm_target structure and callers test if the structure is
zeroed with the macro dm_target_is_valid.

However, returning NULL is common practice to indicate errors.

This patch refactors the dm code, so that dm_table_find_target returns
NULL and its callers test the returned value for NULL. The macro
dm_target_is_valid is deleted. In alloc_targets, we no longer allocate an
extra zeroed target.

Signed-off-by: Mikulas Patocka 

---
 drivers/md/dm-ioctl.c |2 +-
 drivers/md/dm-table.c |8 +++-
 drivers/md/dm.c   |8 
 drivers/md/dm.h   |5 -
 4 files changed, 8 insertions(+), 15 deletions(-)

Index: linux-2.6/drivers/md/dm.c
===
--- linux-2.6.orig/drivers/md/dm.c  2019-08-23 15:45:46.0 +0200
+++ linux-2.6/drivers/md/dm.c   2019-08-23 15:45:46.0 +0200
@@ -457,7 +457,7 @@ static int dm_blk_report_zones(struct ge
return -EIO;
 
tgt = dm_table_find_target(map, sector);
-   if (!dm_target_is_valid(tgt)) {
+   if (!tgt) {
ret = -EIO;
goto out;
}
@@ -1072,7 +1072,7 @@ static struct dm_target *dm_dax_get_live
return NULL;
 
ti = dm_table_find_target(map, sector);
-   if (!dm_target_is_valid(ti))
+   if (!ti)
return NULL;
 
return ti;
@@ -1572,7 +1572,7 @@ static int __split_and_process_non_flush
int r;
 
ti = dm_table_find_target(ci->map, ci->sector);
-   if (!dm_target_is_valid(ti))
+   if (!ti)
return -EIO;
 
if (__process_abnormal_io(ci, ti, ))
@@ -1748,7 +1748,7 @@ static blk_qc_t dm_process_bio(struct ma
 
if (!ti) {
ti = dm_table_find_target(map, bio->bi_iter.bi_sector);
-   if (unlikely(!ti || !dm_target_is_valid(ti))) {
+   if (unlikely(!ti)) {
bio_io_error(bio);
return ret;
}
Index: linux-2.6/drivers/md/dm-ioctl.c
===
--- linux-2.6.orig/drivers/md/dm-ioctl.c2019-08-23 15:45:46.0 
+0200
+++ linux-2.6/drivers/md/dm-ioctl.c 2019-08-23 15:45:46.0 +0200
@@ -1592,7 +1592,7 @@ static int target_message(struct file *f
}
 
ti = dm_table_find_target(table, tmsg->sector);
-   if (!dm_target_is_valid(ti)) {
+   if (!ti) {
DMWARN("Target message sector outside device.");
r = -EINVAL;
} else if (ti->type->message)
Index: linux-2.6/drivers/md/dm-table.c
===
--- linux-2.6.orig/drivers/md/dm-table.c2019-08-23 15:45:46.0 
+0200
+++ linux-2.6/drivers/md/dm-table.c 2019-08-23 15:47:18.0 +0200
@@ -163,10 +163,8 @@ static int alloc_targets(struct dm_table
 
/*
 * Allocate both the target array and offset array at once.
-* Append an empty entry to catch sectors beyond the end of
-* the device.
 */
-   n_highs = (sector_t *) dm_vcalloc(num + 1, sizeof(struct dm_target) +
+   n_highs = (sector_t *) dm_vcalloc(num, sizeof(struct dm_target) +
  sizeof(sector_t));
if (!n_highs)
return -ENOMEM;
@@ -1359,7 +1357,7 @@ struct dm_target *dm_table_get_target(st
 /*
  * Search the btree for the correct target.
  *
- * Caller should check returned pointer with dm_target_is_valid()
+ * Caller should check returned pointer for NULL
  * to trap I/O beyond end of device.
  */
 struct dm_target *dm_table_find_target(struct dm_table *t, sector_t sector)
@@ -1368,7 +1366,7 @@ struct dm_target *dm_table_find_target(s
sector_t *node;
 
if (unlikely(sector >= dm_table_get_size(t)))
-   return >targets[t->num_targets];
+   return NULL;
 
for (l = 0; l < t->depth; l++) {
n = get_child(n, k);
Index: linux-2.6/drivers/md/dm.h
===
--- linux-2.6.orig/drivers/md/dm.h  2019-08-23 15:45:46.0 +0200
+++ linux-2.6/drivers/md/dm.h   2019-08-23 15:45:46.0 +0200
@@ -86,11 +86,6 @@ struct target_type *dm_get_immutable_tar
 int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t);
 
 /*
- * To check the return value from dm_table_find_target().
- */
-#define dm_target_is_valid(t) ((t)->table)
-
-/*
  * To check whether the target type is bio-based or not (request-based).
  */
 #define dm_target_bio_based(t) ((t)->type->map != NULL)



[PATCH 1/2] dm table: fix invalid memory accesses with too high sector number

2019-08-23 Thread Mikulas Patocka
If the sector number is too high, dm_table_find_target should return a
pointer to a zeroed dm_target structure (the caller should test it with
dm_target_is_valid).

However, for some table sizes, the code in dm_table_find_target that
performs btree lookup will access out of bound memory structures.

This patch fixes the bug by testing the sector number at the beginning of
dm_table_find_target. We add an "inline" keyword to the function
dm_table_get_size because this is hot path.

Signed-off-by: Mikulas Patocka 
Reported-by: Zhang Tao 
Fixes: 512875bd9661 ("dm: table detect io beyond device")
Cc: sta...@vger.kernel.org

---
 drivers/md/dm-table.c |5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

Index: linux-2.6/drivers/md/dm-table.c
===
--- linux-2.6.orig/drivers/md/dm-table.c2019-08-23 13:40:51.0 
+0200
+++ linux-2.6/drivers/md/dm-table.c 2019-08-23 15:43:19.0 +0200
@@ -1342,7 +1342,7 @@ void dm_table_event(struct dm_table *t)
 }
 EXPORT_SYMBOL(dm_table_event);
 
-sector_t dm_table_get_size(struct dm_table *t)
+inline sector_t dm_table_get_size(struct dm_table *t)
 {
return t->num_targets ? (t->highs[t->num_targets - 1] + 1) : 0;
 }
@@ -1367,6 +1367,9 @@ struct dm_target *dm_table_find_target(s
unsigned int l, n = 0, k = 0;
sector_t *node;
 
+   if (unlikely(sector >= dm_table_get_size(t)))
+   return >targets[t->num_targets];
+
for (l = 0; l < t->depth; l++) {
n = get_child(n, k);
node = get_node(t, l, n);



Re: [dm-devel] [PATCH] dm table: fix a potential array out of bounds

2019-08-23 Thread Mikulas Patocka
Hi

I tested it and the bug is real - with some table sizes, 
dm_table_find_target will access memory out of bounds if the sector 
argument is beyond limit.

Your patch fixes some of these cases, but not all of them.

I used this script to test all possible table sizes:
#!/bin/bash -e
sync
dmsetup remove_all || true
rmmod dm_mod || true
>t.txt
for i in `seq 1 1`; do
echo $i
echo $((i-1)) 1 error >>t.txt
dmsetup create error  From: Zhang Tao 
> 
> allocate num + 1 for target and offset array, n_highs need num + 1
> elements, the last element will be used for node lookup in function
> dm_table_find_target.
> 
> Signed-off-by: Zhang Tao 
> ---
>  drivers/md/dm-table.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 7b6c3ee..fd7f604 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -160,20 +160,22 @@ static int alloc_targets(struct dm_table *t, unsigned 
> int num)
>  {
>   sector_t *n_highs;
>   struct dm_target *n_targets;
> + unsigned int alloc_num;
>  
>   /*
>* Allocate both the target array and offset array at once.
>* Append an empty entry to catch sectors beyond the end of
>* the device.
>*/
> - n_highs = (sector_t *) dm_vcalloc(num + 1, sizeof(struct dm_target) +
> + alloc_num = num + 1;
> + n_highs = (sector_t *) dm_vcalloc(alloc_num, sizeof(struct dm_target) +
> sizeof(sector_t));
>   if (!n_highs)
>   return -ENOMEM;
>  
> - n_targets = (struct dm_target *) (n_highs + num);
> + n_targets = (struct dm_target *) (n_highs + alloc_num);
>  
> - memset(n_highs, -1, sizeof(*n_highs) * num);
> + memset(n_highs, -1, sizeof(*n_highs) * alloc_num);
>   vfree(t->highs);
>  
>   t->num_allocated = num;
> -- 
> 1.8.3.1
> 
> 
> --
> dm-devel mailing list
> dm-de...@redhat.com
> https://www.redhat.com/mailman/listinfo/dm-devel
> 


Re: [PATCH v3 0/5] Optimize writecache when using pmem as cache

2019-02-04 Thread Mikulas Patocka



On Thu, 31 Jan 2019, Huaisheng Ye wrote:

> From: Huaisheng Ye 
> 
> This patch set could be used for dm-writecache when use persistent
> memory as cache data device.
> 
> Patch 1 and 2 go towards removing unused parameter and codes which
> actually doesn't really work.

I agree that there is some unused variables and code, but I would let it 
be as it is. The processors can write data to persistent memory either by 
using non-temporal stores (the movnti instruction) or by normal stores 
followed by the clwb instruction.

Currently, the movnti instruction is faster - however, it may be possible 
that with some newer processors, the clwb instruction could be faster - 
and in that case, we need the code that you have removed.

I would like to keep both flush strategies around (movnti and clwb), so 
that we can test how do they perform on various processors. 
Unfortunatelly, some upstream developers hate code with #ifdefs :-(

Note that compiler optimizations already remove the unused parameter and 
the impossible code behind "if (WC_MODE_PMEM(wc)) if (!WC_MODE_PMEM(wc))".

> Patch 3 and 4 are targeted at solving problem fn ctr failed to work
> due to invalid magic or version, which is caused by the super block
> of pmem has messy data stored.

LVM zeros the beginning of new logical volumes, so there should be no 
problem with it. If the user wants to use the writecache target without 
LVM, he should zero the superblock with dd if=/dev/zero of=/dev/pmem0 
bs=4k count=1

Note that other device mapper targets also follow this policy - for 
example see drivers/md/dm-snap-persistent.c:
if (le32_to_cpu(dh->magic) == 0) {
*new_snapshot = 1;
return 0;
}

if (le32_to_cpu(dh->magic) != SNAP_MAGIC) {
DMWARN("Invalid or corrupt snapshot");
r = -ENXIO;
goto bad;
}

So, I think there is no need for these patches - dm-writecache just does 
what others targets do.

> Patch 5 is used for getting the status of seq_count.

It may be accepted if other LVM team members find some use for this value.

Mikulas

> Changes Since v2: 
> - seq_count is important for flush operations, output it within status
>   for debugging and analyzing code behavior.
>   [1]: https://lkml.org/lkml/2019/1/3/43
> [2]: https://lkml.org/lkml/2019/1/9/6
> 
> Huaisheng Ye (5):
>   dm-writecache: remove unused size to writecache_flush_region
>   dm-writecache: get rid of memory_data flush to writecache_flush_entry
>   dm-writecache: expand pmem_reinit for struct dm_writecache
>   Documentation/device-mapper: add optional parameter reinit
>   dm-writecache: output seq_count within status
> 
>  Documentation/device-mapper/writecache.txt |  4 
>  drivers/md/dm-writecache.c | 23 +--
>  2 files changed, 17 insertions(+), 10 deletions(-)
> 
> -- 
> 1.8.3.1
> 
> 


Re: Sleeping from invalid context in udlfb

2018-12-31 Thread Mikulas Patocka


On Thu, 2 Aug 2018, David Airlie wrote:

> 
> I'm pretty sure udlkms handles this already.
> 
> Dave.

But it crashes on unplug :-)

Mikulas

> On Wed, Aug 1, 2018 at 11:34 PM, Mikulas Patocka  wrote:
> 
> 
>   On Wed, 1 Aug 2018, Geert Uytterhoeven wrote:
> 
>   > Hi Mikulas,
>   >
>   > On Wed, Aug 1, 2018 at 12:59 PM Mikulas Patocka  
> wrote:
>   > > On Wed, 1 Aug 2018, Geert Uytterhoeven wrote:
>   > > > On Tue, Jul 31, 2018 at 5:23 PM Mikulas Patocka 
>  wrote:
>   > > > > BTW when using the udlfb driver as a console, I've got this 
> warning.
>   > > > > vt_console_print takes a spinlock and then calls the 
> framebuffer driver
>   > > > > that sleeps.
>   > > > >
>   > > > > The question is - whose fault is this? Could the console code 
> somehow be
>   > > > > told to print characters without holding a spinlock? Or does it 
> mean that
>   > > > > framebuffer drivers can't sleep?
>   > > > >
>   > > > > udlfb communicates through USB, so the sleeping is inevitable.
>   > > > >
>   > > > > Mikulas
>   > > > >
>   > > > >
>   > > > > BUG: sleeping function called from invalid context at 
> mm/slab.h:421
>   > > > > in_atomic(): 1, irqs_disabled(): 0, pid: 430, name: kworker/2:3
>   > > > > 6 locks held by kworker/2:3/430:
>   > > > >  #0: 1301127e ( (wq_completion)"events"){} , at: 
> process_one_work+0x17c/0x3a8
>   > > > >  #1: beacc951 ( 
> (work_completion)(&(>init_framebuffer_work)->work)){} , at: 
> process_one_work+0x17c/0x3a8
>   > > > >  #2: a402f826 ( registration_lock){} , at: 
> register_framebuffer+0x28/0x2c0 [fb]
>   > > > >  #3: 21cbe902 ( console_lock){} , at: 
> register_framebuffer+0x258/0x2c0 [fb]
>   > > > >  #4: 96d51735 ( console_owner){} , at: 
> console_unlock+0x174/0x500
>   > > > >  #5: faa7f206 ( printing_lock){} , at: 
> vt_console_print+0x60/0x3a0
>   > > > > Preemption disabled at: [] 
> vt_console_print+0x60/0x3a0
>   > > > > CPU: 2 PID: 430 Comm: kworker/2:3 Not tainted 4.17.10-debug #3
>   > > > > Hardware name: Marvell Armada 8040 MacchiatoBin/Armada 8040 
> MacchiatoBin, BIOS EDK II Jul 30 2018
>   > > > > Workqueue: events dlfb_init_framebuffer_work [udlfb]
>   > > > > Call trace:
>   > > > >  dump_backtrace+0x0/0x150
>   > > > >  show_stack+0x14/0x20
>   > > > >  dump_stack+0x8c/0xac
>   > > > >  ___might_sleep+0x140/0x170
>   > > > >  __might_sleep+0x50/0x88
>   > > > >  __kmalloc+0x1b0/0x270
>   > > > >  xhci_urb_enqueue+0xa8/0x460 [xhci_hcd]
>   > > > >  usb_hcd_submit_urb+0xc0/0x998 [usbcore]
>   > > > >  usb_submit_urb+0x1e0/0x518 [usbcore]
>   > > > >  dlfb_submit_urb+0x38/0x98 [udlfb]
>   > > > >  dlfb_handle_damage.isra.4+0x1e0/0x210 [udlfb]
>   > > > >  dlfb_ops_imageblit+0x28/0x38 [udlfb]
>   > > > >  soft_cursor+0x15c/0x1d8 [fb]
>   > > > >  bit_cursor+0x324/0x510 [fb]
>   > > > >  fbcon_cursor+0x144/0x1a0 [fb]
>   > > > >  hide_cursor+0x38/0xa0
>   > > > >  vt_console_print+0x334/0x3a0
>   > > > >  console_unlock+0x274/0x500
>   > > > >  register_framebuffer+0x22c/0x2c0 [fb]
>   > > > >  dlfb_init_framebuffer_work+0x1ec/0x2fc [udlfb]
>   > > > >  process_one_work+0x1e8/0x3a8
>   > > > >  worker_thread+0x44/0x418
>   > > > >  kthread+0x11c/0x120
>   > > > >  ret_from_fork+0x10/0x18
>   > > >
>   > > > This is sort of expected: you cannot do USB transfers from 
> printk().
>   > > >
>   > > > Gr{oetje,eeting}s,
>   > > >
>   > > >                         Geert
>   > >
>   > > So, should there be a framebuffer flag that prevents the console 
> from
>   > > binding to it?
>   > >
>   > > If I start the kernel with "console=ttyS0,115200", it doesn't try 
> to bind
>   > > to the udlfb driver, but if I start it without this flag, it does 
> and
>   > > crashes :-(
>   >
>   > Your frame buffer driver should offload tasks that may sleep to e.g. a
>   > workqueue.
>   >
>   > Gr{oetje,eeting}s,
>   >
>   >                         Geert
> 
> I can try to do this - but - taking a spinlock and copying 8MB framebuffer
> would damage scheduling latency even for PCI framebuffer drivers.
> 
> Mikulas
> 
> 
> 
> 

Re: [LKP] d50d82faa0 [ 33.671845] WARNING: possible circular locking dependency detected

2018-11-12 Thread Mikulas Patocka



On Wed, 7 Nov 2018, Andrew Morton wrote:

> On Wed, 7 Nov 2018 15:43:36 -0800 Andrew Morton  
> wrote:
> 
> > On Tue, 23 Oct 2018 08:30:04 +0800 kernel test robot 
> >  wrote:
> > 
> > > Greetings,
> > > 
> > > 0day kernel testing robot got the below dmesg and the first bad commit is
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> > > 
> > > commit d50d82faa0c964e31f7a946ba8aba7c715ca7ab0
> > > Author: Mikulas Patocka 
> > > AuthorDate: Wed Jun 27 23:26:09 2018 -0700
> > > Commit: Linus Torvalds 
> > > CommitDate: Thu Jun 28 11:16:44 2018 -0700
> > > 
> > > slub: fix failure when we delete and create a slab cache
> > 
> > This is ugly.  Is there an alternative way of fixing the race which
> > Mikulas attempted to address?  Possibly cancel the work and reuse the
> > existing sysfs file, or is that too stupid to live?
> > 
> > 3b7b314053d021 ("slub: make sysfs file removal asynchronous") was
> > pretty lame, really.  As mentioned,
> > 
> > : It'd be the cleanest to deal with the issue by removing sysfs files
> > : without holding slab_mutex before the rest of shutdown; however, given
> > : the current code structure, it is pretty difficult to do so.
> > 
> > Would be a preferable approach.
> > 
> > > 
> > > This uncovered a bug in the slub subsystem - if we delete a cache and
> > > immediatelly create another cache with the same attributes, it fails
> > > because of duplicate filename in /sys/kernel/slab/.  The slub 
> > > subsystem
> > > offloads freeing the cache to a workqueue - and if we create the new
> > > cache before the workqueue runs, it complains because of duplicate
> > > filename in sysfs.
> 
> Alternatively, could we flush the workqueue before attempting to
> (re)create the sysfs file?

What if someone creates the slab cache from the workqueue?

> Extra points for only doing this if the
> first (re)creation attempt returned -EEXIST?

If it returns -EEXIST, it has already written the warning to the log.

Mikulas


Re: [LKP] d50d82faa0 [ 33.671845] WARNING: possible circular locking dependency detected

2018-11-12 Thread Mikulas Patocka



On Wed, 7 Nov 2018, Andrew Morton wrote:

> On Wed, 7 Nov 2018 15:43:36 -0800 Andrew Morton  
> wrote:
> 
> > On Tue, 23 Oct 2018 08:30:04 +0800 kernel test robot 
> >  wrote:
> > 
> > > Greetings,
> > > 
> > > 0day kernel testing robot got the below dmesg and the first bad commit is
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> > > 
> > > commit d50d82faa0c964e31f7a946ba8aba7c715ca7ab0
> > > Author: Mikulas Patocka 
> > > AuthorDate: Wed Jun 27 23:26:09 2018 -0700
> > > Commit: Linus Torvalds 
> > > CommitDate: Thu Jun 28 11:16:44 2018 -0700
> > > 
> > > slub: fix failure when we delete and create a slab cache
> > 
> > This is ugly.  Is there an alternative way of fixing the race which
> > Mikulas attempted to address?  Possibly cancel the work and reuse the
> > existing sysfs file, or is that too stupid to live?
> > 
> > 3b7b314053d021 ("slub: make sysfs file removal asynchronous") was
> > pretty lame, really.  As mentioned,
> > 
> > : It'd be the cleanest to deal with the issue by removing sysfs files
> > : without holding slab_mutex before the rest of shutdown; however, given
> > : the current code structure, it is pretty difficult to do so.
> > 
> > Would be a preferable approach.
> > 
> > > 
> > > This uncovered a bug in the slub subsystem - if we delete a cache and
> > > immediatelly create another cache with the same attributes, it fails
> > > because of duplicate filename in /sys/kernel/slab/.  The slub 
> > > subsystem
> > > offloads freeing the cache to a workqueue - and if we create the new
> > > cache before the workqueue runs, it complains because of duplicate
> > > filename in sysfs.
> 
> Alternatively, could we flush the workqueue before attempting to
> (re)create the sysfs file?

What if someone creates the slab cache from the workqueue?

> Extra points for only doing this if the
> first (re)creation attempt returned -EEXIST?

If it returns -EEXIST, it has already written the warning to the log.

Mikulas


[PATCH] vt: fix broken display when running aptitude

2018-10-23 Thread Mikulas Patocka
If you run aptitude on framebuffer console, the display is corrupted. The 
corruption is caused by the commit d8ae7242. The patch adds "offset" to 
"start" when calling scr_memsetw, but it forgets to do the same addition 
on a subsequent call to do_update_region.

Signed-off-by: Mikulas Patocka 
Fixes: d8ae72427187 ("vt: preserve unicode values corresponding to screen 
characters")
Cc: sta...@vger.kernel.org  # 4.19

---
 drivers/tty/vt/vt.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/drivers/tty/vt/vt.c
===
--- linux-2.6.orig/drivers/tty/vt/vt.c  2018-10-23 16:15:08.0 +0200
+++ linux-2.6/drivers/tty/vt/vt.c   2018-10-23 16:29:20.0 +0200
@@ -1551,7 +1551,7 @@ static void csi_K(struct vc_data *vc, in
scr_memsetw(start + offset, vc->vc_video_erase_char, 2 * count);
vc->vc_need_wrap = 0;
if (con_should_update(vc))
-   do_update_region(vc, (unsigned long) start, count);
+   do_update_region(vc, (unsigned long)(start + offset), count);
 }
 
 static void csi_X(struct vc_data *vc, int vpar) /* erase the following vpar 
positions */


[PATCH] vt: fix broken display when running aptitude

2018-10-23 Thread Mikulas Patocka
If you run aptitude on framebuffer console, the display is corrupted. The 
corruption is caused by the commit d8ae7242. The patch adds "offset" to 
"start" when calling scr_memsetw, but it forgets to do the same addition 
on a subsequent call to do_update_region.

Signed-off-by: Mikulas Patocka 
Fixes: d8ae72427187 ("vt: preserve unicode values corresponding to screen 
characters")
Cc: sta...@vger.kernel.org  # 4.19

---
 drivers/tty/vt/vt.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/drivers/tty/vt/vt.c
===
--- linux-2.6.orig/drivers/tty/vt/vt.c  2018-10-23 16:15:08.0 +0200
+++ linux-2.6/drivers/tty/vt/vt.c   2018-10-23 16:29:20.0 +0200
@@ -1551,7 +1551,7 @@ static void csi_K(struct vc_data *vc, in
scr_memsetw(start + offset, vc->vc_video_erase_char, 2 * count);
vc->vc_need_wrap = 0;
if (con_should_update(vc))
-   do_update_region(vc, (unsigned long) start, count);
+   do_update_region(vc, (unsigned long)(start + offset), count);
 }
 
 static void csi_X(struct vc_data *vc, int vpar) /* erase the following vpar 
positions */


Re: [PATCHv5 0/7] tty: Hold write ldisc sem in tty_reopen()

2018-09-20 Thread Mikulas Patocka



On Wed, 19 Sep 2018, Dmitry Safonov wrote:

> On Wed, 2018-09-19 at 16:03 -0400, Mikulas Patocka wrote:
> > 
> > On Wed, 19 Sep 2018, Dmitry Safonov wrote:
> > > Thanks much for the testing, Mikulas.
> > > Could you try to bisect which of the patches causes it?
> > > The most important are 1,2,3 - probably, one of them caused it..
> > > I'll stare a bit into the code.
> > 
> > The patch 3 causes it.
> > 
> > The hangs during reboot take either 80 seconds, 3 minutes or 3:25.
> 
> Thanks much for that again.
> Any chance you have sysrq enabled and could print: locks held in system
> and kernel backtraces for applications?

Stacktrace doesn't work on parisc. It used to work on 4.18, but the kernel
4.19 reportedly requires patched binutils.

Mikulas

> I suppose, userspace hangs in n_tty_receive_buf_common(), which is
> better of cause than null-ptr dereference, but I believe we can improve
> it by stopping a reader earlier if there is any writer pending.
> 
> -- 
> Thanks,
>  Dmitry
> 


Re: [PATCHv5 0/7] tty: Hold write ldisc sem in tty_reopen()

2018-09-20 Thread Mikulas Patocka



On Wed, 19 Sep 2018, Dmitry Safonov wrote:

> On Wed, 2018-09-19 at 16:03 -0400, Mikulas Patocka wrote:
> > 
> > On Wed, 19 Sep 2018, Dmitry Safonov wrote:
> > > Thanks much for the testing, Mikulas.
> > > Could you try to bisect which of the patches causes it?
> > > The most important are 1,2,3 - probably, one of them caused it..
> > > I'll stare a bit into the code.
> > 
> > The patch 3 causes it.
> > 
> > The hangs during reboot take either 80 seconds, 3 minutes or 3:25.
> 
> Thanks much for that again.
> Any chance you have sysrq enabled and could print: locks held in system
> and kernel backtraces for applications?

Stacktrace doesn't work on parisc. It used to work on 4.18, but the kernel
4.19 reportedly requires patched binutils.

Mikulas

> I suppose, userspace hangs in n_tty_receive_buf_common(), which is
> better of cause than null-ptr dereference, but I believe we can improve
> it by stopping a reader earlier if there is any writer pending.
> 
> -- 
> Thanks,
>  Dmitry
> 


  1   2   3   4   5   6   7   8   9   10   >