Re: [patch 2/4] fs/buffer: Move BH_Uptodate_Lock locking into wrapper functions

2019-07-31 Thread Jan Kara
On Tue 30-07-19 13:24:54, Thomas Gleixner wrote:
> Bit spinlocks are problematic if PREEMPT_RT is enabled, because they
> disable preemption, which is undesired for latency reasons and breaks when
> regular spinlocks are taken within the bit_spinlock locked region because
> regular spinlocks are converted to 'sleeping spinlocks' on RT. So RT
> replaces the bit spinlocks with regular spinlocks to avoid this problem.
> 
> To avoid ifdeffery at the source level, wrap all BH_Uptodate_Lock bitlock
> operations with inline functions, so the spinlock substitution can be done
> at one place.
> 
> Using regular spinlocks can also be enabled for lock debugging purposes so
> the lock operations become visible to lockdep.
> 
> Signed-off-by: Thomas Gleixner 
> Cc: "Theodore Ts'o" 
> Cc: Matthew Wilcox 
> Cc: Alexander Viro 
> Cc: linux-fsde...@vger.kernel.org

Looks good to me. You can add:

Reviewed-by: Jan Kara 

BTW, it should be possible to get rid of BH_Uptodate_Lock altogether using
bio chaining (which was non-existent when this bh code was written) to make
sure IO completion function gets called only once all bios used to fill in
/ write out the page are done. It would be also more efficient. But I guess
that's an interesting cleanup project for some other time...

Honza

> ---
>  fs/buffer.c |   20 ++--
>  fs/ext4/page-io.c   |6 ++
>  fs/ntfs/aops.c  |   10 +++---
>  include/linux/buffer_head.h |   16 
>  4 files changed, 27 insertions(+), 25 deletions(-)
> 
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -275,8 +275,7 @@ static void end_buffer_async_read(struct
>* decide that the page is now completely done.
>*/
>   first = page_buffers(page);
> - local_irq_save(flags);
> - bit_spin_lock(BH_Uptodate_Lock, &first->b_state);
> + flags = bh_uptodate_lock_irqsave(first);
>   clear_buffer_async_read(bh);
>   unlock_buffer(bh);
>   tmp = bh;
> @@ -289,8 +288,7 @@ static void end_buffer_async_read(struct
>   }
>   tmp = tmp->b_this_page;
>   } while (tmp != bh);
> - bit_spin_unlock(BH_Uptodate_Lock, &first->b_state);
> - local_irq_restore(flags);
> + bh_uptodate_unlock_irqrestore(first, flags);
>  
>   /*
>* If none of the buffers had errors and they are all
> @@ -302,9 +300,7 @@ static void end_buffer_async_read(struct
>   return;
>  
>  still_busy:
> - bit_spin_unlock(BH_Uptodate_Lock, &first->b_state);
> - local_irq_restore(flags);
> - return;
> + bh_uptodate_unlock_irqrestore(first, flags);
>  }
>  
>  /*
> @@ -331,8 +327,7 @@ void end_buffer_async_write(struct buffe
>   }
>  
>   first = page_buffers(page);
> - local_irq_save(flags);
> - bit_spin_lock(BH_Uptodate_Lock, &first->b_state);
> + flags = bh_uptodate_lock_irqsave(first);
>  
>   clear_buffer_async_write(bh);
>   unlock_buffer(bh);
> @@ -344,15 +339,12 @@ void end_buffer_async_write(struct buffe
>   }
>   tmp = tmp->b_this_page;
>   }
> - bit_spin_unlock(BH_Uptodate_Lock, &first->b_state);
> - local_irq_restore(flags);
> + bh_uptodate_unlock_irqrestore(first, flags);
>   end_page_writeback(page);
>   return;
>  
>  still_busy:
> - bit_spin_unlock(BH_Uptodate_Lock, &first->b_state);
> - local_irq_restore(flags);
> - return;
> + bh_uptodate_unlock_irqrestore(first, flags);
>  }
>  EXPORT_SYMBOL(end_buffer_async_write);
>  
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -90,8 +90,7 @@ static void ext4_finish_bio(struct bio *
>* We check all buffers in the page under BH_Uptodate_Lock
>* to avoid races with other end io clearing async_write flags
>*/
> - local_irq_save(flags);
> - bit_spin_lock(BH_Uptodate_Lock, &head->b_state);
> + flags = bh_uptodate_lock_irqsave(head);
>   do {
>   if (bh_offset(bh) < bio_start ||
>   bh_offset(bh) + bh->b_size > bio_end) {
> @@ -103,8 +102,7 @@ static void ext4_finish_bio(struct bio *
>   if (bio->bi_status)
>   buffer_io_error(bh);
>   } while ((bh = bh->b_this_page) != head);
> - bit_spin_unlock(BH_Uptodate_Lock, &head->b_state);
> - local_irq_restore(flags);
> + bh_uptodate_unlock_irqrestore(head, flags);
>

Re: [PATCH] udf: prevent allocation beyond UDF partition

2019-07-31 Thread Jan Kara
On Sun 28-07-19 14:19:12, Steve Magnani wrote:
> The UDF bitmap allocation code assumes that a recorded 
> Unallocated Space Bitmap is compliant with ECMA-167 4/13,
> which requires that pad bytes between the end of the bitmap 
> and the end of a logical block are all zero.
> 
> When a recorded bitmap does not comply with this requirement,
> for example one padded with FF to the block boundary instead
> of 00, the allocator may "allocate" blocks that are outside
> the UDF partition extent. This can result in UDF volume descriptors
> being overwritten by file data or by partition-level descriptors,
> and in extreme cases, even in scribbling on a subsequent disk partition.
> 
> Add a check that the block selected by the allocator actually
> resides within the UDF partition extent.
> 
> Signed-off-by: Steven J. Magnani 

Thanks for the patch! Added to my tree. I've just slightly modified the
patch to also output error message about filesystem corruption.

Honza

> 
> --- a/fs/udf/balloc.c 2019-07-26 11:35:28.249563705 -0500
> +++ b/fs/udf/balloc.c 2019-07-28 13:11:25.061431597 -0500
> @@ -325,6 +325,13 @@ got_block:
>   newblock = bit + (block_group << (sb->s_blocksize_bits + 3)) -
>   (sizeof(struct spaceBitmapDesc) << 3);
>  
> + if (newblock >= sbi->s_partmaps[partition].s_partition_len) {
> + /* Ran off the end of the bitmap,
> +  * and bits following are non-compliant (not all zero)
> +  */
> + goto error_return;
> + }
> +
>   if (!udf_clear_bit(bit, bh->b_data)) {
>   udf_debug("bit already cleared for block %d\n", bit);
>   goto repeat;
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] fs: reiserfs: Remove unnecessary check of bh in remove_from_transaction()

2019-07-31 Thread Jan Kara
On Sat 27-07-19 16:40:19, Jia-Ju Bai wrote:
> On lines 3430-3434, bh has been assured to be non-null:
> cn = get_journal_hash_dev(sb, journal->j_hash_table, blocknr);
> if (!cn || !cn->bh) {
> return ret;
> }
> bh = cn->bh;
> 
> Thus, the check of bh on line 3447 is unnecessary and can be removed.
> Thank Andrew Morton for good advice.
> 
> Signed-off-by: Jia-Ju Bai 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/reiserfs/journal.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
> index 4517a1394c6f..11155b8513db 100644
> --- a/fs/reiserfs/journal.c
> +++ b/fs/reiserfs/journal.c
> @@ -3444,9 +3444,8 @@ static int remove_from_transaction(struct super_block 
> *sb,
>   if (cn == journal->j_last) {
>   journal->j_last = cn->prev;
>   }
> - if (bh)
> - remove_journal_hash(sb, journal->j_hash_table, NULL,
> - bh->b_blocknr, 0);
> + remove_journal_hash(sb, journal->j_hash_table, NULL,
> + bh->b_blocknr, 0);
>       clear_buffer_journaled(bh); /* don't log this one */
>  
>   if (!already_cleaned) {
> -- 
> 2.17.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v3] writeback: fix -Wstringop-truncation warnings

2019-07-30 Thread Jan Kara
On Thu 25-07-19 13:18:19, Qian Cai wrote:
> There are many of those warnings.
> 
> In file included from ./arch/powerpc/include/asm/paca.h:15,
>  from ./arch/powerpc/include/asm/current.h:13,
>  from ./include/linux/thread_info.h:21,
>  from ./include/asm-generic/preempt.h:5,
>  from ./arch/powerpc/include/generated/asm/preempt.h:1,
>  from ./include/linux/preempt.h:78,
>  from ./include/linux/spinlock.h:51,
>  from fs/fs-writeback.c:19:
> In function 'strncpy',
> inlined from 'perf_trace_writeback_page_template' at
> ./include/trace/events/writeback.h:56:1:
> ./include/linux/string.h:260:9: warning: '__builtin_strncpy' specified
> bound 32 equals destination size [-Wstringop-truncation]
>   return __builtin_strncpy(p, q, size);
>  ^
> 
> Fix it by using the new strscpy_pad() which was introduced in the
> commit 458a3bf82df4 ("lib/string: Add strscpy_pad() function") and will
> always be NUL-terminated instead of strncpy(). Also, changes strlcpy()
> to use strscpy_pad() in this file for consistency.
> 
> Fixes: 455b2864686d ("writeback: Initial tracing support")
> Fixes: 028c2dd184c0 ("writeback: Add tracing to balance_dirty_pages")
> Fixes: e84d0a4f8e39 ("writeback: trace event writeback_queue_io")
> Fixes: b48c104d2211 ("writeback: trace event bdi_dirty_ratelimit")
> Fixes: cc1676d917f3 ("writeback: Move requeueing when I_SYNC set to 
> writeback_sb_inodes()")
> Fixes: 9fb0a7da0c52 ("writeback: add more tracepoints")
> Signed-off-by: Qian Cai 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
> 
> v3: Rearrange a long-line a bit to make the code more readable.
> v2: Use strscpy_pad() to address the possible data leaking concern from Steve 
> [1].
> Replace strlcpy() as well for consistency.
> 
> [1] https://lore.kernel.org/lkml/20190716170339.1c447...@gandalf.local.home/
> 
>  include/trace/events/writeback.h | 38 --
>  1 file changed, 20 insertions(+), 18 deletions(-)
> 
> diff --git a/include/trace/events/writeback.h 
> b/include/trace/events/writeback.h
> index aa7f3aeac740..79095434c1be 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -66,8 +66,9 @@
>   ),
>  
>   TP_fast_assign(
> - strncpy(__entry->name,
> - mapping ? dev_name(inode_to_bdi(mapping->host)->dev) : 
> "(unknown)", 32);
> + strscpy_pad(__entry->name,
> + mapping ? 
> dev_name(inode_to_bdi(mapping->host)->dev) : "(unknown)",
> + 32);
>   __entry->ino = mapping ? mapping->host->i_ino : 0;
>   __entry->index = page->index;
>   ),
> @@ -110,8 +111,8 @@
>   struct backing_dev_info *bdi = inode_to_bdi(inode);
>  
>   /* may be called for files on pseudo FSes w/ unregistered bdi */
> - strncpy(__entry->name,
> - bdi->dev ? dev_name(bdi->dev) : "(unknown)", 32);
> + strscpy_pad(__entry->name,
> + bdi->dev ? dev_name(bdi->dev) : "(unknown)", 32);
>   __entry->ino= inode->i_ino;
>   __entry->state  = inode->i_state;
>   __entry->flags  = flags;
> @@ -190,8 +191,8 @@ static inline unsigned int 
> __trace_wbc_assign_cgroup(struct writeback_control *w
>   ),
>  
>   TP_fast_assign(
> - strncpy(__entry->name,
> - dev_name(inode_to_bdi(inode)->dev), 32);
> + strscpy_pad(__entry->name,
> + dev_name(inode_to_bdi(inode)->dev), 32);
>   __entry->ino= inode->i_ino;
>   __entry->sync_mode  = wbc->sync_mode;
>   __entry->cgroup_ino = __trace_wbc_assign_cgroup(wbc);
> @@ -234,8 +235,9 @@ static inline unsigned int 
> __trace_wbc_assign_cgroup(struct writeback_control *w
>   __field(unsigned int, cgroup_ino)
>   ),
>   TP_fast_assign(
> - strncpy(__entry->name,
> - wb->bdi->dev ? dev_name(wb->bdi->dev) : "(unknown)", 
> 32);
> + strscpy_pad(__entry->name,
> + wb->bdi->dev ? dev_name(wb->bdi

Re: [PATCH] quota: fix condition for resetting time limit in do_set_dqblk()

2019-07-30 Thread Jan Kara
On Wed 24-07-19 13:32:16, Chengguang Xu wrote:
> We reset time limit when current usage is smaller
> or equal to soft limit in other place, so follow
> this rule in do_set_dqblk().
> 
> Signed-off-by: Chengguang Xu 

Thanks! Applied.

Honza

> ---
>  fs/quota/dquot.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
> index be9c471cdbc8..6e826b454082 100644
> --- a/fs/quota/dquot.c
> +++ b/fs/quota/dquot.c
> @@ -2731,7 +2731,7 @@ static int do_set_dqblk(struct dquot *dquot, struct 
> qc_dqblk *di)
>  
>   if (check_blim) {
>   if (!dm->dqb_bsoftlimit ||
> - dm->dqb_curspace + dm->dqb_rsvspace < dm->dqb_bsoftlimit) {
> + dm->dqb_curspace + dm->dqb_rsvspace <= dm->dqb_bsoftlimit) {
>   dm->dqb_btime = 0;
>   clear_bit(DQ_BLKS_B, &dquot->dq_flags);
>   } else if (!(di->d_fieldmask & QC_SPC_TIMER))
> @@ -2740,7 +2740,7 @@ static int do_set_dqblk(struct dquot *dquot, struct 
> qc_dqblk *di)
>   }
>   if (check_ilim) {
>   if (!dm->dqb_isoftlimit ||
> - dm->dqb_curinodes < dm->dqb_isoftlimit) {
> + dm->dqb_curinodes <= dm->dqb_isoftlimit) {
>   dm->dqb_itime = 0;
>   clear_bit(DQ_INODES_B, &dquot->dq_flags);
>   } else if (!(di->d_fieldmask & QC_INO_TIMER))
> -- 
> 2.20.1
> 
> 
> 
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 1/2] mm/filemap: don't initiate writeback if mapping has no dirty pages

2019-07-30 Thread Jan Kara
On Tue 30-07-19 17:57:18, Konstantin Khlebnikov wrote:
> On 30.07.2019 17:14, Jan Kara wrote:
> > On Tue 23-07-19 11:16:51, Konstantin Khlebnikov wrote:
> > > On 23.07.2019 3:52, Andrew Morton wrote:
> > > > 
> > > > (cc linux-fsdevel and Jan)
> > 
> > Thanks for CC Andrew.
> > 
> > > > On Mon, 22 Jul 2019 12:36:08 +0300 Konstantin Khlebnikov 
> > > >  wrote:
> > > > 
> > > > > Functions like filemap_write_and_wait_range() should do nothing if 
> > > > > inode
> > > > > has no dirty pages or pages currently under writeback. But they anyway
> > > > > construct struct writeback_control and this does some atomic 
> > > > > operations
> > > > > if CONFIG_CGROUP_WRITEBACK=y - on fast path it locks inode->i_lock and
> > > > > updates state of writeback ownership, on slow path might be more work.
> > > > > Current this path is safely avoided only when inode mapping has no 
> > > > > pages.
> > > > > 
> > > > > For example generic_file_read_iter() calls 
> > > > > filemap_write_and_wait_range()
> > > > > at each O_DIRECT read - pretty hot path.
> > 
> > Yes, but in common case mapping_needs_writeback() is false for files you do
> > direct IO to (exactly the case with no pages in the mapping). So you
> > shouldn't see the overhead at all. So which case you really care about?
> > 
> > > > > This patch skips starting new writeback if mapping has no dirty tags 
> > > > > set.
> > > > > If writeback is already in progress filemap_write_and_wait_range() 
> > > > > will
> > > > > wait for it.
> > > > > 
> > > > > ...
> > > > > 
> > > > > --- a/mm/filemap.c
> > > > > +++ b/mm/filemap.c
> > > > > @@ -408,7 +408,8 @@ int __filemap_fdatawrite_range(struct 
> > > > > address_space *mapping, loff_t start,
> > > > >   .range_end = end,
> > > > >   };
> > > > > - if (!mapping_cap_writeback_dirty(mapping))
> > > > > + if (!mapping_cap_writeback_dirty(mapping) ||
> > > > > + !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
> > > > >   return 0;
> > > > >   wbc_attach_fdatawrite_inode(&wbc, mapping->host);
> > > > 
> > > > How does this play with tagged_writepages?  We assume that no tagging
> > > > has been performed by any __filemap_fdatawrite_range() caller?
> > > > 
> > > 
> > > Checking also PAGECACHE_TAG_TOWRITE is cheap but seems redundant.
> > > 
> > > To-write tags are supposed to be a subset of dirty tags:
> > > to-write is set only when dirty is set and cleared after starting 
> > > writeback.
> > > 
> > > Special case set_page_writeback_keepwrite() which does not clear to-write
> > > should be for dirty page thus dirty tag is not going to be cleared either.
> > > Ext4 calls it after redirty_page_for_writepage()
> > > XFS even without clear_page_dirty_for_io()
> > > 
> > > Anyway to-write tag without dirty tag or at clear page is confusing.
> > 
> > Yeah, TOWRITE tag is intended to be internal to writepages logic so your
> > patch is fine in that regard. Overall the patch looks good to me so I'm
> > just wondering a bit about the motivation...
> 
> In our case file mixes cached pages and O_DIRECT read. Kind of database
> were index header is memory mapped while the rest data read via O_DIRECT.
> I suppose for sharing index between multiple instances.

OK, that has always been a bit problematic but you're not the first one to
have such design ;). So feel free to add:

Reviewed-by: Jan Kara 

to your patch.

> On this path we also hit this bug:
> https://lore.kernel.org/lkml/156355839560.2063.5265687291430814589.stgit@buzz/
> so that's why I've started looking into this code.

I see. OK.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 1/2] mm/filemap: don't initiate writeback if mapping has no dirty pages

2019-07-30 Thread Jan Kara
On Tue 23-07-19 11:16:51, Konstantin Khlebnikov wrote:
> On 23.07.2019 3:52, Andrew Morton wrote:
> > 
> > (cc linux-fsdevel and Jan)

Thanks for CC Andrew.

> > On Mon, 22 Jul 2019 12:36:08 +0300 Konstantin Khlebnikov 
> >  wrote:
> > 
> > > Functions like filemap_write_and_wait_range() should do nothing if inode
> > > has no dirty pages or pages currently under writeback. But they anyway
> > > construct struct writeback_control and this does some atomic operations
> > > if CONFIG_CGROUP_WRITEBACK=y - on fast path it locks inode->i_lock and
> > > updates state of writeback ownership, on slow path might be more work.
> > > Current this path is safely avoided only when inode mapping has no pages.
> > > 
> > > For example generic_file_read_iter() calls filemap_write_and_wait_range()
> > > at each O_DIRECT read - pretty hot path.

Yes, but in common case mapping_needs_writeback() is false for files you do
direct IO to (exactly the case with no pages in the mapping). So you
shouldn't see the overhead at all. So which case you really care about?

> > > This patch skips starting new writeback if mapping has no dirty tags set.
> > > If writeback is already in progress filemap_write_and_wait_range() will
> > > wait for it.
> > > 
> > > ...
> > > 
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -408,7 +408,8 @@ int __filemap_fdatawrite_range(struct address_space 
> > > *mapping, loff_t start,
> > >   .range_end = end,
> > >   };
> > > - if (!mapping_cap_writeback_dirty(mapping))
> > > + if (!mapping_cap_writeback_dirty(mapping) ||
> > > + !mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
> > >   return 0;
> > >   wbc_attach_fdatawrite_inode(&wbc, mapping->host);
> > 
> > How does this play with tagged_writepages?  We assume that no tagging
> > has been performed by any __filemap_fdatawrite_range() caller?
> > 
> 
> Checking also PAGECACHE_TAG_TOWRITE is cheap but seems redundant.
> 
> To-write tags are supposed to be a subset of dirty tags:
> to-write is set only when dirty is set and cleared after starting writeback.
> 
> Special case set_page_writeback_keepwrite() which does not clear to-write
> should be for dirty page thus dirty tag is not going to be cleared either.
> Ext4 calls it after redirty_page_for_writepage()
> XFS even without clear_page_dirty_for_io()
> 
> Anyway to-write tag without dirty tag or at clear page is confusing.

Yeah, TOWRITE tag is intended to be internal to writepages logic so your
patch is fine in that regard. Overall the patch looks good to me so I'm
just wondering a bit about the motivation...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] dax: Fix missed PMD wakeups

2019-07-29 Thread Jan Kara
On Tue 16-07-19 20:39:46, Dan Williams wrote:
> On Fri, Jul 12, 2019 at 2:14 AM Jan Kara  wrote:
> >
> > On Thu 11-07-19 08:25:50, Matthew Wilcox wrote:
> > > On Thu, Jul 11, 2019 at 07:13:50AM -0700, Matthew Wilcox wrote:
> > > > However, the XA_RETRY_ENTRY might be a good choice.  It doesn't normally
> > > > appear in an XArray (it may appear if you're looking at a deleted node,
> > > > but since we're holding the lock, we can't see deleted nodes).
> > >
> > ...
> >
> > > @@ -254,7 +267,7 @@ static void wait_entry_unlocked(struct xa_state *xas, 
> > > void *entry)
> > >  static void put_unlocked_entry(struct xa_state *xas, void *entry)
> > >  {
> > >   /* If we were the only waiter woken, wake the next one */
> > > - if (entry)
> > > + if (entry && dax_is_conflict(entry))
> >
> > This should be !dax_is_conflict(entry)...
> >
> > >   dax_wake_entry(xas, entry, false);
> > >  }
> >
> > Otherwise the patch looks good to me so feel free to add:
> >
> > Reviewed-by: Jan Kara 
> 
> Looks good, and passes the test case. Now pushed out to
> libnvdimm-for-next for v5.3 inclusion:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm.git/commit/?h=libnvdimm-for-next&id=23c84eb7837514e16d79ed6d849b13745e0ce688

Thanks for picking up the patch but you didn't apply the fix I've mentioned
above. So put_unlocked_entry() is not waking up anybody anymore... Since
this got already to Linus' tree, I guess a separate fixup patch is needed
(attached).

Honza

-- 
Jan Kara 
SUSE Labs, CR
>From 950204f7dfdb06198f40820be4d33ce824508f11 Mon Sep 17 00:00:00 2001
From: Jan Kara 
Date: Mon, 29 Jul 2019 13:57:49 +0200
Subject: [PATCH] dax: Fix missed wakup in put_unlocked_entry()

The condition checking whether put_unlocked_entry() needs to wake up
following waiter got broken by commit 23c84eb78375 ("dax: Fix missed
wakeup with PMD faults"). We need to wake the waiter whenever the passed
entry is valid (i.e., non-NULL and not special conflict entry). This
could lead to processes never being woken up when waiting for entry
lock. Fix the condition.

CC: sta...@vger.kernel.org
Fixes: 23c84eb78375 ("dax: Fix missed wakeup with PMD faults")
Signed-off-by: Jan Kara 
---
 fs/dax.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index a237141d8787..b64964ef44f6 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -266,7 +266,7 @@ static void wait_entry_unlocked(struct xa_state *xas, void *entry)
 static void put_unlocked_entry(struct xa_state *xas, void *entry)
 {
 	/* If we were the only waiter woken, wake the next one */
-	if (entry && dax_is_conflict(entry))
+	if (entry && !dax_is_conflict(entry))
 		dax_wake_entry(xas, entry, false);
 }
 
-- 
2.16.4



Re: [PATCH] trace/writeback: fix Wstringop-truncation warnings

2019-07-29 Thread Jan Kara
> > TP_fast_assign(
> > -   strncpy(__entry->name,
> > +   strlcpy(__entry->name,
> > wb->bdi->dev ? dev_name(wb->bdi->dev) : "(unknown)", 
> > 32);
> > __entry->nr_pages = work->nr_pages;
> > __entry->sb_dev = work->sb ? work->sb->s_dev : 0;
> > @@ -288,7 +288,7 @@ static inline unsigned int 
> > __trace_wbc_assign_cgroup(struct writeback_control *w
> > __field(unsigned int, cgroup_ino)
> > ),
> > TP_fast_assign(
> > -   strncpy(__entry->name, dev_name(wb->bdi->dev), 32);
> > +   strlcpy(__entry->name, dev_name(wb->bdi->dev), 32);
> > __entry->cgroup_ino = __trace_wb_assign_cgroup(wb);
> > ),
> > TP_printk("bdi %s: cgroup_ino=%u",
> > @@ -310,7 +310,7 @@ static inline unsigned int 
> > __trace_wbc_assign_cgroup(struct writeback_control *w
> > __array(char, name, 32)
> > ),
> > TP_fast_assign(
> > -   strncpy(__entry->name, dev_name(bdi->dev), 32);
> > +   strlcpy(__entry->name, dev_name(bdi->dev), 32);
> > ),
> > TP_printk("bdi %s",
> > __entry->name
> > @@ -335,7 +335,7 @@ static inline unsigned int 
> > __trace_wbc_assign_cgroup(struct writeback_control *w
> > ),
> >  
> > TP_fast_assign(
> > -   strncpy(__entry->name, dev_name(bdi->dev), 32);
> > +   strlcpy(__entry->name, dev_name(bdi->dev), 32);
> > __entry->nr_to_write= wbc->nr_to_write;
> > __entry->pages_skipped  = wbc->pages_skipped;
> > __entry->sync_mode  = wbc->sync_mode;
> > @@ -386,7 +386,7 @@ static inline unsigned int 
> > __trace_wbc_assign_cgroup(struct writeback_control *w
> > ),
> > TP_fast_assign(
> > unsigned long *older_than_this = work->older_than_this;
> > -   strncpy(__entry->name, dev_name(wb->bdi->dev), 32);
> > +   strlcpy(__entry->name, dev_name(wb->bdi->dev), 32);
> > __entry->older  = older_than_this ?  *older_than_this : 0;
> > __entry->age= older_than_this ?
> >   (jiffies - *older_than_this) * 1000 / HZ : -1;
> > @@ -597,7 +597,7 @@ static inline unsigned int 
> > __trace_wbc_assign_cgroup(struct writeback_control *w
> > ),
> >  
> > TP_fast_assign(
> > -   strncpy(__entry->name,
> > +   strlcpy(__entry->name,
> > dev_name(inode_to_bdi(inode)->dev), 32);
> > __entry->ino= inode->i_ino;
> > __entry->state  = inode->i_state;
> > @@ -671,7 +671,7 @@ static inline unsigned int 
> > __trace_wbc_assign_cgroup(struct writeback_control *w
> > ),
> >  
> > TP_fast_assign(
> > -   strncpy(__entry->name,
> > +   strlcpy(__entry->name,
> > dev_name(inode_to_bdi(inode)->dev), 32);
> > __entry->ino= inode->i_ino;
> > __entry->state  = inode->i_state;
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2 1/2] udf: refactor VRS descriptor identification

2019-07-12 Thread Jan Kara
On Thu 11-07-19 20:15:21, Pali Rohár  wrote:
> On Thursday 11 July 2019 08:38:51 Steven J. Magnani wrote:
> > --- a/fs/udf/super.c2019-07-10 18:57:41.192852154 -0500
> > +++ b/fs/udf/super.c2019-07-10 20:47:50.438352500 -0500
> > @@ -685,16 +685,62 @@ out_unlock:
> > return error;
> >  }
> >  
> > -/* Check Volume Structure Descriptors (ECMA 167 2/9.1) */
> > -/* We also check any "CD-ROM Volume Descriptor Set" (ECMA 167 2/8.3.1) */
> > -static loff_t udf_check_vsd(struct super_block *sb)
> > +static int identify_vsd(const struct volStructDesc *vsd)
> > +{
> > +   int vsd_id = 0;
> > +
> > +   if (!strncmp(vsd->stdIdent, VSD_STD_ID_CD001, VSD_STD_ID_LEN)) {
> 
> Hi! You probably want to use memcmp() instead of strncmp().

There's no difference in functionality but I agree it makes more sense.
I'll modify the patch. Thanks for review!

Honza

> 
> > +   switch (vsd->structType) {
> > +   case 0:
> > +   udf_debug("ISO9660 Boot Record found\n");
> > +   break;
> > +   case 1:
> > +   udf_debug("ISO9660 Primary Volume Descriptor found\n");
> > +   break;
> > +   case 2:
> > +   udf_debug("ISO9660 Supplementary Volume Descriptor 
> > found\n");
> > +   break;
> > +   case 3:
> > +   udf_debug("ISO9660 Volume Partition Descriptor 
> > found\n");
> > +   break;
> > +   case 255:
> > +   udf_debug("ISO9660 Volume Descriptor Set Terminator 
> > found\n");
> > +   break;
> > +   default:
> > +   udf_debug("ISO9660 VRS (%u) found\n", vsd->structType);
> > +   break;
> > +   }
> > +   } else if (!strncmp(vsd->stdIdent, VSD_STD_ID_BEA01, VSD_STD_ID_LEN))
> > +   vsd_id = 1;
> > +   else if (!strncmp(vsd->stdIdent, VSD_STD_ID_NSR02, VSD_STD_ID_LEN))
> > +   vsd_id = 2;
> > +   else if (!strncmp(vsd->stdIdent, VSD_STD_ID_NSR03, VSD_STD_ID_LEN))
> > +   vsd_id = 3;
> > +   else if (!strncmp(vsd->stdIdent, VSD_STD_ID_BOOT2, VSD_STD_ID_LEN))
> > +   ; /* vsd_id = 0 */
> > +   else if (!strncmp(vsd->stdIdent, VSD_STD_ID_CDW02, VSD_STD_ID_LEN))
> > +   ; /* vsd_id = 0 */
> > +   else {
> > +   /* TEA01 or invalid id : end of volume recognition area */
> > +   vsd_id = 255;
> > +   }
> > +
> > +   return vsd_id;
> > +}
> 
> -- 
> Pali Rohár
> pali.ro...@gmail.com


-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2 2/2] udf: support 2048-byte spacing of VRS descriptors on 4K media

2019-07-12 Thread Jan Kara
On Thu 11-07-19 10:56:52, Steve Magnani wrote:
> On 7/11/19 10:04 AM, Jan Kara wrote:
> > Thanks for the patches! I've added them to my tree and somewhat simplified
> > the logic since we don't really care about nsr 2 vs 3 or whether we
> > actually saw BEA or not. Everything seems to work fine for me but I'd
> > appreciate if you could doublecheck - the result is pushed out to
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git for_next
> > 
> Tested-by: Steven J. Magnani 
> 
> The rework is more permissive than what you had suggested initially
> (conditioning acceptance of a noncompliant NSR on a preceding BEA).
> I had also tried to code the original so that a malformed 2048-byte
> interval VRS would not be accepted. But the simplifications do make
> the code easier to follow...

Yeah, it's simpler to follow and we do this check just to see whether this
may be a valid UDF media before doing more expensive probing. So I don't
think that the code being more permissive matters. Thanks for testing!

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] dax: Fix missed PMD wakeups

2019-07-12 Thread Jan Kara
On Thu 11-07-19 08:25:50, Matthew Wilcox wrote:
> On Thu, Jul 11, 2019 at 07:13:50AM -0700, Matthew Wilcox wrote:
> > However, the XA_RETRY_ENTRY might be a good choice.  It doesn't normally
> > appear in an XArray (it may appear if you're looking at a deleted node,
> > but since we're holding the lock, we can't see deleted nodes).
> 
...

> @@ -254,7 +267,7 @@ static void wait_entry_unlocked(struct xa_state *xas, 
> void *entry)
>  static void put_unlocked_entry(struct xa_state *xas, void *entry)
>  {
>   /* If we were the only waiter woken, wake the next one */
> - if (entry)
> + if (entry && dax_is_conflict(entry))

This should be !dax_is_conflict(entry)...

>   dax_wake_entry(xas, entry, false);
>  }

Otherwise the patch looks good to me so feel free to add:

Reviewed-by: Jan Kara 

once you fix this.

    Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2 2/2] udf: support 2048-byte spacing of VRS descriptors on 4K media

2019-07-11 Thread Jan Kara
On Thu 11-07-19 08:38:52,  Steven J. Magnani  wrote:
> Some UDF creators (specifically Microsoft, but perhaps others) mishandle
> the ECMA-167 corner case that requires descriptors within a Volume
> Recognition Sequence to be placed at 4096-byte intervals on media where
> the block size is 4K. Instead, the descriptors are placed at the 2048-
> byte interval mandated for media with smaller blocks. This nonconformity
> currently prevents Linux from recognizing the filesystem as UDF.
> 
> Modify the driver to tolerate a misformatted VRS on 4K media.
> 
> Signed-off-by: Steven J. Magnani 

Thanks for the patches! I've added them to my tree and somewhat simplified
the logic since we don't really care about nsr 2 vs 3 or whether we
actually saw BEA or not. Everything seems to work fine for me but I'd
appreciate if you could doublecheck - the result is pushed out to

git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git for_next

Thanks!

Honza
> 
> --- a/fs/udf/super.c  2019-07-10 20:55:33.334359446 -0500
> +++ b/fs/udf/super.c  2019-07-10 21:20:58.138382326 -0500
> @@ -741,6 +741,7 @@ static int udf_check_vsd(struct super_bl
>   int sectorsize;
>   struct buffer_head *bh = NULL;
>   int nsr = 0;
> + int quirk_nsr = 0;
>   struct udf_sb_info *sbi;
>  
>   sbi = UDF_SB(sb);
> @@ -780,11 +781,27 @@ static int udf_check_vsd(struct super_bl
>   if (vsd_id > nsr)
>   nsr = vsd_id;
>  
> + /* Special handling for improperly formatted VRS (e.g., Win10)
> +  * where components are separated by 2048 bytes
> +  * even though sectors are 4K
> +  */
> + if ((sb->s_blocksize == 4096) && (quirk_nsr < 2)) {
> + vsd_id = identify_vsd(vsd + 1);
> + if ((nsr == 1) || (quirk_nsr == 1)) {
> + /* BEA01 has been seen, allow quirk NSR */
> + if (vsd_id > quirk_nsr)
> + quirk_nsr = vsd_id;
> + } else if (vsd_id > 3)
> + quirk_nsr = vsd_id;  /* 0 -> 255 */
> + }
> +
>   brelse(bh);
>   }
>  
>   if ((nsr >= 2) && (nsr <= 3))
>   return nsr;
> + else if ((quirk_nsr >= 2) && (quirk_nsr <= 3))
> + return quirk_nsr;
>   else if (!bh && sector - (sbi->s_session << sb->s_blocksize_bits) ==
>   VSD_FIRST_SECTOR_OFFSET)
>   return -1;
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] dax: Fix missed PMD wakeups

2019-07-11 Thread Jan Kara
On Wed 10-07-19 13:15:39, Matthew Wilcox wrote:
> On Wed, Jul 10, 2019 at 09:02:04PM +0200, Jan Kara wrote:
> > +#define DAX_ENTRY_CONFLICT dax_make_entry(pfn_to_pfn_t(1), DAX_EMPTY)
> 
> I was hoping to get rid of DAX_EMPTY ... it's almost unused now.  Once
> we switch to having a single DAX_LOCK value instead of a single bit,
> I think it can go away, freeing up two bits.
> 
> If you really want a special DAX_ENTRY_CONFLICT, I think we can make
> one in the 2..4094 range.
> 
> That aside, this looks pretty similar to the previous patch I sent, so
> if you're now happy with this, let's add
> 
> #define XA_DAX_CONFLICT_ENTRY xa_mk_internal(258)
> 
> to xarray.h and do it that way?

Yeah, that would work for me as well. The chosen value for DAX_ENTRY_CONFLICT
was pretty arbitrary. Or we could possibly use:

#define DAX_ENTRY_CONFLICT XA_ZERO_ENTRY

so that we don't leak DAX-specific internal definition into xarray.h?

        Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] dax: Fix missed PMD wakeups

2019-07-11 Thread Jan Kara
On Wed 10-07-19 20:35:55, Matthew Wilcox wrote:
> On Wed, Jul 10, 2019 at 09:02:04PM +0200, Jan Kara wrote:
> > So how about the attached patch? That keeps the interface sane and passes a
> > smoketest for me (full fstest run running). Obviously it also needs a
> > proper changelog...
> 
> Changelog and slightly massaged version along the lines of my two comments
> attached.
> 

> From 57b63fdd38e7bea7eb8d6332f0163fb028570def Mon Sep 17 00:00:00 2001
> From: "Matthew Wilcox (Oracle)" 
> Date: Wed, 3 Jul 2019 23:21:25 -0400
> Subject: [PATCH] dax: Fix missed wakeup with PMD faults
> 
> RocksDB can hang indefinitely when using a DAX file.  This is due to
> a bug in the XArray conversion when handling a PMD fault and finding a
> PTE entry.  We use the wrong index in the hash and end up waiting on
> the wrong waitqueue.
> 
> There's actually no need to wait; if we find a PTE entry while looking
> for a PMD entry, we can return immediately as we know we should fall
> back to a PTE fault (which may not conflict with the lock held).
> 
> Cc: sta...@vger.kernel.org
> Fixes: b15cd800682f ("dax: Convert page fault handlers to XArray")
> Signed-off-by: Matthew Wilcox (Oracle) 

Just one nit below. Otherwise feel free to add:

Reviewed-by: Jan Kara 

> diff --git a/include/linux/xarray.h b/include/linux/xarray.h
> index 052e06ff4c36..fb25452bcfa4 100644
> --- a/include/linux/xarray.h
> +++ b/include/linux/xarray.h
> @@ -169,7 +169,9 @@ static inline bool xa_is_internal(const void *entry)
>   return ((unsigned long)entry & 3) == 2;
>  }
>  
> +#define XA_RETRY_ENTRY   xa_mk_internal(256)
>  #define XA_ZERO_ENTRYxa_mk_internal(257)
> +#define XA_DAX_CONFLICT_ENTRYxa_mk_internal(258)
>  
>  /**
>   * xa_is_zero() - Is the entry a zero entry?

As I wrote in my previous email, won't it be nicer if we just defined
DAX_CONFLICT_ENTRY (or however we name it) inside dax code say to
XA_ZERO_ENTRY?  Generic xarray code just doesn't care about that value and
I can imagine in future there'll be other xarray user's who'd like to have
some special value(s) for use similarly to DAX and we don't want to clutter
xarray definitions with those as well. If you don't like XA_ZERO_ENTRY, I
could also imagine having XA_USER_RESERVED value that's guaranteed to be
unused by xarray and we'd define DAX_CONFLICT_ENTRY to it. Overall I don't
care too much so I can live even with what you have now but it would seem
cleaner that way to me.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] dax: Fix missed PMD wakeups

2019-07-11 Thread Jan Kara
On Wed 10-07-19 20:08:55, Matthew Wilcox wrote:
> On Wed, Jul 10, 2019 at 09:02:04PM +0200, Jan Kara wrote:
> > @@ -848,7 +853,7 @@ static int dax_writeback_one(struct xa_state *xas, 
> > struct dax_device *dax_dev,
> > if (unlikely(dax_is_locked(entry))) {
> > void *old_entry = entry;
> >  
> > -   entry = get_unlocked_entry(xas);
> > +   entry = get_unlocked_entry(xas, 0);
> >  
> > /* Entry got punched out / reallocated? */
> > if (!entry || WARN_ON_ONCE(!xa_is_value(entry)))
> 
> I'm not sure about this one.  Are we sure there will never be a dirty
> PMD entry?  Even if we can't create one today, it feels like a bit of
> a landmine to leave for someone who creates one in the future.

I was thinking about this but dax_writeback_one() doesn't really care what
entry it gets. Yes, in theory it could get a PMD when previously there was
PTE or vice-versa but we check that PFN's match and if they really do
match, there's no harm in doing the flushing whatever entry we got back...
And the checks are simpler this way.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] dax: Fix missed PMD wakeups

2019-07-10 Thread Jan Kara
On Fri 05-07-19 13:47:02, Dan Williams wrote:
> On Fri, Jul 5, 2019 at 12:10 PM Matthew Wilcox  wrote:
> >
> > On Thu, Jul 04, 2019 at 04:27:14PM -0700, Dan Williams wrote:
> > > On Thu, Jul 4, 2019 at 12:14 PM Matthew Wilcox  
> > > wrote:
> > > >
> > > > On Thu, Jul 04, 2019 at 06:54:50PM +0200, Jan Kara wrote:
> > > > > On Wed 03-07-19 20:27:28, Matthew Wilcox wrote:
> > > > > > So I think we're good for all current users.
> > > > >
> > > > > Agreed but it is an ugly trap. As I already said, I'd rather pay the
> > > > > unnecessary cost of waiting for pte entry and have an easy to 
> > > > > understand
> > > > > interface. If we ever have a real world use case that would care for 
> > > > > this
> > > > > optimization, we will need to refactor functions to make this 
> > > > > possible and
> > > > > still keep the interfaces sane. For example get_unlocked_entry() could
> > > > > return special "error code" indicating that there's no entry with 
> > > > > matching
> > > > > order in xarray but there's a conflict with it. That would be much 
> > > > > less
> > > > > error-prone interface.
> > > >
> > > > This is an internal interface.  I think it's already a pretty gnarly
> > > > interface to use by definition -- it's going to sleep and might return
> > > > almost anything.  There's not much scope for returning an error 
> > > > indicator
> > > > either; value entries occupy half of the range (all odd numbers between 
> > > > 1
> > > > and ULONG_MAX inclusive), plus NULL.  We could use an internal entry, 
> > > > but
> > > > I don't think that makes the interface any easier to use than returning
> > > > a locked entry.
> > > >
> > > > I think this iteration of the patch makes it a little clearer.  What do 
> > > > you
> > > > think?
> > > >
> > >
> > > Not much clearer to me. get_unlocked_entry() is now misnamed and this
> >
> > misnamed?  You'd rather it was called "try_get_unlocked_entry()"?
> 
> I was thinking more along the lines of
> get_unlocked_but_sometimes_locked_entry(), i.e. per Jan's feedback to
> keep the interface simple.

So how about the attached patch? That keeps the interface sane and passes a
smoketest for me (full fstest run running). Obviously it also needs a
proper changelog...

Honza

-- 
Jan Kara 
SUSE Labs, CR
>From 1aeaba0e061b2bf38143f21d054e66853543a680 Mon Sep 17 00:00:00 2001
From: Jan Kara 
Date: Wed, 10 Jul 2019 20:28:37 +0200
Subject: [PATCH] dax: Fix missed PMD wakeups

Signed-off-by: Jan Kara 
---
 fs/dax.c | 46 +-
 1 file changed, 25 insertions(+), 21 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index fe5e33810cd4..3fe655d38c7a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -191,15 +191,18 @@ static void dax_wake_entry(struct xa_state *xas, void *entry, bool wake_all)
 		__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key);
 }
 
+#define DAX_ENTRY_CONFLICT dax_make_entry(pfn_to_pfn_t(1), DAX_EMPTY)
 /*
  * Look up entry in page cache, wait for it to become unlocked if it
  * is a DAX entry and return it.  The caller must subsequently call
  * put_unlocked_entry() if it did not lock the entry or dax_unlock_entry()
- * if it did.
+ * if it did. 'order' is the minimum order of entry to return. If there's no
+ * entry of sufficiently large order but there are some entries of lower order
+ * in the range described by xas, return special DAX_ENTRY_CONFLICT value.
  *
  * Must be called with the i_pages lock held.
  */
-static void *get_unlocked_entry(struct xa_state *xas)
+static void *get_unlocked_entry(struct xa_state *xas, unsigned int order)
 {
 	void *entry;
 	struct wait_exceptional_entry_queue ewait;
@@ -210,6 +213,8 @@ static void *get_unlocked_entry(struct xa_state *xas)
 
 	for (;;) {
 		entry = xas_find_conflict(xas);
+		if (dax_entry_order(entry) < order)
+			return DAX_ENTRY_CONFLICT;
 		if (!entry || WARN_ON_ONCE(!xa_is_value(entry)) ||
 !dax_is_locked(entry))
 			return entry;
@@ -254,7 +259,7 @@ static void wait_entry_unlocked(struct xa_state *xas, void *entry)
 static void put_unlocked_entry(struct xa_state *xas, void *entry)
 {
 	/* If we were the only waiter woken, wake the next one */
-	if (entry)
+	if (entry && entry != DAX_ENTRY_CONFLICT)
 		dax_wake_entry(xas, entry, false);
 }
 
@@ -461,7 +466,7 @@ void dax_unlock_page(str

Re: pagecache locking

2019-07-10 Thread Jan Kara
On Wed 10-07-19 09:47:12, Dave Chinner wrote:
> On Mon, Jul 08, 2019 at 03:31:14PM +0200, Jan Kara wrote:
> > I'd be really careful with nesting range locks. You can have nasty
> > situations like:
> > 
> > Thread 1Thread 2
> > read_lock(0,1000)   
> > write_lock(500,1500) -> blocks due to read lock
> > read_lock(1001,1500) -> blocks due to write lock (or you have to break
> >   fairness and then deal with starvation issues).
> >
> > So once you allow nesting of range locks, you have to very carefully define
> > what is and what is not allowed.
> 
> Yes. I do understand the problem with rwsem read nesting and how
> that can translate to reange locks.
> 
> That's why my range locks don't even try to block on other pending
> waiters. The case where read nesting vs write might occur like above
> is something like copy_file_range() vs truncate, but otherwise for
> IO locks we simply don't have arbitrarily deep nesting of range
> locks.
> 
> i.e. for your example my range lock would result in:
> 
> Thread 1  Thread 2
> read_lock(0,1000) 
>   write_lock(500,1500)
>   
>   
>   
> <...>
> read_lock_nested(1001,1500)
> 
> 
> 
> <>
> read_unlock(1001,1500)on (1001,1500) so no wakeup>
> <>
> read_unlock(0,1000)
> 
>   
>   
>   
> 
> IOWs, I'm not trying to complicate the range lock implementation
> with complex stuff like waiter fairness or anti-starvation semantics
> at this point in time. Waiters simply don't impact whether a new lock
> can be gained or not, and hence the limitations of rwsem semantics
> don't apply.
> 
> If such functionality is necessary (and I suspect it will be to
> prevent AIO from delaying truncate and fallocate-like operations
> indefinitely) then I'll add a "barrier" lock type (e.g.
> range_lock_write_barrier()) that will block new range lock attempts
> across it's span.
> 
> However, because this can cause deadlocks like the above, a barrier
> lock will not block new *_lock_nested() or *_trylock() calls, hence
> allowing runs of nested range locking to complete and drain without
> deadlocking on a conflicting barrier range. And that still can't be
> modelled by existing rwsem semantics

Clever :). Thanks for explanation.

> > That's why in my range lock implementation
> > ages back I've decided to treat range lock as a rwsem for deadlock
> > verification purposes.
> 
> IMO, treating a range lock as a rwsem for deadlock purposes defeats
> the purpose of adding range locks in the first place. The
> concurrency models are completely different, and some of the
> limitations on rwsems are a result of implementation choices rather
> than limitations of a rwsem construct.

Well, even a range lock that cannot nest allows concurrent non-overlapping
read and write to the same file which rwsem doesn't allow. But I agree that
your nesting variant offers more (but also at the cost of higher complexity
of the lock semantics).

> In reality I couldn't care less about what lockdep can or can't
> verify. I've always said lockdep is a crutch for people who don't
> understand locks and the concurrency model of the code they
> maintain. That obviously extends to the fact that lockdep
> verification limitations should not limit what we allow new locking
> primitives to do.

I didn't say we have to constrain new locking primitives to what lockdep
can support. It is just a locking correctness verification tool so naturally
lockdep should be taught to what it needs to know about any locking scheme
we come up with. And sometimes it is just too much effort which is why e.g.
page lock still doesn't have lockdep coverage.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2] udf: Fix incorrect final NOT_ALLOCATED (hole) extent length

2019-07-09 Thread Jan Kara
r = udf_do_extend_final_block(inode, &epos, &extent,
> + partial_final_block);
> + } else {
> + loff_t add = (offset << sb->s_blocksize_bits) |

Typed 'offset' to loff_t before shifting. Otherwise the shift could
overflow for systems with 32-bit sector_t.

> +  partial_final_block;
> + err = udf_do_extend_file(inode, &epos, &extent, add);
> + }
> +
>   if (err < 0)
>   goto out;
>   err = 0;
...
> @@ -760,7 +806,8 @@ static sector_t inode_getblk(struct inod
>       startnum = (offset > 0);
>   }
>   /* Create extents for the hole between EOF and offset */
> - ret = udf_do_extend_file(inode, &prev_epos, laarr, offset);
> + hole_len = offset << inode->i_sb->s_blocksize_bits;

The same as above.

> + ret = udf_do_extend_file(inode, &prev_epos, laarr, hole_len);
>   if (ret < 0) {
>   *err = ret;
>   newblock = 0;

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [RFC] udf: 2.01 interoperability issues with Windows 10

2019-07-09 Thread Jan Kara
Hi!

On Tue 09-07-19 13:27:58, Steve Magnani wrote:
> Recently I have been exploring Advanced Format (4K sector size)
> and high capacity aspects of UDF 2.01 support in Linux and
> Windows 10. I thought it might be helpful to summarize my findings.
> 
> The good news is that I did not see any bugs in the Linux
> ecosystem (kernel driver + mkudffs).
> 
> The not-so-good news is that Windows has some issues that affect
> interoperability. One of my goals in posting this is to open a
> discussion on whether changes should be made in the Linux UDF
> ecosystem to accommodate these quirks.
> 
> My test setup includes the following software components:
> 
> * mkudffs 1.3 and 2.0
> * kernel 4.15.0-43 and 4.15.0-52
> * Windows 10 1803 17134.829
> * chkdsk 10.0.17134.1
> * udfs.sys 10.0.17134.648
> 
> 
> ISSUE 1: Inability of the Linux UDF driver to mount 4K-sector
>  media formatted by Windows.
> 
> This is because the Windows ecosystem mishandles the ECMA-167
> corner case that requires Volume Recognition Sequence components
> to be placed at 4K intervals on 4K-sector media, instead of the
> 2K intervals required with smaller sectors. The Linux UDF driver
> emits the following when presented with Windows-formatted media:
> 
>   UDF-fs: warning (device sdc1): udf_load_vrs: No VRS found
>   UDF-fs: Scanning with blocksize 4096 failed
> 
> A hex dump of the VRS written by the Windows 10 'format' utility
> yields this:
> 
>   : 00 42 45 41 30 31 01 00 00 00 00 00 00 00 00 00  .BEA01..
>   0800: 00 4e 53 52 30 33 01 00 00 00 00 00 00 00 00 00  .NSR03..
>   1000: 00 54 45 41 30 31 01 00 00 00 00 00 00 00 00 00  .TEA01..
> 
> We may want to consider tweaking the kernel UDF driver to
> tolerate this quirk; if so a question is whether that should be
> done automatically, only in response to a mount option or
> module parameter, or only with some subsequent confirmation
> that the medium was formatted by Windows.

Yeah, I think we could do that although it will be a bit painful. I would
do it by default if we found BEA descriptor but not NSR descriptor. We do
already have handling for various quirks of other udf creators so this is
nothing new... I think Palo replied about the rest of the issues you've
found.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: pagecache locking

2019-07-08 Thread Jan Kara
On Sat 06-07-19 09:31:57, Dave Chinner wrote:
> On Wed, Jul 03, 2019 at 03:04:45AM +0300, Boaz Harrosh wrote:
> > On 20/06/2019 01:37, Dave Chinner wrote:
> > <>
> > > 
> > > I'd prefer it doesn't get lifted to the VFS because I'm planning on
> > > getting rid of it in XFS with range locks. i.e. the XFS_MMAPLOCK is
> > > likely to go away in the near term because a range lock can be
> > > taken on either side of the mmap_sem in the page fault path.
> > > 
> > <>
> > Sir Dave
> > 
> > Sorry if this was answered before. I am please very curious. In the zufs
> > project I have an equivalent rw_MMAPLOCK that I _read_lock on page_faults.
> > (Read & writes all take read-locks ...)
> > The only reason I have it is because of lockdep actually.
> > 
> > Specifically for those xfstests that mmap a buffer then direct_IO in/out
> > of that buffer from/to another file in the same FS or the same file.
> > (For lockdep its the same case).
> 
> Which can deadlock if the same inode rwsem is taken on both sides of
> the mmap_sem, as lockdep tells you...
> 
> > I would be perfectly happy to recursively _read_lock both from the top
> > of the page_fault at the DIO path, and under in the page_fault. I'm
> > _read_locking after all. But lockdep is hard to convince. So I stole the
> > xfs idea of having an rw_MMAPLOCK. And grab yet another _write_lock at
> > truncate/punch/clone time when all mapping traversal needs to stop for
> > the destructive change to take place. (Allocations are done another way
> > and are race safe with traversal)
> > 
> > How do you intend to address this problem with range-locks? ie recursively
> > taking the same "lock"? because if not for the recursive-ity and lockdep I 
> > would
> > not need the extra lock-object per inode.
> 
> As long as the IO ranges to the same file *don't overlap*, it should
> be perfectly safe to take separate range locks (in read or write
> mode) on either side of the mmap_sem as non-overlapping range locks
> can be nested and will not self-deadlock.

I'd be really careful with nesting range locks. You can have nasty
situations like:

Thread 1Thread 2
read_lock(0,1000)   
write_lock(500,1500) -> blocks due to read lock
read_lock(1001,1500) -> blocks due to write lock (or you have to break
  fairness and then deal with starvation issues).

So once you allow nesting of range locks, you have to very carefully define
what is and what is not allowed. That's why in my range lock implementation
ages back I've decided to treat range lock as a rwsem for deadlock
verification purposes.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: kernel BUG at mm/swap_state.c:170!

2019-07-05 Thread Jan Kara
On Fri 05-07-19 20:19:48, Mikhail Gavrilov wrote:
> Hey folks.
> Excuse me, is anybody read my previous message?
> 5.2-rc7 is still affected by this issue [the logs in file
> dmesg-5.2rc7-0.1.tar.xz] and I worry that stable 5.2 would be released
> with this bug because there is almost no time left and I didn't see
> the attention to this problem.
> I confirm that reverting commit 5fd4ca2d84b2 on top of the rc7 tag is
> help fix it [the logs in file dmesg-5.2rc7-0.2.tar.xz].
> I am still awaiting any feedback here.

Yeah, I guess revert of 5fd4ca2d84b2 at this point is probably the best we
can do. Let's CC Linus, Andrew, and Greg (Linus is travelling AFAIK so I'm
not sure whether Greg won't do release for him).

        Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] dax: Fix missed PMD wakeups

2019-07-04 Thread Jan Kara
On Wed 03-07-19 20:27:28, Matthew Wilcox wrote:
> On Wed, Jul 03, 2019 at 02:28:41PM -0700, Dan Williams wrote:
> > On Wed, Jul 3, 2019 at 12:53 PM Matthew Wilcox  wrote:
> > > @@ -211,7 +215,8 @@ static void *get_unlocked_entry(struct xa_state *xas)
> > > for (;;) {
> > > entry = xas_find_conflict(xas);
> > > if (!entry || WARN_ON_ONCE(!xa_is_value(entry)) ||
> > > -   !dax_is_locked(entry))
> > > +   !dax_is_locked(entry) ||
> > > +   dax_entry_order(entry) < 
> > > xas_get_order(xas))
> > 
> > Doesn't this potentially allow a locked entry to be returned for a
> > caller that expects all value entries are unlocked?
> 
> It only allows locked entries to be returned for callers which pass in
> an xas which refers to a PMD entry.  This is fine for grab_mapping_entry()
> because it checks size_flag & is_pte_entry.
> 
> dax_layout_busy_page() only uses 0-order.
> __dax_invalidate_entry() only uses 0-order.
> dax_writeback_one() needs an extra fix:
> 
> /* Did a PMD entry get split? */
> if (dax_is_locked(entry))
> goto put_unlocked;
> 
> dax_insert_pfn_mkwrite() checks for a mismatch of pte vs pmd.
> 
> So I think we're good for all current users.

Agreed but it is an ugly trap. As I already said, I'd rather pay the
unnecessary cost of waiting for pte entry and have an easy to understand
interface. If we ever have a real world use case that would care for this
optimization, we will need to refactor functions to make this possible and
still keep the interfaces sane. For example get_unlocked_entry() could
return special "error code" indicating that there's no entry with matching
order in xarray but there's a conflict with it. That would be much less
error-prone interface.

Honza

-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] filesystem-dax: Disable PMD support

2019-07-04 Thread Jan Kara
On Wed 03-07-19 08:47:00, Matthew Wilcox wrote:
> On Mon, Jul 01, 2019 at 02:11:19PM +0200, Jan Kara wrote:
> > BTW, looking into the xarray code, I think I found another difference
> > between the old radix tree code and the new xarray code that could cause
> > issues. In the old radix tree code if we tried to insert PMD entry but
> > there was some PTE entry in the covered range, we'd get EEXIST error back
> > and the DAX fault code relies on this. I don't see how similar behavior is
> > achieved by xas_store()...
> 
> Are you referring to this?
> 
> -   entry = dax_make_locked(0, size_flag | DAX_EMPTY);
> -
> -   err = __radix_tree_insert(&mapping->i_pages, index,
> -   dax_entry_order(entry), entry);
> -   radix_tree_preload_end();
> -   if (err) {
> -   xa_unlock_irq(&mapping->i_pages);
> -   /*
> -* Our insertion of a DAX entry failed, most likely
> -* because we were inserting a PMD entry and it
> -* collided with a PTE sized entry at a different
> -* index in the PMD range.  We haven't inserted
> -* anything into the radix tree and have no waiters to
> -* wake.
> -*/
> -   return ERR_PTR(err);
> -   }

Mostly yes.

> If so, that can't happen any more because we no longer drop the i_pages
> lock while the entry is NULL, so the entry is always locked while the
> i_pages lock is dropped.

Ah, I have misinterpretted what xas_find_conflict() does. I'm sorry for the
noise. I find it somewhat unfortunate that xas_find_conflict() will not
return in any way the index where it has found the conflicting entry. We
could then use it for the wait logic as well and won't have to resort to
some masking tricks...

Honza

-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] mm: Support madvise_willneed override by Filesystems

2019-07-03 Thread Jan Kara
On Wed 03-07-19 04:04:57, Boaz Harrosh wrote:
> On 19/06/2019 11:21, Jan Kara wrote:
> <>
> > Yes, I have patch to make madvise(MADV_WILLNEED) go through ->fadvise() as
> > well. I'll post it soon since the rest of the series isn't really dependent
> > on it.
> > 
> > Honza
> > 
> 
> Hi Jan
> 
> Funny I'm sitting on the same patch since LSF last. I need it too for other
> reasons. I have not seen, have you pushed your patch yet?
> (Is based on old v4.20)

Your patch is wrong due to lock ordering. You should not call vfs_fadvise()
under mmap_sem. So we need to do a similar dance like madvise_remove(). I
have to get to writing at least XFS fix so that the madvise change gets
used and post the madvise patch with it... Sorry it takes me so long.

Honza
> 
> ~
> From fddb38169e33d23060ddd444ba6f2319f76edc89 Mon Sep 17 00:00:00 2001
> From: Boaz Harrosh 
> Date: Thu, 16 May 2019 20:02:14 +0300
> Subject: [PATCH] mm: Support madvise_willneed override by Filesystems
> 
> In the patchset:
>   [b833a3660394] ovl: add ovl_fadvise()
>   [3d8f7615319b] vfs: implement readahead(2) using POSIX_FADV_WILLNEED
>   [45cd0faae371] vfs: add the fadvise() file operation
> 
> Amir Goldstein introduced a way for filesystems to overide fadvise.
> Well madvise_willneed is exactly as fadvise_willneed except it always
> returns 0.
> 
> In this patch we call the FS vector if it exists.
> 
> NOTE: I called vfs_fadvise(..,POSIX_FADV_WILLNEED);
>   (Which is my artistic preference)
> 
> I could also selectively call
>   if (file->f_op->fadvise)
>   return file->f_op->fadvise(..., POSIX_FADV_WILLNEED);
> If we fear theoretical side effects. I don't mind either way.
> 
> CC: Amir Goldstein 
> CC: Miklos Szeredi 
> Signed-off-by: Boaz Harrosh 
> ---
>  mm/madvise.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 6cb1ca93e290..6b84ddcaaaf2 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -24,6 +24,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  
> @@ -303,7 +304,8 @@ static long madvise_willneed(struct vm_area_struct *vma,
>   end = vma->vm_end;
>   end = ((end - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
>  
> - force_page_cache_readahead(file->f_mapping, file, start, end - start);
> + vfs_fadvise(file, start << PAGE_SHIFT, (end - start) << PAGE_SHIFT,
> + POSIX_FADV_WILLNEED);
>   return 0;
>  }
>  
> -- 
> 2.20.1
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 23/30] ext2: Use kmemdup rather than duplicating its implementation

2019-07-03 Thread Jan Kara
On Wed 03-07-19 21:17:27, Fuqian Huang wrote:
> kmemdup is introduced to duplicate a region of memory in a neat way.
> Rather than kmalloc/kzalloc + memset, which the programmer needs to
> write the size twice (sometimes lead to mistakes), kmemdup improves
> readability, leads to smaller code and also reduce the chances of mistakes.
> Suggestion to use kmemdup rather than using kmalloc/kzalloc + memset.
> 
> Signed-off-by: Fuqian Huang 

Thanks. Added to my tree.

Honza

> ---
>  fs/ext2/xattr.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/ext2/xattr.c b/fs/ext2/xattr.c
> index 1e33e0ac8cf1..a9c641cd5484 100644
> --- a/fs/ext2/xattr.c
> +++ b/fs/ext2/xattr.c
> @@ -506,11 +506,10 @@ bad_block:  ext2_error(sb, "ext2_xattr_set",
>  
>   unlock_buffer(bh);
>   ea_bdebug(bh, "cloning");
> - header = kmalloc(bh->b_size, GFP_KERNEL);
> + header = kmemdup(HDR(bh), bh->b_size, GFP_KERNEL);
>   error = -ENOMEM;
>   if (header == NULL)
>   goto cleanup;
> - memcpy(header, HDR(bh), bh->b_size);
>   header->h_refcount = cpu_to_le32(1);
>  
>   offset = (char *)here - bh->b_data;
> -- 
> 2.11.0
> 
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] fs: reiserfs: journal: Change return type of dirty_one_transaction

2019-07-03 Thread Jan Kara
On Tue 02-07-19 23:24:30, Hariprasad Kelam wrote:
> Change return type of dirty_one_transaction from int to void. As this
> function always return success.
> 
> Fixes below issue reported by coccicheck
> fs/reiserfs/journal.c:1690:5-8: Unneeded variable: "ret". Return "0" on
> line 1719
> 
> Signed-off-by: Hariprasad Kelam 

I can see Andrew already picked up the cleanup. The patch looks good to me.
Feel free to add:

Reviewed-by: Jan Kara 

Honza


> ---
>  fs/reiserfs/journal.c | 6 ++
>  1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
> index 36346dc..4517a13 100644
> --- a/fs/reiserfs/journal.c
> +++ b/fs/reiserfs/journal.c
> @@ -94,7 +94,7 @@ static int journal_join(struct reiserfs_transaction_handle 
> *th,
>   struct super_block *sb);
>  static void release_journal_dev(struct super_block *super,
>  struct reiserfs_journal *journal);
> -static int dirty_one_transaction(struct super_block *s,
> +static void dirty_one_transaction(struct super_block *s,
>struct reiserfs_journal_list *jl);
>  static void flush_async_commits(struct work_struct *work);
>  static void queue_log_writer(struct super_block *s);
> @@ -1682,12 +1682,11 @@ static int write_one_transaction(struct super_block 
> *s,
>  }
>  
>  /* used by flush_commit_list */
> -static int dirty_one_transaction(struct super_block *s,
> +static void dirty_one_transaction(struct super_block *s,
>struct reiserfs_journal_list *jl)
>  {
>   struct reiserfs_journal_cnode *cn;
>   struct reiserfs_journal_list *pjl;
> - int ret = 0;
>  
>   jl->j_state |= LIST_DIRTY;
>   cn = jl->j_realblock;
> @@ -1716,7 +1715,6 @@ static int dirty_one_transaction(struct super_block *s,
>       }
>   cn = cn->next;
>   }
> - return ret;
>  }
>  
>  static int kupdate_transactions(struct super_block *s,
> -- 
> 2.7.4
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] filesystem-dax: Disable PMD support

2019-07-03 Thread Jan Kara
On Sun 30-06-19 08:23:24, Matthew Wilcox wrote:
> On Sun, Jun 30, 2019 at 01:01:04AM -0700, Dan Williams wrote:
> > @@ -215,7 +216,7 @@ static wait_queue_head_t
> > *dax_entry_waitqueue(struct xa_state *xas,
> >  * queue to the start of that PMD.  This ensures that all offsets in
> >  * the range covered by the PMD map to the same bit lock.
> >  */
> > -   if (dax_is_pmd_entry(entry))
> > +   //if (dax_is_pmd_entry(entry))
> > index &= ~PG_PMD_COLOUR;
> > key->xa = xas->xa;
> > key->entry_start = index;
> 
> Hah, that's a great naive fix!  Thanks for trying that out.
> 
> I think my theory was slightly mistaken, but your fix has the effect of
> fixing the actual problem too.
> 
> The xas->xa_index for a PMD is going to be PMD-aligned (ie a multiple of
> 512), but xas_find_conflict() does _not_ adjust xa_index (... which I
> really should have mentioned in the documentation).  So we go to sleep
> on the PMD-aligned index instead of the index of the PTE.  Your patch
> fixes this by using the PMD-aligned index for PTEs too.
> 
> I'm trying to come up with a clean fix for this.  Clearly we
> shouldn't wait for a PTE entry if we're looking for a PMD entry.
> But what should get_unlocked_entry() return if it detects that case?
> We could have it return an error code encoded as an internal entry,
> like grab_mapping_entry() does.  Or we could have it return the _locked_
> PTE entry, and have callers interpret that.
> 
> At least get_unlocked_entry() is static, but it's got quite a few callers.
> Trying to discern which ones might ask for a PMD entry is a bit tricky.
> So this seems like a large patch which might have bugs.

Yeah. So get_unlocked_entry() is used in several cases:

1) Case where we already have entry at given index but it is locked and we
need it unlocked so that we can do our thing `(dax_writeback_one(),
dax_layout_busy_page()).

2) Case where we want any entry covering given index (in
__dax_invalidate_entry()). This is essentially the same as case 1) since we
have already looked up the entry (just didn't propagate that information
from mm/truncate.c) - we want any unlocked entry covering given index.

3) Cases where we really want entry at given index and we have some entry
order constraints (dax_insert_pfn_mkwrite(), grab_mapping_entry()).

Honestly I'd make the rule that get_unlocked_entry() returns entry of any
order that is covering given index. I agree it may be unnecessarily waiting
for PTE entry lock for the case where in case 3) we are really looking only
for PMD entry but that seems like a relatively small cost for the
simplicity of the interface.

BTW, looking into the xarray code, I think I found another difference
between the old radix tree code and the new xarray code that could cause
issues. In the old radix tree code if we tried to insert PMD entry but
there was some PTE entry in the covered range, we'd get EEXIST error back
and the DAX fault code relies on this. I don't see how similar behavior is
achieved by xas_store()...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 1/1] udf: Fix incorrect final NOT_ALLOCATED (hole) extent length

2019-06-25 Thread Jan Kara
On Tue 04-06-19 07:31:58, Steve Magnani wrote:
> In some cases, using the 'truncate' command to extend a UDF file results
> in a mismatch between the length of the file's extents (specifically, due
> to incorrect length of the final NOT_ALLOCATED extent) and the information
> (file) length. The discrepancy can prevent other operating systems
> (i.e., Windows 10) from opening the file.
> 
> Two particular errors have been observed when extending a file:
> 
> 1. The final extent is larger than it should be, having been rounded up
>to a multiple of the block size.
> 
> B. The final extent is not shorter than it should be, due to not having
>been updated when the file's information length was increased.
> 
> The first case could represent a design error, if coded intentionally
> due to a misinterpretation of scantily-documented ECMA-167 "file tail"
> rules. The standard specifies that the tail, if present, consists of
> a sequence of "unrecorded and allocated" extents (only).
> 
> Signed-off-by: Steven J. Magnani 

Thanks for the testcase and the patch! I finally got to reading through
this in detail. In udf driver in Linux we are generally fine with the last
extent being rounded up to the block size. udf_truncate_tail_extent() is
generally responsible for truncating the last extent to appropriate size
once we are done with the inode. However there are two problems with this:

1) We used to do this inside udf_clear_inode() back in the old days but
then switched to a different scheme in commit 2c948b3f86e5f "udf: Avoid IO
in udf_clear_inode". So this actually breaks workloads where user calls
truncate(2) directly and there's no place where udf_truncate_tail_extent()
gets called.

2) udf_extend_file() sets i_lenExtents == i_size although the last extent
isn't properly rounded so even if udf_truncate_tail_extent() gets called
(which is actually the case for truncate(1) which does open, ftruncate,
close), it will think it has nothing to do and exit.

Now 2) is easily fixed by setting i_lenExtents to real length of extents we
have created. However that still leaves problem 1) which isn't easy to deal
with. After some though I think that your solution of making
udf_do_extend_file() always create appropriately sized extents makes
sense. However I dislike the calling convention you've chosen. When
udf_do_extend_file() needs to now byte length, then why not pass it to it
directly, instead of somewhat cumbersome "sector length + byte offset"
pair?

Will you update the patch please? Thanks!

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 2/9] blkcg, writeback: Add wbc->no_wbc_acct

2019-06-24 Thread Jan Kara
Hello Tejun!

On Thu 20-06-19 10:02:50, Tejun Heo wrote:
> On Thu, Jun 20, 2019 at 05:21:45PM +0200, Jan Kara wrote:
> > I'm completely ignorant of how btrfs compressed writeback works so don't
> > quite understand implications of this. So does this mean that writeback to
> > btrfs compressed files won't be able to transition inodes from one memcg to
> > another? Or are you trying to say the 'wbc' used from async worker thread
> > is actually a dummy one and we would double-account the writeback?
> 
> So, before, only the async compression workers would run through the
> wbc accounting code regardless of who originated the dirty pages,
> which is obviously wrong.  After the patch, the code accounts when the
> dirty pages are being handed off to the compression workers and
> no_wbc_acct is used to suppress spurious accounting from the workers.

OK, now I understand. Just one more question: So effectively, you are using
wbc->no_wbc_acct to pass information from btrfs code to btrfs code telling
it whether IO should or should not be accounted with wbc_account_io().
Wouldn't it make more sense to just pass this information internally
within btrfs? Granted, if this mechanism gets more widespread use by other
filesystems, then probably using wbc flag makes more sense. But I'm not
sure if this isn't a premature generalization...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-20 Thread Jan Kara
On Thu 13-06-19 08:27:55, Matthew Wilcox wrote:
> On Thu, Jun 13, 2019 at 10:25:55AM +1000, Dave Chinner wrote:
> > e.g. Process A has an exclusive layout lease on file F. It does an
> > IO to file F. The filesystem IO path checks that Process A owns the
> > lease on the file and so skips straight through layout breaking
> > because it owns the lease and is allowed to modify the layout. It
> > then takes the inode metadata locks to allocate new space and write
> > new data.
> > 
> > Process B now tries to write to file F. The FS checks whether
> > Process B owns a layout lease on file F. It doesn't, so then it
> > tries to break the layout lease so the IO can proceed. The layout
> > breaking code sees that process A has an exclusive layout lease
> > granted, and so returns -ETXTBSY to process B - it is not allowed to
> > break the lease and so the IO fails with -ETXTBSY.
> 
> This description doesn't match the behaviour that RDMA wants either.
> Even if Process A has a lease on the file, an IO from Process A which
> results in blocks being freed from the file is going to result in the
> RDMA device being able to write to blocks which are now freed (and
> potentially reallocated to another file).

I think you're partially wrong here. You are correct that the lease won't
stop process A from doing truncate on the file. *But* there are still page
pins in existence so truncate will block on waiting for these pins to go
away (after all this is a protection that guards all short-term page pin
users). So there is no problem with blocks being freed under the RDMA app.
Yes, the app will effectively deadlock and sysadmin has to kill it. IMO an
acceptable answer for doing something stupid and unsupportable...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 3/3] ext4: use jbd2_inode dirty range scoping

2019-06-20 Thread Jan Kara
On Wed 19-06-19 11:21:56, Ross Zwisler wrote:
> Use the newly introduced jbd2_inode dirty range scoping to prevent us
> from waiting forever when trying to complete a journal transaction.
> 
> Signed-off-by: Ross Zwisler 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/ext4/ext4_jbd2.h   | 12 ++--
>  fs/ext4/inode.c   | 13 ++---
>  fs/ext4/move_extent.c |  3 ++-
>  3 files changed, 18 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
> index 75a5309f22315..ef8fcf7d0d3b3 100644
> --- a/fs/ext4/ext4_jbd2.h
> +++ b/fs/ext4/ext4_jbd2.h
> @@ -361,20 +361,20 @@ static inline int ext4_journal_force_commit(journal_t 
> *journal)
>  }
>  
>  static inline int ext4_jbd2_inode_add_write(handle_t *handle,
> - struct inode *inode)
> + struct inode *inode, loff_t start_byte, loff_t length)
>  {
>   if (ext4_handle_valid(handle))
> - return jbd2_journal_inode_add_write(handle,
> - EXT4_I(inode)->jinode);
> + return jbd2_journal_inode_ranged_write(handle,
> + EXT4_I(inode)->jinode, start_byte, length);
>   return 0;
>  }
>  
>  static inline int ext4_jbd2_inode_add_wait(handle_t *handle,
> -struct inode *inode)
> + struct inode *inode, loff_t start_byte, loff_t length)
>  {
>   if (ext4_handle_valid(handle))
> - return jbd2_journal_inode_add_wait(handle,
> -EXT4_I(inode)->jinode);
> + return jbd2_journal_inode_ranged_wait(handle,
> + EXT4_I(inode)->jinode, start_byte, length);
>   return 0;
>  }
>  
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index c7f77c6430085..27fec5c594459 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -731,10 +731,16 @@ int ext4_map_blocks(handle_t *handle, struct inode 
> *inode,
>   !(flags & EXT4_GET_BLOCKS_ZERO) &&
>   !ext4_is_quota_file(inode) &&
>   ext4_should_order_data(inode)) {
> + loff_t start_byte =
> + (loff_t)map->m_lblk << inode->i_blkbits;
> + loff_t length = (loff_t)map->m_len << inode->i_blkbits;
> +
>   if (flags & EXT4_GET_BLOCKS_IO_SUBMIT)
> - ret = ext4_jbd2_inode_add_wait(handle, inode);
> + ret = ext4_jbd2_inode_add_wait(handle, inode,
> + start_byte, length);
>   else
> - ret = ext4_jbd2_inode_add_write(handle, inode);
> + ret = ext4_jbd2_inode_add_write(handle, inode,
> + start_byte, length);
>   if (ret)
>   return ret;
>   }
> @@ -4085,7 +4091,8 @@ static int __ext4_block_zero_page_range(handle_t 
> *handle,
>   err = 0;
>   mark_buffer_dirty(bh);
>   if (ext4_should_order_data(inode))
> - err = ext4_jbd2_inode_add_write(handle, inode);
> + err = ext4_jbd2_inode_add_write(handle, inode, from,
> + length);
>   }
>  
>  unlock:
> diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
> index 1083a9f3f16a1..c7ded4e2adff5 100644
> --- a/fs/ext4/move_extent.c
> +++ b/fs/ext4/move_extent.c
> @@ -390,7 +390,8 @@ move_extent_per_page(struct file *o_filp, struct inode 
> *donor_inode,
>  
>   /* Even in case of data=writeback it is reasonable to pin
>* inode to transaction, to prevent unexpected data loss */
> - *err = ext4_jbd2_inode_add_write(handle, orig_inode);
> + *err = ext4_jbd2_inode_add_write(handle, orig_inode,
> + (loff_t)orig_page_offset << PAGE_SHIFT, replaced_size);
>  
>  unlock_pages:
>   unlock_page(pagep[0]);
> -- 
> 2.22.0.410.gd8fdbe21b5-goog
> 
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 2/3] jbd2: introduce jbd2_inode dirty range scoping

2019-06-20 Thread Jan Kara
On Wed 19-06-19 11:21:55, Ross Zwisler wrote:
> Currently both journal_submit_inode_data_buffers() and
> journal_finish_inode_data_buffers() operate on the entire address space
> of each of the inodes associated with a given journal entry.  The
> consequence of this is that if we have an inode where we are constantly
> appending dirty pages we can end up waiting for an indefinite amount of
> time in journal_finish_inode_data_buffers() while we wait for all the
> pages under writeback to be written out.
> 
> The easiest way to cause this type of workload is do just dd from
> /dev/zero to a file until it fills the entire filesystem.  This can
> cause journal_finish_inode_data_buffers() to wait for the duration of
> the entire dd operation.
> 
> We can improve this situation by scoping each of the inode dirty ranges
> associated with a given transaction.  We do this via the jbd2_inode
> structure so that the scoping is contained within jbd2 and so that it
> follows the lifetime and locking rules for that structure.
> 
> This allows us to limit the writeback & wait in
> journal_submit_inode_data_buffers() and
> journal_finish_inode_data_buffers() respectively to the dirty range for
> a given struct jdb2_inode, keeping us from waiting forever if the inode
> in question is still being appended to.
> 
> Signed-off-by: Ross Zwisler 

The patch looks good to me. I was thinking whether we should not have
separate ranges for current and the next transaction but I guess it is not
worth it at least for now. So just one nit below. With that applied feel free
to add:

Reviewed-by: Jan Kara 

> @@ -257,15 +262,24 @@ static int journal_finish_inode_data_buffers(journal_t 
> *journal,
>   /* For locking, see the comment in journal_submit_data_buffers() */
>   spin_lock(&journal->j_list_lock);
>   list_for_each_entry(jinode, &commit_transaction->t_inode_list, i_list) {
> + loff_t dirty_start = jinode->i_dirty_start;
> + loff_t dirty_end = jinode->i_dirty_end;
> +
>   if (!(jinode->i_flags & JI_WAIT_DATA))
>   continue;
>   jinode->i_flags |= JI_COMMIT_RUNNING;
>   spin_unlock(&journal->j_list_lock);
> - err = filemap_fdatawait_keep_errors(
> - jinode->i_vfs_inode->i_mapping);
> + err = filemap_fdatawait_range_keep_errors(
> + jinode->i_vfs_inode->i_mapping, dirty_start,
> + dirty_end);
>   if (!ret)
>   ret = err;
>   spin_lock(&journal->j_list_lock);
> +
> + if (!jinode->i_next_transaction) {
> + jinode->i_dirty_start = 0;
> + jinode->i_dirty_end = 0;
> + }

This would be more logical in the next loop that moves jinode into the next
transaction.

>       jinode->i_flags &= ~JI_COMMIT_RUNNING;
>   smp_mb();
>   wake_up_bit(&jinode->i_flags, __JI_COMMIT_RUNNING);

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 1/3] mm: add filemap_fdatawait_range_keep_errors()

2019-06-20 Thread Jan Kara
On Wed 19-06-19 11:21:54, Ross Zwisler wrote:
> In the spirit of filemap_fdatawait_range() and
> filemap_fdatawait_keep_errors(), introduce
> filemap_fdatawait_range_keep_errors() which both takes a range upon
> which to wait and does not clear errors from the address space.
> 
> Signed-off-by: Ross Zwisler 

The patch looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  include/linux/fs.h |  2 ++
>  mm/filemap.c   | 22 ++
>  2 files changed, 24 insertions(+)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index f7fdfe93e25d3..79fec8a8413f4 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2712,6 +2712,8 @@ extern int filemap_flush(struct address_space *);
>  extern int filemap_fdatawait_keep_errors(struct address_space *mapping);
>  extern int filemap_fdatawait_range(struct address_space *, loff_t lstart,
>  loff_t lend);
> +extern int filemap_fdatawait_range_keep_errors(struct address_space *mapping,
> + loff_t start_byte, loff_t end_byte);
>  
>  static inline int filemap_fdatawait(struct address_space *mapping)
>  {
> diff --git a/mm/filemap.c b/mm/filemap.c
> index df2006ba0cfa5..e87252ca0835a 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -553,6 +553,28 @@ int filemap_fdatawait_range(struct address_space 
> *mapping, loff_t start_byte,
>  }
>  EXPORT_SYMBOL(filemap_fdatawait_range);
>  
> +/**
> + * filemap_fdatawait_range_keep_errors - wait for writeback to complete
> + * @mapping: address space structure to wait for
> + * @start_byte:  offset in bytes where the range starts
> + * @end_byte:offset in bytes where the range ends (inclusive)
> + *
> + * Walk the list of under-writeback pages of the given address space in the
> + * given range and wait for all of them.  Unlike filemap_fdatawait_range(),
> + * this function does not clear error status of the address space.
> + *
> + * Use this function if callers don't handle errors themselves.  Expected
> + * call sites are system-wide / filesystem-wide data flushers: e.g. sync(2),
> + * fsfreeze(8)
> + */
> +int filemap_fdatawait_range_keep_errors(struct address_space *mapping,
> + loff_t start_byte, loff_t end_byte)
> +{
> + __filemap_fdatawait_range(mapping, start_byte, end_byte);
> + return filemap_check_and_keep_errors(mapping);
> +}
> +EXPORT_SYMBOL(filemap_fdatawait_range_keep_errors);
> +
>  /**
>   * file_fdatawait_range - wait for writeback to complete
>   * @file:file pointing to address space structure to wait for
> -- 
> 2.22.0.410.gd8fdbe21b5-goog
> 
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH AUTOSEL 5.1 35/70] loop: Don't change loop device under exclusive opener

2019-06-20 Thread Jan Kara
On Wed 19-06-19 16:11:36, Sasha Levin wrote:
> On Mon, Jun 10, 2019 at 11:00:13AM +0200, Jan Kara wrote:
> > On Sat 08-06-19 07:39:14, Sasha Levin wrote:
> > > From: Jan Kara 
> > > 
> > > [ Upstream commit 33ec3e53e7b1869d7851e59e126bdb0fe0bd1982 ]
> > 
> > Please don't push this to stable kernels because...
> 
> I've dropped this, but...

OK, thanks.

> > > [Deliberately chosen not to CC stable as a user with priviledges to
> > > trigger this race has other means of taking the system down and this
> > > has a potential of breaking some weird userspace setup]
> > 
> > ... of this. Thanks!
> 
> Can't this be triggered by an "innocent" user, rather as part of an
> attack? Why can't this race happen during regular usage?

Well, the problem happens when mount happens on loop device when someone
modifies the backing file of the loop device. So for this to be
triggerable, you have to have control over assignment of backing files to
loop devices (you have to be owner of these loop devices to be able to do
this - pretty much means root in most setups) and be able to trigger mount
on this device. If you have these capabilities, there are much more
efficient ways to gain full administrator priviledges on the system,
deadlocking the kernel is thus the least of your worries.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] ext4: make __ext4_get_inode_loc plug

2019-06-19 Thread Jan Kara
On Wed 19-06-19 19:34:00, Zhangjs Jinshui wrote:
> You can blktrace
> 
>   8,80  31   11 0.296373038 2885275  Q  RA 8279571464 + 8 []
>   8,80  31   12 0.296374017 2885275  G  RA 8279571464 + 8 []
>   8,80  31   13 0.296375468 2885275  I  RA 8279571464 + 8 []
>   8,80  31   14 0.296382099  3886  D  RA 8279571464 + 8 
> [kworker/31:1H]
>   8,80  31   15 0.296391907 2885275  Q  RA 8279571472 + 8 []
>   8,80  31   16 0.296392275 2885275  G  RA 8279571472 + 8 []
>   8,80  31   17 0.296393305 2885275  I  RA 8279571472 + 8 []
>   8,80  31   18 0.296395844  3886  D  RA 8279571472 + 8 
> [kworker/31:1H]
>   8,80  31   19 0.296399685 2885275  Q  RA 8279571480 + 8 []
>   8,80  31   20 0.296400025 2885275  G  RA 8279571480 + 8 []
>   8,80  31   21 0.296401232 2885275  I  RA 8279571480 + 8 []
>   8,80  31   22 0.296403422  3886  D  RA 8279571480 + 8 
> [kworker/31:1H]
>   8,80  31   23 0.296407375 2885275  Q  RA 8279571488 + 8 []
>   8,80  31   24 0.296407721 2885275  G  RA 8279571488 + 8 []
>   8,80  31   25 0.296408904 2885275  I  RA 8279571488 + 8 []
>   8,80  31   26 0.296411127  3886  D  RA 8279571488 + 8 
> [kworker/31:1H]
>   8,80  31   27 0.296414779 2885275  Q  RA 8279571496 + 8 []
>   8,80  31   28 0.296415119 2885275  G  RA 8279571496 + 8 []
>   8,80  31   29 0.296415744 2885275  I  RA 8279571496 + 8 []
>   8,80  31   30 0.296417779  3886  D  RA 8279571496 + 8 
> [kworker/31:1H]
> 
> these RA io were caused by ext4_inode_readahead_blks, there are all not 
> merged becourse of the unplugged state.
> the backtrace shows below, was traced by systemtap ioblock.request filtered 
> by "opf & 1 << 19"
> 
>  0x8136fb20 : generic_make_request+0x0/0x2f0 [kernel]
>  0x8136fe7e : submit_bio+0x6e/0x130 [kernel]
>  0x812971e6 : submit_bh_wbc+0x156/0x190 [kernel]
>  0x81297bca : ll_rw_block+0x6a/0xb0 [kernel]
>  0x81297cc0 : __breadahead+0x40/0x70 [kernel]
>  0xa0392c9a : __ext4_get_inode_loc+0x37a/0x3d0 [ext4]
>  0xa0396a6c : ext4_iget+0x8c/0xc00 [ext4]
>  0xa03ad98a : ext4_lookup+0xca/0x1d0 [ext4]
>  0x8126b814 : path_openat+0xcb4/0x1250 [kernel]
>  0x8126dc41 : do_filp_open+0x91/0x100 [kernel]
>  0x8125ad86 : do_sys_open+0x126/0x210 [kernel]
>  0x81003864 : do_syscall_64+0x74/0x1a0 [kernel]
>  0x81800081 : entry_SYSCALL_64_after_hwframe+0x3d/0xa2 [kernel]
> 
> I have patched it on online servers, It can improved the performance.

Ah, OK, directory lookup code... Makes sense. Thanks for sharing!

Honza

> 
> > 在 2019年6月19日,19:08,Jan Kara  写道:
> > 
> > On Mon 17-06-19 23:57:12, jinshui zhang wrote:
> >> From: zhangjs mailto:zach...@baishancloud.com>>
> >> 
> >> If the task is unplugged when called, the inode_readahead_blks may not be 
> >> merged, 
> >> these will cause small pieces of io, It should be plugged.
> >> 
> >> Signed-off-by: zhangjs  >> <mailto:zach...@baishancloud.com>>
> > 
> > Out of curiosity, on which path do you see __ext4_get_inode_loc() being
> > called without IO already plugged?
> > 
> > Otherwise the patch looks good to me. You can add:
> > 
> > Reviewed-by: Jan Kara mailto:j...@suse.cz>>
> > 
> > Honza
> > 
> >> ---
> >> fs/ext4/inode.c | 6 ++
> >> 1 file changed, 6 insertions(+)
> >> 
> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> >> index c7f77c6..8fe046b 100644
> >> --- a/fs/ext4/inode.c
> >> +++ b/fs/ext4/inode.c
> >> @@ -4570,6 +4570,7 @@ static int __ext4_get_inode_loc(struct inode *inode,
> >>struct buffer_head  *bh;
> >>struct super_block  *sb = inode->i_sb;
> >>ext4_fsblk_tblock;
> >> +  struct blk_plug plug;
> >>int inodes_per_block, inode_offset;
> >> 
> >>iloc->bh = NULL;
> >> @@ -4654,6 +4655,8 @@ static int __ext4_get_inode_loc(struct inode *inode,
> >>}
> >> 
> >> make_io:
> >> +  blk_start_plug(&plug);
> >> +
> >>/*
> >> * If we need to do any I/O, try to pre-readahead extra
> >> * blocks from the inode table.
> >> @@ -4688,6 +4691,9 @@ static int __ext4_get_inode_loc(struct inode *inode,
> >>get_bh(bh);
> >>bh->b_end_io = end_buffer_read_sync;
> >>submit_bh(REQ_OP_READ, REQ_META | REQ_PRIO, bh);
> >> +
> >> +  blk_finish_plug(&plug);
> >> +
> >>wait_on_buffer(bh);
> >>if (!buffer_uptodate(bh)) {
> >>EXT4_ERROR_INODE_BLOCK(inode, block,
> >> -- 
> >> 1.8.3.1
> >> 
> > -- 
> > Jan Kara mailto:j...@suse.com>>
> > SUSE Labs, CR
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] ext4: remove redundant assignment to node

2019-06-19 Thread Jan Kara
On Wed 19-06-19 10:00:06, Colin King wrote:
> From: Colin Ian King 
> 
> Pointer 'node' is assigned a value that is never read, node is
> later overwritten when it re-assigned a different value inside
> the while-loop.  The assignment is redundant and can be removed.
> 
> Addresses-Coverity: ("Unused value")
> Signed-off-by: Colin Ian King 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/ext4/extents_status.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> index 023a3eb3afa3..7521de2dcf3a 100644
> --- a/fs/ext4/extents_status.c
> +++ b/fs/ext4/extents_status.c
> @@ -1317,7 +1317,6 @@ static int es_do_reclaim_extents(struct ext4_inode_info 
> *ei, ext4_lblk_t end,
>   es = __es_tree_search(&tree->root, ei->i_es_shrink_lblk);
>   if (!es)
>   goto out_wrap;
> - node = &es->rb_node;
>   while (*nr_to_scan > 0) {
>       if (es->es_lblk > end) {
>   ei->i_es_shrink_lblk = end + 1;
> -- 
> 2.20.1
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] ext4: make __ext4_get_inode_loc plug

2019-06-19 Thread Jan Kara
On Mon 17-06-19 23:57:12, jinshui zhang wrote:
> From: zhangjs 
> 
> If the task is unplugged when called, the inode_readahead_blks may not be 
> merged, 
> these will cause small pieces of io, It should be plugged.
> 
> Signed-off-by: zhangjs 

Out of curiosity, on which path do you see __ext4_get_inode_loc() being
called without IO already plugged?

Otherwise the patch looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/ext4/inode.c | 6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index c7f77c6..8fe046b 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4570,6 +4570,7 @@ static int __ext4_get_inode_loc(struct inode *inode,
>   struct buffer_head  *bh;
>   struct super_block  *sb = inode->i_sb;
>   ext4_fsblk_tblock;
> + struct blk_plug plug;
>   int inodes_per_block, inode_offset;
>  
>   iloc->bh = NULL;
> @@ -4654,6 +4655,8 @@ static int __ext4_get_inode_loc(struct inode *inode,
>   }
>  
>  make_io:
> + blk_start_plug(&plug);
> +
>   /*
>* If we need to do any I/O, try to pre-readahead extra
>* blocks from the inode table.
> @@ -4688,6 +4691,9 @@ static int __ext4_get_inode_loc(struct inode *inode,
>   get_bh(bh);
>   bh->b_end_io = end_buffer_read_sync;
>   submit_bh(REQ_OP_READ, REQ_META | REQ_PRIO, bh);
> +
> + blk_finish_plug(&plug);
> +
>   wait_on_buffer(bh);
>   if (!buffer_uptodate(bh)) {
>   EXT4_ERROR_INODE_BLOCK(inode, block,
> -- 
> 1.8.3.1
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: pagecache locking (was: bcachefs status update) merged)

2019-06-19 Thread Jan Kara
On Tue 18-06-19 07:21:56, Amir Goldstein wrote:
> > > Right, but regardless of the spec we have to consider that the
> > > behaviour of XFS comes from it's Irix heritage (actually from EFS,
> > > the predecessor of XFS from the late 1980s)
> >
> > Sure. And as I mentioned, I think it's technically the nicer guarantee.
> >
> > That said, it's a pretty *expensive* guarantee. It's one that you
> > yourself are not willing to give for O_DIRECT IO.
> >
> > And it's not a guarantee that Linux has ever had. In fact, it's not
> > even something I've ever seen anybody ever depend on.
> >
> > I agree that it's possible that some app out there might depend on
> > that kind of guarantee, but I also suspect it's much much more likely
> > that it's the other way around: XFS is being unnecessarily strict,
> > because everybody is testing against filesystems that don't actually
> > give the total atomicity guarantees.
> >
> > Nobody develops for other unixes any more (and nobody really ever did
> > it by reading standards papers - even if they had been very explicit).
> >
> > And honestly, the only people who really do threaded accesses to the same 
> > file
> >
> >  (a) don't want that guarantee in the first place
> >
> >  (b) are likely to use direct-io that apparently doesn't give that
> > atomicity guarantee even on xfs
> >
> > so I do think it's moot.
> >
> > End result: if we had a really cheap range lock, I think it would be a
> > good idea to use it (for the whole QoI implementation), but for
> > practical reasons it's likely better to just stick to the current lack
> > of serialization because it performs better and nobody really seems to
> > want anything else anyway.
> >
> 
> This is the point in the conversation where somebody usually steps in
> and says "let the user/distro decide". Distro maintainers are in a much
> better position to take the risk of breaking hypothetical applications.
> 
> I should point out that even if "strict atomic rw" behavior is desired, then
> page cache warmup [1] significantly improves performance.
> Having mentioned that, the discussion can now return to what is the
> preferred way to solve the punch hole vs. page cache add race.
> 
> XFS may end up with special tailored range locks, which beings some
> other benefits to XFS, but all filesystems need the solution for the punch
> hole vs. page cache add race.
> Jan recently took a stab at it for ext4 [2], but that didn't work out.

Yes, but I have idea how to fix it. I just need to push acquiring ext4's
i_mmap_sem down a bit further so that only page cache filling is protected
by it but not copying of data out to userspace. But I didn't get to coding
it last week due to other stuff.

> So I wonder what everyone thinks about Kent's page add lock as the
> solution to the problem.
> Allegedly, all filesystems (XFS included) are potentially exposed to
> stale data exposure/data corruption.

When we first realized that hole-punching vs page faults races are an issue I
was looking into a generic solution but at that time there was a sentiment
against adding another rwsem to struct address_space (or inode) so we ended
up with a private lock in ext4 (i_mmap_rwsem), XFS (I_MMAPLOCK), and other
filesystems these days. If people think that growing struct inode for
everybody is OK, we can think about lifting private filesystem solutions
into a generic one. I'm fine with that.

That being said as Dave said we use those fs-private locks also for
serializing against equivalent issues arising for DAX. So the problem is
not only about page cache but generally about doing IO and caching
block mapping information for a file range. So the solution should not be
too tied to page cache.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: bcachefs status update (it's done cooking; let's get this sucker merged)

2019-06-19 Thread Jan Kara
On Thu 13-06-19 09:02:24, Dave Chinner wrote:
> On Wed, Jun 12, 2019 at 12:21:44PM -0400, Kent Overstreet wrote:
> > This would simplify things a lot and eliminate a really nasty corner case - 
> > page
> > faults trigger readahead. Even if the buffer and the direct IO don't 
> > overlap,
> > readahead can pull in pages that do overlap with the dio.
> 
> Page cache readahead needs to be moved under the filesystem IO
> locks. There was a recent thread about how readahead can race with
> hole punching and other fallocate() operations because page cache
> readahead bypasses the filesystem IO locks used to serialise page
> cache invalidation.
> 
> e.g. Readahead can be directed by userspace via fadvise, so we now
> have file->f_op->fadvise() so that filesystems can lock the inode
> before calling generic_fadvise() such that page cache instantiation
> and readahead dispatch can be serialised against page cache
> invalidation. I have a patch for XFS sitting around somewhere that
> implements the ->fadvise method.
> 
> I think there are some other patches floating around to address the
> other readahead mechanisms to only be done under filesytem IO locks,
> but I haven't had time to dig into it any further. Readahead from
> page faults most definitely needs to be under the MMAPLOCK at
> least so it serialises against fallocate()...

Yes, I have patch to make madvise(MADV_WILLNEED) go through ->fadvise() as
well. I'll post it soon since the rest of the series isn't really dependent
on it.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 1/1] udf: Fix incorrect final NOT_ALLOCATED (hole) extent length

2019-06-18 Thread Jan Kara
Hi Steve!

On Sun 16-06-19 11:28:46, Steve Magnani wrote:
> On 6/4/19 7:31 AM, Steve Magnani wrote:
> 
> > In some cases, using the 'truncate' command to extend a UDF file results
> > in a mismatch between the length of the file's extents (specifically, due
> > to incorrect length of the final NOT_ALLOCATED extent) and the information
> > (file) length. The discrepancy can prevent other operating systems
> > (i.e., Windows 10) from opening the file.
> > 
> > Two particular errors have been observed when extending a file:
> > 
> > 1. The final extent is larger than it should be, having been rounded up
> > to a multiple of the block size.
> > 
> > B. The final extent is shorter than it should be, due to not having
> > been updated when the file's information length was increased.
> 
> Wondering if you've seen this, or if something got lost in a spam folder.

Sorry for not getting to you earlier. I've seen the patches and they look
reasonable to me. I just wanted to have a one more closer look but last
weeks were rather busy so I didn't get to it. I'll look into it this week.
Thanks a lot for debugging the problem and sending the fixes!

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-13 Thread Jan Kara
On Wed 12-06-19 11:41:53, Dan Williams wrote:
> On Wed, Jun 12, 2019 at 5:09 AM Jan Kara  wrote:
> >
> > On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote:
> > > On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote:
> > >
> > > > > > The main objection to the current ODP & DAX solution is that very
> > > > > > little HW can actually implement it, having the alternative still
> > > > > > require HW support doesn't seem like progress.
> > > > > >
> > > > > > I think we will eventually start seein some HW be able to do this
> > > > > > invalidation, but it won't be universal, and I'd rather leave it
> > > > > > optional, for recovery from truely catastrophic errors (ie my DAX is
> > > > > > on fire, I need to unplug it).
> > > > >
> > > > > Agreed.  I think software wise there is not much some of the devices 
> > > > > can do
> > > > > with such an "invalidate".
> > > >
> > > > So out of curiosity: What does RDMA driver do when userspace just closes
> > > > the file pointing to RDMA object? It has to handle that somehow by 
> > > > aborting
> > > > everything that's going on... And I wanted similar behavior here.
> > >
> > > It aborts *everything* connected to that file descriptor. Destroying
> > > everything avoids creating inconsistencies that destroying a subset
> > > would create.
> > >
> > > What has been talked about for lease break is not destroying anything
> > > but very selectively saying that one memory region linked to the GUP
> > > is no longer functional.
> >
> > OK, so what I had in mind was that if RDMA app doesn't play by the rules
> > and closes the file with existing pins (and thus layout lease) we would
> > force it to abort everything. Yes, it is disruptive but then the app didn't
> > obey the rule that it has to maintain file lease while holding pins. Thus
> > such situation should never happen unless the app is malicious / buggy.
> 
> When you say 'close' do you mean the final release of the fd? The vma
> keeps a reference to a 'struct file' live even after the fd is closed.

When I say 'close', I mean a call to ->release file operation which happens
when the last reference to struct file is dropped. I.e., when all file
descriptors and vmas (and possibly other places holding struct file
reference) are gone.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-13 Thread Jan Kara
On Wed 12-06-19 11:49:52, Dan Williams wrote:
> On Wed, Jun 12, 2019 at 3:29 AM Jan Kara  wrote:
> >
> > On Fri 07-06-19 07:52:13, Ira Weiny wrote:
> > > On Fri, Jun 07, 2019 at 09:17:29AM -0300, Jason Gunthorpe wrote:
> > > > On Fri, Jun 07, 2019 at 12:36:36PM +0200, Jan Kara wrote:
> > > >
> > > > > Because the pins would be invisible to sysadmin from that point on.
> > > >
> > > > It is not invisible, it just shows up in a rdma specific kernel
> > > > interface. You have to use rdma netlink to see the kernel object
> > > > holding this pin.
> > > >
> > > > If this visibility is the main sticking point I suggest just enhancing
> > > > the existing MR reporting to include the file info for current GUP
> > > > pins and teaching lsof to collect information from there as well so it
> > > > is easy to use.
> > > >
> > > > If the ownership of the lease transfers to the MR, and we report that
> > > > ownership to userspace in a way lsof can find, then I think all the
> > > > concerns that have been raised are met, right?
> > >
> > > I was contemplating some new lsof feature yesterday.  But what I don't
> > > think we want is sysadmins to have multiple tools for multiple
> > > subsystems.  Or even have to teach lsof something new for every potential
> > > new subsystem user of GUP pins.
> >
> > Agreed.
> >
> > > I was thinking more along the lines of reporting files which have GUP
> > > pins on them directly somewhere (dare I say procfs?) and teaching lsof to
> > > report that information.  That would cover any subsystem which does a
> > > longterm pin.
> >
> > So lsof already parses /proc//maps to learn about files held open by
> > memory mappings. It could parse some other file as well I guess. The good
> > thing about that would be that then "longterm pin" structure would just hold
> > struct file reference. That would avoid any needs of special behavior on
> > file close (the file reference in the "longterm pin" structure would make
> > sure struct file and thus the lease stays around, we'd just need to make
> > explicit lease unlock block until the "longterm pin" structure is freed).
> > The bad thing is that it requires us to come up with a sane new proc
> > interface for reporting "longterm pins" and associated struct file. Also we
> > need to define what this interface shows if the pinned pages are in DRAM
> > (either page cache or anon) and not on NVDIMM.
> 
> The anon vs shared detection case is important because a longterm pin
> might be blocking a memory-hot-unplug operation if it is pinning
> ZONE_MOVABLE memory, but I don't think we want DRAM vs NVDIMM to be an
> explicit concern of the interface. For the anon / cached case I expect
> it might be useful to put that communication under the memory-blocks
> sysfs interface. I.e. a list of pids that are pinning that
> memory-block from being hot-unplugged.

Yes, I was thinking of memory hotplug as well. But I don't think the
distinction is really shared vs anon - a pinned page cache page can be
blocking your memory unplug / migration the same way as a pinned anon page.
So the information for a pin we need to convey is the "location of
resources" being pinned - that is pfn (both for DRAM and NVDIMM) - but then
also additional mapping information (which is filename for DAX page, not
sure about DRAM). Also a separate question is how to expose this
information so that it is efficiently usable by userspace. For lsof, a file
in /proc//xxx with information would be probably the easiest to use
plus all the issues with file access permissions and visibility among
different user namespaces is solved out of the box. And I believe it would
be reasonably usable for memory hotplug usecase as well. A file in sysfs
would be OK for memory hotplug I guess, but not really usable for lsof and
so I'm not sure we really need it when we are going to have one in procfs.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-13 Thread Jan Kara
On Wed 12-06-19 15:13:36, Ira Weiny wrote:
> On Wed, Jun 12, 2019 at 04:14:21PM -0300, Jason Gunthorpe wrote:
> > On Wed, Jun 12, 2019 at 02:09:07PM +0200, Jan Kara wrote:
> > > On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote:
> > > > On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote:
> > > > 
> > > > > > > The main objection to the current ODP & DAX solution is that very
> > > > > > > little HW can actually implement it, having the alternative still
> > > > > > > require HW support doesn't seem like progress.
> > > > > > > 
> > > > > > > I think we will eventually start seein some HW be able to do this
> > > > > > > invalidation, but it won't be universal, and I'd rather leave it
> > > > > > > optional, for recovery from truely catastrophic errors (ie my DAX 
> > > > > > > is
> > > > > > > on fire, I need to unplug it).
> > > > > > 
> > > > > > Agreed.  I think software wise there is not much some of the 
> > > > > > devices can do
> > > > > > with such an "invalidate".
> > > > > 
> > > > > So out of curiosity: What does RDMA driver do when userspace just 
> > > > > closes
> > > > > the file pointing to RDMA object? It has to handle that somehow by 
> > > > > aborting
> > > > > everything that's going on... And I wanted similar behavior here.
> > > > 
> > > > It aborts *everything* connected to that file descriptor. Destroying
> > > > everything avoids creating inconsistencies that destroying a subset
> > > > would create.
> > > > 
> > > > What has been talked about for lease break is not destroying anything
> > > > but very selectively saying that one memory region linked to the GUP
> > > > is no longer functional.
> > > 
> > > OK, so what I had in mind was that if RDMA app doesn't play by the rules
> > > and closes the file with existing pins (and thus layout lease) we would
> > > force it to abort everything. Yes, it is disruptive but then the app 
> > > didn't
> > > obey the rule that it has to maintain file lease while holding pins. Thus
> > > such situation should never happen unless the app is malicious / buggy.
> > 
> > We do have the infrastructure to completely revoke the entire
> > *content* of a FD (this is called device disassociate). It is
> > basically close without the app doing close. But again it only works
> > with some drivers. However, this is more likely something a driver
> > could support without a HW change though.
> > 
> > It is quite destructive as it forcibly kills everything RDMA related
> > the process(es) are doing, but it is less violent than SIGKILL, and
> > there is perhaps a way for the app to recover from this, if it is
> > coded for it.
> 
> I don't think many are...  I think most would effectively be "killed" if this
> happened to them.

Yes, I repeat we are in a situation when the application has a bug and
didn't propely manage its long term pins which are fully under its control.
So in my mind a situation similar to application using memory it has
already freed. The kernel has to manage that but we don't really care
what's left from the application when this happens.

That being said I'm not insisting this has to happen - tracking associated
"RDMA file" with a layout lease and somehow invalidating it on close of a
leased file is somewhat ugly anyway. But it is still an option if exposing
pins to userspace for lsof to consume proves even worse...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

2019-06-13 Thread Jan Kara
On Wed 12-06-19 12:36:53, Srivatsa S. Bhat wrote:
> 
> [ Adding Greg to CC ]
> 
> On 6/12/19 6:04 AM, Jan Kara wrote:
> > On Tue 11-06-19 15:34:48, Srivatsa S. Bhat wrote:
> >> On 6/2/19 12:04 AM, Srivatsa S. Bhat wrote:
> >>> On 5/30/19 3:45 AM, Paolo Valente wrote:
> >>>>
> >> [...]
> >>>> At any rate, since you pointed out that you are interested in
> >>>> out-of-the-box performance, let me complete the context: in case
> >>>> low_latency is left set, one gets, in return for this 12% loss,
> >>>> a) at least 1000% higher responsiveness, e.g., 1000% lower start-up
> >>>> times of applications under load [1];
> >>>> b) 500-1000% higher throughput in multi-client server workloads, as I
> >>>> already pointed out [2].
> >>>>
> >>>
> >>> I'm very happy that you could solve the problem without having to
> >>> compromise on any of the performance characteristics/features of BFQ!
> >>>
> >>>
> >>>> I'm going to prepare complete patches.  In addition, if ok for you,
> >>>> I'll report these results on the bug you created.  Then I guess we can
> >>>> close it.
> >>>>
> >>>
> >>> Sounds great!
> >>>
> >>
> >> Hi Paolo,
> >>
> >> Hope you are doing great!
> >>
> >> I was wondering if you got a chance to post these patches to LKML for
> >> review and inclusion... (No hurry, of course!)
> >>
> >> Also, since your fixes address the performance issues in BFQ, do you
> >> have any thoughts on whether they can be adapted to CFQ as well, to
> >> benefit the older stable kernels that still support CFQ?
> > 
> > Since CFQ doesn't exist in current upstream kernel anymore, I seriously
> > doubt you'll be able to get any performance improvements for it in the
> > stable kernels...
> > 
> 
> I suspected as much, but that seems unfortunate though. The latest LTS
> kernel is based on 4.19, which still supports CFQ. It would have been
> great to have a process to address significant issues on older
> kernels too.

Well, you could still tune the performance difference by changing
slice_idle and group_idle tunables for CFQ (in
/sys/block//queue/iosched/).  Changing these to lower values will
reduce the throughput loss when switching between cgroups at the cost of
lower accuracy of enforcing configured IO proportions among cgroups.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-12 Thread Jan Kara
On Wed 12-06-19 08:47:21, Jason Gunthorpe wrote:
> On Wed, Jun 12, 2019 at 12:29:17PM +0200, Jan Kara wrote:
> 
> > > > The main objection to the current ODP & DAX solution is that very
> > > > little HW can actually implement it, having the alternative still
> > > > require HW support doesn't seem like progress.
> > > > 
> > > > I think we will eventually start seein some HW be able to do this
> > > > invalidation, but it won't be universal, and I'd rather leave it
> > > > optional, for recovery from truely catastrophic errors (ie my DAX is
> > > > on fire, I need to unplug it).
> > > 
> > > Agreed.  I think software wise there is not much some of the devices can 
> > > do
> > > with such an "invalidate".
> > 
> > So out of curiosity: What does RDMA driver do when userspace just closes
> > the file pointing to RDMA object? It has to handle that somehow by aborting
> > everything that's going on... And I wanted similar behavior here.
> 
> It aborts *everything* connected to that file descriptor. Destroying
> everything avoids creating inconsistencies that destroying a subset
> would create.
> 
> What has been talked about for lease break is not destroying anything
> but very selectively saying that one memory region linked to the GUP
> is no longer functional.

OK, so what I had in mind was that if RDMA app doesn't play by the rules
and closes the file with existing pins (and thus layout lease) we would
force it to abort everything. Yes, it is disruptive but then the app didn't
obey the rule that it has to maintain file lease while holding pins. Thus
such situation should never happen unless the app is malicious / buggy.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-12 Thread Jan Kara
On Fri 07-06-19 07:52:13, Ira Weiny wrote:
> On Fri, Jun 07, 2019 at 09:17:29AM -0300, Jason Gunthorpe wrote:
> > On Fri, Jun 07, 2019 at 12:36:36PM +0200, Jan Kara wrote:
> > 
> > > Because the pins would be invisible to sysadmin from that point on. 
> > 
> > It is not invisible, it just shows up in a rdma specific kernel
> > interface. You have to use rdma netlink to see the kernel object
> > holding this pin.
> > 
> > If this visibility is the main sticking point I suggest just enhancing
> > the existing MR reporting to include the file info for current GUP
> > pins and teaching lsof to collect information from there as well so it
> > is easy to use.
> > 
> > If the ownership of the lease transfers to the MR, and we report that
> > ownership to userspace in a way lsof can find, then I think all the
> > concerns that have been raised are met, right?
> 
> I was contemplating some new lsof feature yesterday.  But what I don't
> think we want is sysadmins to have multiple tools for multiple
> subsystems.  Or even have to teach lsof something new for every potential
> new subsystem user of GUP pins.

Agreed.

> I was thinking more along the lines of reporting files which have GUP
> pins on them directly somewhere (dare I say procfs?) and teaching lsof to
> report that information.  That would cover any subsystem which does a
> longterm pin.

So lsof already parses /proc//maps to learn about files held open by
memory mappings. It could parse some other file as well I guess. The good
thing about that would be that then "longterm pin" structure would just hold
struct file reference. That would avoid any needs of special behavior on
file close (the file reference in the "longterm pin" structure would make
sure struct file and thus the lease stays around, we'd just need to make
explicit lease unlock block until the "longterm pin" structure is freed).
The bad thing is that it requires us to come up with a sane new proc
interface for reporting "longterm pins" and associated struct file. Also we
need to define what this interface shows if the pinned pages are in DRAM
(either page cache or anon) and not on NVDIMM.

> > > ugly to live so we have to come up with something better. The best I can
> > > currently come up with is to have a method associated with the lease that
> > > would invalidate the RDMA context that holds the pins in the same way that
> > > a file close would do it.
> > 
> > This is back to requiring all RDMA HW to have some new behavior they
> > currently don't have..
> > 
> > The main objection to the current ODP & DAX solution is that very
> > little HW can actually implement it, having the alternative still
> > require HW support doesn't seem like progress.
> > 
> > I think we will eventually start seein some HW be able to do this
> > invalidation, but it won't be universal, and I'd rather leave it
> > optional, for recovery from truely catastrophic errors (ie my DAX is
> > on fire, I need to unplug it).
> 
> Agreed.  I think software wise there is not much some of the devices can do
> with such an "invalidate".

So out of curiosity: What does RDMA driver do when userspace just closes
the file pointing to RDMA object? It has to handle that somehow by aborting
everything that's going on... And I wanted similar behavior here.

Honza

-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH RFC 02/10] fs/locks: Export F_LAYOUT lease to user space

2019-06-12 Thread Jan Kara
On Tue 11-06-19 14:38:13, Ira Weiny wrote:
> On Sun, Jun 09, 2019 at 09:00:24AM -0400, Jeff Layton wrote:
> > On Wed, 2019-06-05 at 18:45 -0700, ira.we...@intel.com wrote:
> > > From: Ira Weiny 
> > > 
> > > GUP longterm pins of non-pagecache file system pages (eg FS DAX) are
> > > currently disallowed because they are unsafe.
> > > 
> > > The danger for pinning these pages comes from the fact that hole punch
> > > and/or truncate of those files results in the pages being mapped and
> > > pinned by a user space process while DAX has potentially allocated those
> > > pages to other processes.
> > > 
> > > Most (All) users who are mapping FS DAX pages for long term pin purposes
> > > (such as RDMA) are not going to want to deallocate these pages while
> > > those pages are in use.  To do so would mean the application would lose
> > > data.  So the use case for allowing truncate operations of such pages
> > > is limited.
> > > 
> > > However, the kernel must protect itself and users from potential
> > > mistakes and/or malicious user space code.  Rather than disabling long
> > > term pins as is done now.   Allow for users who know they are going to
> > > be pinning this memory to alert the file system of this intention.
> > > Furthermore, allow users to be alerted such that they can react if a
> > > truncate operation occurs for some reason.
> > > 
> > > Example user space pseudocode for a user using RDMA and wanting to allow
> > > a truncate would look like this:
> > > 
> > > lease_break_sigio_handler() {
> > > ...
> > >   if (sigio.fd == rdma_fd) {
> > >   complete_rdma_operations(...);
> > >   ibv_dereg_mr(mr);
> > >   close(rdma_fd);
> > >   fcntl(rdma_fd, F_SETLEASE, F_UNLCK);
> > >   }
> > > }
> > > 
> > > setup_rdma_to_dax_file() {
> > > ...
> > >   rdma_fd = open(...)
> > >   fcntl(rdma_fd, F_SETLEASE, F_LAYOUT);
> > 
> > I'm not crazy about this interface. F_LAYOUT doesn't seem to be in the
> > same category as F_RDLCK/F_WRLCK/F_UNLCK.
> > 
> > Maybe instead of F_SETLEASE, this should use new
> > F_SETLAYOUT/F_GETLAYOUT cmd values? There is nothing that would prevent
> > you from setting both a lease and a layout on a file, and indeed knfsd
> > can set both.
> > 
> > This interface seems to conflate the two.
> 
> I've been feeling the same way.  This is why I was leaning toward a new lease
> type.  I called it "F_LONGTERM" but the name is not important.
> 
> I think the concept of adding "exclusive" to the layout lease can fix this
> because the NFS lease is non-exclusive where the user space one (for the
> purpose of GUP pinning) would need to be.
> 
> FWIW I have not worked out exactly what this new "exclusive" code will look
> like.  Jan said:
> 
>   "There actually is support for locks that are not broken after given
>   timeout so there shouldn't be too many changes need."
> 
> But I'm not seeing that for Lease code.  So I'm working on something for the
> lease code now.

Yeah, sorry for misleading you. Somehow I thought that if lease_break_time
== 0, we will wait indefinitely but when checking the code again, that
doesn't seem to be the case.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-07 Thread Jan Kara
On Thu 06-06-19 15:03:30, Ira Weiny wrote:
> On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote:
> > On Wed 05-06-19 18:45:33, ira.we...@intel.com wrote:
> > > From: Ira Weiny 
> > 
> > So I'd like to actually mandate that you *must* hold the file lease until
> > you unpin all pages in the given range (not just that you have an option to
> > hold a lease). And I believe the kernel should actually enforce this. That
> > way we maintain a sane state that if someone uses a physical location of
> > logical file offset on disk, he has a layout lease. Also once this is done,
> > sysadmin has a reasonably easy way to discover run-away RDMA application
> > and kill it if he wishes so.
> 
> Fair enough.
> 
> I was kind of heading that direction but had not thought this far forward.  I
> was exploring how to have a lease remain on the file even after a "lease
> break".  But that is incompatible with the current semantics of a "layout"
> lease (as currently defined in the kernel).  [In the end I wanted to get an 
> RFC
> out to see what people think of this idea so I did not look at keeping the
> lease.]
> 
> Also hitch is that currently a lease is forcefully broken after
> /lease-break-time.  To do what you suggest I think we would need a new
> lease type with the semantics you describe.

I'd do what Dave suggested - add flag to mark lease as unbreakable by
truncate and teach file locking core to handle that. There actually is
support for locks that are not broken after given timeout so there
shouldn't be too many changes need.
 
> Previously I had thought this would be a good idea (for other reasons).  But
> what does everyone think about using a "longterm lease" similar to [1] which
> has the semantics you proppose?  In [1] I was not sure "longterm" was a good
> name but with your proposal I think it makes more sense.

As I wrote elsewhere in this thread I think FL_LAYOUT name still makes
sense and I'd add there FL_UNBREAKABLE to mark unusal behavior with
truncate.

> > - probably I'd just transition all gup_longterm()
> > users to a saner API similar to the one we have in mm/frame_vector.c where
> > we don't hand out page pointers but an encapsulating structure that does
> > all the necessary tracking.
> 
> I'll take a look at that code.  But that seems like a pretty big change.

I was looking into that yesterday before proposing this and there aren't
than many gup_longterm() users and most of them anyway just stick pages
array into their tracking structure and then release them once done. So it
shouldn't be that complex to convert to a new convention (and you have to
touch all gup_longterm() users anyway to teach them track leases etc.).

> > Removing a lease would need to block until all
> > pins are released - this is probably the most hairy part since we need to
> > handle a case if application just closes the file descriptor which would
> > release the lease but OTOH we need to make sure task exit does not deadlock.
> > Maybe we could block only on explicit lease unlock and just drop the layout
> > lease on file close and if there are still pinned pages, send SIGKILL to an
> > application as a reminder it did something stupid...
> 
> As presented at LSFmm I'm not opposed to killing a process which does not
> "follow the rules".  But I'm concerned about how to handle this across a fork.
> 
> Limiting the open()/LEASE/GUP/close()/SIGKILL to a specific pid "leak"'s pins
> to a child through the RDMA context.  This was the major issue Jason had with
> the SIGBUS proposal.
> 
> Always sending a SIGKILL would prevent an RDMA process from doing something
> like system("ls") (would kill the child unnecessarily).  Are we ok with that?

I answered this in another email but system("ls") won't kill anybody.
fork(2) just creates new file descriptor for the same file and possibly
then closes it but since there is still another file descriptor for the
same struct file, the "close" code won't trigger.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-07 Thread Jan Kara
On Thu 06-06-19 15:22:28, Ira Weiny wrote:
> On Thu, Jun 06, 2019 at 04:51:15PM -0300, Jason Gunthorpe wrote:
> > On Thu, Jun 06, 2019 at 12:42:03PM +0200, Jan Kara wrote:
> > 
> > > So I'd like to actually mandate that you *must* hold the file lease until
> > > you unpin all pages in the given range (not just that you have an option 
> > > to
> > > hold a lease). And I believe the kernel should actually enforce this. That
> > > way we maintain a sane state that if someone uses a physical location of
> > > logical file offset on disk, he has a layout lease. Also once this is 
> > > done,
> > > sysadmin has a reasonably easy way to discover run-away RDMA application
> > > and kill it if he wishes so.
> > > 
> > > The question is on how to exactly enforce that lease is taken until all
> > > pages are unpinned. I belive it could be done by tracking number of
> > > long-term pinned pages within a lease. Gup_longterm could easily increment
> > > the count when verifying the lease exists, gup_longterm users will somehow
> > > need to propagate corresponding 'filp' (struct file pointer) to
> > > put_user_pages_longterm() callsites so that they can look up appropriate
> > > lease to drop reference - probably I'd just transition all gup_longterm()
> > > users to a saner API similar to the one we have in mm/frame_vector.c where
> > > we don't hand out page pointers but an encapsulating structure that does
> > > all the necessary tracking. Removing a lease would need to block until all
> > > pins are released - this is probably the most hairy part since we need to
> > > handle a case if application just closes the file descriptor which
> > > would
> > 
> > I think if you are going to do this then the 'struct filp' that
> > represents the lease should be held in the kernel (ie inside the RDMA
> > umem) until the kernel is done with it.
> 
> Yea there seems merit to this.  I'm still not resolving how this helps track
> who has the pin across a fork.

Yes, my thought was that gup_longterm() would return a structure that would
be tracking filp (or whatever is needed) and that would be embedded inside
RDMA umem.

> > Actually does someone have a pointer to this userspace lease API, I'm
> > not at all familiar with it, thanks
> 
> man fcntl
>   search for SETLEASE
> 
> But I had to add the F_LAYOUT lease type.  (Personally I'm for calling it
> F_LONGTERM at this point.  I don't think LAYOUT is compatible with what we are
> proposing here.)

I think F_LAYOUT still expresses it pretty well. The lease is pinning
logical->physical file offset mapping, i.e. the file layout.

> > 
> > And yes, a better output format from GUP would be great..
> > 
> > > Maybe we could block only on explicit lease unlock and just drop the 
> > > layout
> > > lease on file close and if there are still pinned pages, send SIGKILL to 
> > > an
> > > application as a reminder it did something stupid...
> > 
> > Which process would you SIGKILL? At least for the rdma case a FD is
> > holding the GUP, so to do the put_user_pages() the kernel needs to
> > close the FD. I guess it would have to kill every process that has the
> > FD open? Seems complicated...
> 
> Tending to agree...  But I'm still not opposed to killing bad actors...  ;-)
> 
> NOTE: Jason I think you need to be more clear about the FD you are speaking 
> of.
> I believe you mean the FD which refers to the RMDA context.  That is what I
> called it in my other email.

I keep forgetting that the file with RDMA context may be held by multiple
processes so thanks for correcting me. My proposal with SIGKILL was jumping
to conclusion too quickly :) We have two struct files here: A file with RDMA
context that effectively is the owner of the page pins (let's call it
"context file") and a file which is mapped and on which we hold the lease and
whose blocks (pages) we are pinning (let's call it "buffer file"). Now once
buffer file is closed (and this means that all file descriptors pointing to
this struct file are closed - so just one child closing the file descriptor
won't trigger this) we need to release the lease and I want to have a way
of safely releasing remaining pins associated with this lease as well.
Because the pins would be invisible to sysadmin from that point on. Now if
the context file would be open only by the process closing the buffer file,
SIGKILL would work as that would close the buffer file as a side effect.
But as you properly pointed out, that's not necessarily the case. Walking
processes that have the context file open is technically complex and too
ugly to live so we have to come up with something better. The best I can
currently come up with is to have a method associated with the lease that
would invalidate the RDMA context that holds the pins in the same way that
a file close would do it.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH RFC 07/10] fs/ext4: Fail truncate if pages are GUP pinned

2019-06-06 Thread Jan Kara
On Wed 05-06-19 18:45:40, ira.we...@intel.com wrote:
> From: Ira Weiny 
> 
> If pages are actively gup pinned fail the truncate operation.
> 
> Signed-off-by: Ira Weiny 
> ---
>  fs/ext4/inode.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 75f543f384e4..1ded83ec08c0 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4250,6 +4250,9 @@ int ext4_break_layouts(struct inode *inode, loff_t 
> offset, loff_t len)
>   if (!page)
>   return 0;
>  
> + if (page_gup_pinned(page))
> + return -ETXTBSY;
> +
>   error = ___wait_var_event(&page->_refcount,
>   atomic_read(&page->_refcount) == 1,
>   TASK_INTERRUPTIBLE, 0, 0,

This caught my eye. Does this mean that now truncate for a file which has
temporary gup users (such buffers for DIO) can fail with ETXTBUSY? That
doesn't look desirable. If we would mandate layout lease while pages are
pinned as I suggested, this could be dealt with by checking for leases with
pins (breaking such lease would return error and not break it) and if
breaking leases succeeds (i.e., there are no long-term pinned pages), we'd
just wait for the remaining references as we do now.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH RFC 00/10] RDMA/FS DAX truncate proposal

2019-06-06 Thread Jan Kara
ODP
>   2) ODP may be detrimental to the over all network (cluster or cloud)
>  performance
> 
> Therefore, in order to support RDMA to File system pages without On Demand
> Paging (ODP) a number of things need to be done.
> 
> 1) GUP "longterm" users need to inform the other subsystems that they have
>taken a pin on a page which may remain pinned for a very "long time".[3]
> 
> 2) Any page which is "controlled" by a file system needs to have special
>handling.  The details of the handling depends on if the page is page cache
>fronted or not.
> 
>2a) A page cache fronted page which has been pinned by GUP long term can 
> use a
>bounce buffer to allow the file system to write back snap shots of the 
> page.
>This is handled by the FS recognizing the GUP long term pin and making a 
> copy
>of the page to be written back.
>   NOTE: this patch set does not address this path.
> 
>2b) A FS "controlled" page which is not page cache fronted is either easier
>to deal with or harder depending on the operation the filesystem is trying
>to do.
> 
>   2ba) [Hard case] If the FS operation _is_ a truncate or hole punch the
>   FS can no longer use the pages in question until the pin has been
>   removed.  This patch set presents a solution to this by introducing
>   some reasonable restrictions on user space applications.
> 
>   2bb) [Easy case] If the FS operation is _not_ a truncate or hole punch
>   then there is nothing which need be done.  Data is Read or Written
>   directly to the page.  This is an easy case which would currently work
>   if not for GUP long term pins being disabled.  Therefore this patch set
>   need not change access to the file data but does allow for GUP pins
>   after 2ba above is dealt with.
> 
> 
> This patch series and presents a solution for problem 2ba)
> 
> [1] https://github.com/johnhubbard/linux/tree/gup_dma_core
> 
> [2] ext4/dev branch:
> 
> - https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/log/?h=dev
> 
>   Specific patches:
> 
>   [2a] ext4: wait for outstanding dio during truncate in nojournal mode
> 
>   - 
> https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=82a25b027ca48d7ef197295846b352345853dfa8
> 
>   [2b] ext4: do not delete unlinked inode from orphan list on failed 
> truncate
> 
>   - 
> https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=ee0ed02ca93ef1ecf8963ad96638795d55af2c14
> 
>   [2c] ext4: gracefully handle ext4_break_layouts() failure during 
> truncate
> 
>   - 
> https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=b9c1c26739ec2d4b4fb70207a0a9ad6747e43f4c
> 
> [3] The definition of long time is debatable but it has been established
> that RDMAs use of pages, minutes or hours after the pin is the extreme case
> which makes this problem most severe.
> 
> 
> Ira Weiny (10):
>   fs/locks: Add trace_leases_conflict
>   fs/locks: Export F_LAYOUT lease to user space
>   mm/gup: Pass flags down to __gup_device_huge* calls
>   mm/gup: Ensure F_LAYOUT lease is held prior to GUP'ing pages
>   fs/ext4: Teach ext4 to break layout leases
>   fs/ext4: Teach dax_layout_busy_page() to operate on a sub-range
>   fs/ext4: Fail truncate if pages are GUP pinned
>   fs/xfs: Teach xfs to use new dax_layout_busy_page()
>   fs/xfs: Fail truncate if pages are GUP pinned
>   mm/gup: Remove FOLL_LONGTERM DAX exclusion
> 
>  fs/Kconfig   |   1 +
>  fs/dax.c |  38 ++---
>  fs/ext4/ext4.h   |   2 +-
>  fs/ext4/extents.c|   6 +-
>  fs/ext4/inode.c  |  26 +--
>  fs/locks.c   |  97 ---
>  fs/xfs/xfs_file.c|  24 --
>  fs/xfs/xfs_inode.h   |   5 +-
>  fs/xfs/xfs_ioctl.c   |  15 +++-
>  fs/xfs/xfs_iops.c|  14 +++-
>  fs/xfs/xfs_pnfs.c|  14 ++--
>  include/linux/dax.h  |   9 ++-
>  include/linux/fs.h   |   2 +-
>  include/linux/mm.h   |   2 +
>  include/trace/events/filelock.h  |  35 +
>  include/uapi/asm-generic/fcntl.h |   3 +
>  mm/gup.c | 129 ---
>  mm/huge_memory.c |  12 +++
>  18 files changed, 299 insertions(+), 135 deletions(-)
> 
> -- 
> 2.20.1
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2] mm: Move MAP_SYNC to asm-generic/mman-common.h

2019-05-28 Thread Jan Kara
On Tue 28-05-19 14:41:20, Aneesh Kumar K.V wrote:
> This enables support for synchronous DAX fault on powerpc
> 
> The generic changes are added as part of
> commit b6fb293f2497 ("mm: Define MAP_SYNC and VM_SYNC flags")
> 
> Without this, mmap returns EOPNOTSUPP for MAP_SYNC with MAP_SHARED_VALIDATE
> 
> Instead of adding MAP_SYNC with same value to
> arch/powerpc/include/uapi/asm/mman.h, I am moving the #define to
> asm-generic/mman-common.h. Two architectures using mman-common.h directly are
> sparc and powerpc. We should be able to consloidate more #defines to
> mman-common.h. That can be done as a separate patch.
> 
> Signed-off-by: Aneesh Kumar K.V 

Looks good to me FWIW (I don't have much experience with mmap flags and
their peculirarities). So feel free to add:

Reviewed-by: Jan Kara 

Honza

> ---
> Changes from V1:
> * Move #define to mman-common.h instead of powerpc specific mman.h change
> 
> 
>  include/uapi/asm-generic/mman-common.h | 3 ++-
>  include/uapi/asm-generic/mman.h| 1 -
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/uapi/asm-generic/mman-common.h 
> b/include/uapi/asm-generic/mman-common.h
> index abd238d0f7a4..bea0278f65ab 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -25,7 +25,8 @@
>  # define MAP_UNINITIALIZED 0x0   /* Don't support this flag */
>  #endif
>  
> -/* 0x0100 - 0x8 flags are defined in asm-generic/mman.h */
> +/* 0x0100 - 0x4 flags are defined in asm-generic/mman.h */
> +#define MAP_SYNC 0x08 /* perform synchronous page faults for 
> the mapping */
>  #define MAP_FIXED_NOREPLACE  0x10/* MAP_FIXED which doesn't 
> unmap underlying mapping */
>  
>  /*
> diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
> index 653687d9771b..2dffcbf705b3 100644
> --- a/include/uapi/asm-generic/mman.h
> +++ b/include/uapi/asm-generic/mman.h
> @@ -13,7 +13,6 @@
>  #define MAP_NONBLOCK 0x1 /* do not block on IO */
>  #define MAP_STACK0x2 /* give out an address that is best 
> suited for process/thread stacks */
>  #define MAP_HUGETLB  0x4 /* create a huge page mapping */
> -#define MAP_SYNC 0x80000 /* perform synchronous page faults for 
> the mapping */
>  
>  /* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */
>  
> -- 
> 2.21.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] jbd2: fix typo in comment of journal_submit_inode_data_buffers

2019-05-27 Thread Jan Kara
On Sat 25-05-19 17:12:51, Liu Song wrote:
> From: Liu Song 
> 
> delayed/dealyed
> 
> Signed-off-by: Liu Song 

Thanks. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/jbd2/commit.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index 150cc030b4d7..0395ca016235 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -184,7 +184,7 @@ static int journal_wait_on_commit_record(journal_t 
> *journal,
>  /*
>   * write the filemap data using writepage() address_space_operations.
>   * We don't do block allocation here even for delalloc. We don't
> - * use writepages() because with dealyed allocation we may be doing
> + * use writepages() because with delayed allocation we may be doing
>   * block allocation in writepages().
>   */
>  static int journal_submit_inode_data_buffers(struct address_space *mapping)
> -- 
> 2.20.1
> 
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] fanotify: remove redundant capable(CAP_SYS_ADMIN)s

2019-05-23 Thread Jan Kara
On Thu 23-05-19 15:35:18, Christian Brauner wrote:
> So let's say the user tells me:
> - When the "/A/B/C/target" file appears on the host filesystem,
>   please give me access to "target" in the container at a path I tell
>   you.
> What I do right now is listen for the creation of the "target" file.
> But at the time the user gives me instructions to listen for
> "/A/B/C/target" only /A might exist and so I currently add a watch on A/
> and then wait for the creation of B/, then wait for the creation of C/
> and finally for the creation of "target" (Of course, I also need to
> handle B/ and C/ being removed again an recreated and so on.). It would
> be helpful, if I could specify, give me notifications, recursively for
> e.g. A/ without me having to place extra watches on B/ and C/ when they
> appear. Maybe that's out of scope...

I see. But this is going to be painful whatever you do. Consider for
example situation like:

mkdir -p BAR/B/C/
touch BAR/B/C/target
mv BAR A

Or even situation where several renames race so that the end result creates
the name (or does not create it depending on how renames race). And by the
time you decide A/B/C/target exists, it doesn't need to exist anymore.
Honestly I don't see how you want to implement *any* solution in a sane
way. About the most reliable+simple would seem to be stat "A/B/C/target"
once per second as dumb as it is.

Honza

-- 
Jan Kara 
SUSE Labs, CR


Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller

2019-05-20 Thread Jan Kara
On Sat 18-05-19 15:28:47, Theodore Ts'o wrote:
> On Sat, May 18, 2019 at 08:39:54PM +0200, Paolo Valente wrote:
> > I've addressed these issues in my last batch of improvements for
> > BFQ, which landed in the upcoming 5.2. If you give it a try, and
> > still see the problem, then I'll be glad to reproduce it, and
> > hopefully fix it for you.
> 
> Hi Paolo, I'm curious if you could give a quick summary about what you
> changed in BFQ?
> 
> I was considering adding support so that if userspace calls fsync(2)
> or fdatasync(2), to attach the process's CSS to the transaction, and
> then charge all of the journal metadata writes the process's CSS.  If
> there are multiple fsync's batched into the transaction, the first
> process which forced the early transaction commit would get charged
> the entire journal write.  OTOH, journal writes are sequential I/O, so
> the amount of disk time for writing the journal is going to be
> relatively small, and especially, the fact that work from other
> cgroups is going to be minimal, especially if hadn't issued an
> fsync().

But this makes priority-inversion problems with ext4 journal worse, doesn't
it? If we submit journal commit in blkio cgroup of some random process, it
may get throttled which then effectively blocks the whole filesystem. Or do
you want to implement a more complex back-pressure mechanism where you'd
just account to different blkio cgroup during journal commit and then
throttle as different point where you are not blocking other tasks from
progress?

> In the case where you have three cgroups all issuing fsync(2) and they
> all landed in the same jbd2 transaction thanks to commit batching, in
> the ideal world we would split up the disk time usage equally across
> those three cgroups.  But it's probably not worth doing that...
> 
> That being said, we probably do need some BFQ support, since in the
> case where we have multiple processes doing buffered writes w/o fsync,
> we do charnge the data=ordered writeback to each block cgroup.  Worse,
> the commit can't complete until the all of the data integrity
> writebacks have completed.  And if there are N cgroups with dirty
> inodes, and slice_idle set to 8ms, there is going to be 8*N ms worth
> of idle time tacked onto the commit time.

Yeah. At least in some cases, we know there won't be any more IO from a
particular cgroup in the near future (e.g. transaction commit completing,
or when the layers above IO scheduler already know which IO they are going
to submit next) and in that case idling is just a waste of time. But so far
I haven't decided how should look a reasonably clean interface for this
that isn't specific to a particular IO scheduler implementation.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

2019-05-20 Thread Jan Kara
On Sat 18-05-19 21:46:03, Dan Williams wrote:
> On Fri, May 17, 2019 at 12:25 PM Kees Cook  wrote:
> > On Fri, May 17, 2019 at 10:28:48AM -0700, Dan Williams wrote:
> > > It seems dax_iomap_actor() is not a path where we'd be worried about
> > > needing hardened user copy checks.
> >
> > I would agree: I think the proposed patch makes sense. :)
> 
> Sounds like an acked-by to me.

Yeah, if Kees agrees, I'm fine with skipping the checks as well. I just
wanted that to be clarified. Also it helped me that you wrote:

That routine (dax_iomap_actor()) validates that the logical file offset is
within bounds of the file, then it does a sector-to-pfn translation which
validates that the physical mapping is within bounds of the block device.

That is more specific than "dax_iomap_actor() takes care of necessary
checks" which was in the changelog. And the above paragraph helped me
clarify which checks in dax_iomap_actor() you think replace those usercopy
checks. So I think it would be good to add that paragraph to those
copy_from_pmem() functions as a comment just in case we are wondering in
the future why we are skipping the checks... Also feel free to add:

Acked-by: Jan Kara 

        Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead

2019-05-17 Thread Jan Kara
Let's add Kees to CC for usercopy expertise...

On Thu 16-05-19 17:33:38, Dan Williams wrote:
> Jeff discovered that performance improves from ~375K iops to ~519K iops
> on a simple psync-write fio workload when moving the location of 'struct
> page' from the default PMEM location to DRAM. This result is surprising
> because the expectation is that 'struct page' for dax is only needed for
> third party references to dax mappings. For example, a dax-mapped buffer
> passed to another system call for direct-I/O requires 'struct page' for
> sending the request down the driver stack and pinning the page. There is
> no usage of 'struct page' for first party access to a file via
> read(2)/write(2) and friends.
> 
> However, this "no page needed" expectation is violated by
> CONFIG_HARDENED_USERCOPY and the check_copy_size() performed in
> copy_from_iter_full_nocache() and copy_to_iter_mcsafe(). The
> check_heap_object() helper routine assumes the buffer is backed by a
> page-allocator DRAM page and applies some checks.  Those checks are
> invalid, dax pages are not from the heap, and redundant,
> dax_iomap_actor() has already validated that the I/O is within bounds.

So this last paragraph is not obvious to me as check_copy_size() does a lot
of various checks in CONFIG_HARDENED_USERCOPY case. I agree that some of
those checks don't make sense for PMEM pages but I'd rather handle that by
refining check_copy_size() and check_object_size() functions to detect and
appropriately handle pmem pages rather that generally skip all the checks
in pmem_copy_from/to_iter(). And yes, every check in such hot path is going
to cost performance but that's what user asked for with
CONFIG_HARDENED_USERCOPY... Kees?

Honza

> 
> Bypass this overhead and call the 'no check' versions of the
> copy_{to,from}_iter operations directly.
> 
> Fixes: 0aed55af8834 ("x86, uaccess: introduce copy_from_iter_flushcache...")
> Cc: Jan Kara 
> Cc: 
> Cc: Jeff Moyer 
> Cc: Ingo Molnar 
> Cc: Christoph Hellwig 
> Cc: Al Viro 
> Cc: Thomas Gleixner 
> Cc: Matthew Wilcox 
> Reported-and-tested-by: Jeff Smits 
> Signed-off-by: Dan Williams 
> ---
>  drivers/nvdimm/pmem.c |9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index 845c5b430cdd..c894f45e5077 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -281,16 +281,21 @@ static long pmem_dax_direct_access(struct dax_device 
> *dax_dev,
>   return __pmem_direct_access(pmem, pgoff, nr_pages, kaddr, pfn);
>  }
>  
> +/*
> + * Use the 'no check' versions of copy_from_iter_flushcache() and
> + * copy_to_iter_mcsafe() to bypass HARDENED_USERCOPY overhead. Bounds
> + * checking is handled by dax_iomap_actor()
> + */
>  static size_t pmem_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
>   void *addr, size_t bytes, struct iov_iter *i)
>  {
> - return copy_from_iter_flushcache(addr, bytes, i);
> + return _copy_from_iter_flushcache(addr, bytes, i);
>  }
>  
>  static size_t pmem_copy_to_iter(struct dax_device *dax_dev, pgoff_t pgoff,
>   void *addr, size_t bytes, struct iov_iter *i)
>  {
> - return copy_to_iter_mcsafe(addr, bytes, i);
> + return _copy_to_iter_mcsafe(addr, bytes, i);
>  }
>  
>  static const struct dax_operations pmem_dax_ops = {
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] jbd2: fix some print format mistakes

2019-05-17 Thread Jan Kara
On Fri 17-05-19 14:19:51, Gaowei Pu wrote:
> There are some print format mistakes in debug messages. Fix them.
> 
> Signed-off-by: Gaowei Pu 

Looks good. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/jbd2/journal.c | 18 +-
>  1 file changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 37e16d969925..565e99b67b30 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -203,7 +203,7 @@ static int kjournald2(void *arg)
>   if (journal->j_flags & JBD2_UNMOUNT)
>   goto end_loop;
>  
> - jbd_debug(1, "commit_sequence=%d, commit_request=%d\n",
> + jbd_debug(1, "commit_sequence=%u, commit_request=%u\n",
>   journal->j_commit_sequence, journal->j_commit_request);
>  
>   if (journal->j_commit_sequence != journal->j_commit_request) {
> @@ -324,7 +324,7 @@ static void journal_kill_thread(journal_t *journal)
>   * IO is in progress. do_get_write_access() handles this.
>   *
>   * The function returns a pointer to the buffer_head to be used for IO.
> - * 
> + *
>   *
>   * Return value:
>   *  <0: Error
> @@ -500,7 +500,7 @@ int __jbd2_log_start_commit(journal_t *journal, tid_t 
> target)
>*/
>  
>   journal->j_commit_request = target;
> - jbd_debug(1, "JBD2: requesting commit %d/%d\n",
> + jbd_debug(1, "JBD2: requesting commit %u/%u\n",
> journal->j_commit_request,
> journal->j_commit_sequence);
>   journal->j_running_transaction->t_requested = jiffies;
> @@ -513,7 +513,7 @@ int __jbd2_log_start_commit(journal_t *journal, tid_t 
> target)
>   WARN_ONCE(1, "JBD2: bad log_start_commit: %u %u %u %u\n",
> journal->j_commit_request,
> journal->j_commit_sequence,
> -   target, journal->j_running_transaction ? 
> +   target, journal->j_running_transaction ?
> journal->j_running_transaction->t_tid : 0);
>   return 0;
>  }
> @@ -698,12 +698,12 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
>  #ifdef CONFIG_JBD2_DEBUG
>   if (!tid_geq(journal->j_commit_request, tid)) {
>   printk(KERN_ERR
> -"%s: error: j_commit_request=%d, tid=%d\n",
> +"%s: error: j_commit_request=%u, tid=%u\n",
>  __func__, journal->j_commit_request, tid);
>   }
>  #endif
>   while (tid_gt(tid, journal->j_commit_sequence)) {
> - jbd_debug(1, "JBD2: want %d, j_commit_sequence=%d\n",
> + jbd_debug(1, "JBD2: want %u, j_commit_sequence=%u\n",
> tid, journal->j_commit_sequence);
>   read_unlock(&journal->j_state_lock);
>   wake_up(&journal->j_wait_commit);
> @@ -944,7 +944,7 @@ int __jbd2_update_log_tail(journal_t *journal, tid_t tid, 
> unsigned long block)
>  
>   trace_jbd2_update_log_tail(journal, tid, block, freed);
>   jbd_debug(1,
> -   "Cleaning journal tail from %d to %d (offset %lu), "
> +   "Cleaning journal tail from %u to %u (offset %lu), "
> "freeing %lu\n",
> journal->j_tail_sequence, tid, block, freed);
>  
> @@ -1318,7 +1318,7 @@ static int journal_reset(journal_t *journal)
>*/
>   if (sb->s_start == 0) {
>   jbd_debug(1, "JBD2: Skipping superblock update on recovered sb "
> - "(start %ld, seq %d, errno %d)\n",
> + "(start %ld, seq %u, errno %d)\n",
>   journal->j_tail, journal->j_tail_sequence,
>       journal->j_errno);
>   journal->j_flags |= JBD2_FLUSHED;
> @@ -1453,7 +1453,7 @@ static void jbd2_mark_journal_empty(journal_t *journal, 
> int write_op)
>   return;
>   }
>  
> - jbd_debug(1, "JBD2: Marking journal as empty (seq %d)\n",
> + jbd_debug(1, "JBD2: Marking journal as empty (seq %u)\n",
> journal->j_tail_sequence);
>  
>   sb->s_sequence = cpu_to_be32(journal->j_tail_sequence);
> -- 
> 2.21.0
> 
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: INFO: task hung in __get_super

2019-05-16 Thread Jan Kara
On Thu 16-05-19 21:17:14, Tetsuo Handa wrote:
> On 2019/05/16 20:48, Jan Kara wrote:
> > OK, so non-racy fix was a bit more involved and I've ended up just
> > upgrading the file reference to an exclusive one in loop_set_fd() instead
> > of trying to hand-craft some locking solution. The result is attached and
> > it passes blktests.
> 
> blkdev_get() has corresponding blkdev_put().
> bdgrab() does not have corresponding bdput() ?

Yes, and that's hidden inside blkdev_put() (or failing blkdev_get()). Don't
get me started on calling conventions of these functions... I've wasted half
an hour trying to figure out where I'm leaking inode references in my patch
;).

        Honza

-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] mm/huge_memory: Fix vmf_insert_pfn_{pmd, pud}() crash, handle unaligned addresses

2019-05-09 Thread Jan Kara
On Thu 09-05-19 09:31:41, Dan Williams wrote:
> Starting with commit c6f3c5ee40c1 "mm/huge_memory.c: fix modifying of
> page protection by insert_pfn_pmd()" vmf_insert_pfn_pmd() internally
> calls pmdp_set_access_flags(). That helper enforces a pmd aligned
> @address argument via VM_BUG_ON() assertion.
> 
> Update the implementation to take a 'struct vm_fault' argument directly
> and apply the address alignment fixup internally to fix crash signatures
> like:
> 
> kernel BUG at arch/x86/mm/pgtable.c:515!
> invalid opcode:  [#1] SMP NOPTI
> CPU: 51 PID: 43713 Comm: java Tainted: G   OE 4.19.35 #1
> [..]
> RIP: 0010:pmdp_set_access_flags+0x48/0x50
> [..]
> Call Trace:
>  vmf_insert_pfn_pmd+0x198/0x350
>  dax_iomap_fault+0xe82/0x1190
>  ext4_dax_huge_fault+0x103/0x1f0
>  ? __switch_to_asm+0x40/0x70
>  __handle_mm_fault+0x3f6/0x1370
>  ? __switch_to_asm+0x34/0x70
>  ? __switch_to_asm+0x40/0x70
>  handle_mm_fault+0xda/0x200
>  __do_page_fault+0x249/0x4f0
>  do_page_fault+0x32/0x110
>  ? page_fault+0x8/0x30
>  page_fault+0x1e/0x30
> 
> Cc: 
> Fixes: c6f3c5ee40c1 ("mm/huge_memory.c: fix modifying of page protection by 
> insert_pfn_pmd()")
> Reported-by: Piotr Balcer 
> Tested-by: Yan Ma 
> Cc: Aneesh Kumar K.V 
> Cc: Chandan Rajendra 
> Cc: Jan Kara 
> Cc: Andrew Morton 
> Cc: Matthew Wilcox 
> Cc: Souptick Joarder 
> Signed-off-by: Dan Williams 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
> 
>  drivers/dax/device.c|6 ++
>  fs/dax.c|6 ++
>  include/linux/huge_mm.h |6 ++
>  mm/huge_memory.c|   16 ++--
>  4 files changed, 16 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index e428468ab661..996d68ff992a 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -184,8 +184,7 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax 
> *dev_dax,
>  
>   *pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
>  
> - return vmf_insert_pfn_pmd(vmf->vma, vmf->address, vmf->pmd, *pfn,
> - vmf->flags & FAULT_FLAG_WRITE);
> + return vmf_insert_pfn_pmd(vmf, *pfn, vmf->flags & FAULT_FLAG_WRITE);
>  }
>  
>  #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> @@ -235,8 +234,7 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax 
> *dev_dax,
>  
>   *pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
>  
> - return vmf_insert_pfn_pud(vmf->vma, vmf->address, vmf->pud, *pfn,
> - vmf->flags & FAULT_FLAG_WRITE);
> + return vmf_insert_pfn_pud(vmf, *pfn, vmf->flags & FAULT_FLAG_WRITE);
>  }
>  #else
>  static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
> diff --git a/fs/dax.c b/fs/dax.c
> index e5e54da1715f..83009875308c 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -1575,8 +1575,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault 
> *vmf, pfn_t *pfnp,
>   }
>  
>   trace_dax_pmd_insert_mapping(inode, vmf, PMD_SIZE, pfn, entry);
> - result = vmf_insert_pfn_pmd(vma, vmf->address, vmf->pmd, pfn,
> - write);
> + result = vmf_insert_pfn_pmd(vmf, pfn, write);
>   break;
>   case IOMAP_UNWRITTEN:
>   case IOMAP_HOLE:
> @@ -1686,8 +1685,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, 
> unsigned int order)
>   ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
>  #ifdef CONFIG_FS_DAX_PMD
>   else if (order == PMD_ORDER)
> - ret = vmf_insert_pfn_pmd(vmf->vma, vmf->address, vmf->pmd,
> - pfn, true);
> + ret = vmf_insert_pfn_pmd(vmf, pfn, FAULT_FLAG_WRITE);
>  #endif
>   else
>   ret = VM_FAULT_FALLBACK;
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 381e872bfde0..7cd5c150c21d 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -47,10 +47,8 @@ extern bool move_huge_pmd(struct vm_area_struct *vma, 
> unsigned long old_addr,
>  extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>   unsigned long addr, pgprot_t newprot,
>   int prot_numa);
> -vm_fault_t vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
> - pmd_t *pmd, pfn_t pfn, bool write);
> -vm_fault_t vmf_insert_pfn_pud(struct vm_

Re: [PATCH] quota: add dqi_dirty_list description to comment of Dquot List Management

2019-05-09 Thread Jan Kara
On Mon 06-05-19 09:39:03, Chengguang Xu wrote:
> Actually there are four lists for dquot management, so add
> the description of dqui_dirty_list to comment.
> 
> Signed-off-by: Chengguang Xu 

Thanks applied with small addition:

Note that some filesystems do dirty dquot tracking on their own (e.g. in a
journal) and thus don't use dqi_dirty_list.

Honza

> ---
>  fs/quota/dquot.c | 10 +++---
>  1 file changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
> index fc20e06c56ba..6a236bdaef89 100644
> --- a/fs/quota/dquot.c
> +++ b/fs/quota/dquot.c
> @@ -223,9 +223,9 @@ static void put_quota_format(struct quota_format_type 
> *fmt)
>  
>  /*
>   * Dquot List Management:
> - * The quota code uses three lists for dquot management: the inuse_list,
> - * free_dquots, and dquot_hash[] array. A single dquot structure may be
> - * on all three lists, depending on its current state.
> + * The quota code uses four lists for dquot management: the inuse_list,
> + * free_dquots, dqi_dirty_list, and dquot_hash[] array. A single dquot
> + * structure may be on some of those lists, depending on its current state.
>   *
>   * All dquots are placed to the end of inuse_list when first created, and 
> this
>   * list is used for invalidate operation, which must look at every dquot.
> @@ -236,6 +236,10 @@ static void put_quota_format(struct quota_format_type 
> *fmt)
>   * dqstats.free_dquots gives the number of dquots on the list. When
>   * dquot is invalidated it's completely released from memory.
>   *
> + * Dirty dquots are added to the dqi_dirty_list of quota_info when mark
> + * dirtied, and this list is searched when writeback diry dquots to
> + * quota file.
> + *
>   * Dquots with a specific identity (device, type and id) are placed on
>   * one of the dquot_hash[] hash chains. The provides an efficient search
>   * mechanism to locate a specific dquot.
> -- 
> 2.17.2
> 
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 1/3] jbd2: fix potential double free

2019-05-09 Thread Jan Kara
On Wed 08-05-19 14:38:22, sunny.s.zhang wrote:
> Hi Chengguang,
> 
> 在 2019年05月05日 19:01, Chengguang Xu 写道:
> > When fail from creating cache jbd2_inode_cache, we will
> > destroy previously created cache jbd2_handle_cache twice.
> > This patch fixes it by removing first destroy in error path.
> > 
> > Signed-off-by: Chengguang Xu 
> > ---
> >   fs/jbd2/journal.c | 1 -
> >   1 file changed, 1 deletion(-)
> > 
> > diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> > index 382c030cc78b..49797854ccb8 100644
> > --- a/fs/jbd2/journal.c
> > +++ b/fs/jbd2/journal.c
> > @@ -2642,7 +2642,6 @@ static int __init jbd2_journal_init_handle_cache(void)
> > jbd2_inode_cache = KMEM_CACHE(jbd2_inode, 0);
> > if (jbd2_inode_cache == NULL) {
> > printk(KERN_EMERG "JBD2: failed to create inode cache\n");
> > -   kmem_cache_destroy(jbd2_handle_cache);
> Maybe we should keep it, and set the jbd2_handle_cache to NULL.
> If there are some changes in the future,  we may forget to change the
> function
> of jbd2_journal_destroy_handle_cache.

So what I'd do is that I'd split initialization of jbd2_inode_cache into a
separate function (and the same for destruction). That more aligns with how
things are currently done in jbd2 and also fixes the problem with double
destruction.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH AUTOSEL 4.14 50/95] fsnotify: generalize handling of extra event flags

2019-05-07 Thread Jan Kara
On Tue 07-05-19 01:37:39, Sasha Levin wrote:
> From: Amir Goldstein 
> 
> [ Upstream commit 007d1e8395eaa59b0e7ad9eb2b53a40859446a88 ]
> 
> FS_EVENT_ON_CHILD gets a special treatment in fsnotify() because it is
> not a flag specifying an event type, but rather an extra flags that may
> be reported along with another event and control the handling of the
> event by the backend.
> 
> FS_ISDIR is also an "extra flag" and not an "event type" and therefore
> desrves the same treatment. With inotify/dnotify backends it was never
> possible to set FS_ISDIR in mark masks, so it did not matter.
> With fanotify backend, mark adding code jumps through hoops to avoid
> setting the FS_ISDIR in the commulative object mask.
> 
> Separate the constant ALL_FSNOTIFY_EVENTS to ALL_FSNOTIFY_FLAGS and
> ALL_FSNOTIFY_EVENTS, so the latter can be used to test for specific
> event types.
> 
> Signed-off-by: Amir Goldstein 
> Signed-off-by: Jan Kara 
> Signed-off-by: Sasha Levin 

Sasha, why did you select this patch? It is just a cleanup with no user
visible effect and was done mostly to simplify implementing following
features...

Honza

> ---
>  fs/notify/fsnotify.c | 7 +++
>  include/linux/fsnotify_backend.h | 9 +++--
>  2 files changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index 506da82ff3f1..dc080c642dd0 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -192,7 +192,7 @@ static int send_to_group(struct inode *to_tell,
>struct fsnotify_iter_info *iter_info)
>  {
>   struct fsnotify_group *group = NULL;
> - __u32 test_mask = (mask & ~FS_EVENT_ON_CHILD);
> + __u32 test_mask = (mask & ALL_FSNOTIFY_EVENTS);
>   __u32 marks_mask = 0;
>   __u32 marks_ignored_mask = 0;
>  
> @@ -256,8 +256,7 @@ int fsnotify(struct inode *to_tell, __u32 mask, const 
> void *data, int data_is,
>   struct fsnotify_iter_info iter_info;
>   struct mount *mnt;
>   int ret = 0;
> - /* global tests shouldn't care about events on child only the specific 
> event */
> - __u32 test_mask = (mask & ~FS_EVENT_ON_CHILD);
> + __u32 test_mask = (mask & ALL_FSNOTIFY_EVENTS);
>  
>   if (data_is == FSNOTIFY_EVENT_PATH)
>   mnt = real_mount(((const struct path *)data)->mnt);
> @@ -380,7 +379,7 @@ static __init int fsnotify_init(void)
>  {
>   int ret;
>  
> - BUG_ON(hweight32(ALL_FSNOTIFY_EVENTS) != 23);
> + BUG_ON(hweight32(ALL_FSNOTIFY_BITS) != 23);
>  
>   ret = init_srcu_struct(&fsnotify_mark_srcu);
>   if (ret)
> diff --git a/include/linux/fsnotify_backend.h 
> b/include/linux/fsnotify_backend.h
> index ce74278a454a..81052313adeb 100644
> --- a/include/linux/fsnotify_backend.h
> +++ b/include/linux/fsnotify_backend.h
> @@ -67,15 +67,20 @@
>  
>  #define ALL_FSNOTIFY_PERM_EVENTS (FS_OPEN_PERM | FS_ACCESS_PERM)
>  
> +/* Events that can be reported to backends */
>  #define ALL_FSNOTIFY_EVENTS (FS_ACCESS | FS_MODIFY | FS_ATTRIB | \
>FS_CLOSE_WRITE | FS_CLOSE_NOWRITE | FS_OPEN | \
>FS_MOVED_FROM | FS_MOVED_TO | FS_CREATE | \
>FS_DELETE | FS_DELETE_SELF | FS_MOVE_SELF | \
>FS_UNMOUNT | FS_Q_OVERFLOW | FS_IN_IGNORED | \
> -  FS_OPEN_PERM | FS_ACCESS_PERM | FS_EXCL_UNLINK | \
> -  FS_ISDIR | FS_IN_ONESHOT | FS_DN_RENAME | \
> +  FS_OPEN_PERM | FS_ACCESS_PERM | FS_DN_RENAME)
> +
> +/* Extra flags that may be reported with event or control handling of events 
> */
> +#define ALL_FSNOTIFY_FLAGS  (FS_EXCL_UNLINK | FS_ISDIR | FS_IN_ONESHOT | \
>FS_DN_MULTISHOT | FS_EVENT_ON_CHILD)
>  
> +#define ALL_FSNOTIFY_BITS   (ALL_FSNOTIFY_EVENTS | ALL_FSNOTIFY_FLAGS)
> +
>  struct fsnotify_group;
>  struct fsnotify_event;
>  struct fsnotify_mark;
> -- 
> 2.20.1
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] quota: check time limit when back out space/inode change

2019-04-30 Thread Jan Kara
On Tue 30-04-19 14:40:10, Chengguang Xu wrote:
> When we fail from allocating inode/space, we back out
> the change we already did. In a special case which has
> exceeded soft limit by the change, we should also check
> time limit and reset it properly.
> 
> Signed-off-by: Chengguang Xu 

Good catch. Thanks for fixing this. I've added the patch to my tree.

Honza

> ---
>  fs/quota/dquot.c | 14 ++
>  1 file changed, 6 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
> index 9d7dfc47c854..58f15a083dd1 100644
> --- a/fs/quota/dquot.c
> +++ b/fs/quota/dquot.c
> @@ -1681,13 +1681,11 @@ int __dquot_alloc_space(struct inode *inode, qsize_t 
> number, int flags)
>   if (!dquots[cnt])
>   continue;
>   spin_lock(&dquots[cnt]->dq_dqb_lock);
> - if (reserve) {
> - dquots[cnt]->dq_dqb.dqb_rsvspace -=
> - number;
> - } else {
> - dquots[cnt]->dq_dqb.dqb_curspace -=
> - number;
> - }
> + if (reserve)
> + dquot_free_reserved_space(dquots[cnt],
> +   number);
> + else
> + dquot_decr_space(dquots[cnt], number);
>   spin_unlock(&dquots[cnt]->dq_dqb_lock);
>   }
>   spin_unlock(&inode->i_lock);
> @@ -1738,7 +1736,7 @@ int dquot_alloc_inode(struct inode *inode)
>   continue;
>   /* Back out changes we already did */
>   spin_lock(&dquots[cnt]->dq_dqb_lock);
> - dquots[cnt]->dq_dqb.dqb_curinodes--;
> + dquot_decr_inodes(dquots[cnt], 1);
>   spin_unlock(&dquots[cnt]->dq_dqb_lock);
>   }
>   goto warn_put_all;
> --
> 2.20.1
> 
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: INFO: task hung in __get_super

2019-04-30 Thread Jan Kara
On Tue 30-04-19 14:18:21, Al Viro wrote:
> On Tue, Apr 30, 2019 at 03:07:39PM +0200, Jan Kara wrote:
> > On Tue 30-04-19 04:11:44, Al Viro wrote:
> > > On Tue, Apr 30, 2019 at 04:55:01AM +0200, Jan Kara wrote:
> > > 
> > > > Yeah, you're right. And if we push the patch a bit further to not take
> > > > loop_ctl_mutex for invalid ioctl number, that would fix the problem. I
> > > > can send a fix.
> > > 
> > > Huh?  We don't take it until in lo_simple_ioctl(), and that patch doesn't
> > > get to its call on invalid ioctl numbers.  What am I missing here?
> > 
> > Doesn't it? blkdev_ioctl() calls into __blkdev_driver_ioctl() for
> > unrecognized ioctl numbers. That calls into lo_ioctl() in case of a loop
> > device. lo_ioctl() calls into lo_simple_ioctl() for ioctl numbers it
> > doesn't recognize and lo_simple_ioctl() will lock loop_ctl_mutex as you
> > say.
> 
> Not with the patch upthread.  lo_ioctl() part was
> 
> @@ -1567,10 +1564,9 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
> mode,
>   case LOOP_SET_BLOCK_SIZE:
>   if (!(mode & FMODE_WRITE) && !capable(CAP_SYS_ADMIN))
>   return -EPERM;
> - /* Fall through */
> + return lo_simple_ioctl(lo, cmd, arg);
>   default:
> - err = lo_simple_ioctl(lo, cmd, arg);
> - break;
> + return -EINVAL;
>   }
>  
>   return err;
> 
> so anything unrecognized doesn't make it to lo_simple_ioctl() at all.

Ah, right. I've missed that in your patch. So your patch should be really
fixing the problem. Will you post it officially? Thanks!

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: INFO: task hung in __get_super

2019-04-30 Thread Jan Kara
On Tue 30-04-19 04:11:44, Al Viro wrote:
> On Tue, Apr 30, 2019 at 04:55:01AM +0200, Jan Kara wrote:
> 
> > Yeah, you're right. And if we push the patch a bit further to not take
> > loop_ctl_mutex for invalid ioctl number, that would fix the problem. I
> > can send a fix.
> 
> Huh?  We don't take it until in lo_simple_ioctl(), and that patch doesn't
> get to its call on invalid ioctl numbers.  What am I missing here?

Doesn't it? blkdev_ioctl() calls into __blkdev_driver_ioctl() for
unrecognized ioctl numbers. That calls into lo_ioctl() in case of a loop
device. lo_ioctl() calls into lo_simple_ioctl() for ioctl numbers it
doesn't recognize and lo_simple_ioctl() will lock loop_ctl_mutex as you
say.

    Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: INFO: task hung in __get_super

2019-04-29 Thread Jan Kara
955,6 @@ static int loop_set_fd(struct loop_device *lo, fmode_t 
> mode,
>   lo->lo_flags = lo_flags;
>   lo->lo_backing_file = file;
>   lo->transfer = NULL;
> - lo->ioctl = NULL;
>   lo->lo_sizelimit = 0;
>   lo->old_gfp_mask = mapping_gfp_mask(mapping);
>   mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS));
> @@ -1064,7 +1063,6 @@ static int __loop_clr_fd(struct loop_device *lo, bool 
> release)
>  
>   loop_release_xfer(lo);
>   lo->transfer = NULL;
> - lo->ioctl = NULL;
>   lo->lo_device = NULL;
>   lo->lo_encryption = NULL;
>   lo->lo_offset = 0;
> @@ -1262,7 +1260,6 @@ loop_set_status(struct loop_device *lo, const struct 
> loop_info64 *info)
>   if (!xfer)
>   xfer = &none_funcs;
>   lo->transfer = xfer->transfer;
> - lo->ioctl = xfer->ioctl;
>  
>   if ((lo->lo_flags & LO_FLAGS_AUTOCLEAR) !=
>(info->lo_flags & LO_FLAGS_AUTOCLEAR))
> @@ -1525,7 +1522,7 @@ static int lo_simple_ioctl(struct loop_device *lo, 
> unsigned int cmd,
>   err = loop_set_block_size(lo, arg);
>   break;
>   default:
> - err = lo->ioctl ? lo->ioctl(lo, cmd, arg) : -EINVAL;
> + err = -EINVAL;
>   }
>   mutex_unlock(&loop_ctl_mutex);
>   return err;
> @@ -1567,10 +1564,9 @@ static int lo_ioctl(struct block_device *bdev, fmode_t 
> mode,
>   case LOOP_SET_BLOCK_SIZE:
>   if (!(mode & FMODE_WRITE) && !capable(CAP_SYS_ADMIN))
>   return -EPERM;
> - /* Fall through */
> + return lo_simple_ioctl(lo, cmd, arg);
>   default:
> - err = lo_simple_ioctl(lo, cmd, arg);
> - break;
> + return -EINVAL;
>   }
>  
>   return err;
> diff --git a/drivers/block/loop.h b/drivers/block/loop.h
> index af75a5ee4094..56a9a0c161d7 100644
> --- a/drivers/block/loop.h
> +++ b/drivers/block/loop.h
> @@ -84,7 +84,6 @@ struct loop_func_table {
>   int (*init)(struct loop_device *, const struct loop_info64 *); 
>   /* release is called from loop_unregister_transfer or clr_fd */
>   int (*release)(struct loop_device *); 
> - int (*ioctl)(struct loop_device *, int cmd, unsigned long arg);
>   struct module *owner;
>  }; 
>  
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] quota: set init_needed flag only when successfully getting dquot

2019-04-29 Thread Jan Kara
On Sun 28-04-19 13:39:21, Chengguang Xu wrote:
> Set init_needed flag only when successfully getting dquot,
> so that we can skip unnecessary subsequent operation.
> 
> Signed-off-by: Chengguang Xu 

Thanks for the patch but I don't think it's really useful. It will be very
rare that we race with quotaoff of dqget() fails due to error. So the
additional overhead of iterating over dquots doesn't really matter in that
case.

Honza

> ---
>  fs/quota/dquot.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
> index fc20e06c56ba..8d4ce2a2b5c8 100644
> --- a/fs/quota/dquot.c
> +++ b/fs/quota/dquot.c
> @@ -1449,8 +1449,6 @@ static int __dquot_initialize(struct inode *inode, int 
> type)
>   if (!sb_has_quota_active(sb, cnt))
>   continue;
> 
> - init_needed = 1;
> -
>   switch (cnt) {
>   case USRQUOTA:
>   qid = make_kqid_uid(inode->i_uid);
> @@ -1475,6 +1473,9 @@ static int __dquot_initialize(struct inode *inode, int 
> type)
>   dquot = NULL;
>   }
>   got[cnt] = dquot;
> +
> + if (got[cnt])
> + init_needed = 1;
>   }
> 
>   /* All required i_dquot has been initialized */
> --
> 2.20.1
> 
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2] ext4: bad mount opts in no journal mode

2019-04-29 Thread Jan Kara
On Mon 29-04-19 13:31:58, Debabrata Banerjee wrote:
> Fixes:
> commit 1e381f60dad9 ("ext4: do not allow journal_opts for fs w/o journal")
> 
> Instead of removing EXT4_MOUNT_JOURNAL_CHECKSUM from s_def_mount_opt as
> I assume was intended, all other options were blown away leading to
> _ext4_show_options() output being incorrect.
> 
> Signed-off-by: Debabrata Banerjee 

Thanks! You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/ext4/super.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 6ed4eb81e674..5cdf1d88b5c3 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4238,7 +4238,7 @@ static int ext4_fill_super(struct super_block *sb, void 
> *data, int silent)
>"data=, fs mounted w/o journal");
>   goto failed_mount_wq;
>   }
> - sbi->s_def_mount_opt &= EXT4_MOUNT_JOURNAL_CHECKSUM;
> + sbi->s_def_mount_opt &= ~EXT4_MOUNT_JOURNAL_CHECKSUM;
>   clear_opt(sb, JOURNAL_CHECKSUM);
>   clear_opt(sb, DATA_FLAGS);
>   sbi->s_journal = NULL;
> -- 
> 2.21.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] ext4: bad mount opts in no journal mode

2019-04-29 Thread Jan Kara
On Thu 11-04-19 17:49:17, Debabrata Banerjee wrote:
> Fixes:
> commit 1e381f60dad9 ("ext4: do not allow journal_opts for fs w/o journal")
> 
> Instead of removing EXT4_MOUNT_JOURNAL_CHECKSUM from s_def_mount_opt as
> I assume was intended, all other options were blown away leading to
> _ext4_show_options() output being incorrect. I don't see why this or
> other journal related flags should be removed from s_def_mount_opt
> regardless, it is only used for comparison to display opts, and we
> already made sure they couldn't be set.
> 
> Signed-off-by: Debabrata Banerjee 

So I agree that the clearing is wrong. But I don't agree with just deleting
the line. We could have JOURNAL_CHECKSUM in s_def_mount_opt in nojournal
mode and in such case we don't want to show in /proc/mounts
nojournal_checksum as a mount option. So the line should be really fixed
to:

sbi->s_def_mount_opt &= ~EXT4_MOUNT_JOURNAL_CHECKSUM;

Honza

> ---
>  fs/ext4/super.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 6ed4eb81e674..63eef29666e0 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4238,7 +4238,6 @@ static int ext4_fill_super(struct super_block *sb, void 
> *data, int silent)
>"data=, fs mounted w/o journal");
>   goto failed_mount_wq;
>   }
> - sbi->s_def_mount_opt &= EXT4_MOUNT_JOURNAL_CHECKSUM;
>   clear_opt(sb, JOURNAL_CHECKSUM);
>   clear_opt(sb, DATA_FLAGS);
>   sbi->s_journal = NULL;
> -- 
> 2.21.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: general protection fault in fanotify_handle_event

2019-04-28 Thread Jan Kara
On Fri 26-04-19 04:13:00, syzbot wrote:
> Hello,
> 
> syzbot has tested the proposed patch but the reproducer still triggered
> crash:
> general protection fault in fanotify_handle_event
> 
> kasan: CONFIG_KASAN_INLINE enabled
> kasan: GPF could be caused by NULL-ptr deref or user memory access
> general protection fault:  [#1] PREEMPT SMP KASAN
> CPU: 0 PID: 13565 Comm: syz-executor.2 Not tainted 5.1.0-rc6+ #1
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> RIP: 0010:fanotify_get_fsid fs/notify/fanotify/fanotify.c:352 [inline]
> RIP: 0010:fanotify_handle_event+0x7d0/0xc40
> fs/notify/fanotify/fanotify.c:412
> Code: ff ff 48 8b 18 48 8d 7b 68 48 89 f8 48 c1 e8 03 42 80 3c 38 00 0f 85
> 47 04 00 00 48 8b 5b 68 48 8d 7b 3c 48 89 fa 48 c1 ea 03 <42> 0f b6 0c 3a 48
> 89 fa 83 e2 07 83 c2 03 38 ca 7c 08 84 c9 0f 85
> RSP: 0018:888085eb7b78 EFLAGS: 00010203
> RAX: 111013db05ab RBX:  RCX: 81c427ae
> RDX: 0007 RSI: 81c427bb RDI: 003c
> RBP: 888085eb7cc0 R08: 88808b0926c0 R09: 
> R10: 88808b092f90 R11: 88808b0926c0 R12: 0002
> R13:  R14: 0001 R15: dc00
> FS:  7ff801b58700() GS:8880ae80() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 7f500e416000 CR3: a3f17000 CR4: 001406f0
> Call Trace:
>  send_to_group fs/notify/fsnotify.c:243 [inline]
>  fsnotify+0x725/0xbc0 fs/notify/fsnotify.c:381
>  fsnotify_path include/linux/fsnotify.h:54 [inline]
>  fsnotify_path include/linux/fsnotify.h:47 [inline]
>  fsnotify_modify include/linux/fsnotify.h:263 [inline]
>  vfs_write+0x4dc/0x580 fs/read_write.c:551
>  ksys_write+0x14f/0x2d0 fs/read_write.c:599
>  __do_sys_write fs/read_write.c:611 [inline]
>  __se_sys_write fs/read_write.c:608 [inline]
>  __x64_sys_write+0x73/0xb0 fs/read_write.c:608
>  do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x458c29
> Code: ad b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7
> 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff
> 0f 83 7b b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00
> RSP: 002b:7ff801b57c78 EFLAGS: 0246 ORIG_RAX: 0001
> RAX: ffda RBX: 0003 RCX: 00458c29
> RDX: 0007 RSI: 2080 RDI: 0005
> RBP: 0073bf00 R08:  R09: 
> R10:  R11: 0246 R12: 7ff801b586d4
> R13: 004c8386 R14: 004de8b8 R15: 
> Modules linked in:
> ---[ end trace 1767f2144c7cb47e ]---
> RIP: 0010:fanotify_get_fsid fs/notify/fanotify/fanotify.c:352 [inline]
> RIP: 0010:fanotify_handle_event+0x7d0/0xc40
> fs/notify/fanotify/fanotify.c:412
> Code: ff ff 48 8b 18 48 8d 7b 68 48 89 f8 48 c1 e8 03 42 80 3c 38 00 0f 85
> 47 04 00 00 48 8b 5b 68 48 8d 7b 3c 48 89 fa 48 c1 ea 03 <42> 0f b6 0c 3a 48
> 89 fa 83 e2 07 83 c2 03 38 ca 7c 08 84 c9 0f 85
> RSP: 0018:888085eb7b78 EFLAGS: 00010203
> RAX: 111013db05ab RBX:  RCX: 81c427ae
> RDX: 0007 RSI: 81c427bb RDI: 003c
> RBP: 888085eb7cc0 R08: 88808b0926c0 R09: 
> R10: 88808b092f90 R11: 88808b0926c0 R12: 0002
> R13:  R14: 0001 R15: dc00
> FS:  7ff801b58700() GS:8880ae80() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 7f500e416000 CR3: a3f17000 CR4: 001406f0
> 
> 
> Tested on:
> 
> commit: 8c7008a0 fsnotify: Fix NULL ptr deref in fanotify_get_fsid()
> git tree:
> git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git fsnotify
> console output: https://syzkaller.appspot.com/x/log.txt?x=13020ef4a0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=a42d110b47dd6b36
> compiler:   gcc (GCC) 9.0.0 20181231 (experimental)

Let's try new version of the patch:

#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git 
fsnotify

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: general protection fault in fanotify_handle_event

2019-04-26 Thread Jan Kara
On Thu 18-04-19 10:14:00, syzbot wrote:
> syzbot has bisected this bug to:
> 
> commit 77115225acc67d9ac4b15f04dd138006b9cd1ef2
> Author: Amir Goldstein 
> Date:   Thu Jan 10 17:04:37 2019 +
> 
> fanotify: cache fsid in fsnotify_mark_connector
> 
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=1627632d20
> start commit:   3f018f4a Add linux-next specific files for 20190418
> git tree:   linux-next
> final crash:https://syzkaller.appspot.com/x/report.txt?x=1527632d20
> console output: https://syzkaller.appspot.com/x/log.txt?x=1127632d20
> kernel config:  https://syzkaller.appspot.com/x/.config?x=faa7bdc352fc157e
> dashboard link: https://syzkaller.appspot.com/bug?extid=15927486a4f1bfcbaf91
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=155543d320
> 
> Reported-by: syzbot+15927486a4f1bfcba...@syzkaller.appspotmail.com
> Fixes: 77115225acc6 ("fanotify: cache fsid in fsnotify_mark_connector")
> 
> For information about bisection process see: https://goo.gl/tpsmEJ#bisection

Let syzbot test the fix:

#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs.git 
fsnotify

        Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] fs/quota: erase unused but set variable warning

2019-04-25 Thread Jan Kara
On Wed 24-04-19 08:58:57, Jiang Biao wrote:
> Local variable *reserved* of remove_dquot_ref() is only used if
> define CONFIG_QUOTA_DEBUG, but not ebraced in CONFIG_QUOTA_DEBUG
> macro, which leads to unused-but-set-variable warning when compiling.
> 
> This patch ebrace it into CONFIG_QUOTA_DEBUG macro like what is done
> in add_dquot_ref().
> 
> Signed-off-by: Jiang Biao 

Thanks. I've added the patch to my tree.

Honza

> ---
>  fs/quota/dquot.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
> index fc20e06c56ba..14ee4c6deba1 100644
> --- a/fs/quota/dquot.c
> +++ b/fs/quota/dquot.c
> @@ -1049,7 +1049,9 @@ static void remove_dquot_ref(struct super_block *sb, 
> int type,
>   struct list_head *tofree_head)
>  {
>   struct inode *inode;
> +#ifdef CONFIG_QUOTA_DEBUG
>   int reserved = 0;
> +#endif
>  
>   spin_lock(&sb->s_inode_list_lock);
>   list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> @@ -1061,8 +1063,10 @@ static void remove_dquot_ref(struct super_block *sb, 
> int type,
>*/
>   spin_lock(&dq_data_lock);
>   if (!IS_NOQUOTA(inode)) {
> +#ifdef CONFIG_QUOTA_DEBUG
>   if (unlikely(inode_get_rsv_space(inode) > 0))
>   reserved = 1;
> +#endif
>   remove_inode_dquot_ref(inode, type, tofree_head);
>   }
>   spin_unlock(&dq_data_lock);
> -- 
> 2.17.2 (Apple Git-113)
> 
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] quota: fix wrong indentation

2019-04-25 Thread Jan Kara
On Fri 19-04-19 13:46:01, Chengguang Xu wrote:
> We need to check return code only when calling ->read_dqblk(),
> so fix it properly.
> 
> Signed-off-by: Chengguang Xu 

Thanks. Added to my tree.

Honza

> ---
>  fs/quota/dquot.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
> index fc20e06c56ba..934b8f3f9e57 100644
> --- a/fs/quota/dquot.c
> +++ b/fs/quota/dquot.c
> @@ -424,10 +424,11 @@ int dquot_acquire(struct dquot *dquot)
>   struct quota_info *dqopt = sb_dqopt(dquot->dq_sb);
> 
>   mutex_lock(&dquot->dq_lock);
> - if (!test_bit(DQ_READ_B, &dquot->dq_flags))
> + if (!test_bit(DQ_READ_B, &dquot->dq_flags)) {
>   ret = dqopt->ops[dquot->dq_id.type]->read_dqblk(dquot);
> - if (ret < 0)
> - goto out_iolock;
> + if (ret < 0)
> + goto out_iolock;
> + }
>   /* Make sure flags update is visible after dquot has been filled */
>   smp_mb__before_atomic();
>   set_bit(DQ_READ_B, &dquot->dq_flags);
> --
> 2.20.1
> 
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: general protection fault in fanotify_handle_event

2019-04-24 Thread Jan Kara
On Tue 23-04-19 21:05:22, Amir Goldstein wrote:
> On Tue, Apr 23, 2019 at 7:27 PM Jan Kara  wrote:
> >
> > On Fri 19-04-19 12:33:02, Amir Goldstein wrote:
> > > On Thu, Apr 18, 2019 at 8:14 PM syzbot
> > >  wrote:
> > > >
> > > > syzbot has bisected this bug to:
> > > >
> > > > commit 77115225acc67d9ac4b15f04dd138006b9cd1ef2
> > > > Author: Amir Goldstein 
> > > > Date:   Thu Jan 10 17:04:37 2019 +
> > > >
> > > >  fanotify: cache fsid in fsnotify_mark_connector
> > > >
> > > > bisection log:  
> > > > https://syzkaller.appspot.com/x/bisect.txt?x=1627632d20
> > > > start commit:   3f018f4a Add linux-next specific files for 20190418
> > > > git tree:   linux-next
> > > > final crash:
> > > > https://syzkaller.appspot.com/x/report.txt?x=1527632d20
> > > > console output: https://syzkaller.appspot.com/x/log.txt?x=1127632d20
> > > > kernel config:  
> > > > https://syzkaller.appspot.com/x/.config?x=faa7bdc352fc157e
> > > > dashboard link: 
> > > > https://syzkaller.appspot.com/bug?extid=15927486a4f1bfcbaf91
> > > > syz repro:  
> > > > https://syzkaller.appspot.com/x/repro.syz?x=155543d320
> > > >
> > > > Reported-by: syzbot+15927486a4f1bfcba...@syzkaller.appspotmail.com
> > > > Fixes: 77115225acc6 ("fanotify: cache fsid in fsnotify_mark_connector")
> > > >
> >
> 
> [...]
> 
> >
> > I'd prefer if we could make only fully initialized marks visible on
> > connector's list. Just to make things simple in the fast path of handling
> > events. So I'd just set mark->connector before adding mark to connector's
> > list and having smp_wmb() there to force ordering in
> > fsnotify_add_mark_list(). And we should use READ_ONCE() as a counter-part
> > to this in fanotify_get_fsid(). It is somewhat unfortunate that there are
> > three different ways how fsnotify_add_mark_list() can add mark to a list so
> > we may need to either repeat this in three different places or just use
> > some macro magic like:
> >
> > #define fanotify_attach_obj_list(where, mark, conn, head) \
> > do { \
> > mark->connector = conn; \
> > smp_wmb(); \
> > hlist_add_##where##_rcu(&mark->obj_list, head); \
> > } while (0)
> >
> > And I would not really worry about fsnotify_put_mark() - if it ever sees
> > mark->connector set, mark really is on the connector's list and
> > fsnotify_put_mark() does the right thing (i.e. locks connector->lock if
> > needed).
> >
> 
> Sounds good to me.
> 
> I was worried about fsnotify_put_mark() NOT seeing mark->connector set
> even though it is on the list (add mark; get lockless ref; remove
> mark; put lockless ref).
> Its subtle, not sure if possible.

Not possible for the cases it matters. fsnotify_put_mark() just decrements
the refcount. If someone is adding mark to the list, he is by definition
holding one reference so racing fsnotify_put_mark() cannot be dropping the
last one and so the result of the check doesn't really matter.

> I am assuming you will be sending the above patch to syzbot for testing.
> If you want me do help with anything please let me know.

OK, I'll write it and send it.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: general protection fault in fanotify_handle_event

2019-04-23 Thread Jan Kara
On Fri 19-04-19 12:33:02, Amir Goldstein wrote:
> On Thu, Apr 18, 2019 at 8:14 PM syzbot
>  wrote:
> >
> > syzbot has bisected this bug to:
> >
> > commit 77115225acc67d9ac4b15f04dd138006b9cd1ef2
> > Author: Amir Goldstein 
> > Date:   Thu Jan 10 17:04:37 2019 +
> >
> >  fanotify: cache fsid in fsnotify_mark_connector
> >
> > bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=1627632d20
> > start commit:   3f018f4a Add linux-next specific files for 20190418
> > git tree:   linux-next
> > final crash:https://syzkaller.appspot.com/x/report.txt?x=1527632d20
> > console output: https://syzkaller.appspot.com/x/log.txt?x=1127632d20
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=faa7bdc352fc157e
> > dashboard link: https://syzkaller.appspot.com/bug?extid=15927486a4f1bfcbaf91
> > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=155543d320
> >
> > Reported-by: syzbot+15927486a4f1bfcba...@syzkaller.appspotmail.com
> > Fixes: 77115225acc6 ("fanotify: cache fsid in fsnotify_mark_connector")
> >

Hi Amir!

> It looks like lockless access to mark->connector is not safe as there is
> nothing preventing a reader from seeing a mark on object list without
> seeing the mark->connector assignment.

Yes, that seems to be possible.

> It made me wonder if (!mark->connector) check in fsnotify_put_mark() is safe.
> I couldn't find any call site where that would be a problem, but perhaps
> we should be more careful?

Why? That check is there really only to catch destruction of a mark that
was never attached (i.e., allocated but later never used due to some
error). Otherwise we should always have mark->connector set.

> Anyway, it seems that fsnotify_put_mark() uses the non NULL mark->connector
> as the indication that mark is on object list, so just assigning 
> mark->connector
> before adding to object list won't do.
> 
> Since a reference of mark is our guaranty that mark->connector is not
> going away, I guess we could do opportunistic test for non NULL
> mark->connector from lockless path, if that fails, we check again
> with mark->lock held and if that fails something went wrong.
> 
> Another option is to teach fsnotify_first_mark() and fsnotify_next_mark()
> to skip over marks with NULL mark->connector.
> 
> What do you think? Did I over complicate this?

I'd prefer if we could make only fully initialized marks visible on
connector's list. Just to make things simple in the fast path of handling
events. So I'd just set mark->connector before adding mark to connector's
list and having smp_wmb() there to force ordering in
fsnotify_add_mark_list(). And we should use READ_ONCE() as a counter-part
to this in fanotify_get_fsid(). It is somewhat unfortunate that there are
three different ways how fsnotify_add_mark_list() can add mark to a list so
we may need to either repeat this in three different places or just use
some macro magic like:

#define fanotify_attach_obj_list(where, mark, conn, head) \
do { \
mark->connector = conn; \
smp_wmb(); \
hlist_add_##where##_rcu(&mark->obj_list, head); \
} while (0)

And I would not really worry about fsnotify_put_mark() - if it ever sees
mark->connector set, mark really is on the connector's list and
fsnotify_put_mark() does the right thing (i.e. locks connector->lock if
needed).

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v4 4/7] block: introduce write-hint to stream-id conversion

2019-04-18 Thread Jan Kara
On Wed 17-04-19 23:20:03, Kanchan Joshi wrote:
> This patch moves write-hint-to-stream-id conversion in block-layer.
> Earlier this was done by driver (nvme). Current conversion is of the
> form "streamid = write-hint - 1", for both user and kernel streams.
> Conversion takes stream limit (maintained in request queue) into
> account. Write-hints beyond the exposed limit turn to 0.
> A new field 'streamid' has been added in request. While 'write-hint'
> field continues to exist. It keeps original value passed from upper
> layer, and used during merging checks.
> 
> Signed-off-by: Kanchan Joshi 
> ---
>  block/blk-core.c   | 20 
>  include/linux/blkdev.h |  1 +
>  2 files changed, 21 insertions(+)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index a55389b..712e6b7 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -730,6 +730,25 @@ bool blk_attempt_plug_merge(struct request_queue *q, 
> struct bio *bio,
>   return false;
>  }
>  
> +enum rw_hint blk_write_hint_to_streamid(struct request *req,
> + struct bio *bio)
> +{
> + enum rw_hint streamid, nr_streams;
> + struct request_queue *q = req->q;
> + nr_streams = q->limits.nr_streams;
> +
> + streamid = bio->bi_write_hint;
> + if (!nr_streams || streamid == WRITE_LIFE_NOT_SET ||
> + streamid == WRITE_LIFE_NONE)
> + streamid = 0;
> + else {
> + --streamid;
> + if(streamid > nr_streams)
> + streamid = 0;
> + }
> + return streamid;
> +}
> +

Someone told me that stream ids are potentially persistent on the storage
so it isn't great to change the id for the same thing e.g. if we add
another user hint. So maybe we should allocate kernel hints from top as
Dave Chinner suggested? I.e., something like the following mapping function:

if (streamid <= BLK_MAX_USER_HINTS) {
streamid--;
if (streamid > nr_streams)
streamid = 0;
} else {
/* Kernel hints get allocated from top */
streamid -= WRITE_LIFE_KERN_MIN;
if (streamid > nr_streams - BLK_MAX_USER_HINTS)
streamid = 0;
        else
streamid = nr_streams - streamid - 1;
}

what do you think?

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: ext4: avoid drop reference to iloc.bh twice

2019-04-18 Thread Jan Kara
On Thu 18-04-19 16:31:18, Pan Bian wrote:
> The reference to iloc.bh has been dropped in ext4_mark_iloc_dirty.
> However, the reference is dropped again if error occurs during
> ext4_handle_dirty_metadata, which may result in use-after-free bugs.
> 
> Fixes: fb265c9cb49e("ext4: add ext4_sb_bread() to disambiguate ENOMEM
> cases")
> 
> Signed-off-by: Pan Bian 

Thanks for the patch! Good spotting. You can add:

Reviewed-by: Jan Kara 

I'm just wondering: Ted, shouldn't we make ext4_mark_iloc_dirty() clear the
iloc.bh pointer on it's own? I believe this is not the first time we had a
bug like this and it would also catch possible use-after-free issues in
case someone tried to use iloc.bh after the reference has been dropped.

Generally the scheme around ext4_get_inode_loc() and
ext4_mark_iloc_dirty() seems to be somewhat error prone. E.g. a quick audit
shows that there's bh leak in ext4_orphan_del() in one of the error paths.
I'll send patches.

Honza

> ---
>  fs/ext4/resize.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
> index e7ae26e..4d5c0fc 100644
> --- a/fs/ext4/resize.c
> +++ b/fs/ext4/resize.c
> @@ -874,6 +874,7 @@ static int add_new_gdb(handle_t *handle, struct inode 
> *inode,
>   err = ext4_handle_dirty_metadata(handle, NULL, gdb_bh);
>   if (unlikely(err)) {
>   ext4_std_error(sb, err);
> + iloc.bh = NULL;
>   goto errout;
>   }
>   brelse(dind);
> -- 
> 2.7.4
> 
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [RFC PATCH 49/62] dax: make use of ->free_inode()

2019-04-18 Thread Jan Kara
On Tue 16-04-19 18:53:27, Al Viro wrote:
> From: Al Viro 
> 
> we might want to drop ->destroy_inode() there - it's used only for
> WARN_ON() now, and AFAICS that could be moved to ->evict_inode()
> if we had one...
> 
> Signed-off-by: Al Viro 
> ---
>  drivers/dax/super.c | 7 ++-
>  1 file changed, 2 insertions(+), 5 deletions(-)

Added Dan to CC since I'm not sure how closely he follows fsdevel. The
patch looks good to me FWIW so feel free to add:

Reviewed-by: Jan Kara 

Honza


> 
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 0a339b85133e..bbd57ca0634a 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -412,11 +412,9 @@ static struct dax_device *to_dax_dev(struct inode *inode)
>   return container_of(inode, struct dax_device, inode);
>  }
>  
> -static void dax_i_callback(struct rcu_head *head)
> +static void dax_free_inode(struct inode *inode)
>  {
> - struct inode *inode = container_of(head, struct inode, i_rcu);
>   struct dax_device *dax_dev = to_dax_dev(inode);
> -
>   kfree(dax_dev->host);
>   dax_dev->host = NULL;
>   if (inode->i_rdev)
> @@ -427,16 +425,15 @@ static void dax_i_callback(struct rcu_head *head)
>  static void dax_destroy_inode(struct inode *inode)
>  {
>   struct dax_device *dax_dev = to_dax_dev(inode);
> -
>   WARN_ONCE(test_bit(DAXDEV_ALIVE, &dax_dev->flags),
>   "kill_dax() must be called before final iput()\n");
> - call_rcu(&inode->i_rcu, dax_i_callback);
>  }
>  
>  static const struct super_operations dax_sops = {
>   .statfs = simple_statfs,
>   .alloc_inode = dax_alloc_inode,
>   .destroy_inode = dax_destroy_inode,
> + .free_inode = dax_free_inode,
>   .drop_inode = generic_delete_inode,
>  };
>  
> -- 
> 2.11.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [RFC PATCH 54/62] ext4: make use of ->free_inode()

2019-04-18 Thread Jan Kara
On Tue 16-04-19 18:53:32, Al Viro wrote:
> From: Al Viro 
> 
> the rest of this ->destroy_inode() instance could probably be folded
> into ext4_evict_inode()
> 
> Signed-off-by: Al Viro 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

You're right about the possibility of moving the check to
ext4_evict_inode() (probably ext4_clear_inode() would be the best). But we
can leave that for later.

Honza

> ---
>  fs/ext4/super.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 6ed4eb81e674..981f702848e7 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1107,9 +1107,8 @@ static int ext4_drop_inode(struct inode *inode)
>   return drop;
>  }
>  
> -static void ext4_i_callback(struct rcu_head *head)
> +static void ext4_free_in_core_inode(struct inode *inode)
>  {
> - struct inode *inode = container_of(head, struct inode, i_rcu);
>   kmem_cache_free(ext4_inode_cachep, EXT4_I(inode));
>  }
>  
> @@ -1124,7 +1123,6 @@ static void ext4_destroy_inode(struct inode *inode)
>   true);
>   dump_stack();
>   }
> - call_rcu(&inode->i_rcu, ext4_i_callback);
>  }
>  
>  static void init_once(void *foo)
> @@ -1402,6 +1400,7 @@ static const struct quotactl_ops ext4_qctl_operations = 
> {
>  
>  static const struct super_operations ext4_sops = {
>   .alloc_inode= ext4_alloc_inode,
> + .free_inode = ext4_free_in_core_inode,
>   .destroy_inode  = ext4_destroy_inode,
>   .write_inode= ext4_write_inode,
>   .dirty_inode= ext4_dirty_inode,
> -- 
> 2.11.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] ext4: fix use-after-free race with debug_want_extra_isize

2019-04-18 Thread Jan Kara
On Mon 15-04-19 17:19:45, Barret Rhoden wrote:
> When remounting with debug_want_extra_isize, we were not performing the
> same checks that we do during a normal mount.  That allowed us to set a
> value for s_want_extra_isize that reached outside the s_inode_size.
> 
> Reported-by: syzbot+f584efa0ac7213c22...@syzkaller.appspotmail.com
> Signed-off-by: Barret Rhoden 
> Cc: sta...@vger.kernel.org

The patch looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
> 
> - In the current code, it looks like someone could mount with want_extra_isize
>   with some value > 0 but less than the minimums in the s_es.  If that's a 
> bug,
>   I can submit a follow-on patch.
> 
> - Similarly, on a failed remount, sbi->s_want_extra_isize is changed to the
>   remounted value.  I can fix that too if it's a problem.
> 
> - Is it OK to remount with a smaller s_want_extra_isize than the previous 
> mount?
>   I thought it was, but figured I'd ask while I'm looking at it.
> 
> 
>  fs/ext4/super.c | 58 +
>  1 file changed, 34 insertions(+), 24 deletions(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 6ed4eb81e674..184944d4d8d1 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -3513,6 +3513,37 @@ int ext4_calculate_overhead(struct super_block *sb)
>   return 0;
>  }
>  
> +static void ext4_clamp_want_extra_isize(struct super_block *sb)
> +{
> + struct ext4_sb_info *sbi = EXT4_SB(sb);
> + struct ext4_super_block *es = sbi->s_es;
> +
> + /* determine the minimum size of new large inodes, if present */
> + if (sbi->s_inode_size > EXT4_GOOD_OLD_INODE_SIZE &&
> + sbi->s_want_extra_isize == 0) {
> + sbi->s_want_extra_isize = sizeof(struct ext4_inode) -
> +  EXT4_GOOD_OLD_INODE_SIZE;
> + if (ext4_has_feature_extra_isize(sb)) {
> + if (sbi->s_want_extra_isize <
> + le16_to_cpu(es->s_want_extra_isize))
> + sbi->s_want_extra_isize =
> + le16_to_cpu(es->s_want_extra_isize);
> + if (sbi->s_want_extra_isize <
> + le16_to_cpu(es->s_min_extra_isize))
> + sbi->s_want_extra_isize =
> + le16_to_cpu(es->s_min_extra_isize);
> + }
> + }
> + /* Check if enough inode space is available */
> + if (EXT4_GOOD_OLD_INODE_SIZE + sbi->s_want_extra_isize >
> + sbi->s_inode_size) {
> + sbi->s_want_extra_isize = sizeof(struct ext4_inode) -
> +EXT4_GOOD_OLD_INODE_SIZE;
> + ext4_msg(sb, KERN_INFO,
> +  "required extra inode space not available");
> + }
> +}
> +
>  static void ext4_set_resv_clusters(struct super_block *sb)
>  {
>   ext4_fsblk_t resv_clusters;
> @@ -4387,30 +4418,7 @@ static int ext4_fill_super(struct super_block *sb, 
> void *data, int silent)
>   } else if (ret)
>   goto failed_mount4a;
>  
> - /* determine the minimum size of new large inodes, if present */
> - if (sbi->s_inode_size > EXT4_GOOD_OLD_INODE_SIZE &&
> - sbi->s_want_extra_isize == 0) {
> - sbi->s_want_extra_isize = sizeof(struct ext4_inode) -
> -  EXT4_GOOD_OLD_INODE_SIZE;
> - if (ext4_has_feature_extra_isize(sb)) {
> - if (sbi->s_want_extra_isize <
> - le16_to_cpu(es->s_want_extra_isize))
> - sbi->s_want_extra_isize =
> - le16_to_cpu(es->s_want_extra_isize);
> - if (sbi->s_want_extra_isize <
> - le16_to_cpu(es->s_min_extra_isize))
> - sbi->s_want_extra_isize =
> - le16_to_cpu(es->s_min_extra_isize);
> - }
> - }
> - /* Check if enough inode space is available */
> - if (EXT4_GOOD_OLD_INODE_SIZE + sbi->s_want_extra_isize >
> - sbi->s_inode_size) {
> - sbi->s_want_extra_isize = sizeof(struct ext4_inode) -
> -EXT4_GOOD_OLD_INODE_SIZE;
> - ex

Re: [PATCH] ext4: add cond_resched() to ext4_mb_init_backend()

2019-04-18 Thread Jan Kara
On Mon 15-04-19 19:59:34, Khazhismel Kumykov wrote:
> on non-preempt kernels for filesystems with large number of groups we
> may take a long time (>50 ticks) initializing all the groups.
> 
> Signed-off-by: Khazhismel Kumykov 

Thanks for the patch. I'm not opposed to this but I'm just wondering: Has
this caused any real issues to you? Or how have you come across this
particular loop? Because I'd think we have more similar loops in ext4
codebase...

Honza

> ---
>  fs/ext4/mballoc.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 8ef5f12bbee2..c89f497ccf50 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -2490,6 +2490,7 @@ static int ext4_mb_init_backend(struct super_block *sb)
>   sbi->s_buddy_cache->i_ino = EXT4_BAD_INO;
>   EXT4_I(sbi->s_buddy_cache)->i_disksize = 0;
>   for (i = 0; i < ngroups; i++) {
> + cond_resched();
>   desc = ext4_get_group_desc(sb, i, NULL);
>   if (desc == NULL) {
>   ext4_msg(sb, KERN_ERR, "can't read descriptor %u", i);
> -- 
> 2.21.0.392.gf8f6787159e-goog
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: WARNING in notify_change

2019-04-18 Thread Jan Kara
On Tue 16-04-19 00:54:28, Al Viro wrote:
> On Mon, Apr 15, 2019 at 04:20:17PM -0700, Khazhismel Kumykov wrote:
> > I was able to reproduce this by setting security.capability xattr on a
> > blockdev file, then writing to it - when writing to the blockdev we
> > never lock the inode, so when we clear the capability we hit this
> > lockdep warning.
> > 
> > Is the issue here that we can set this xattr in the first place so we
> > have to clear it at all? Or should we really be locking the inode for
> > blockdevs after all? I'm not too familiar, but my gut says former
> 
> More interesting question is, WTF do we even touch that thing for
> bdev?  The thing is, mknod will cheerfully create any number of
> different filesystem objects, all giving access to the same block
> device.  Which of them should have that xattr removed?  It makes
> no sense whatsoever; moreover, who *cares* about caps for block
> device in the first place?
> 
> And if we did, what of another way to modify the block device?
> You know, mount it read-write...

Yes, Alexander Lochman has sent a patch to silence this warning back in
February [1] by just bailing out from file_remove_privs() for non-regular
files.  But so far you've ignored that patch... Will you pick it up please?

Honza

[1] 
https://lore.kernel.org/lkml/cbdc8071-de76-bb0a-6890-15ef21023...@tu-dortmund.de

-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2] udf: fix an uninitialized read bug and remove dead code

2019-04-17 Thread Jan Kara
On Mon 15-04-19 12:58:06, Wenwen Wang wrote:
> In udf_lookup(), the pointer 'fi' is a local variable initialized by the
> return value of the function call udf_find_entry(). However, if the macro
> 'UDF_RECOVERY' is defined, this variable will become uninitialized if the
> else branch is not taken, which can potentially cause incorrect results in
> the following execution.
> 
> To fix this issue, this patch drops the whole code in the ifdef
> 'UDF_RECOVERY' region, as it is dead code.
> 
> Signed-off-by: Wenwen Wang 

Thanks for the patch. I've added it to my tree.

Honza

> ---
>  fs/udf/namei.c | 15 ---
>  1 file changed, 15 deletions(-)
> 
> diff --git a/fs/udf/namei.c b/fs/udf/namei.c
> index 58cc241..77b6d89 100644
> --- a/fs/udf/namei.c
> +++ b/fs/udf/namei.c
> @@ -304,21 +304,6 @@ static struct dentry *udf_lookup(struct inode *dir, 
> struct dentry *dentry,
>   if (dentry->d_name.len > UDF_NAME_LEN)
>   return ERR_PTR(-ENAMETOOLONG);
>  
> -#ifdef UDF_RECOVERY
> - /* temporary shorthand for specifying files by inode number */
> - if (!strncmp(dentry->d_name.name, ".B=", 3)) {
> - struct kernel_lb_addr lb = {
> - .logicalBlockNum = 0,
> - .partitionReferenceNum =
> - simple_strtoul(dentry->d_name.name + 3,
> - NULL, 0),
> - };
> - inode = udf_iget(dir->i_sb, lb);
> - if (IS_ERR(inode))
> - return inode;
> - } else
> -#endif /* UDF_RECOVERY */
> -
>   fi = udf_find_entry(dir, &dentry->d_name, &fibh, &cfi);
>   if (IS_ERR(fi))
>   return ERR_CAST(fi);
> -- 
> 2.7.4
> 
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] udf: fix an uninitialized read bug

2019-04-15 Thread Jan Kara
On Mon 15-04-19 10:26:24, Wenwen Wang wrote:
> In udf_lookup(), the pointer 'fi' is a local variable initialized by the
> return value of the function call udf_find_entry(). However, if the macro
> 'UDF_RECOVERY' is defined, this variable will become uninitialized if the
> else branch is not taken, which can potentially cause incorrect results in
> the following execution.
> 
> This patch simply initializes this local pointer to NULL.
> 
> Signed-off-by: Wenwen Wang 

Thanks for the patch! A better fix is to drop the whole UDF_RECOVERY ifdef
and what's in it. It is just dead code anyway.

Honza

> ---
>  fs/udf/namei.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/udf/namei.c b/fs/udf/namei.c
> index 58cc241..9d499e1 100644
> --- a/fs/udf/namei.c
> +++ b/fs/udf/namei.c
> @@ -299,7 +299,7 @@ static struct dentry *udf_lookup(struct inode *dir, 
> struct dentry *dentry,
>   struct inode *inode = NULL;
>   struct fileIdentDesc cfi;
>   struct udf_fileident_bh fibh;
> - struct fileIdentDesc *fi;
> + struct fileIdentDesc *fi = NULL;
>  
>   if (dentry->d_name.len > UDF_NAME_LEN)
>   return ERR_PTR(-ENAMETOOLONG);
> -- 
> 2.7.4
> 
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] fanotify: Make wait for permission events interruptible

2019-04-15 Thread Jan Kara
On Thu 21-03-19 16:11:42, Jan Kara wrote:
> Switch waiting for response to fanotify permission events interruptible.
> This allows e.g. the system to be suspended while there are some
> fanotify permission events pending (which is reportedly pretty common
> when for example AV solution is in use). However just making the wait
> interruptible can result in e.g. open(2) returning -EINTR where
> previously such error code never happened in practice. To avoid
> confusion of userspace due to this error code, return -ERESTARTNOINTR
> instead.
> 
> Signed-off-by: Jan Kara 
> ---
>  fs/notify/fanotify/fanotify.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> Orion, can you give this patch some testing with your usecase? Also if anybody
> sees any issue with returning -ERESTARTNOINTR I have missed, please speak up.

Ping Orion? Did you have any chance to give this patch a try? Does it fix
hibernation issues you observe without causing issues with bash and other
programs? I'd like to queue this patch for the coming merge window but
I'd like to see some testing results showing that it actually helps
anything... Thanks!

Honza

> 
> diff --git a/fs/notify/fanotify/fanotify.c b/fs/notify/fanotify/fanotify.c
> index 6b9c27548997..eb790853844b 100644
> --- a/fs/notify/fanotify/fanotify.c
> +++ b/fs/notify/fanotify/fanotify.c
> @@ -92,10 +92,17 @@ static int fanotify_get_response(struct fsnotify_group 
> *group,
>  
>   pr_debug("%s: group=%p event=%p\n", __func__, group, event);
>  
> - ret = wait_event_killable(group->fanotify_data.access_waitq,
> -   event->state == FAN_EVENT_ANSWERED);
> + ret = wait_event_interruptible(group->fanotify_data.access_waitq,
> +event->state == FAN_EVENT_ANSWERED);
>   /* Signal pending? */
>   if (ret < 0) {
> + /*
> +  * Force restarting a syscall so that this is mostly invisible
> +  * for userspace which is not prepared for handling EINTR e.g.
> +  * from open(2).
> +  */
> + if (ret == -ERESTARTSYS)
> + ret = -ERESTARTNOINTR;
>   spin_lock(&group->notification_lock);
>   /* Event reported to userspace and no answer yet? */
>   if (event->state == FAN_EVENT_REPORTED) {
> -- 
> 2.16.4
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] quota: remvoe redundant variable assignment

2019-04-15 Thread Jan Kara
On Sun 14-04-19 21:58:55, Chengguang Xu wrote:
> The assignment of variable ret is redundant because the
> value of ret is 0 after calling v2_read_header() in normal
> case.
> 
> Signed-off-by: Chengguang Xu 

Thanks for the patch. Yes, you are correct the assignment is redundant but
this way it is more obvious what we're going to return and there are also
less chances we'll break the logic when adding some code using 'ret' in the
middle of v2_read_file_info().

Honza

> ---
>  fs/quota/quota_v2.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/fs/quota/quota_v2.c b/fs/quota/quota_v2.c
> index a73e5b34db41..25d36607be6e 100644
> --- a/fs/quota/quota_v2.c
> +++ b/fs/quota/quota_v2.c
> @@ -158,7 +158,6 @@ static int v2_read_file_info(struct super_block *sb, int 
> type)
>   qinfo->dqi_entry_size = sizeof(struct v2r1_disk_dqblk);
>   qinfo->dqi_ops = &v2r1_qtree_ops;
>   }
> - ret = 0;
>  out:
>   up_read(&dqopt->dqio_sem);
>   return ret;
> --
> 2.17.2
> 
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] fs/reiserfs/journal.c: Make remove_journal_hash static

2019-04-15 Thread Jan Kara
On Sun 14-04-19 16:54:38, Bharath Vedartham wrote:
> This fixes the -WDecl sparse warning in journal.c. Function was declared
> as static void but the definition was void.
> 
> Signed-off-by: Bharath Vedartham 

Thanks! I've added your patch into my tree.

Honza

> ---
>  fs/reiserfs/journal.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/reiserfs/journal.c b/fs/reiserfs/journal.c
> index 8a76f9d..36346dc 100644
> --- a/fs/reiserfs/journal.c
> +++ b/fs/reiserfs/journal.c
> @@ -1844,7 +1844,7 @@ static int flush_used_journal_lists(struct super_block 
> *s,
>   * removes any nodes in table with name block and dev as bh.
>   * only touchs the hnext and hprev pointers.
>   */
> -void remove_journal_hash(struct super_block *sb,
> +static void remove_journal_hash(struct super_block *sb,
>struct reiserfs_journal_cnode **table,
>struct reiserfs_journal_list *jl,
>        unsigned long block, int remove_freed)
> -- 
> 2.7.4
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v5 1/6] libnvdimm: nd_region flush callback support

2019-04-12 Thread Jan Kara
On Thu 11-04-19 07:51:48, Dan Williams wrote:
> On Tue, Apr 9, 2019 at 9:09 PM Pankaj Gupta  wrote:
> > +   } else {
> > +   if (nd_region->flush(nd_region))
> > +   rc = -EIO;
> 
> Given the common case wants to be fast and synchronous I think we
> should try to avoid retpoline overhead by default. So something like
> this:
> 
> if (nd_region->flush == generic_nvdimm_flush)
> rc = generic_nvdimm_flush(...);

I'd either add a comment about avoiding retpoline overhead here or just
make ->flush == NULL mean generic_nvdimm_flush(). Just so that people don't
get confused by the code.

        Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback

2019-04-11 Thread Jan Kara
On Wed 10-04-19 13:42:23, Amir Goldstein wrote:
> On Wed, Apr 10, 2019 at 12:10 PM Jan Kara  wrote:
> >
> > On Tue 09-04-19 21:01:54, Amir Goldstein wrote:
> > > On Tue, Apr 9, 2019 at 6:43 PM Jan Kara  wrote:
> > > > On Tue 09-04-19 14:49:22, Amir Goldstein wrote:
> > > > > Commit 23d0127096cb ("fs/sync.c: make sync_file_range(2) use 
> > > > > WB_SYNC_NONE
> > > > > writeback") claims that sync_file_range(2) syscall was "created for
> > > > > userspace to be able to issue background writeout and so waiting for
> > > > > in-flight IO is undesirable there" and changes the writeback (back) to
> > > > > WB_SYNC_NONE.
> > > > >
> > > > > This claim is only partially true. Is is true for users that use the 
> > > > > flag
> > > > > SYNC_FILE_RANGE_WRITE by itself, as does PostgreSQL, the user that was
> > > > > the reason for changing to WB_SYNC_NONE writeback.
> > > > >
> > > > > However, that claim is not true for users that use that flag 
> > > > > combination
> > > > > SYNC_FILE_RANGE_{WAIT_BEFORE|WRITE|_WAIT_AFTER}.  Those users 
> > > > > explicitly
> > > > > requested to wait for in-flight IO as well as to writeback of dirty
> > > > > pages.  sync_file_range(2) man page describes this flag combintaion as
> > > > > "write-for-data-integrity operation", although some may argue against
> > > > > this definition.
> > > > >
> > > > > Re-brand that flag combination as SYNC_FILE_RANGE_WRITE_AND_WAIT and 
> > > > > use
> > > > > the helper filemap_write_and_wait_range(), that uses WB_SYNC_ALL
> > > > > writeback, to perform the range sync request.
> > > > >
> > > > > One intended use case for this API is creating a dependency between
> > > > > a new file's content and its link into the namepsace without requiring
> > > > > journal commit and flushing of disk volatile caches:
> > > > >
> > > > >   fd = open(".", O_TMPFILE | O_RDWR);
> > > > >   write(fd, newdata, count);
> > > > >   sync_file_range(fd, SYNC_FILE_RANGE_WRITE_AND_WAIT, 0, 0);
> > > > >   linkat(AT_EMPTY_PATH, fd, AT_FDCWD, newfile);
> > > > >
> > > > > For many local filesystems, ext4 and xfs included, the sequence above
> > > > > will guaranty that after crash, either 'newfile' exists with 'newdata'
> > > > > content or 'newfile' does not exists.  For some applications, this
> > > > > guaranty is strong enough and the effects of sync_file_range(2), even
> > > > > with WB_SYNC_ALL, are far less intrusive to other writers in the 
> > > > > system
> > > > > than the effects of fdatasync(2).
> > > >
> > > > I agree that this paragraph is true but I don't want any userspace 
> > > > program
> > > > rely on this. We've been through this when ext3 got converted to ext4 
> > > > and
> > > > it has caused a lot of complaints. Ext3 had provided rather strong data 
> > > > vs
> > > > metadata ordering due to the way journalling was implemented. Then ext4
> > > > came, implemented delayed allocation and somewhat changed how 
> > > > journalling
> > > > works and suddently userspace programmers were caught by surprise their 
> > > > code
> > > > working by luck stopped working. And they were complaining that when 
> > > > their
> > > > code worked without fsync(2) before, it should work after it as well. 
> > > > So it
> > > > took a lot of explaining that their applications are broken and Ted has
> > > > also implemented some workarounds to make at least the most common 
> > > > mistakes
> > > > silently fixed by the kernel most of the time.
> > > >
> > > > So I'm against providing 90% data integrity guarantees because there'll 
> > > > be
> > > > many people who'll think they can get away with it and then complain 
> > > > when
> > > > they won't once our implementation changes.
> > > >
> > > > Rather I'd modify the manpage to not say that 
> > > > SYNC_FILE_RANGE_WAIT_BEFORE
> > > > | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER is a
&

Re: [PATCH] fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback

2019-04-10 Thread Jan Kara
On Tue 09-04-19 21:01:54, Amir Goldstein wrote:
> On Tue, Apr 9, 2019 at 6:43 PM Jan Kara  wrote:
> > On Tue 09-04-19 14:49:22, Amir Goldstein wrote:
> > > Commit 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE
> > > writeback") claims that sync_file_range(2) syscall was "created for
> > > userspace to be able to issue background writeout and so waiting for
> > > in-flight IO is undesirable there" and changes the writeback (back) to
> > > WB_SYNC_NONE.
> > >
> > > This claim is only partially true. Is is true for users that use the flag
> > > SYNC_FILE_RANGE_WRITE by itself, as does PostgreSQL, the user that was
> > > the reason for changing to WB_SYNC_NONE writeback.
> > >
> > > However, that claim is not true for users that use that flag combination
> > > SYNC_FILE_RANGE_{WAIT_BEFORE|WRITE|_WAIT_AFTER}.  Those users explicitly
> > > requested to wait for in-flight IO as well as to writeback of dirty
> > > pages.  sync_file_range(2) man page describes this flag combintaion as
> > > "write-for-data-integrity operation", although some may argue against
> > > this definition.
> > >
> > > Re-brand that flag combination as SYNC_FILE_RANGE_WRITE_AND_WAIT and use
> > > the helper filemap_write_and_wait_range(), that uses WB_SYNC_ALL
> > > writeback, to perform the range sync request.
> > >
> > > One intended use case for this API is creating a dependency between
> > > a new file's content and its link into the namepsace without requiring
> > > journal commit and flushing of disk volatile caches:
> > >
> > >   fd = open(".", O_TMPFILE | O_RDWR);
> > >   write(fd, newdata, count);
> > >   sync_file_range(fd, SYNC_FILE_RANGE_WRITE_AND_WAIT, 0, 0);
> > >   linkat(AT_EMPTY_PATH, fd, AT_FDCWD, newfile);
> > >
> > > For many local filesystems, ext4 and xfs included, the sequence above
> > > will guaranty that after crash, either 'newfile' exists with 'newdata'
> > > content or 'newfile' does not exists.  For some applications, this
> > > guaranty is strong enough and the effects of sync_file_range(2), even
> > > with WB_SYNC_ALL, are far less intrusive to other writers in the system
> > > than the effects of fdatasync(2).
> >
> > I agree that this paragraph is true but I don't want any userspace program
> > rely on this. We've been through this when ext3 got converted to ext4 and
> > it has caused a lot of complaints. Ext3 had provided rather strong data vs
> > metadata ordering due to the way journalling was implemented. Then ext4
> > came, implemented delayed allocation and somewhat changed how journalling
> > works and suddently userspace programmers were caught by surprise their code
> > working by luck stopped working. And they were complaining that when their
> > code worked without fsync(2) before, it should work after it as well. So it
> > took a lot of explaining that their applications are broken and Ted has
> > also implemented some workarounds to make at least the most common mistakes
> > silently fixed by the kernel most of the time.
> >
> > So I'm against providing 90% data integrity guarantees because there'll be
> > many people who'll think they can get away with it and then complain when
> > they won't once our implementation changes.
> >
> > Rather I'd modify the manpage to not say that SYNC_FILE_RANGE_WAIT_BEFORE
> > | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER is a
> > write-for-data-integrity.
> 
> OK. I do agree that manpage is misleading and that the language describing
> this flag combination should be toned down. I do not object to adding more
> and more disclaimers about this API not being a data integrity API.

I don't think we need more disclaimers, I'd just reword that
"write-for-data-integrity" bit.

> But the requirement I have is a real life workload and fdatasync(2) is not at
> all a viable option when application is creating many files per second.
> 
> I need to export the functionality of data writeback to userspace.
> Objecting to expose this functionality via the interface that has been
> documented to expose it for years and used to expose it in the
> past (until the Fixes commit) is a bit weird IMO, even though I do
> completely relate to your concern.
> 
> I don't mind removing the "intended use case" part of commit message
> if that helps reducing the chance that people misunderstand
> the guaranties of the API.
&

Re: [PATCH] fs/buffer.c: Fix data corruption when buffer write with IO error

2019-04-09 Thread Jan Kara
On Tue 09-04-19 21:11:22, zhangxiaoxu (A) wrote:
> 
> 
> On 4/8/2019 7:11 PM, Jan Kara wrote:
> > On Sat 06-04-19 15:13:13, ZhangXiaoxu wrote:
> > > When the buffer write failed, 'end_buffer_write_sync' and
> > > 'end_buffer_async_write' will clear the uptodate flag. But the
> > > data in the buffer maybe newer than disk. In some case, this
> > > will lead data corruption.
> > > 
> > > For example: ext4 flush metadata to disk failed, it will clear
> > > the uptodate flag. when a new coming call want the buffer, it will
> > > read it from the disk(because the buffer no uptodate flag). But
> > > the journal not checkpoint now, it will read old data from disk.
> > > If read successfully, ext4 will write the old data to the new
> > > journal, the data will corruption.
> > > 
> > > So, don't clear the uptodate flag when write the buffer failed.
> > > 
> > > Signed-off-by: ZhangXiaoxu
> > Thanks for the patch. But what are the chances that after the write has
> > failed the read will succeed? Also there were places that were using
> > buffer_uptodate() to detect IO errors. Did you check all those got
> > converted to using buffer_write_io_error() instead?
> > 
> > Honza
> > 
> I encountered this situation when using iscsi target as the ext4 volume,
> if the network down a while when ext4 write metadata to disk, and maybe
> read will success after the network recovered.

OK, but then you are running in errors=continue mode, aren't you? Because
otherwise the filesystem would get remounted read-only or panic the system
on the first metadata IO error.

> In my testcase, just create and delete files, when disconnect the network,
> the dir entry maybe corruption. I add some logs when read entry buffer from
> disk, the buffer stats maybe buffer_write_io_error() and !buffer_uptodate().
> In this case, the ext4 will corruption high probability. Because the buffer
> is read from disk, and some newer information just on journal. In this case,
> we should not read buffer from the disk.

Well, this never worked reliably (e.g. the buffer may get evicted due to
memory pressure after IO error) and never was promised to work. Once you
hit the first metadata error, all bets are off, you have to unmount the
filesystem and run fsck. Sure the damage would be likely smaller if we kept
the buffer uptodate in your case but it isn't guaranteed the filesystem
will not get corrupted. But as I said, changing this is not just a matter
of deleting those two clear_buffer_uptodate() calls...

> I don't know any other filesystem how to use the uptodate flag. But I think
> uptodate means the buffer is valid and newer than disk, am I right?

Yes, uptodate means the buffer is valid and newer than disk. However after
IO error, it isn't clear what the actual disk state is. But generally I'm
not opposed to the change you suggest, just it means you have to first go
and check all filesystems using buffer heads and verify that none of them
uses buffer_uptodate() check to detect buffer write failure.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback

2019-04-09 Thread Jan Kara
---
>  include/uapi/linux/fs.h |  3 +++
>  2 files changed, 21 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/sync.c b/fs/sync.c
> index b54e0541ad89..5cf6fdbae4de 100644
> --- a/fs/sync.c
> +++ b/fs/sync.c
> @@ -18,8 +18,8 @@
>  #include 
>  #include "internal.h"
>  
> -#define VALID_FLAGS (SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE| \
> - SYNC_FILE_RANGE_WAIT_AFTER)
> +#define VALID_FLAGS (SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WRITE_AND_WAIT 
> | \
> +  SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WAIT_AFTER)
>  
>  /*
>   * Do the filesystem syncing work. For simple filesystems
> @@ -235,9 +235,9 @@ SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
>  }
>  
>  /*
> - * sys_sync_file_range() permits finely controlled syncing over a segment of
> + * ksys_sync_file_range() permits finely controlled syncing over a segment of
>   * a file in the range offset .. (offset+nbytes-1) inclusive.  If nbytes is
> - * zero then sys_sync_file_range() will operate from offset out to EOF.
> + * zero then ksys_sync_file_range() will operate from offset out to EOF.
>   *
>   * The flag bits are:
>   *
> @@ -254,7 +254,7 @@ SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
>   * Useful combinations of the flag bits are:
>   *
>   * SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE: ensures that all pages
> - * in the range which were dirty on entry to sys_sync_file_range() are placed
> + * in the range which were dirty on entry to ksys_sync_file_range() are 
> placed
>   * under writeout.  This is a start-write-for-data-integrity operation.
>   *
>   * SYNC_FILE_RANGE_WRITE: start writeout of all dirty pages in the range 
> which
> @@ -266,10 +266,13 @@ SYSCALL_DEFINE1(fdatasync, unsigned int, fd)
>   * earlier SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE operation to 
> wait
>   * for that operation to complete and to return the result.
>   *
> - * 
> SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER:
> + * 
> SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER
> + * (a.k.a. SYNC_FILE_RANGE_WRITE_AND_WAIT):
>   * a traditional sync() operation.  This is a write-for-data-integrity 
> operation
>   * which will ensure that all pages in the range which were dirty on entry to
> - * sys_sync_file_range() are committed to disk.
> + * ksys_sync_file_range() are written to disk.  It should be noted that disk
> + * caches are not flushed by this call, so there are no guarantees here that 
> the
> + * data will be available on disk after a crash.
>   *
>   *
>   * SYNC_FILE_RANGE_WAIT_BEFORE and SYNC_FILE_RANGE_WAIT_AFTER will detect any
> @@ -338,6 +341,14 @@ int ksys_sync_file_range(int fd, loff_t offset, loff_t 
> nbytes,
>  
>   mapping = f.file->f_mapping;
>   ret = 0;
> + if ((flags & SYNC_FILE_RANGE_WRITE_AND_WAIT) ==
> +  SYNC_FILE_RANGE_WRITE_AND_WAIT) {
> + /* Unlike SYNC_FILE_RANGE_WRITE alone uses WB_SYNC_ALL */
> + ret = filemap_write_and_wait_range(mapping, offset, endbyte);
> + if (ret < 0)
> + goto out_put;
> + }
> +
>   if (flags & SYNC_FILE_RANGE_WAIT_BEFORE) {
>   ret = file_fdatawait_range(f.file, offset, endbyte);
>   if (ret < 0)
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index 121e82ce296b..59c71fa8c553 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -320,6 +320,9 @@ struct fscrypt_key {
>  #define SYNC_FILE_RANGE_WAIT_BEFORE  1
>  #define SYNC_FILE_RANGE_WRITE2
>  #define SYNC_FILE_RANGE_WAIT_AFTER   4
> +#define SYNC_FILE_RANGE_WRITE_AND_WAIT   (SYNC_FILE_RANGE_WRITE | \
> +  SYNC_FILE_RANGE_WAIT_BEFORE | \
> +  SYNC_FILE_RANGE_WAIT_AFTER)
>  
>  /*
>   * Flags for preadv2/pwritev2:
> -- 
> 2.17.1
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] fs/buffer.c: Fix data corruption when buffer write with IO error

2019-04-08 Thread Jan Kara
On Sat 06-04-19 15:13:13, ZhangXiaoxu wrote:
> When the buffer write failed, 'end_buffer_write_sync' and
> 'end_buffer_async_write' will clear the uptodate flag. But the
> data in the buffer maybe newer than disk. In some case, this
> will lead data corruption.
> 
> For example: ext4 flush metadata to disk failed, it will clear
> the uptodate flag. when a new coming call want the buffer, it will
> read it from the disk(because the buffer no uptodate flag). But
> the journal not checkpoint now, it will read old data from disk.
> If read successfully, ext4 will write the old data to the new
> journal, the data will corruption.
> 
> So, don't clear the uptodate flag when write the buffer failed.
> 
> Signed-off-by: ZhangXiaoxu 

Thanks for the patch. But what are the chances that after the write has
failed the read will succeed? Also there were places that were using
buffer_uptodate() to detect IO errors. Did you check all those got
converted to using buffer_write_io_error() instead?

Honza

> ---
>  fs/buffer.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index ce35760..9fe1827 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -172,7 +172,6 @@ void end_buffer_write_sync(struct buffer_head *bh, int 
> uptodate)
>   } else {
>   buffer_io_error(bh, ", lost sync page write");
>   mark_buffer_write_io_error(bh);
> - clear_buffer_uptodate(bh);
>   }
>   unlock_buffer(bh);
>   put_bh(bh);
> @@ -325,7 +324,6 @@ void end_buffer_async_write(struct buffer_head *bh, int 
> uptodate)
>   } else {
>   buffer_io_error(bh, ", lost async page write");
>       mark_buffer_write_io_error(bh);
> - clear_buffer_uptodate(bh);
>   SetPageError(page);
>   }
>  
> -- 
> 2.7.4
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v4 4/5] ext4: disable map_sync for async flush

2019-04-03 Thread Jan Kara
On Wed 03-04-19 16:10:17, Pankaj Gupta wrote:
> Virtio pmem provides asynchronous host page cache flush
> mechanism. We don't support 'MAP_SYNC' with virtio pmem 
> and ext4. 
> 
> Signed-off-by: Pankaj Gupta 

The patch looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/ext4/file.c | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 69d65d49837b..86e4bf464320 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -360,8 +360,10 @@ static const struct vm_operations_struct 
> ext4_file_vm_ops = {
>  static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
>  {
>   struct inode *inode = file->f_mapping->host;
> + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> + struct dax_device *dax_dev = sbi->s_daxdev;
>  
> - if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb
> + if (unlikely(ext4_forced_shutdown(sbi)))
>   return -EIO;
>  
>   /*
> @@ -371,6 +373,13 @@ static int ext4_file_mmap(struct file *file, struct 
> vm_area_struct *vma)
>   if (!IS_DAX(file_inode(file)) && (vma->vm_flags & VM_SYNC))
>   return -EOPNOTSUPP;
>  
> + /* We don't support synchronous mappings with DAX files if
> +  * dax_device is not synchronous.
> +  */
> + if (IS_DAX(file_inode(file)) && !dax_synchronous(dax_dev)
> + && (vma->vm_flags & VM_SYNC))
> + return -EOPNOTSUPP;
> +
>   file_accessed(file);
>   if (IS_DAX(file_inode(file))) {
>   vma->vm_ops = &ext4_dax_vm_ops;
> -- 
> 2.20.1
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: Possible UDF locking error?

2019-04-03 Thread Jan Kara
Hi,

On Sat 30-03-19 14:49:46, Steve Magnani wrote:
> On 3/25/19 11:42 AM, Jan Kara wrote:
> > Hi!
> > 
> > On Sat 23-03-19 15:14:05, Steve Magnani wrote:
> > > I have been hunting a UDF bug that occasionally results in generation
> > > of an Allocation Extent Descriptor with an incorrect tagLocation. So
> > > far I haven't been able to see a path through the code that could
> > > cause that. But, I noticed some inconsistency in locking during
> > > AED generation and wonder if it could result in random corruption.
> > > 
> > > The function udf_update_inode() has this general pattern:
> > > 
> > >bh = udf_tgetblk(...);   // calls sb_getblk()
> > >lock_buffer(bh);
> > >memset(bh->b_data, 0, inode->i_sb->s_blocksize);
> > >// other code to populate FE/EFE data in the block
> > >set_buffer_uptodate(bh);
> > >unlock_buffer(bh);
> > >mark_buffer_dirty(bh);
> > > 
> > > This I can understand - the lock is held for as long as the buffer
> > > contents are being assembled.
> > > 
> > > In contrast, udf_setup_indirect_aext(), which constructs an AED,
> > > has this sequence:
> > > 
> > >bh = udf_tgetblk(...);   // calls sb_getblk()
> > >lock_buffer(bh);
> > >memset(bh->b_data, 0, inode->i_sb->s_blocksize);
> > > 
> > >set_buffer_uptodate(bh);
> > >unlock_buffer(bh);
> > >mark_buffer_dirty_inode(bh);
> > > 
> > >// other code to populate AED data in the block
> > > 
> > > In this case the population of the block occurs without
> > > the protection of the lock.
> > > 
> > > Because the block has been marked dirty, does this mean that
> > > writeback could occur at any point during population?
> > Yes. Thanks for noticing this!
> > 
> > > There is one path through udf_setup_indirect_aext() where
> > > mark_buffer_dirty_inode() gets called again after population is
> > > complete, which I suppose could heal a partial writeout, but there is
> > > also another path in which the buffer does not get marked dirty again.
> > Generally, we add new extents to the created indirect extent which dirties
> > the buffer and that should fix the problem. But you are definitely right
> > that the code is suspicious and should be fixed. Will you send a patch?
> 
> I did a little archaeology to see how the code evolved to this point. It's
> been like this a long time.
> 
> I also did some research to understand why filesystems use lock_buffer()
> sometimes but not others. For example, the FAT driver never calls it. I ran
> across this thread from 2011:
> 
>https://lkml.org/lkml/2011/5/16/402
> 
> ...from which I conclude that while it is correct in a strict sense to hold
> a lock on a buffer any time its contents are being modified, performance
> considerations make it preferable (or at least reasonable) to make some
> modifications without a lock provided it's known that a subsequent write-out
> will "fix" any potential partial write out before anyone else tries to read
> the block.

Understood but UDF (and neither FAT) are really that performance critical.
If you look for performance, you'd certainly pick a different filesystem.
UDF is mainly for data interchange so it should work reasonably for copy-in
copy-out style of workloads, the rest isn't that important. So there
correctness and simplicity is preferred over performance.

> I doubt that UDF sees common use with DIF/DIX block devices,
> which might make a decision in favor of performance a little easier. Since
> the FAT driver doesn't contain Darrick's proposed changes I assume a
> decision was made that performance was more important there.
> 
> Certainly the call to udf_setup_indirect_aext() from udf_add_aext() meets
> those criteria. But udf_table_free_blocks() may not dirty the AED block.
> 
> So if this looks reasonable I will resend as a formal patch:
> 
> --- a/fs/udf/inode.c  2019-03-30 11:28:38.637759458 -0500
> +++ b/fs/udf/inode.c  2019-03-30 11:33:00.357761250 -0500
> @@ -1873,9 +1873,6 @@ int udf_setup_indirect_aext(struct inode
>   return -EIO;
>   lock_buffer(bh);
>   memset(bh->b_data, 0x00, sb->s_blocksize);
> - set_buffer_uptodate(bh);
> - unlock_buffer(bh);
> - mark_buffer_dirty_inode(bh, inode);
>   aed = (struct allocExtDesc *)(bh->b_data);
>   if (!UDF_QUERY_FLAG(sb, UDF_FLAG_STRICT)) {
> @@ -1890,6 +1887,9 @@ int udf_setup_indirect_aext(struct inode
>

Re: ext3 file system livelock and file system corruption, 4.9.166 stable kernel

2019-04-02 Thread Jan Kara
On Tue 02-04-19 12:35:07, Greg Kroah-Hartman wrote:
> On Tue, Apr 02, 2019 at 01:08:45PM +0300, Jari Ruusu wrote:
> > To trigger this ext4 file system bug, you need a sparse file with
> > correct sparse pattern on old-school ext3 file system. I tried
> > more simpler ways to trigger this but those attempts did not
> > trigger the bug. I have provided compressed sparse file that
> > reliably triggers the bug. Size of compressed sparse file 1667256
> > bytes. Size of uncompressed sparse file 7369850880 bytes.
> > Following commands will demo the problem.
> > 
> >   wget http://www.elisanet.fi/jariruusu/123/sparse-demo.data.xz
> >   xz -d sparse-demo.data.xz
> >   mkfs -t ext3 -b 4096 -e remount-ro -O "^dir_index" /dev/sdc1
> >   mount -t ext3 /dev/sdc1 /mnt
> >   cp -v --sparse=always sparse-demo.data /mnt/aa
> >   cp -v --sparse=always sparse-demo.data /mnt/bb
> >   umount /mnt
> >   mount -t ext3 /dev/sdc1 /mnt
> >   cp -v --sparse=always /mnt/bb /mnt/aa
> > 
> > That last cp command reliably triggers the bug that livelocks and
> > after reset you have file system corruption to deal with. Deeply
> > unfunny.
> > 
> > The bug is caused by
> > "ext4: brelse all indirect buffer in ext4_ind_remove_space()"
> > upstream commit 674a2b27234d1b7afcb0a9162e81b2e53aeef217, from
> > , who provided a follow-up patch
> > "ext4: cleanup bh release code in ext4_ind_remove_space()"
> > upstream commit 5e86bdda41534e17621d5a071b294943cae4376e. The
> > problem with that follow-up patch is that it is almost criminally
> > mislabeled. It should have said "fixes ext3 livelock and file
> > system corrupting bug" or something like that, so that Greg KH &
> > Co would have understood that it must be backported to stable
> > kernels too. Now the bug appears to be in all/most stable kernels
> > already.
> > 
> > Below is the buggy patch that causes the problem. Look at those
> > new while loops. Once the while condition is true once, it is
> > ALWAYS true, so it livelocks.
> > 
> > > --- a/fs/ext4/indirect.c
> > > +++ b/fs/ext4/indirect.c
> > > @@ -1385,10 +1385,14 @@ end_range:
> > >  partial->p + 1,
> > >  partial2->p,
> > >  (chain+n-1) - partial);
> > > - BUFFER_TRACE(partial->bh, "call brelse");
> > > - brelse(partial->bh);
> > > - BUFFER_TRACE(partial2->bh, "call brelse");
> > > - brelse(partial2->bh);
> > > + while (partial > chain) {
> > > + BUFFER_TRACE(partial->bh, "call brelse");
> > > + brelse(partial->bh);
> > > + }
> > > + while (partial2 > chain2) {
> > > + BUFFER_TRACE(partial2->bh, "call brelse");
> > > + brelse(partial2->bh);
> > > + }
> > >   return 0;
> > >   }
> > >
> > 
> > Greg & Co,
> > Please revert that above patch from stable kernels or backport the
> > follow-up patch that fixes the problem.
> 
> So you need 5e86bdda4153 ("ext4: cleanup bh release code in
> ext4_ind_remove_space()") applied to all of the stable and LTS kernels
> at the moment (as that patch only showed up in 5.1-rc1)?
> 
> If so, I need an ack from the ext4 developers/maintainer to do so.

Ack from me, and sorry for missing this brown paper bag bug during
review...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: linux-next: Fixes tag needs some work in the ext3 tree

2019-04-01 Thread Jan Kara
Hi Stephen

On Sat 30-03-19 16:19:39, Stephen Rothwell wrote:
> In commit
> 
>   47d92aa5d33a ("quota: fix a problem about transfer quota")
> 
> Fixes tag
> 
>   Fixes: 7b9ca4c61("quota: Reduce contention on dq_data_lock")
> 
> has these problem(s):
> 
>   - missing space between the SHA1 and the subject
>   - SHA1 should be at least 12 digits long
> Can be fixed by setting core.abbrev to 12 (or more) or (for git v2.11
> or later) just making sure it is not set (or set to "auto").

Thanks for the notice. I didn't notice the problems when merging the patch.
I've fixed these problems up.

    Honza

-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] jbd2: do not start commit when t_updates does not back to zero

2019-03-29 Thread Jan Kara
On Sun 24-03-19 11:38:35, Liu Song wrote:
> When t_updates back to zero, it guaranteed wake up process which
> waiting on j_wait_updates. If we triggered a commit start without
> considered t_updates, the commit thread wakes up and find t_updates
> is not zero, it have to wait on it once again. So, add checking code
> to avoid this happen.
> 
> Signed-off-by: Liu Song 

Do I understand correctly that this is a performance improvement? If yes,
did you measure any benefit of the patch? Because I have some doubts that
t_updates == 0 case is very common.

Honza

> ---
>  fs/jbd2/transaction.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index 79a028a7a579..e0499fd73b1e 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -144,12 +144,13 @@ static void wait_transaction_locked(journal_t *journal)
>   __releases(journal->j_state_lock)
>  {
>   DEFINE_WAIT(wait);
> - int need_to_start;
> + int need_to_start = 0;
>   tid_t tid = journal->j_running_transaction->t_tid;
>  
>   prepare_to_wait(&journal->j_wait_transaction_locked, &wait,
>   TASK_UNINTERRUPTIBLE);
> - need_to_start = !tid_geq(journal->j_commit_request, tid);
> + if (!atomic_read(&journal->j_running_transaction->t_updates))
> + need_to_start = !tid_geq(journal->j_commit_request, tid);
>   read_unlock(&journal->j_state_lock);
>   if (need_to_start)
>   jbd2_log_start_commit(journal, tid);
> -- 
> 2.19.1
> 
> 
-- 
Jan Kara 
SUSE Labs, CR


<    1   2   3   4   5   6   7   8   9   10   >