Re: [PATCH v3 2/2] fs/xattr: add *at family syscalls

2024-04-30 Thread Jan Kara
On Fri 26-04-24 18:20:14, Christian Göttsche wrote:
> From: Christian Göttsche 
> 
> Add the four syscalls setxattrat(), getxattrat(), listxattrat() and
> removexattrat().  Those can be used to operate on extended attributes,
> especially security related ones, either relative to a pinned directory
> or on a file descriptor without read access, avoiding a
> /proc//fd/ detour, requiring a mounted procfs.
> 
> One use case will be setfiles(8) setting SELinux file contexts
> ("security.selinux") without race conditions and without a file
> descriptor opened with read access requiring SELinux read permission.
> 
> Use the do_{name}at() pattern from fs/open.c.
> 
> Pass the value of the extended attribute, its length, and for
> setxattrat(2) the command (XATTR_CREATE or XATTR_REPLACE) via an added
> struct xattr_args to not exceed six syscall arguments and not
> merging the AT_* and XATTR_* flags.
> 
> Signed-off-by: Christian Göttsche 

The patch looks good to me. Just a few nits below:

> -static int path_setxattr(const char __user *pathname,
> +static int do_setxattrat(int dfd, const char __user *pathname, unsigned int 
> at_flags,

Can we please stay within 80 columns (happens in multiple places in the
patch)? I don't insist but it makes things easier to read in some setups so
I prefer it.

> @@ -852,13 +908,21 @@ listxattr(struct dentry *d, char __user *list, size_t 
> size)
>   return error;
>  }
>  
> -static ssize_t path_listxattr(const char __user *pathname, char __user *list,
> -   size_t size, unsigned int lookup_flags)
> +static ssize_t do_listxattrat(int dfd, const char __user *pathname, char 
> __user *list,
> +   size_t size, int flags)

So I like how in previous syscalls you have 'at_flags', 'lookup_flags', and
'xattr_flags'. That makes things much easier to digest. Can you please stay
with that convention here as well and call this argument 'at_flags'? Also I
think the argument ordering like "dfd, pathname, at_flags, list, size" is
more consistent with other syscalls you define.

> @@ -870,16 +934,22 @@ static ssize_t path_listxattr(const char __user 
> *pathname, char __user *list,
>   return error;
>  }
>  
> +SYSCALL_DEFINE5(listxattrat, int, dfd, const char __user *, pathname, char 
> __user *, list,
> + size_t, size, int, flags)
> +{
> + return do_listxattrat(dfd, pathname, list, size, flags);
> +}
> +

Same comment as above - "flags" -> "at_flags" and reorder args please.

> @@ -917,13 +987,21 @@ removexattr(struct mnt_idmap *idmap, struct dentry *d,
>   return vfs_removexattr(idmap, d, kname);
>  }
>  
> -static int path_removexattr(const char __user *pathname,
> - const char __user *name, unsigned int lookup_flags)
> +static int do_removexattrat(int dfd, const char __user *pathname,
> + const char __user *name, int flags)
>  {

Same comment as above - "flags" -> "at_flags" and reorder args please.

> @@ -939,16 +1017,22 @@ static int path_removexattr(const char __user 
> *pathname,
>   return error;
>  }
>  
> +SYSCALL_DEFINE4(removexattrat, int, dfd, const char __user *, pathname,
> + const char __user *, name, int, flags)
> +{

Same comment as above - "flags" -> "at_flags" and reorder args please.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2 00/35] bitops: add atomic find_bit() operations

2023-12-07 Thread Jan Kara
On Tue 05-12-23 21:22:59, Yury Norov wrote:
> On Mon, Dec 04, 2023 at 07:51:01PM +0100, Jan Kara wrote:
> > > This series is a result of discussion [1]. All find_bit() functions imply
> > > exclusive access to the bitmaps. However, KCSAN reports quite a number
> > > of warnings related to find_bit() API. Some of them are not pointing
> > > to real bugs because in many situations people intentionally allow
> > > concurrent bitmap operations.
> > > 
> > > If so, find_bit() can be annotated such that KCSAN will ignore it:
> > > 
> > > bit = data_race(find_first_bit(bitmap, nbits));
> > 
> > No, this is not a correct thing to do. If concurrent bitmap changes can
> > happen, find_first_bit() as it is currently implemented isn't ever a safe
> > choice because it can call __ffs(0) which is dangerous as you properly note
> > above. I proposed adding READ_ONCE() into find_first_bit() / find_next_bit()
> > implementation to fix this issue but you disliked that. So other option we
> > have is adding find_first_bit() and find_next_bit() variants that take
> > volatile 'addr' and we have to use these in code like xas_find_chunk()
> > which cannot be converted to your new helpers.
> 
> Here is some examples when concurrent operations with plain find_bit()
> are acceptable:
> 
>  - two threads running find_*_bit(): safe wrt ffs(0) and returns correct
>value, because underlying bitmap is unchanged;
>  - find_next_bit() in parallel with set or clear_bit(), when modifying
>a bit prior to the start bit to search: safe and correct;
>  - find_first_bit() in parallel with set_bit(): safe, but may return wrong
>bit number;
>  - find_first_zero_bit() in parallel with clear_bit(): same as above.
> 
> In last 2 cases find_bit() may not return a correct bit number, but
> it may be OK if caller requires any (not exactly first) set or clear
> bit, correspondingly.
> 
> In such cases, KCSAN may be safely silenced.

True - but these are special cases. In particular the case in xas_find_chunk()
is not any of these special cases. It is using find_next_bit() which is can
be racing with clear_bit(). So what are your plans for such usecase?

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2 00/35] bitops: add atomic find_bit() operations

2023-12-04 Thread Jan Kara
Hello Yury!

On Sun 03-12-23 11:23:47, Yury Norov wrote:
> Add helpers around test_and_{set,clear}_bit() that allow to search for
> clear or set bits and flip them atomically.
> 
> The target patterns may look like this:
> 
>   for (idx = 0; idx < nbits; idx++)
>   if (test_and_clear_bit(idx, bitmap))
>   do_something(idx);
> 
> Or like this:
> 
>   do {
>   bit = find_first_bit(bitmap, nbits);
>   if (bit >= nbits)
>   return nbits;
>   } while (!test_and_clear_bit(bit, bitmap));
>   return bit;
> 
> In both cases, the opencoded loop may be converted to a single function
> or iterator call. Correspondingly:
> 
>   for_each_test_and_clear_bit(idx, bitmap, nbits)
>   do_something(idx);
> 
> Or:
>   return find_and_clear_bit(bitmap, nbits);

These are fine cleanups but they actually don't address the case that has
triggered all these changes - namely the xarray use of find_next_bit() in
xas_find_chunk().

...
> This series is a result of discussion [1]. All find_bit() functions imply
> exclusive access to the bitmaps. However, KCSAN reports quite a number
> of warnings related to find_bit() API. Some of them are not pointing
> to real bugs because in many situations people intentionally allow
> concurrent bitmap operations.
> 
> If so, find_bit() can be annotated such that KCSAN will ignore it:
> 
> bit = data_race(find_first_bit(bitmap, nbits));

No, this is not a correct thing to do. If concurrent bitmap changes can
happen, find_first_bit() as it is currently implemented isn't ever a safe
choice because it can call __ffs(0) which is dangerous as you properly note
above. I proposed adding READ_ONCE() into find_first_bit() / find_next_bit()
implementation to fix this issue but you disliked that. So other option we
have is adding find_first_bit() and find_next_bit() variants that take
volatile 'addr' and we have to use these in code like xas_find_chunk()
which cannot be converted to your new helpers.

> This series addresses the other important case where people really need
> atomic find ops. As the following patches show, the resulting code
> looks safer and more verbose comparing to opencoded loops followed by
> atomic bit flips.
> 
> In [1] Mirsad reported 2% slowdown in a single-thread search test when
> switching find_bit() function to treat bitmaps as volatile arrays. On
> the other hand, kernel robot in the same thread reported +3.7% to the
> performance of will-it-scale.per_thread_ops test.

It was actually me who reported the regression here [2] but whatever :)

[2] https://lore.kernel.org/all/20231011150252.32737-1-j...@suse.cz

> Assuming that our compilers are sane and generate better code against
> properly annotated data, the above discrepancy doesn't look weird. When
> running on non-volatile bitmaps, plain find_bit() outperforms atomic
> find_and_bit(), and vice-versa.
> 
> So, all users of find_bit() API, where heavy concurrency is expected,
> are encouraged to switch to atomic find_and_bit() as appropriate.

Well, all users where any concurrency can happen should switch. Otherwise
they are prone to the (admittedly mostly theoretical) data race issue.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2 4/4] eventfd: make eventfd_signal{_mask}() void

2023-11-22 Thread Jan Kara
On Wed 22-11-23 13:48:25, Christian Brauner wrote:
> No caller care about the return value.
> 
> Signed-off-by: Christian Brauner 

Yup. Feel free to add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/eventfd.c| 40 +++-
>  include/linux/eventfd.h | 16 +++-
>  2 files changed, 22 insertions(+), 34 deletions(-)
> 
> diff --git a/fs/eventfd.c b/fs/eventfd.c
> index a9a6de920fb4..13be2fb7fc96 100644
> --- a/fs/eventfd.c
> +++ b/fs/eventfd.c
> @@ -43,10 +43,19 @@ struct eventfd_ctx {
>   int id;
>  };
>  
> -__u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __poll_t mask)
> +/**
> + * eventfd_signal - Adds @n to the eventfd counter.
> + * @ctx: [in] Pointer to the eventfd context.
> + * @mask: [in] poll mask
> + *
> + * This function is supposed to be called by the kernel in paths that do not
> + * allow sleeping. In this function we allow the counter to reach the 
> ULLONG_MAX
> + * value, and we signal this as overflow condition by returning a EPOLLERR
> + * to poll(2).
> + */
> +void eventfd_signal_mask(struct eventfd_ctx *ctx, __poll_t mask)
>  {
>   unsigned long flags;
> - __u64 n = 1;
>  
>   /*
>* Deadlock or stack overflow issues can happen if we recurse here
> @@ -57,37 +66,18 @@ __u64 eventfd_signal_mask(struct eventfd_ctx *ctx, 
> __poll_t mask)
>* safe context.
>*/
>   if (WARN_ON_ONCE(current->in_eventfd))
> - return 0;
> + return;
>  
>   spin_lock_irqsave(>wqh.lock, flags);
>   current->in_eventfd = 1;
> - if (ULLONG_MAX - ctx->count < n)
> - n = ULLONG_MAX - ctx->count;
> - ctx->count += n;
> + if (ctx->count < ULLONG_MAX)
> + ctx->count++;
>   if (waitqueue_active(>wqh))
>   wake_up_locked_poll(>wqh, EPOLLIN | mask);
>   current->in_eventfd = 0;
>   spin_unlock_irqrestore(>wqh.lock, flags);
> -
> - return n == 1;
> -}
> -
> -/**
> - * eventfd_signal - Adds @n to the eventfd counter.
> - * @ctx: [in] Pointer to the eventfd context.
> - *
> - * This function is supposed to be called by the kernel in paths that do not
> - * allow sleeping. In this function we allow the counter to reach the 
> ULLONG_MAX
> - * value, and we signal this as overflow condition by returning a EPOLLERR
> - * to poll(2).
> - *
> - * Returns the amount by which the counter was incremented.
> - */
> -__u64 eventfd_signal(struct eventfd_ctx *ctx)
> -{
> - return eventfd_signal_mask(ctx, 0);
>  }
> -EXPORT_SYMBOL_GPL(eventfd_signal);
> +EXPORT_SYMBOL_GPL(eventfd_signal_mask);
>  
>  static void eventfd_free_ctx(struct eventfd_ctx *ctx)
>  {
> diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
> index 4f8aac7eb62a..fea7c4eb01d6 100644
> --- a/include/linux/eventfd.h
> +++ b/include/linux/eventfd.h
> @@ -35,8 +35,7 @@ void eventfd_ctx_put(struct eventfd_ctx *ctx);
>  struct file *eventfd_fget(int fd);
>  struct eventfd_ctx *eventfd_ctx_fdget(int fd);
>  struct eventfd_ctx *eventfd_ctx_fileget(struct file *file);
> -__u64 eventfd_signal(struct eventfd_ctx *ctx);
> -__u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __poll_t mask);
> +void eventfd_signal_mask(struct eventfd_ctx *ctx, __poll_t mask);
>  int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, 
> wait_queue_entry_t *wait,
> __u64 *cnt);
>  void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt);
> @@ -58,14 +57,8 @@ static inline struct eventfd_ctx *eventfd_ctx_fdget(int fd)
>   return ERR_PTR(-ENOSYS);
>  }
>  
> -static inline int eventfd_signal(struct eventfd_ctx *ctx)
> +static inline void eventfd_signal_mask(struct eventfd_ctx *ctx, unsigned 
> mask)
>  {
> - return -ENOSYS;
> -}
> -
> -static inline int eventfd_signal_mask(struct eventfd_ctx *ctx, unsigned mask)
> -{
> - return -ENOSYS;
>  }
>  
>  static inline void eventfd_ctx_put(struct eventfd_ctx *ctx)
> @@ -91,5 +84,10 @@ static inline void eventfd_ctx_do_read(struct eventfd_ctx 
> *ctx, __u64 *cnt)
>  
>  #endif
>  
> +static inline void eventfd_signal(struct eventfd_ctx *ctx)
> +{
> + eventfd_signal_mask(ctx, 0);
> +}
> +
>  #endif /* _LINUX_EVENTFD_H */
>  
> 
> -- 
> 2.42.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2 3/4] eventfd: simplify eventfd_signal_mask()

2023-11-22 Thread Jan Kara
On Wed 22-11-23 13:48:24, Christian Brauner wrote:
> The eventfd_signal_mask() helper was introduced for io_uring and similar
> to eventfd_signal() it always passed 1 for @n. So don't bother with that
> argument at all.
> 
> Signed-off-by: Christian Brauner 

Nice. Feel free to add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/eventfd.c| 7 ---
>  include/linux/eventfd.h | 5 ++---
>  io_uring/io_uring.c | 4 ++--
>  3 files changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/eventfd.c b/fs/eventfd.c
> index dc9e01053235..a9a6de920fb4 100644
> --- a/fs/eventfd.c
> +++ b/fs/eventfd.c
> @@ -43,9 +43,10 @@ struct eventfd_ctx {
>   int id;
>  };
>  
> -__u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 n, __poll_t mask)
> +__u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __poll_t mask)
>  {
>   unsigned long flags;
> + __u64 n = 1;
>  
>   /*
>* Deadlock or stack overflow issues can happen if we recurse here
> @@ -68,7 +69,7 @@ __u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 n, 
> __poll_t mask)
>   current->in_eventfd = 0;
>   spin_unlock_irqrestore(>wqh.lock, flags);
>  
> - return n;
> + return n == 1;
>  }
>  
>  /**
> @@ -84,7 +85,7 @@ __u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 n, 
> __poll_t mask)
>   */
>  __u64 eventfd_signal(struct eventfd_ctx *ctx)
>  {
> - return eventfd_signal_mask(ctx, 1, 0);
> + return eventfd_signal_mask(ctx, 0);
>  }
>  EXPORT_SYMBOL_GPL(eventfd_signal);
>  
> diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
> index 562089431551..4f8aac7eb62a 100644
> --- a/include/linux/eventfd.h
> +++ b/include/linux/eventfd.h
> @@ -36,7 +36,7 @@ struct file *eventfd_fget(int fd);
>  struct eventfd_ctx *eventfd_ctx_fdget(int fd);
>  struct eventfd_ctx *eventfd_ctx_fileget(struct file *file);
>  __u64 eventfd_signal(struct eventfd_ctx *ctx);
> -__u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 n, __poll_t mask);
> +__u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __poll_t mask);
>  int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, 
> wait_queue_entry_t *wait,
> __u64 *cnt);
>  void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt);
> @@ -63,8 +63,7 @@ static inline int eventfd_signal(struct eventfd_ctx *ctx)
>   return -ENOSYS;
>  }
>  
> -static inline int eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 n,
> -   unsigned mask)
> +static inline int eventfd_signal_mask(struct eventfd_ctx *ctx, unsigned mask)
>  {
>   return -ENOSYS;
>  }
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index ed254076c723..70170a41eac4 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -558,7 +558,7 @@ static void io_eventfd_ops(struct rcu_head *rcu)
>   int ops = atomic_xchg(_fd->ops, 0);
>  
>   if (ops & BIT(IO_EVENTFD_OP_SIGNAL_BIT))
> - eventfd_signal_mask(ev_fd->cq_ev_fd, 1, EPOLL_URING_WAKE);
> + eventfd_signal_mask(ev_fd->cq_ev_fd, EPOLL_URING_WAKE);
>  
>   /* IO_EVENTFD_OP_FREE_BIT may not be set here depending on callback
>* ordering in a race but if references are 0 we know we have to free
> @@ -594,7 +594,7 @@ static void io_eventfd_signal(struct io_ring_ctx *ctx)
>   goto out;
>  
>   if (likely(eventfd_signal_allowed())) {
> - eventfd_signal_mask(ev_fd->cq_ev_fd, 1, EPOLL_URING_WAKE);
> + eventfd_signal_mask(ev_fd->cq_ev_fd, EPOLL_URING_WAKE);
>   } else {
>   atomic_inc(_fd->refs);
>   if (!atomic_fetch_or(BIT(IO_EVENTFD_OP_SIGNAL_BIT), 
> _fd->ops))
> 
> -- 
> 2.42.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2 2/4] eventfd: simplify eventfd_signal()

2023-11-22 Thread Jan Kara
On Wed 22-11-23 13:48:23, Christian Brauner wrote:
> Ever since the evenfd type was introduced back in 2007 in commit
> e1ad7468c77d ("signal/timer/event: eventfd core") the eventfd_signal()
> function only ever passed 1 as a value for @n. There's no point in
> keeping that additional argument.
> 
> Signed-off-by: Christian Brauner 

Looks good. Feel free to add:

Reviewed-by: Jan Kara 
Honza

> ---
>  arch/x86/kvm/hyperv.c |  2 +-
>  arch/x86/kvm/xen.c|  2 +-
>  drivers/accel/habanalabs/common/device.c  |  2 +-
>  drivers/fpga/dfl.c|  2 +-
>  drivers/gpu/drm/drm_syncobj.c |  6 +++---
>  drivers/gpu/drm/i915/gvt/interrupt.c  |  2 +-
>  drivers/infiniband/hw/mlx5/devx.c |  2 +-
>  drivers/misc/ocxl/file.c  |  2 +-
>  drivers/s390/cio/vfio_ccw_chp.c   |  2 +-
>  drivers/s390/cio/vfio_ccw_drv.c   |  4 ++--
>  drivers/s390/cio/vfio_ccw_ops.c   |  6 +++---
>  drivers/s390/crypto/vfio_ap_ops.c |  2 +-
>  drivers/usb/gadget/function/f_fs.c|  4 ++--
>  drivers/vdpa/vdpa_user/vduse_dev.c|  6 +++---
>  drivers/vfio/fsl-mc/vfio_fsl_mc_intr.c|  2 +-
>  drivers/vfio/pci/vfio_pci_core.c  |  6 +++---
>  drivers/vfio/pci/vfio_pci_intrs.c | 12 ++--
>  drivers/vfio/platform/vfio_platform_irq.c |  4 ++--
>  drivers/vhost/vdpa.c  |  4 ++--
>  drivers/vhost/vhost.c | 10 +-
>  drivers/vhost/vhost.h |  2 +-
>  drivers/virt/acrn/ioeventfd.c |  2 +-
>  drivers/xen/privcmd.c |  2 +-
>  fs/aio.c  |  2 +-
>  fs/eventfd.c  |  9 +++--
>  include/linux/eventfd.h   |  4 ++--
>  mm/memcontrol.c   | 10 +-
>  mm/vmpressure.c   |  2 +-
>  samples/vfio-mdev/mtty.c  |  4 ++--
>  virt/kvm/eventfd.c|  4 ++--
>  30 files changed, 60 insertions(+), 63 deletions(-)
> 
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index 238afd7335e4..4943f6b2bbee 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -2388,7 +2388,7 @@ static u16 kvm_hvcall_signal_event(struct kvm_vcpu 
> *vcpu, struct kvm_hv_hcall *h
>   if (!eventfd)
>   return HV_STATUS_INVALID_PORT_ID;
>  
> - eventfd_signal(eventfd, 1);
> + eventfd_signal(eventfd);
>   return HV_STATUS_SUCCESS;
>  }
>  
> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> index e53fad915a62..523bb6df5ac9 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
> @@ -2088,7 +2088,7 @@ static bool kvm_xen_hcall_evtchn_send(struct kvm_vcpu 
> *vcpu, u64 param, u64 *r)
>   if (ret < 0 && ret != -ENOTCONN)
>   return false;
>   } else {
> - eventfd_signal(evtchnfd->deliver.eventfd.ctx, 1);
> + eventfd_signal(evtchnfd->deliver.eventfd.ctx);
>   }
>  
>   *r = 0;
> diff --git a/drivers/accel/habanalabs/common/device.c 
> b/drivers/accel/habanalabs/common/device.c
> index 9711e8fc979d..3a89644f087c 100644
> --- a/drivers/accel/habanalabs/common/device.c
> +++ b/drivers/accel/habanalabs/common/device.c
> @@ -2044,7 +2044,7 @@ static void hl_notifier_event_send(struct 
> hl_notifier_event *notifier_event, u64
>   notifier_event->events_mask |= event_mask;
>  
>   if (notifier_event->eventfd)
> - eventfd_signal(notifier_event->eventfd, 1);
> + eventfd_signal(notifier_event->eventfd);
>  
>   mutex_unlock(_event->lock);
>  }
> diff --git a/drivers/fpga/dfl.c b/drivers/fpga/dfl.c
> index dd7a783d53b5..e73f88050f08 100644
> --- a/drivers/fpga/dfl.c
> +++ b/drivers/fpga/dfl.c
> @@ -1872,7 +1872,7 @@ static irqreturn_t dfl_irq_handler(int irq, void *arg)
>  {
>   struct eventfd_ctx *trigger = arg;
>  
> - eventfd_signal(trigger, 1);
> + eventfd_signal(trigger);
>   return IRQ_HANDLED;
>  }
>  
> diff --git a/drivers/gpu/drm/drm_syncobj.c b/drivers/gpu/drm/drm_syncobj.c
> index 01da6789d044..b9cc62982196 100644
> --- a/drivers/gpu/drm/drm_syncobj.c
> +++ b/drivers/gpu/drm/drm_syncobj.c
> @@ -1365,7 +1365,7 @@ static void syncobj_eventfd_entry_fence_func(struct 
> dma_fence *fence,
>   struct syncobj_eventfd_entry *entry =
>   container_of(cb, struct syncobj_eventfd_entry, fence_cb);
>  
> - eventfd_signal(entry->ev_fd_ctx, 1);
> + eventfd_si

Re: [PATCH v2 1/4] i915: make inject_virtual_interrupt() void

2023-11-22 Thread Jan Kara
On Wed 22-11-23 13:48:22, Christian Brauner wrote:
> The single caller of inject_virtual_interrupt() ignores the return value
> anyway. This allows us to simplify eventfd_signal() in follow-up
> patches.
> 
> Signed-off-by: Christian Brauner 

Looks good. Feel free to add:

Reviewed-by: Jan Kara 

Honza

> ---
>  drivers/gpu/drm/i915/gvt/interrupt.c | 14 +++---
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gvt/interrupt.c 
> b/drivers/gpu/drm/i915/gvt/interrupt.c
> index de3f5903d1a7..9665876b4b13 100644
> --- a/drivers/gpu/drm/i915/gvt/interrupt.c
> +++ b/drivers/gpu/drm/i915/gvt/interrupt.c
> @@ -422,7 +422,7 @@ static void init_irq_map(struct intel_gvt_irq *irq)
>  #define MSI_CAP_DATA(offset) (offset + 8)
>  #define MSI_CAP_EN 0x1
>  
> -static int inject_virtual_interrupt(struct intel_vgpu *vgpu)
> +static void inject_virtual_interrupt(struct intel_vgpu *vgpu)
>  {
>   unsigned long offset = vgpu->gvt->device_info.msi_cap_offset;
>   u16 control, data;
> @@ -434,10 +434,10 @@ static int inject_virtual_interrupt(struct intel_vgpu 
> *vgpu)
>  
>   /* Do not generate MSI if MSIEN is disabled */
>   if (!(control & MSI_CAP_EN))
> - return 0;
> + return;
>  
>   if (WARN(control & GENMASK(15, 1), "only support one MSI format\n"))
> - return -EINVAL;
> + return;
>  
>   trace_inject_msi(vgpu->id, addr, data);
>  
> @@ -451,10 +451,10 @@ static int inject_virtual_interrupt(struct intel_vgpu 
> *vgpu)
>* returned and don't inject interrupt into guest.
>*/
>   if (!test_bit(INTEL_VGPU_STATUS_ATTACHED, vgpu->status))
> - return -ESRCH;
> - if (vgpu->msi_trigger && eventfd_signal(vgpu->msi_trigger, 1) != 1)
> - return -EFAULT;
> - return 0;
> + return;
> +     if (!vgpu->msi_trigger)
> + return;
> + eventfd_signal(vgpu->msi_trigger, 1);
>  }
>  
>  static void propagate_event(struct intel_gvt_irq *irq,
> 
> -- 
> 2.42.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 0/7] arch/*: config: Remove ReiserFS from defconfig

2023-09-20 Thread Jan Kara
On Tue 19-09-23 18:02:39, Geert Uytterhoeven wrote:
> Hi Peter,
> 
> On Tue, Sep 19, 2023 at 5:58 PM Peter Lafreniere  wrote:
> >  2) Stops building an obsolete and largely-unused filesystem unnecessarily.
> > Some hobbyist targets like m68k and alpha may prefer to keep all 
> > filesystems
> > available until total removal, but others like arm and UML have no need 
> > for
> > ReiserFS to be built unless specifically configured.
> 
> As UML is used a lot for testing, isn't it actually counter-productive
> to remove ReiserFS from the UML defconfig?  The less testing it
> receives, the higher the chance of introducing regressions.

The only testing I know about for reiserfs (besides build testing) is
syzbot. And regarding the people / bots doing filesystem testing I know
none of them uses UML. Rather it is x86 VMs these days where reiserfs is
disabled in the defconfig for a *long* time (many years). Also when you do
filesystem testing, you usually just test the few filesystems you care
about and for which you have all the tools installed. So frankly I don't
see a good reason to leave reiserfs enabled in defconfigs. But sure if
m68k or other arch wants to keep reiserfs in it's defconfig for some
consistency reasons, I'm fine with it. I just suspect that for most archs
this is just a historical reason.

        Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2 92/92] fs: rename i_ctime field to __i_ctime

2023-07-06 Thread Jan Kara
On Wed 05-07-23 14:58:12, Jeff Layton wrote:
> Now that everything in-tree is converted to use the accessor functions,
> rename the i_ctime field in the inode to discourage direct access.
> 
> Signed-off-by: Jeff Layton 

Looks good. Feel free to add:

Reviewed-by: Jan Kara 

Honza

> ---
>  include/linux/fs.h | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 14e38bd900f1..b66442f91835 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -642,7 +642,7 @@ struct inode {
>   loff_t  i_size;
>   struct timespec64   i_atime;
>   struct timespec64   i_mtime;
> - struct timespec64   i_ctime;
> + struct timespec64   __i_ctime; /* use inode_*_ctime accessors! */
>   spinlock_t  i_lock; /* i_blocks, i_bytes, maybe i_size */
>   unsigned short  i_bytes;
>   u8  i_blkbits;
> @@ -1485,7 +1485,7 @@ struct timespec64 inode_set_ctime_current(struct inode 
> *inode);
>   */
>  static inline struct timespec64 inode_get_ctime(const struct inode *inode)
>  {
> - return inode->i_ctime;
> + return inode->__i_ctime;
>  }
>  
>  /**
> @@ -1498,7 +1498,7 @@ static inline struct timespec64 inode_get_ctime(const 
> struct inode *inode)
>  static inline struct timespec64 inode_set_ctime_to_ts(struct inode *inode,
> struct timespec64 ts)
>  {
> - inode->i_ctime = ts;
> + inode->__i_ctime = ts;
>   return ts;
>  }
>  
> -- 
> 2.41.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2 08/92] fs: new helper: simple_rename_timestamp

2023-07-06 Thread Jan Kara
On Wed 05-07-23 14:58:11, Jeff Layton wrote:
> A rename potentially involves updating 4 different inode timestamps. Add
> a function that handles the details sanely, and convert the libfs.c
> callers to use it.
> 
> Signed-off-by: Jeff Layton 

Looks good to me. Feel free to add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/libfs.c | 36 +++-
>  include/linux/fs.h |  2 ++
>  2 files changed, 29 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/libfs.c b/fs/libfs.c
> index a7e56baf8bbd..9ee79668c909 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -692,6 +692,31 @@ int simple_rmdir(struct inode *dir, struct dentry 
> *dentry)
>  }
>  EXPORT_SYMBOL(simple_rmdir);
>  
> +/**
> + * simple_rename_timestamp - update the various inode timestamps for rename
> + * @old_dir: old parent directory
> + * @old_dentry: dentry that is being renamed
> + * @new_dir: new parent directory
> + * @new_dentry: target for rename
> + *
> + * POSIX mandates that the old and new parent directories have their ctime 
> and
> + * mtime updated, and that inodes of @old_dentry and @new_dentry (if any), 
> have
> + * their ctime updated.
> + */
> +void simple_rename_timestamp(struct inode *old_dir, struct dentry 
> *old_dentry,
> +  struct inode *new_dir, struct dentry *new_dentry)
> +{
> + struct inode *newino = d_inode(new_dentry);
> +
> + old_dir->i_mtime = inode_set_ctime_current(old_dir);
> + if (new_dir != old_dir)
> + new_dir->i_mtime = inode_set_ctime_current(new_dir);
> + inode_set_ctime_current(d_inode(old_dentry));
> + if (newino)
> + inode_set_ctime_current(newino);
> +}
> +EXPORT_SYMBOL_GPL(simple_rename_timestamp);
> +
>  int simple_rename_exchange(struct inode *old_dir, struct dentry *old_dentry,
>  struct inode *new_dir, struct dentry *new_dentry)
>  {
> @@ -707,11 +732,7 @@ int simple_rename_exchange(struct inode *old_dir, struct 
> dentry *old_dentry,
>   inc_nlink(old_dir);
>   }
>   }
> - old_dir->i_ctime = old_dir->i_mtime =
> - new_dir->i_ctime = new_dir->i_mtime =
> - d_inode(old_dentry)->i_ctime =
> - d_inode(new_dentry)->i_ctime = current_time(old_dir);
> -
> + simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry);
>   return 0;
>  }
>  EXPORT_SYMBOL_GPL(simple_rename_exchange);
> @@ -720,7 +741,6 @@ int simple_rename(struct mnt_idmap *idmap, struct inode 
> *old_dir,
> struct dentry *old_dentry, struct inode *new_dir,
> struct dentry *new_dentry, unsigned int flags)
>  {
> - struct inode *inode = d_inode(old_dentry);
>   int they_are_dirs = d_is_dir(old_dentry);
>  
>   if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE))
> @@ -743,9 +763,7 @@ int simple_rename(struct mnt_idmap *idmap, struct inode 
> *old_dir,
>   inc_nlink(new_dir);
>   }
>  
> - old_dir->i_ctime = old_dir->i_mtime = new_dir->i_ctime =
> - new_dir->i_mtime = inode->i_ctime = current_time(old_dir);
> -
> + simple_rename_timestamp(old_dir, old_dentry, new_dir, new_dentry);
>   return 0;
>  }
>  EXPORT_SYMBOL(simple_rename);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index bdfbd11a5811..14e38bd900f1 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2979,6 +2979,8 @@ extern int simple_open(struct inode *inode, struct file 
> *file);
>  extern int simple_link(struct dentry *, struct inode *, struct dentry *);
>  extern int simple_unlink(struct inode *, struct dentry *);
>  extern int simple_rmdir(struct inode *, struct dentry *);
> +void simple_rename_timestamp(struct inode *old_dir, struct dentry 
> *old_dentry,
> +          struct inode *new_dir, struct dentry *new_dentry);
>  extern int simple_rename_exchange(struct inode *old_dir, struct dentry 
> *old_dentry,
> struct inode *new_dir, struct dentry 
> *new_dentry);
>  extern int simple_rename(struct mnt_idmap *, struct inode *,
> -- 
> 2.41.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 01/79] fs: add ctime accessors infrastructure

2023-06-21 Thread Jan Kara
On Wed 21-06-23 10:45:06, Jeff Layton wrote:
> struct timespec64 has unused bits in the tv_nsec field that can be used
> for other purposes. In future patches, we're going to change how the
> inode->i_ctime is accessed in certain inodes in order to make use of
> them. In order to do that safely though, we'll need to eradicate raw
> accesses of the inode->i_ctime field from the kernel.
> 
> Add new accessor functions for the ctime that we can use to replace them.
> 
> Signed-off-by: Jeff Layton 

Looks good to me. Feel free to add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/inode.c | 16 ++
>  include/linux/fs.h | 53 +-
>  2 files changed, 68 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/inode.c b/fs/inode.c
> index d37fad91c8da..c005e7328fbb 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -2499,6 +2499,22 @@ struct timespec64 current_time(struct inode *inode)
>  }
>  EXPORT_SYMBOL(current_time);
>  
> +/**
> + * inode_ctime_set_current - set the ctime to current_time
> + * @inode: inode
> + *
> + * Set the inode->i_ctime to the current value for the inode. Returns
> + * the current value that was assigned to i_ctime.
> + */
> +struct timespec64 inode_ctime_set_current(struct inode *inode)
> +{
> + struct timespec64 now = current_time(inode);
> +
> + inode_set_ctime(inode, now);
> + return now;
> +}
> +EXPORT_SYMBOL(inode_ctime_set_current);
> +
>  /**
>   * in_group_or_capable - check whether caller is CAP_FSETID privileged
>   * @idmap:   idmap of the mount @inode was found from
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 6867512907d6..9afb30606373 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1474,7 +1474,58 @@ static inline bool fsuidgid_has_mapping(struct 
> super_block *sb,
>  kgid_has_mapping(fs_userns, kgid);
>  }
>  
> -extern struct timespec64 current_time(struct inode *inode);
> +struct timespec64 current_time(struct inode *inode);
> +struct timespec64 inode_ctime_set_current(struct inode *inode);
> +
> +/**
> + * inode_ctime_peek - fetch the current ctime from the inode
> + * @inode: inode from which to fetch ctime
> + *
> + * Grab the current ctime from the inode and return it.
> + */
> +static inline struct timespec64 inode_ctime_peek(const struct inode *inode)
> +{
> + return inode->i_ctime;
> +}
> +
> +/**
> + * inode_ctime_set - set the ctime in the inode to the given value
> + * @inode: inode in which to set the ctime
> + * @ts: timespec value to set the ctime
> + *
> + * Set the ctime in @inode to @ts.
> + */
> +static inline struct timespec64 inode_ctime_set(struct inode *inode, struct 
> timespec64 ts)
> +{
> + inode->i_ctime = ts;
> + return ts;
> +}
> +
> +/**
> + * inode_ctime_set_sec - set only the tv_sec field in the inode ctime
> + * @inode: inode in which to set the ctime
> + * @sec:  value to set the tv_sec field
> + *
> + * Set the sec field in the ctime. Returns @sec.
> + */
> +static inline time64_t inode_ctime_set_sec(struct inode *inode, time64_t sec)
> +{
> + inode->i_ctime.tv_sec = sec;
> + return sec;
> +}
> +
> +/**
> + * inode_ctime_set_nsec - set only the tv_nsec field in the inode ctime
> + * @inode: inode in which to set the ctime
> + * @nsec:  value to set the tv_nsec field
> + *
> + * Set the nsec field in the ctime. Returns @nsec.
> + */
> +static inline long inode_ctime_set_nsec(struct inode *inode, long nsec)
> +{
> + inode->i_ctime.tv_nsec = nsec;
> + return nsec;
> +}
>  
>  /*
>   * Snapshotting support.
> -- 
> 2.41.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 02/79] spufs: switch to new ctime accessors

2023-06-21 Thread Jan Kara
On Wed 21-06-23 10:45:15, Jeff Layton wrote:
> In later patches, we're going to change how the ctime.tv_nsec field is
> utilized. Switch to using accessor functions instead of raw accesses of
> inode->i_ctime.
> 
> Signed-off-by: Jeff Layton 

Looks good to me. Feel free to add:

Reviewed-by: Jan Kara 

Honza

> ---
>  arch/powerpc/platforms/cell/spufs/inode.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/platforms/cell/spufs/inode.c 
> b/arch/powerpc/platforms/cell/spufs/inode.c
> index ea807aa0c31a..55418395bd9a 100644
> --- a/arch/powerpc/platforms/cell/spufs/inode.c
> +++ b/arch/powerpc/platforms/cell/spufs/inode.c
> @@ -86,7 +86,7 @@ spufs_new_inode(struct super_block *sb, umode_t mode)
>   inode->i_mode = mode;
>   inode->i_uid = current_fsuid();
>   inode->i_gid = current_fsgid();
> - inode->i_atime = inode->i_mtime = inode->i_ctime = current_time(inode);
> + inode->i_atime = inode->i_mtime = inode_ctime_set_current(inode);
>  out:
>   return inode;
>  }
> -- 
> 2.41.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v1 3/5] treewide: use get_random_u32() when possible

2022-10-06 Thread Jan Kara
On Wed 05-10-22 23:48:42, Jason A. Donenfeld wrote:
> The prandom_u32() function has been a deprecated inline wrapper around
> get_random_u32() for several releases now, and compiles down to the
> exact same code. Replace the deprecated wrapper with a direct call to
> the real function.
> 
> Signed-off-by: Jason A. Donenfeld 

...

> diff --git a/fs/ext2/ialloc.c b/fs/ext2/ialloc.c
> index 998dd2ac8008..e439a872c398 100644
> --- a/fs/ext2/ialloc.c
> +++ b/fs/ext2/ialloc.c
> @@ -277,7 +277,7 @@ static int find_group_orlov(struct super_block *sb, 
> struct inode *parent)
>   int best_ndir = inodes_per_group;
>   int best_group = -1;
>  
> - group = prandom_u32();
> + group = get_random_u32();
>   parent_group = (unsigned)group % ngroups;
>   for (i = 0; i < ngroups; i++) {
>   group = (parent_group + i) % ngroups;

The code here is effectively doing the

parent_group = prandom_u32_max(ngroups);

> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index f73e5eb43eae..954ec9736a8d 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -465,7 +465,7 @@ static int find_group_orlov(struct super_block *sb, 
> struct inode *parent,
>   ext4fs_dirhash(parent, qstr->name, qstr->len, );
>   grp = hinfo.hash;
>   } else
> - grp = prandom_u32();
> + grp = get_random_u32();

Similarly here we can use prandom_u32_max(ngroups) like:

if (qstr) {
...
parent_group = hinfo.hash % ngroups;
} else
parent_group = prandom_u32_max(ngroups);

> diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
> index 9af68a7ecdcf..588cb09c5291 100644
> --- a/fs/ext4/mmp.c
> +++ b/fs/ext4/mmp.c
> @@ -265,7 +265,7 @@ static unsigned int mmp_new_seq(void)
>   u32 new_seq;
>  
>   do {
> - new_seq = prandom_u32();
> + new_seq = get_random_u32();
>   } while (new_seq > EXT4_MMP_SEQ_MAX);

OK, here we again effectively implement prandom_u32_max(EXT4_MMP_SEQ_MAX + 1).
Just presumably we didn't want to use modulo here because EXT4_MMP_SEQ_MAX
is rather big and so the resulting 'new_seq' would be seriously
non-uniform.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2 4/8] ocfs2: simplify subdirectory registration with register_sysctl()

2021-11-24 Thread Jan Kara
On Tue 23-11-21 12:24:18, Luis Chamberlain wrote:
> There is no need to user boiler plate code to specify a set of base
> directories we're going to stuff sysctls under. Simplify this by using
> register_sysctl() and specifying the directory path directly.
> 
> // pycocci sysctl-subdir-register-sysctl-simplify.cocci PATH

Heh, nice example of using Coccinelle. The result looks good. Feel free to
add:

Reviewed-by: Jan Kara 

Honza


> 
> @c1@
> expression E1;
> identifier subdir, sysctls;
> @@
> 
> static struct ctl_table subdir[] = {
>   {
>   .procname = E1,
>   .maxlen = 0,
>   .mode = 0555,
>   .child = sysctls,
>   },
>   { }
> };
> 
> @c2@
> identifier c1.subdir;
> 
> expression E2;
> identifier base;
> @@
> 
> static struct ctl_table base[] = {
>   {
>   .procname = E2,
>   .maxlen = 0,
>   .mode = 0555,
>   .child = subdir,
>   },
>   { }
> };
> 
> @c3@
> identifier c2.base;
> identifier header;
> @@
> 
> header = register_sysctl_table(base);
> 
> @r1 depends on c1 && c2 && c3@
> expression c1.E1;
> identifier c1.subdir, c1.sysctls;
> @@
> 
> -static struct ctl_table subdir[] = {
> - {
> - .procname = E1,
> - .maxlen = 0,
> - .mode = 0555,
> - .child = sysctls,
> - },
> - { }
> -};
> 
> @r2 depends on c1 && c2 && c3@
> identifier c1.subdir;
> 
> expression c2.E2;
> identifier c2.base;
> @@
> -static struct ctl_table base[] = {
> - {
> - .procname = E2,
> - .maxlen = 0,
> - .mode = 0555,
> - .child = subdir,
> - },
> - { }
> -};
> 
> @initialize:python@
> @@
> 
> def make_my_fresh_expression(s1, s2):
>   return '"' + s1.strip('"') + "/" + s2.strip('"') + '"'
> 
> @r3 depends on c1 && c2 && c3@
> expression c1.E1;
> identifier c1.sysctls;
> expression c2.E2;
> identifier c2.base;
> identifier c3.header;
> fresh identifier E3 = script:python(E2, E1) { make_my_fresh_expression(E2, 
> E1) };
> @@
> 
> header =
> -register_sysctl_table(base);
> +register_sysctl(E3, sysctls);
> 
> Generated-by: Coccinelle SmPL
> Signed-off-by: Luis Chamberlain 
> ---
>  fs/ocfs2/stackglue.c | 25 +
>  1 file changed, 1 insertion(+), 24 deletions(-)
> 
> diff --git a/fs/ocfs2/stackglue.c b/fs/ocfs2/stackglue.c
> index 16f1bfc407f2..731558a6f27d 100644
> --- a/fs/ocfs2/stackglue.c
> +++ b/fs/ocfs2/stackglue.c
> @@ -672,31 +672,8 @@ static struct ctl_table ocfs2_mod_table[] = {
>   { }
>  };
>  
> -static struct ctl_table ocfs2_kern_table[] = {
> - {
> - .procname   = "ocfs2",
> - .data   = NULL,
> - .maxlen = 0,
> - .mode   = 0555,
> - .child  = ocfs2_mod_table
> - },
> - { }
> -};
> -
> -static struct ctl_table ocfs2_root_table[] = {
> - {
> - .procname   = "fs",
> - .data   = NULL,
> - .maxlen = 0,
> - .mode   = 0555,
> - .child  = ocfs2_kern_table
> - },
> - { }
> -};
> -
>  static struct ctl_table_header *ocfs2_table_header;
>  
> -
>  /*
>   * Initialization
>   */
> @@ -705,7 +682,7 @@ static int __init ocfs2_stack_glue_init(void)
>  {
>   strcpy(cluster_stack_name, OCFS2_STACK_PLUGIN_O2CB);
>  
> - ocfs2_table_header = register_sysctl_table(ocfs2_root_table);
> + ocfs2_table_header = register_sysctl("fs/ocfs2", ocfs2_mod_table);
>   if (!ocfs2_table_header) {
>   printk(KERN_ERR
>  "ocfs2 stack glue: unable to register sysctl\n");
> -- 
> 2.33.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2 6/8] inotify: simplify subdirectory registration with register_sysctl()

2021-11-24 Thread Jan Kara
On Tue 23-11-21 12:24:20, Luis Chamberlain wrote:
> From: Xiaoming Ni 
> 
> There is no need to user boiler plate code to specify a set of base
> directories we're going to stuff sysctls under. Simplify this by using
> register_sysctl() and specifying the directory path directly.
> 
> Move inotify_user sysctl to inotify_user.c while at it to remove clutter
> from kernel/sysctl.c.
> 
> Signed-off-by: Xiaoming Ni 
> [mcgrof: update commit log to reflect new path we decided to take]
> Signed-off-by: Luis Chamberlain 

This looks fishy. You register inotify_table but not fanotify_table and
remove both...

Honza

> ---
>  fs/notify/inotify/inotify_user.c | 11 ++-
>  include/linux/inotify.h  |  3 ---
>  kernel/sysctl.c  | 21 -
>  3 files changed, 10 insertions(+), 25 deletions(-)
> 
> diff --git a/fs/notify/inotify/inotify_user.c 
> b/fs/notify/inotify/inotify_user.c
> index 29fca3284bb5..54583f62dc44 100644
> --- a/fs/notify/inotify/inotify_user.c
> +++ b/fs/notify/inotify/inotify_user.c
> @@ -58,7 +58,7 @@ struct kmem_cache *inotify_inode_mark_cachep __read_mostly;
>  static long it_zero = 0;
>  static long it_int_max = INT_MAX;
>  
> -struct ctl_table inotify_table[] = {
> +static struct ctl_table inotify_table[] = {
>   {
>   .procname   = "max_user_instances",
>   .data   = 
> _user_ns.ucount_max[UCOUNT_INOTIFY_INSTANCES],
> @@ -87,6 +87,14 @@ struct ctl_table inotify_table[] = {
>   },
>   { }
>  };
> +
> +static void __init inotify_sysctls_init(void)
> +{
> + register_sysctl("fs/inotify", inotify_table);
> +}
> +
> +#else
> +#define inotify_sysctls_init() do { } while (0)
>  #endif /* CONFIG_SYSCTL */
>  
>  static inline __u32 inotify_arg_to_mask(struct inode *inode, u32 arg)
> @@ -849,6 +857,7 @@ static int __init inotify_user_setup(void)
>   inotify_max_queued_events = 16384;
>   init_user_ns.ucount_max[UCOUNT_INOTIFY_INSTANCES] = 128;
>   init_user_ns.ucount_max[UCOUNT_INOTIFY_WATCHES] = watches_max;
> + inotify_sysctls_init();
>  
>   return 0;
>  }
> diff --git a/include/linux/inotify.h b/include/linux/inotify.h
> index 6a24905f6e1e..8d20caa1b268 100644
> --- a/include/linux/inotify.h
> +++ b/include/linux/inotify.h
> @@ -7,11 +7,8 @@
>  #ifndef _LINUX_INOTIFY_H
>  #define _LINUX_INOTIFY_H
>  
> -#include 
>  #include 
>  
> -extern struct ctl_table inotify_table[]; /* for sysctl */
> -
>  #define ALL_INOTIFY_BITS (IN_ACCESS | IN_MODIFY | IN_ATTRIB | IN_CLOSE_WRITE 
> | \
> IN_CLOSE_NOWRITE | IN_OPEN | IN_MOVED_FROM | \
> IN_MOVED_TO | IN_CREATE | IN_DELETE | \
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 7a90a12b9ea4..6aa67c737e4e 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -125,13 +125,6 @@ static const int maxolduid = 65535;
>  static const int ngroups_max = NGROUPS_MAX;
>  static const int cap_last_cap = CAP_LAST_CAP;
>  
> -#ifdef CONFIG_INOTIFY_USER
> -#include 
> -#endif
> -#ifdef CONFIG_FANOTIFY
> -#include 
> -#endif
> -
>  #ifdef CONFIG_PROC_SYSCTL
>  
>  /**
> @@ -3099,20 +3092,6 @@ static struct ctl_table fs_table[] = {
>   .proc_handler   = proc_dointvec,
>   },
>  #endif
> -#ifdef CONFIG_INOTIFY_USER
> - {
> - .procname   = "inotify",
> - .mode   = 0555,
> - .child  = inotify_table,
> - },
> -#endif
> -#ifdef CONFIG_FANOTIFY
> - {
> - .procname   = "fanotify",
> - .mode   = 0555,
> - .child  = fanotify_table,
> - },
> -#endif
>  #ifdef CONFIG_EPOLL
>   {
>   .procname   = "epoll",
> -- 
> 2.33.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [powerpc][5.13.0-next-20210701] Kernel crash while running ltp(chdir01) tests

2021-07-05 Thread Jan Kara
On Sun 04-07-21 10:04:21, Theodore Ts'o wrote:
> On Sat, Jul 03, 2021 at 12:55:09PM +0800, Zhang Yi wrote:
> > Yeah, it sounds good to me. Do you want me to send the fix patch, or you
> > modify your commit 8f9e16badb8fd in another email directly?
> 
> I've gone ahead and made the changes; what do you think?
> 
> I like how it also removes 40 lines of code.  :-)
> 
>  - Ted
> 
> From ef3130d1b0b8ca769252d6a722a2e59a00141383 Mon Sep 17 00:00:00 2001
> From: Theodore Ts'o 
> Date: Fri, 2 Jul 2021 18:05:03 -0400
> Subject: [PATCH] ext4: inline jbd2_journal_[un]register_shrinker()
> 
> The function jbd2_journal_unregister_shrinker() was getting called
> twice when the file system was getting unmounted.  On Power and ARM
> platforms this was causing kernel crash when unmounting the file
> system, when a percpu_counter was destroyed twice.
> 
> Fix this by removing jbd2_journal_[un]register_shrinker() functions,
> and inlining the shrinker setup and teardown into
> journal_init_common() and jbd2_journal_destroy().  This means that
> ext4 and ocfs2 now no longer need to know about registering and
> unregistering jbd2's shrinker.
> 
> Also, while we're at it, rename the percpu counter from
> j_jh_shrink_count to j_checkpoint_jh_count, since this makes it
> clearer what this counter is intended to track.
> 
> Fixes: 4ba3fcdde7e3 ("jbd2,ext4: add a shrinker to release checkpointed 
> buffers")
> Reported-by: Sachin Sant 
> Reported-by: Jon Hunter 
> Signed-off-by: Theodore Ts'o 

Except for the bug Zhang Yi noticed the patch looks good to me. Feel free
to add:

Reviewed-by: Jan Kara 

after fixing that.

Honza


> ---
>  fs/ext4/super.c  |   8 ---
>  fs/jbd2/checkpoint.c |   4 +-
>  fs/jbd2/journal.c| 148 +--
>  include/linux/jbd2.h |   6 +-
>  4 files changed, 63 insertions(+), 103 deletions(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index b8ff0399e171..dfa09a277b56 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1184,7 +1184,6 @@ static void ext4_put_super(struct super_block *sb)
>   ext4_unregister_sysfs(sb);
>  
>   if (sbi->s_journal) {
> - jbd2_journal_unregister_shrinker(sbi->s_journal);
>   aborted = is_journal_aborted(sbi->s_journal);
>   err = jbd2_journal_destroy(sbi->s_journal);
>   sbi->s_journal = NULL;
> @@ -5176,7 +5175,6 @@ static int ext4_fill_super(struct super_block *sb, void 
> *data, int silent)
>   sbi->s_ea_block_cache = NULL;
>  
>   if (sbi->s_journal) {
> - jbd2_journal_unregister_shrinker(sbi->s_journal);
>   jbd2_journal_destroy(sbi->s_journal);
>   sbi->s_journal = NULL;
>   }
> @@ -5502,12 +5500,6 @@ static int ext4_load_journal(struct super_block *sb,
>   ext4_commit_super(sb);
>   }
>  
> - err = jbd2_journal_register_shrinker(journal);
> - if (err) {
> - EXT4_SB(sb)->s_journal = NULL;
> - goto err_out;
> - }
> -
>   return 0;
>  
>  err_out:
> diff --git a/fs/jbd2/checkpoint.c b/fs/jbd2/checkpoint.c
> index 51d1eb2ffeb9..746132998c57 100644
> --- a/fs/jbd2/checkpoint.c
> +++ b/fs/jbd2/checkpoint.c
> @@ -701,7 +701,7 @@ int __jbd2_journal_remove_checkpoint(struct journal_head 
> *jh)
>  
>   __buffer_unlink(jh);
>   jh->b_cp_transaction = NULL;
> - percpu_counter_dec(>j_jh_shrink_count);
> + percpu_counter_dec(>j_checkpoint_jh_count);
>   jbd2_journal_put_journal_head(jh);
>  
>   /* Is this transaction empty? */
> @@ -764,7 +764,7 @@ void __jbd2_journal_insert_checkpoint(struct journal_head 
> *jh,
>   jh->b_cpnext->b_cpprev = jh;
>   }
>   transaction->t_checkpoint_list = jh;
> - percpu_counter_inc(>t_journal->j_jh_shrink_count);
> + percpu_counter_inc(>t_journal->j_checkpoint_jh_count);
>  }
>  
>  /*
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 152880c298ca..8a9c94dd3599 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -1283,6 +1283,48 @@ static int jbd2_min_tag_size(void)
>   return sizeof(journal_block_tag_t) - 4;
>  }
>  
> +/**
> + * jbd2_journal_shrink_scan()
> + *
> + * Scan the checkpointed buffer on the checkpoint list and release the
> + * journal_head.
> + */
> +static unsigned long jbd2_journal_shrink_scan(struct shrinker *shrink,
> +   struct shrink_control *sc)
> +

Re: [RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support.

2020-06-03 Thread Jan Kara
On Tue 02-06-20 17:59:08, Williams, Dan J wrote:
> [ forgive formatting, a series of unfortunate events has me using Outlook for 
> the moment ]
> 
> > From: Jan Kara 
> > > > > These flags are device properties that affect the kernel and
> > > > > userspace's handling of persistence.
> > > > >
> > > >
> > > > That will not handle the scenario with multiple applications using
> > > > the same fsdax mount point where one is updated to use the new
> > > > instruction and the other is not.
> > >
> > > Right, it needs to be a global setting / flag day to switch from one
> > > regime to another. Per-process control is a recipe for disaster.
> > 
> > First I'd like to mention that hopefully the concern is mostly theoretical 
> > since
> > as Aneesh wrote above, real persistent memory never shipped for PPC and
> > so there are very few apps (if any) using the old way to ensure cache
> > flushing.
> > 
> > But I'd like to understand why do you think per-process control is a recipe 
> > for
> > disaster? Because from my POV the sysfs interface you propose is actually
> > difficult to use in practice. As a distributor, you have hard time picking 
> > the
> > default because you have a choice between picking safe option which is
> > going to confuse users because of failing MAP_SYNC and unsafe option
> > where everyone will be happy until someone looses data because of some
> > ancient application using wrong instructions to persist data. Poor 
> > experience
> > for users in either way. And when distro defaults to "safe option", then the
> > burden is on the sysadmin to toggle the switch but how is he supposed to
> > decide when that is safe? First he has to understand what the problem
> > actually is, then he has to audit all the applications using pmem whether 
> > they
> > use the new instruction - which is IMO a lot of effort if you have a couple 
> > of
> > applications and practically infeasible if you have more of them.
> > So IMO the burden should be *on the application* to declare that it is aware
> > of the new instructions to flush pmem on the platform and only to such
> > application the kernel should give the trust to use MAP_SYNC mappings.
> 
> The "disaster" in my mind is this need to globally change the ABI for
> persistence semantics for all of Linux because one CPU wants a do over.
> What does a generic "MAP_SYNC_ENABLE" knob even mean to the existing
> deployed base of persistent memory applications? Yes, sysfs is awkward,
> but it's trying to provide some relief without imposing unexplainable
> semantics on everyone else. I think a comprehensive (overengineered)
> solution would involve not introducing another "I know what I'm doing"
> flag to the interface, but maybe requiring applications to call a pmem
> sync API in something like a vsyscall. Or, also overengineered, some
> binary translation / interpretation to actively detect and kill
> applications that deploy the old instructions. Something horrid like on
> first write fault to a MAP_SYNC try to look ahead in the binary for the
> correct sync sequence and kill the application otherwise. That would at
> least provide some enforcement and safety without requiring other
> architectures to consider what MAP_SYNC_ENABLE means to them.

Thanks for explanation. So I absolutely agree that other architectures (and
even older versions of POWER architecture) must not be influenced by the new
tunable. That's why I wrote in my reply to Aneesh that I'd be for checking
during mmap(2) with MAP_SYNC, whether we are in a situation where new PPC
flush instructions are required and *only in that case* decide based on the
prctl value whether MAP_SYNC should be allowed or not.

Whether this solution is overengineering or not depends on how you think
it's likely there will be applications trying to use old flush instructions
with MAP_SYNC on POWER10 platforms...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support.

2020-06-01 Thread Jan Kara
On Mon 01-06-20 17:31:50, Aneesh Kumar K.V wrote:
> On 6/1/20 3:39 PM, Jan Kara wrote:
> > On Fri 29-05-20 16:25:35, Aneesh Kumar K.V wrote:
> > > On 5/29/20 3:22 PM, Jan Kara wrote:
> > > > On Fri 29-05-20 15:07:31, Aneesh Kumar K.V wrote:
> > > > > Thanks Michal. I also missed Jeff in this email thread.
> > > > 
> > > > And I think you'll also need some of the sched maintainers for the prctl
> > > > bits...
> > > > 
> > > > > On 5/29/20 3:03 PM, Michal Suchánek wrote:
> > > > > > Adding Jan
> > > > > > 
> > > > > > On Fri, May 29, 2020 at 11:11:39AM +0530, Aneesh Kumar K.V wrote:
> > > > > > > With POWER10, architecture is adding new pmem flush and sync 
> > > > > > > instructions.
> > > > > > > The kernel should prevent the usage of MAP_SYNC if applications 
> > > > > > > are not using
> > > > > > > the new instructions on newer hardware.
> > > > > > > 
> > > > > > > This patch adds a prctl option MAP_SYNC_ENABLE that can be used 
> > > > > > > to enable
> > > > > > > the usage of MAP_SYNC. The kernel config option is added to allow 
> > > > > > > the user
> > > > > > > to control whether MAP_SYNC should be enabled by default or not.
> > > > > > > 
> > > > > > > Signed-off-by: Aneesh Kumar K.V 
> > > > ...
> > > > > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > > > > index 8c700f881d92..d5a9a363e81e 100644
> > > > > > > --- a/kernel/fork.c
> > > > > > > +++ b/kernel/fork.c
> > > > > > > @@ -963,6 +963,12 @@ __cacheline_aligned_in_smp 
> > > > > > > DEFINE_SPINLOCK(mmlist_lock);
> > > > > > > static unsigned long default_dump_filter = 
> > > > > > > MMF_DUMP_FILTER_DEFAULT;
> > > > > > > +#ifdef CONFIG_ARCH_MAP_SYNC_DISABLE
> > > > > > > +unsigned long default_map_sync_mask = MMF_DISABLE_MAP_SYNC_MASK;
> > > > > > > +#else
> > > > > > > +unsigned long default_map_sync_mask = 0;
> > > > > > > +#endif
> > > > > > > +
> > > > 
> > > > I'm not sure CONFIG is really the right approach here. For a distro 
> > > > that would
> > > > basically mean to disable MAP_SYNC for all PPC kernels unless 
> > > > application
> > > > explicitly uses the right prctl. Shouldn't we rather initialize
> > > > default_map_sync_mask on boot based on whether the CPU we run on 
> > > > requires
> > > > new flush instructions or not? Otherwise the patch looks sensible.
> > > > 
> > > 
> > > yes that is correct. We ideally want to deny MAP_SYNC only w.r.t POWER10.
> > > But on a virtualized platform there is no easy way to detect that. We 
> > > could
> > > ideally hook this into the nvdimm driver where we look at the new compat
> > > string ibm,persistent-memory-v2 and then disable MAP_SYNC
> > > if we find a device with the specific value.
> > 
> > Hum, couldn't we set some flag for nvdimm devices with
> > "ibm,persistent-memory-v2" property and then check it during mmap(2) time
> > and when the device has this propery and the mmap(2) caller doesn't have
> > the prctl set, we'd disallow MAP_SYNC? That should make things mostly
> > seamless, shouldn't it? Only apps that want to use MAP_SYNC on these
> > devices would need to use prctl(MMF_DISABLE_MAP_SYNC, 0) but then these
> > applications need to be aware of new instructions so this isn't that much
> > additional burden...
> 
> I am not sure application would want to add that much details/knowledge
> about a platform in their code. I was expecting application to do
> 
> #ifdef __ppc64__
> prctl(MAP_SYNC_ENABLE, 1, 0, 0, 0));
> #endif
> a = mmap(NULL, PAGE_SIZE, PROT_READ|PROT_WRITE,
> MAP_SHARED_VALIDATE | MAP_SYNC, fd, 0);
> 
> 
> For that code all the complexity that we add w.r.t ibm,persistent-memory-v2
> is not useful. Do you see a value in making all these device specific rather
> than a conditional on  __ppc64__?

Yes, from the application POV the code would look like this plus the
application would use instructions appropriate for POWER10 for flushing
caches...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support.

2020-06-01 Thread Jan Kara
On Fri 29-05-20 16:25:35, Aneesh Kumar K.V wrote:
> On 5/29/20 3:22 PM, Jan Kara wrote:
> > On Fri 29-05-20 15:07:31, Aneesh Kumar K.V wrote:
> > > Thanks Michal. I also missed Jeff in this email thread.
> > 
> > And I think you'll also need some of the sched maintainers for the prctl
> > bits...
> > 
> > > On 5/29/20 3:03 PM, Michal Suchánek wrote:
> > > > Adding Jan
> > > > 
> > > > On Fri, May 29, 2020 at 11:11:39AM +0530, Aneesh Kumar K.V wrote:
> > > > > With POWER10, architecture is adding new pmem flush and sync 
> > > > > instructions.
> > > > > The kernel should prevent the usage of MAP_SYNC if applications are 
> > > > > not using
> > > > > the new instructions on newer hardware.
> > > > > 
> > > > > This patch adds a prctl option MAP_SYNC_ENABLE that can be used to 
> > > > > enable
> > > > > the usage of MAP_SYNC. The kernel config option is added to allow the 
> > > > > user
> > > > > to control whether MAP_SYNC should be enabled by default or not.
> > > > > 
> > > > > Signed-off-by: Aneesh Kumar K.V 
> > ...
> > > > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > > > index 8c700f881d92..d5a9a363e81e 100644
> > > > > --- a/kernel/fork.c
> > > > > +++ b/kernel/fork.c
> > > > > @@ -963,6 +963,12 @@ __cacheline_aligned_in_smp 
> > > > > DEFINE_SPINLOCK(mmlist_lock);
> > > > >static unsigned long default_dump_filter = MMF_DUMP_FILTER_DEFAULT;
> > > > > +#ifdef CONFIG_ARCH_MAP_SYNC_DISABLE
> > > > > +unsigned long default_map_sync_mask = MMF_DISABLE_MAP_SYNC_MASK;
> > > > > +#else
> > > > > +unsigned long default_map_sync_mask = 0;
> > > > > +#endif
> > > > > +
> > 
> > I'm not sure CONFIG is really the right approach here. For a distro that 
> > would
> > basically mean to disable MAP_SYNC for all PPC kernels unless application
> > explicitly uses the right prctl. Shouldn't we rather initialize
> > default_map_sync_mask on boot based on whether the CPU we run on requires
> > new flush instructions or not? Otherwise the patch looks sensible.
> > 
> 
> yes that is correct. We ideally want to deny MAP_SYNC only w.r.t POWER10.
> But on a virtualized platform there is no easy way to detect that. We could
> ideally hook this into the nvdimm driver where we look at the new compat
> string ibm,persistent-memory-v2 and then disable MAP_SYNC
> if we find a device with the specific value.

Hum, couldn't we set some flag for nvdimm devices with
"ibm,persistent-memory-v2" property and then check it during mmap(2) time
and when the device has this propery and the mmap(2) caller doesn't have
the prctl set, we'd disallow MAP_SYNC? That should make things mostly
seamless, shouldn't it? Only apps that want to use MAP_SYNC on these
devices would need to use prctl(MMF_DISABLE_MAP_SYNC, 0) but then these
applications need to be aware of new instructions so this isn't that much
additional burden...

> With that I am wondering should we even have this patch? Can we expect
> userspace get updated to use new instruction?.
> 
> With ppc64 we never had a real persistent memory device available for end
> user to try. The available persistent memory stack was using vPMEM which was
> presented as a volatile memory region for which there is no need to use any
> of the flush instructions. We could safely assume that as we get
> applications certified/verified for working with pmem device on ppc64, they
> would all be using the new instructions?

This is a bit of a gamble... I don't have too much trust in certification /
verification because only the "big players" may do powerfail testing
throughout enough that they'd uncover these problems. So the question
really is: How many apps are out there using MAP_SYNC on ppc64? Hopefully
not many given the HW didn't ship yet as you wrote but I have no real clue.
Similarly there's a question: How many app writers will read manual for
older ppc64 architecture and write apps that won't work reliably on
POWER10? Again, I have no idea.

So the prctl would be IMHO a nice safety belt but I'm not 100% certain it
will be needed...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support.

2020-06-01 Thread Jan Kara
On Sat 30-05-20 09:35:19, Dan Williams wrote:
> On Sat, May 30, 2020 at 12:18 AM Aneesh Kumar K.V
>  wrote:
> >
> > On 5/30/20 12:52 AM, Dan Williams wrote:
> > > On Fri, May 29, 2020 at 3:55 AM Aneesh Kumar K.V
> > >  wrote:
> > >>
> > >> On 5/29/20 3:22 PM, Jan Kara wrote:
> > >>> Hi!
> > >>>
> > >>> On Fri 29-05-20 15:07:31, Aneesh Kumar K.V wrote:
> > >>>> Thanks Michal. I also missed Jeff in this email thread.
> > >>>
> > >>> And I think you'll also need some of the sched maintainers for the prctl
> > >>> bits...
> > >>>
> > >>>> On 5/29/20 3:03 PM, Michal Suchánek wrote:
> > >>>>> Adding Jan
> > >>>>>
> > >>>>> On Fri, May 29, 2020 at 11:11:39AM +0530, Aneesh Kumar K.V wrote:
> > >>>>>> With POWER10, architecture is adding new pmem flush and sync 
> > >>>>>> instructions.
> > >>>>>> The kernel should prevent the usage of MAP_SYNC if applications are 
> > >>>>>> not using
> > >>>>>> the new instructions on newer hardware.
> > >>>>>>
> > >>>>>> This patch adds a prctl option MAP_SYNC_ENABLE that can be used to 
> > >>>>>> enable
> > >>>>>> the usage of MAP_SYNC. The kernel config option is added to allow 
> > >>>>>> the user
> > >>>>>> to control whether MAP_SYNC should be enabled by default or not.
> > >>>>>>
> > >>>>>> Signed-off-by: Aneesh Kumar K.V 
> > >>> ...
> > >>>>>> diff --git a/kernel/fork.c b/kernel/fork.c
> > >>>>>> index 8c700f881d92..d5a9a363e81e 100644
> > >>>>>> --- a/kernel/fork.c
> > >>>>>> +++ b/kernel/fork.c
> > >>>>>> @@ -963,6 +963,12 @@ __cacheline_aligned_in_smp 
> > >>>>>> DEFINE_SPINLOCK(mmlist_lock);
> > >>>>>> static unsigned long default_dump_filter = 
> > >>>>>> MMF_DUMP_FILTER_DEFAULT;
> > >>>>>> +#ifdef CONFIG_ARCH_MAP_SYNC_DISABLE
> > >>>>>> +unsigned long default_map_sync_mask = MMF_DISABLE_MAP_SYNC_MASK;
> > >>>>>> +#else
> > >>>>>> +unsigned long default_map_sync_mask = 0;
> > >>>>>> +#endif
> > >>>>>> +
> > >>>
> > >>> I'm not sure CONFIG is really the right approach here. For a distro 
> > >>> that would
> > >>> basically mean to disable MAP_SYNC for all PPC kernels unless 
> > >>> application
> > >>> explicitly uses the right prctl. Shouldn't we rather initialize
> > >>> default_map_sync_mask on boot based on whether the CPU we run on 
> > >>> requires
> > >>> new flush instructions or not? Otherwise the patch looks sensible.
> > >>>
> > >>
> > >> yes that is correct. We ideally want to deny MAP_SYNC only w.r.t
> > >> POWER10. But on a virtualized platform there is no easy way to detect
> > >> that. We could ideally hook this into the nvdimm driver where we look at
> > >> the new compat string ibm,persistent-memory-v2 and then disable MAP_SYNC
> > >> if we find a device with the specific value.
> > >>
> > >> BTW with the recent changes I posted for the nvdimm driver, older kernel
> > >> won't initialize persistent memory device on newer hardware. Newer
> > >> hardware will present the device to OS with a different device tree
> > >> compat string.
> > >>
> > >> My expectation  w.r.t this patch was, Distro would want to  mark
> > >> CONFIG_ARCH_MAP_SYNC_DISABLE=n based on the different application
> > >> certification.  Otherwise application will have to end up calling the
> > >> prctl(MMF_DISABLE_MAP_SYNC, 0) any way. If that is the case, should this
> > >> be dependent on P10?
> > >>
> > >> With that I am wondering should we even have this patch? Can we expect
> > >> userspace get updated to use new instruction?.
> > >>
> > >> With ppc64 we never had a real persistent memory device available for
> > >> end user to try. The available persistent memory stack was using vPMEM
> > >> which was presented as a volatile memory

Re: [RFC PATCH 1/2] libnvdimm: Add prctl control for disabling synchronous fault support.

2020-05-29 Thread Jan Kara
Hi!

On Fri 29-05-20 15:07:31, Aneesh Kumar K.V wrote:
> Thanks Michal. I also missed Jeff in this email thread.

And I think you'll also need some of the sched maintainers for the prctl
bits...

> On 5/29/20 3:03 PM, Michal Suchánek wrote:
> > Adding Jan
> > 
> > On Fri, May 29, 2020 at 11:11:39AM +0530, Aneesh Kumar K.V wrote:
> > > With POWER10, architecture is adding new pmem flush and sync instructions.
> > > The kernel should prevent the usage of MAP_SYNC if applications are not 
> > > using
> > > the new instructions on newer hardware.
> > > 
> > > This patch adds a prctl option MAP_SYNC_ENABLE that can be used to enable
> > > the usage of MAP_SYNC. The kernel config option is added to allow the user
> > > to control whether MAP_SYNC should be enabled by default or not.
> > > 
> > > Signed-off-by: Aneesh Kumar K.V 
...
> > > diff --git a/kernel/fork.c b/kernel/fork.c
> > > index 8c700f881d92..d5a9a363e81e 100644
> > > --- a/kernel/fork.c
> > > +++ b/kernel/fork.c
> > > @@ -963,6 +963,12 @@ __cacheline_aligned_in_smp 
> > > DEFINE_SPINLOCK(mmlist_lock);
> > >   static unsigned long default_dump_filter = MMF_DUMP_FILTER_DEFAULT;
> > > +#ifdef CONFIG_ARCH_MAP_SYNC_DISABLE
> > > +unsigned long default_map_sync_mask = MMF_DISABLE_MAP_SYNC_MASK;
> > > +#else
> > > +unsigned long default_map_sync_mask = 0;
> > > +#endif
> > > +

I'm not sure CONFIG is really the right approach here. For a distro that would
basically mean to disable MAP_SYNC for all PPC kernels unless application
explicitly uses the right prctl. Shouldn't we rather initialize
default_map_sync_mask on boot based on whether the CPU we run on requires
new flush instructions or not? Otherwise the patch looks sensible.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2020-01-06 Thread Jan Kara
On Sat 28-12-19 20:33:32, John Hubbard wrote:
> On 12/27/19 1:56 PM, John Hubbard wrote:
> ...
> >> It is ancient verification test (~10y) which is not an easy task to
> >> make it understandable and standalone :).
> >>
> > 
> > Is this the only test that fails, btw? No other test failures or hints of
> > problems?
> > 
> > (Also, maybe hopeless, but can *anyone* on the RDMA list provide some
> > characterization of the test, such as how many pins per page, what page
> > sizes are used? I'm still hoping to write a test to trigger something
> > close to this...)
> > 
> > I do have a couple more ideas for test runs:
> > 
> > 1. Reduce GUP_PIN_COUNTING_BIAS to 1. That would turn the whole override of
> > page->_refcount into a no-op, and so if all is well (it may not be!) with 
> > the
> > rest of the patch, then we'd expect this problem to not reappear.
> > 
> > 2. Active /proc/vmstat *foll_pin* statistics unconditionally (just for these
> > tests, of course), so we can see if there is a get/put mismatch. However, 
> > that
> > will change the timing, and so it must be attempted independently of (1), in
> > order to see if it ends up hiding the repro.
> > 
> > I've updated this branch to implement (1), but not (2), hoping you can give
> > this one a spin?
> > 
> >     g...@github.com:johnhubbard/linux.git  
> > pin_user_pages_tracking_v11_with_diags
> > 
> > 
> 
> Also, looking ahead:
> 
> a) if the problem disappears with the latest above test, then we likely have
>a huge page refcount overflow, and there are a couple of different ways to
>fix it. 
> 
> b) if it still reproduces with the above, then it's some other random mistake,
>and in that case I'd be inclined to do a sort of guided (or classic, 
> unguided)
>git bisect of the series. Because it could be any of several patches.
> 
>If that's too much trouble, then I'd have to fall back to submitting a few
>patches at a time and working my way up to the tracking patch...

It could also be that an ordinary page reference is dropped with 'unpin'
thus underflowing the page refcount...

Honza

-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] powerpc/book3s64: Fix error handling in mm_iommu_do_alloc()

2019-12-20 Thread Jan Kara
On Fri 20-12-19 16:06:05, Alexey Kardashevskiy wrote:
> 
> 
> On 11/12/2019 21:42, Jan Kara wrote:
> > The last jump to free_exit in mm_iommu_do_alloc() happens after page
> > pointers in struct mm_iommu_table_group_mem_t were already converted to
> > physical addresses. Thus calling put_page() on these physical addresses
> > will likely crash. Convert physical addresses back to page pointers
> > during the error cleanup.
> > 
> > Signed-off-by: Jan Kara 
> > ---
> >  arch/powerpc/mm/book3s64/iommu_api.c | 5 -
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> >  Beware, this is completely untested, spotted just by code audit.
> > 
> > diff --git a/arch/powerpc/mm/book3s64/iommu_api.c 
> > b/arch/powerpc/mm/book3s64/iommu_api.c
> > index 56cc84520577..06c403381c9c 100644
> > --- a/arch/powerpc/mm/book3s64/iommu_api.c
> > +++ b/arch/powerpc/mm/book3s64/iommu_api.c
> > @@ -154,7 +154,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
> > unsigned long ua,
> >(mem2->entries << PAGE_SHIFT {
> > ret = -EINVAL;
> > mutex_unlock(_list_mutex);
> > -   goto free_exit;
> > +   goto convert_exit;
> > }
> > }
> >  
> > @@ -166,6 +166,9 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
> > unsigned long ua,
> >  
> > return 0;
> >  
> > +convert_exit:
> > +   for (i = 0; i < pinned; i++)
> > +   mem->hpages[i] = pfn_to_page(mem->hpas[i] >> PAGE_SHIFT);
> 
> 
> Good find. Although doing it where you added "goto convert_exit" seems
> cleaner imho. Thanks,

I don't really care :). V2 attached.

Honza
-- 
Jan Kara 
SUSE Labs, CR
>From 947c7a893c282b829a8623da73276a2fe56fdcd3 Mon Sep 17 00:00:00 2001
From: Jan Kara 
Date: Wed, 11 Dec 2019 11:36:54 +0100
Subject: [PATCH v2] powerpc/book3s64: Fix error handling in mm_iommu_do_alloc()

The last jump to free_exit in mm_iommu_do_alloc() happens after page
pointers in struct mm_iommu_table_group_mem_t were already converted to
physical addresses. Thus calling put_page() on these physical addresses
will likely crash. Convert physical addresses back to page pointers
during the error cleanup.

Signed-off-by: Jan Kara 
---
 arch/powerpc/mm/book3s64/iommu_api.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/mm/book3s64/iommu_api.c b/arch/powerpc/mm/book3s64/iommu_api.c
index 56cc84520577..1aa06584d783 100644
--- a/arch/powerpc/mm/book3s64/iommu_api.c
+++ b/arch/powerpc/mm/book3s64/iommu_api.c
@@ -154,6 +154,11 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, unsigned long ua,
    (mem2->entries << PAGE_SHIFT {
 			ret = -EINVAL;
 			mutex_unlock(_list_mutex);
+			/* Convert back to page pointers */
+			for (i = 0; i < pinned; i++) {
+mem->hpages[i] =
+	pfn_to_page(mem->hpas[i] >> PAGE_SHIFT);
+			}
 			goto free_exit;
 		}
 	}
-- 
2.16.4



Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-20 Thread Jan Kara
On Thu 19-12-19 12:30:31, John Hubbard wrote:
> On 12/19/19 5:26 AM, Leon Romanovsky wrote:
> > On Mon, Dec 16, 2019 at 02:25:12PM -0800, John Hubbard wrote:
> > > Hi,
> > > 
> > > This implements an API naming change (put_user_page*() -->
> > > unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
> > > extends that tracking to a few select subsystems. More subsystems will
> > > be added in follow up work.
> > 
> > Hi John,
> > 
> > The patchset generates kernel panics in our IB testing. In our tests, we
> > allocated single memory block and registered multiple MRs using the single
> > block.
> > 
> > The possible bad flow is:
> >   ib_umem_geti() ->
> >pin_user_pages_fast(FOLL_WRITE) ->
> > internal_get_user_pages_fast(FOLL_WRITE) ->
> >  gup_pgd_range() ->
> >   gup_huge_pd() ->
> >gup_hugepte() ->
> > try_grab_compound_head() ->
> 
> Hi Leon,
> 
> Thanks very much for the detailed report! So we're overflowing...
> 
> At first look, this seems likely to be hitting a weak point in the
> GUP_PIN_COUNTING_BIAS-based design, one that I believed could be deferred
> (there's a writeup in Documentation/core-api/pin_user_page.rst, lines
> 99-121). Basically it's pretty easy to overflow the page->_refcount
> with huge pages if the pages have a *lot* of subpages.
> 
> We can only do about 7 pins on 1GB huge pages that use 4KB subpages.
> Do you have any idea how many pins (repeated pins on the same page, which
> it sounds like you have) might be involved in your test case,
> and the huge page and system page sizes? That would allow calculating
> if we're likely overflowing for that reason.
> 
> So, ideas and next steps:
> 
> 1. Assuming that you *are* hitting this, I think I may have to fall back to
> implementing the "deferred" part of this design, as part of this series, after
> all. That means:
> 
>   For the pin/unpin calls at least, stop treating all pages as if they are
>   a cluster of PAGE_SIZE pages; instead, retrieve a huge page as one page.
>   That's not how it works now, and the need to hand back a huge array of
>   subpages is part of the problem. This affects the callers too, so it's not
>   a super quick change to make. (I was really hoping not to have to do this
>   yet.)

Does that mean that you would need to make all GUP users huge page aware?
Otherwise I don't see how what you suggest would work... And I don't think
making all GUP users huge page aware is realistic (effort-wise) or even
wanted (maintenance overhead in all those places).

I believe there might be also a different solution for this: For
transparent huge pages, we could find a space in 'struct page' of the
second page in the huge page for proper pin counter and just account pins
there so we'd have full width of 32-bits for it.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v11 00/25] mm/gup: track dma-pinned pages: FOLL_PIN

2019-12-16 Thread Jan Kara
Hi!

On Mon 16-12-19 14:25:12, John Hubbard wrote:
> Hi,
> 
> This implements an API naming change (put_user_page*() -->
> unpin_user_page*()), and also implements tracking of FOLL_PIN pages. It
> extends that tracking to a few select subsystems. More subsystems will
> be added in follow up work.

Just a note for Andrew and others watching this series: At this point I'm fine
with the series so if someone still has some review feedback or wants to
check the series, now is the right time. Otherwise I think Andrew can push
the series to MM tree so that it will get wider testing exposure and is
prepared for the next merge window.

        Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v11 23/25] mm/gup: track FOLL_PIN pages

2019-12-16 Thread Jan Kara
On Mon 16-12-19 14:18:59, John Hubbard wrote:
> On 12/16/19 4:53 AM, Jan Kara wrote:
> > With this fixed, the patch looks good to me so you can then add:
> > 
> > Reviewed-by: Jan Kara 
> > 
> > Honza
> > 
> 
> btw, thanks for the thorough review of this critical patch (and for your
> patience with my mistakes). I really appreciate it, and this patchset would
> not have made it this far without your detailed help and explanations.

You're welcome! I'd also like to thank you for persistently driving this
series :)

        Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v11 23/25] mm/gup: track FOLL_PIN pages

2019-12-16 Thread Jan Kara
On Fri 13-12-19 19:26:17, John Hubbard wrote:
> Add tracking of pages that were pinned via FOLL_PIN.
> 
> As mentioned in the FOLL_PIN documentation, callers who effectively set
> FOLL_PIN are required to ultimately free such pages via unpin_user_page().
> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
> for DIO and/or RDMA use".
> 
> Pages that have been pinned via FOLL_PIN are identifiable via a
> new function call:
> 
>bool page_dma_pinned(struct page *page);
> 
> What to do in response to encountering such a page, is left to later
> patchsets. There is discussion about this in [1], [2], and [3].
> 
> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
> 
> [1] Some slow progress on get_user_pages() (Apr 2, 2019):
> https://lwn.net/Articles/784574/
> [2] DMA and get_user_pages() (LPC: Dec 12, 2018):
> https://lwn.net/Articles/774411/
> [3] The trouble with get_user_pages() (Apr 30, 2018):
> https://lwn.net/Articles/753027/
> 
> Suggested-by: Jan Kara 
> Suggested-by: Jérôme Glisse 
> Cc: Kirill A. Shutemov 
> Signed-off-by: John Hubbard 
> ---
> 
> Hi Jan,
> 
> This should address all of your comments for patch 23!

Thanks. One comment below:

> @@ -1486,6 +1500,10 @@ struct page *follow_trans_huge_pmd(struct 
> vm_area_struct *vma,
>   VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page);
>   if (flags & FOLL_TOUCH)
>   touch_pmd(vma, addr, pmd, flags);
> +
> + if (!try_grab_page(page, flags))
> + return ERR_PTR(-ENOMEM);
> +
>   if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
>   /*
>* We don't mlock() pte-mapped THPs. This way we can avoid

I'd move this still a bit higher - just after VM_BUG_ON_PAGE() and before
if (flags & FOLL_TOUCH) test. Because touch_pmd() can update page tables
and we don't won't that if we're going to fail the fault.

With this fixed, the patch looks good to me so you can then add:

Reviewed-by: Jan Kara 

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v10 23/25] mm/gup: track FOLL_PIN pages

2019-12-12 Thread Jan Kara
On Thu 12-12-19 00:19:15, John Hubbard wrote:
> Add tracking of pages that were pinned via FOLL_PIN.
> 
> As mentioned in the FOLL_PIN documentation, callers who effectively set
> FOLL_PIN are required to ultimately free such pages via unpin_user_page().
> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
> for DIO and/or RDMA use".
> 
> Pages that have been pinned via FOLL_PIN are identifiable via a
> new function call:
> 
>bool page_dma_pinned(struct page *page);
> 
> What to do in response to encountering such a page, is left to later
> patchsets. There is discussion about this in [1], [2], and [3].
> 
> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
> 
> [1] Some slow progress on get_user_pages() (Apr 2, 2019):
> https://lwn.net/Articles/784574/
> [2] DMA and get_user_pages() (LPC: Dec 12, 2018):
> https://lwn.net/Articles/774411/
> [3] The trouble with get_user_pages() (Apr 30, 2018):
> https://lwn.net/Articles/753027/
> 
> Suggested-by: Jan Kara 
> Suggested-by: Jérôme Glisse 
> Cc: Kirill A. Shutemov 
> Signed-off-by: John Hubbard 

Thanks for the patch. As a side note, given this series is rather big, it
may be better to send just individual updated patches (as replies to the
review comments) instead of resending the whole series every time. And then
you can resend the whole series once enough changes accumulate or we reach
seemingly final state.  That way people don't have to crawl through lots of
uninteresing emails...  Just something to keep in mind for the future.

I've spotted just one issue in this patch (see below), the rest are just
small style nits.

> +#define page_ref_zero_or_close_to_bias_overflow(page) \
> + ((unsigned int) page_ref_count(page) + \
> + GUP_PIN_COUNTING_BIAS <= GUP_PIN_COUNTING_BIAS)
> +

...

> +/**
> + * page_dma_pinned() - report if a page is pinned for DMA.
> + *
> + * This function checks if a page has been pinned via a call to
> + * pin_user_pages*().
> + *
> + * The return value is partially fuzzy: false is not fuzzy, because it means
> + * "definitely not pinned for DMA", but true means "probably pinned for DMA, 
> but
> + * possibly a false positive due to having at least GUP_PIN_COUNTING_BIAS 
> worth
> + * of normal page references".
> + *
> + * False positives are OK, because: a) it's unlikely for a page to get that 
> many
> + * refcounts, and b) all the callers of this routine are expected to be able 
> to
> + * deal gracefully with a false positive.
> + *
> + * For more information, please see Documentation/vm/pin_user_pages.rst.
> + *
> + * @page:pointer to page to be queried.
> + * @Return:  True, if it is likely that the page has been "dma-pinned".
> + *   False, if the page is definitely not dma-pinned.
> + */
> +static inline bool page_dma_pinned(struct page *page)
> +{
> + return (page_ref_count(compound_head(page))) >= GUP_PIN_COUNTING_BIAS;
> +}
> +

I realized one think WRT handling of page refcount overflow: Page refcount is
signed and e.g. try_get_page() fails once the refcount is negative. That
means that:

a) page_ref_zero_or_close_to_bias_overflow() is not necessary - all places
that use pinning (i.e., advance refcount by GUP_PIN_COUNTING_BIAS) are not
necesary, we should just rely on the check for negative value for
consistency.

b) page_dma_pinned() has to be careful and type page_ref_count() to
unsigned type for comparison as otherwise overflowed refcount would
suddently appear as not-pinned.

> +/**
> + * try_pin_compound_head() - mark a compound page as being used by
> + * pin_user_pages*().
> + *
> + * This is the FOLL_PIN counterpart to try_get_compound_head().
> + *
> + * @page:pointer to page to be marked
> + * @Return:  the compound head page, with ref appropriately incremented,
> + * or NULL upon failure.
> + */
> +__must_check struct page *try_pin_compound_head(struct page *page, int refs)
> +{
> + struct page *head = try_get_compound_head(page,
> +   GUP_PIN_COUNTING_BIAS * refs);
> + if (!head)
> + return NULL;
> +
> + __update_proc_vmstat(page, NR_FOLL_PIN_REQUESTED, refs);
> + return head;
> +}
> +
> +/*
> + * try_grab_compound_head() - attempt to elevate a page's refcount, by a
> + * flags-dependent amount.
> + *
> + * "grab" names in this file mean, "look at flags to decide whether to use
> + * FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount.
> + *
> + * Either FOLL_PIN or FOLL_GET (or neither) must be set, but not both at the
> + * same time. (That's true throughout the get_user_pages*() and
> + * 

Re: [PATCH v9 23/25] mm/gup: track FOLL_PIN pages

2019-12-11 Thread Jan Kara
On Tue 10-12-19 18:53:16, John Hubbard wrote:
> Add tracking of pages that were pinned via FOLL_PIN.
> 
> As mentioned in the FOLL_PIN documentation, callers who effectively set
> FOLL_PIN are required to ultimately free such pages via unpin_user_page().
> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
> for DIO and/or RDMA use".
> 
> Pages that have been pinned via FOLL_PIN are identifiable via a
> new function call:
> 
>bool page_dma_pinned(struct page *page);
> 
> What to do in response to encountering such a page, is left to later
> patchsets. There is discussion about this in [1], [2], and [3].
> 
> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
> 
> [1] Some slow progress on get_user_pages() (Apr 2, 2019):
> https://lwn.net/Articles/784574/
> [2] DMA and get_user_pages() (LPC: Dec 12, 2018):
> https://lwn.net/Articles/774411/
> [3] The trouble with get_user_pages() (Apr 30, 2018):
> https://lwn.net/Articles/753027/

The patch looks mostly good to me now. Just a few smaller comments below.

> Suggested-by: Jan Kara 
> Suggested-by: Jérôme Glisse 
> Reviewed-by: Jan Kara 
> Reviewed-by: Jérôme Glisse 
> Reviewed-by: Ira Weiny 

I think you inherited here the Reviewed-by tags from the "add flags" patch
you've merged into this one but that's not really fair since this patch
does much more... In particular I didn't give my Reviewed-by tag for this
patch yet.

> +/*
> + * try_grab_compound_head() - attempt to elevate a page's refcount, by a
> + * flags-dependent amount.
> + *
> + * This has a default assumption of "use FOLL_GET behavior, if FOLL_PIN is 
> not
> + * set".
> + *
> + * "grab" names in this file mean, "look at flags to decide whether to use
> + * FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount.
> + */
> +static __maybe_unused struct page *try_grab_compound_head(struct page *page,
> +   int refs,
> +   unsigned int flags)
> +{
> + if (flags & FOLL_PIN)
> + return try_pin_compound_head(page, refs);
> +
> + return try_get_compound_head(page, refs);
> +}

I somewhat wonder about the asymmetry of try_grab_compound_head() vs
try_grab_page() in the treatment of 'flags'. How costly would it be to make
them symmetric (i.e., either set FOLL_GET for try_grab_compound_head()
callers or make sure one of FOLL_GET, FOLL_PIN is set for try_grab_page())?

Because this difference looks like a subtle catch in the long run...

> +
> +/**
> + * try_grab_page() - elevate a page's refcount by a flag-dependent amount
> + *
> + * This might not do anything at all, depending on the flags argument.
> + *
> + * "grab" names in this file mean, "look at flags to decide whether to use
> + * FOLL_PIN or FOLL_GET behavior, when incrementing the page's refcount.
> + *
> + * @page:pointer to page to be grabbed
> + * @flags:   gup flags: these are the FOLL_* flag values.
> + *
> + * Either FOLL_PIN or FOLL_GET (or neither) may be set, but not both at the 
> same
> + * time. (That's true throughout the get_user_pages*() and pin_user_pages*()
> + * APIs.) Cases:
> + *
> + *   FOLL_GET: page's refcount will be incremented by 1.
> + *  FOLL_PIN: page's refcount will be incremented by 
> GUP_PIN_COUNTING_BIAS.
> + *
> + * Return: true for success, or if no action was required (if neither 
> FOLL_PIN
> + * nor FOLL_GET was set, nothing is done). False for failure: FOLL_GET or
> + * FOLL_PIN was set, but the page could not be grabbed.
> + */
> +bool __must_check try_grab_page(struct page *page, unsigned int flags)
> +{
> + if (flags & FOLL_GET)
> + return try_get_page(page);
> + else if (flags & FOLL_PIN) {
> + page = compound_head(page);
> + WARN_ON_ONCE(flags & FOLL_GET);
> +
> + if (WARN_ON_ONCE(page_ref_zero_or_close_to_bias_overflow(page)))
> + return false;
> +
> + page_ref_add(page, GUP_PIN_COUNTING_BIAS);
> + __update_proc_vmstat(page, NR_FOLL_PIN_REQUESTED, 1);
> + }
> +
> + return true;
> +}

...

> @@ -1522,8 +1536,8 @@ struct page *follow_trans_huge_pmd(struct 
> vm_area_struct *vma,
>  skip_mlock:
>   page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
>   VM_BUG_ON_PAGE(!PageCompound(page) && !is_zone_device_page(page), page);
> - if (flags & FOLL_GET)
> - get_page(page);
> + if (!try_grab_page(page, flags))
> + page = ERR_PTR(-EFAULT);

I think you need to also mo

Re: [PATCH v9 20/25] powerpc: book3s64: convert to pin_user_pages() and put_user_page()

2019-12-11 Thread Jan Kara
On Tue 10-12-19 18:53:13, John Hubbard wrote:
> 1. Convert from get_user_pages() to pin_user_pages().
> 
> 2. As required by pin_user_pages(), release these pages via
> put_user_page().
> 
> Cc: Jan Kara 
> Signed-off-by: John Hubbard 

The patch looks good to me. You can add:

Reviewed-by: Jan Kara 

I'd just note that mm_iommu_do_alloc() has a pre-existing bug that the last
jump to 'free_exit' (at line 157) happens already after converting page
pointers to physical addresses so put_page() calls there will just crash.
But that's completely unrelated to your change. I'll send a fix separately.

Honza

> ---
>  arch/powerpc/mm/book3s64/iommu_api.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/mm/book3s64/iommu_api.c 
> b/arch/powerpc/mm/book3s64/iommu_api.c
> index 56cc84520577..a86547822034 100644
> --- a/arch/powerpc/mm/book3s64/iommu_api.c
> +++ b/arch/powerpc/mm/book3s64/iommu_api.c
> @@ -103,7 +103,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
> unsigned long ua,
>   for (entry = 0; entry < entries; entry += chunk) {
>   unsigned long n = min(entries - entry, chunk);
>  
> - ret = get_user_pages(ua + (entry << PAGE_SHIFT), n,
> + ret = pin_user_pages(ua + (entry << PAGE_SHIFT), n,
>   FOLL_WRITE | FOLL_LONGTERM,
>   mem->hpages + entry, NULL);
>   if (ret == n) {
> @@ -167,9 +167,8 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
> unsigned long ua,
>   return 0;
>  
>  free_exit:
> - /* free the reference taken */
> - for (i = 0; i < pinned; i++)
> - put_page(mem->hpages[i]);
> + /* free the references taken */
> + put_user_pages(mem->hpages, pinned);
>  
>   vfree(mem->hpas);
>   kfree(mem);
> @@ -215,7 +214,8 @@ static void mm_iommu_unpin(struct 
> mm_iommu_table_group_mem_t *mem)
>   if (mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY)
>   SetPageDirty(page);
>  
> - put_page(page);
> + put_user_page(page);
> +
>   mem->hpas[i] = 0;
>   }
>  }
> -- 
> 2.24.0
> 
-- 
Jan Kara 
SUSE Labs, CR


[PATCH] powerpc/book3s64: Fix error handling in mm_iommu_do_alloc()

2019-12-11 Thread Jan Kara
The last jump to free_exit in mm_iommu_do_alloc() happens after page
pointers in struct mm_iommu_table_group_mem_t were already converted to
physical addresses. Thus calling put_page() on these physical addresses
will likely crash. Convert physical addresses back to page pointers
during the error cleanup.

Signed-off-by: Jan Kara 
---
 arch/powerpc/mm/book3s64/iommu_api.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

 Beware, this is completely untested, spotted just by code audit.

diff --git a/arch/powerpc/mm/book3s64/iommu_api.c 
b/arch/powerpc/mm/book3s64/iommu_api.c
index 56cc84520577..06c403381c9c 100644
--- a/arch/powerpc/mm/book3s64/iommu_api.c
+++ b/arch/powerpc/mm/book3s64/iommu_api.c
@@ -154,7 +154,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
unsigned long ua,
   (mem2->entries << PAGE_SHIFT {
ret = -EINVAL;
mutex_unlock(_list_mutex);
-   goto free_exit;
+   goto convert_exit;
}
}
 
@@ -166,6 +166,9 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
unsigned long ua,
 
return 0;
 
+convert_exit:
+   for (i = 0; i < pinned; i++)
+   mem->hpages[i] = pfn_to_page(mem->hpas[i] >> PAGE_SHIFT);
 free_exit:
/* free the reference taken */
for (i = 0; i < pinned; i++)
-- 
2.16.4



Re: [PATCH v8 24/26] mm/gup: track FOLL_PIN pages

2019-12-10 Thread Jan Kara
On Mon 09-12-19 14:53:42, John Hubbard wrote:
> Add tracking of pages that were pinned via FOLL_PIN.
> 
> As mentioned in the FOLL_PIN documentation, callers who effectively set
> FOLL_PIN are required to ultimately free such pages via unpin_user_page().
> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
> for DIO and/or RDMA use".
> 
> Pages that have been pinned via FOLL_PIN are identifiable via a
> new function call:
> 
>bool page_dma_pinned(struct page *page);
> 
> What to do in response to encountering such a page, is left to later
> patchsets. There is discussion about this in [1], [2], and [3].
> 
> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
> 
> [1] Some slow progress on get_user_pages() (Apr 2, 2019):
> https://lwn.net/Articles/784574/
> [2] DMA and get_user_pages() (LPC: Dec 12, 2018):
> https://lwn.net/Articles/774411/
> [3] The trouble with get_user_pages() (Apr 30, 2018):
> https://lwn.net/Articles/753027/
> 
> Suggested-by: Jan Kara 
> Suggested-by: Jérôme Glisse 
> Signed-off-by: John Hubbard 

Looks nice, some comments below...

> +/*
> + * try_grab_compound_head() - attempt to elevate a page's refcount, by a
> + * flags-dependent amount.
> + *
> + * This has a default assumption of "use FOLL_GET behavior, if FOLL_PIN is 
> not
> + * set".
> + *
> + * "grab" names in this file mean, "look at flags to decide with to use 
> FOLL_PIN
> + * or FOLL_GET behavior, when incrementing the page's refcount.
> + */
> +static struct page *try_grab_compound_head(struct page *page, int refs,
> +unsigned int flags)
> +{
> + if (flags & FOLL_PIN)
> + return try_pin_compound_head(page, refs);
> +
> + return try_get_compound_head(page, refs);
> +}
> +
> +/**
> + * grab_page() - elevate a page's refcount by a flag-dependent amount
> + *
> + * This might not do anything at all, depending on the flags argument.
> + *
> + * "grab" names in this file mean, "look at flags to decide with to use 
> FOLL_PIN
   ^^^ whether

> + * or FOLL_GET behavior, when incrementing the page's refcount.
> + *
> + * @page:pointer to page to be grabbed
> + * @flags:   gup flags: these are the FOLL_* flag values.
> + *
> + * Either FOLL_PIN or FOLL_GET (or neither) may be set, but not both at the 
> same
> + * time. (That's true throughout the get_user_pages*() and pin_user_pages*()
> + * APIs.) Cases:
> + *
> + *   FOLL_GET: page's refcount will be incremented by 1.
> + *   FOLL_PIN: page's refcount will be incremented by GUP_PIN_COUNTING_BIAS.
> + */
> +void grab_page(struct page *page, unsigned int flags)
> +{
> + if (flags & FOLL_GET)
> + get_page(page);
> + else if (flags & FOLL_PIN) {
> + get_page(page);
> + WARN_ON_ONCE(flags & FOLL_GET);
> + /*
> +  * Use get_page(), above, to do the refcount error
> +  * checking. Then just add in the remaining references:
> +  */
> + page_ref_add(page, GUP_PIN_COUNTING_BIAS - 1);

This is wrong for two reasons:

1) You miss compound_head() indirection from get_page() for this
page_ref_add().

2) page_ref_add() could overflow the counter without noticing.

Especially with GUP_PIN_COUNTING_BIAS being non-trivial, it is realistic
that an attacker might try to overflow the page refcount and we have to
protect the kernel against that. So I think that all the places that would
use grab_page() actually need to use try_grab_page() and then gracefully
deal with the failure.

> @@ -278,11 +425,23 @@ static struct page *follow_page_pte(struct 
> vm_area_struct *vma,
>   goto retry;
>   }
>  
> - if (flags & FOLL_GET) {
> + if (flags & (FOLL_PIN | FOLL_GET)) {
> + /*
> +  * Allow try_get_page() to take care of error handling, for
> +  * both cases: FOLL_GET or FOLL_PIN:
> +  */
>   if (unlikely(!try_get_page(page))) {
>   page = ERR_PTR(-ENOMEM);
>   goto out;
>   }
> +
> + if (flags & FOLL_PIN) {
> + WARN_ON_ONCE(flags & FOLL_GET);
> +
> + /* We got a +1 refcount from try_get_page(), above. */
> + page_ref_add(page, GUP_PIN_COUNTING_BIAS - 1);
> + __update_proc_vmstat(page, NR_FOLL_PIN_REQUESTED, 1);
> + }
>   }

The same problem here as above, plus this place should use th

Re: [PATCH v8 23/26] mm/gup: pass flags arg to __gup_device_* functions

2019-12-10 Thread Jan Kara
On Mon 09-12-19 14:53:41, John Hubbard wrote:
> A subsequent patch requires access to gup flags, so pass the flags
> argument through to the __gup_device_* functions.
> 
> Also placate checkpatch.pl by shortening a nearby line.
> 
> TODO: Christoph Hellwig requested folding this into the patch the uses
> the gup flags arguments.

You should probably implement this TODO? :)

Honza

> 
> Reviewed-by: Jan Kara 
> Reviewed-by: Jérôme Glisse 
> Reviewed-by: Ira Weiny 
> Cc: Kirill A. Shutemov 
> Signed-off-by: John Hubbard 
> ---
>  mm/gup.c | 28 ++--
>  1 file changed, 18 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 73aedcefa4bd..687d48506f04 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1957,7 +1957,8 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, 
> unsigned long end,
>  
>  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && 
> defined(CONFIG_TRANSPARENT_HUGEPAGE)
>  static int __gup_device_huge(unsigned long pfn, unsigned long addr,
> - unsigned long end, struct page **pages, int *nr)
> +  unsigned long end, unsigned int flags,
> +  struct page **pages, int *nr)
>  {
>   int nr_start = *nr;
>   struct dev_pagemap *pgmap = NULL;
> @@ -1983,13 +1984,14 @@ static int __gup_device_huge(unsigned long pfn, 
> unsigned long addr,
>  }
>  
>  static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> - unsigned long end, struct page **pages, int *nr)
> +  unsigned long end, unsigned int flags,
> +  struct page **pages, int *nr)
>  {
>   unsigned long fault_pfn;
>   int nr_start = *nr;
>  
>   fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> - if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
> + if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
>   return 0;
>  
>   if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
> @@ -2000,13 +2002,14 @@ static int __gup_device_huge_pmd(pmd_t orig, pmd_t 
> *pmdp, unsigned long addr,
>  }
>  
>  static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
> - unsigned long end, struct page **pages, int *nr)
> +  unsigned long end, unsigned int flags,
> +  struct page **pages, int *nr)
>  {
>   unsigned long fault_pfn;
>   int nr_start = *nr;
>  
>   fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> - if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
> + if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
>   return 0;
>  
>   if (unlikely(pud_val(orig) != pud_val(*pudp))) {
> @@ -2017,14 +2020,16 @@ static int __gup_device_huge_pud(pud_t orig, pud_t 
> *pudp, unsigned long addr,
>  }
>  #else
>  static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> - unsigned long end, struct page **pages, int *nr)
> +  unsigned long end, unsigned int flags,
> +  struct page **pages, int *nr)
>  {
>   BUILD_BUG();
>   return 0;
>  }
>  
>  static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
> - unsigned long end, struct page **pages, int *nr)
> +  unsigned long end, unsigned int flags,
> +  struct page **pages, int *nr)
>  {
>   BUILD_BUG();
>   return 0;
> @@ -2136,7 +2141,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, 
> unsigned long addr,
>   if (pmd_devmap(orig)) {
>   if (unlikely(flags & FOLL_LONGTERM))
>   return 0;
> - return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
> + return __gup_device_huge_pmd(orig, pmdp, addr, end, flags,
> +  pages, nr);
>   }
>  
>   page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> @@ -2157,7 +2163,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, 
> unsigned long addr,
>  }
>  
>  static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
> - unsigned long end, unsigned int flags, struct page **pages, int 
> *nr)
> + unsigned long end, unsigned int flags,
> + struct page **pages, int *nr)
>  {
>   struct page *head, *page;
>   

Re: [PATCH v8 08/26] mm/gup: allow FOLL_FORCE for get_user_pages_fast()

2019-12-10 Thread Jan Kara
On Mon 09-12-19 14:53:26, John Hubbard wrote:
> Commit 817be129e6f2 ("mm: validate get_user_pages_fast flags") allowed
> only FOLL_WRITE and FOLL_LONGTERM to be passed to get_user_pages_fast().
> This, combined with the fact that get_user_pages_fast() falls back to
> "slow gup", which *does* accept FOLL_FORCE, leads to an odd situation:
> if you need FOLL_FORCE, you cannot call get_user_pages_fast().
> 
> There does not appear to be any reason for filtering out FOLL_FORCE.
> There is nothing in the _fast() implementation that requires that we
> avoid writing to the pages. So it appears to have been an oversight.
> 
> Fix by allowing FOLL_FORCE to be set for get_user_pages_fast().
> 
> Fixes: 817be129e6f2 ("mm: validate get_user_pages_fast flags")
> Cc: Christoph Hellwig 
> Reviewed-by: Leon Romanovsky 
> Signed-off-by: John Hubbard 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  mm/gup.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index c0c56888e7cc..958ab0757389 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2414,7 +2414,8 @@ int get_user_pages_fast(unsigned long start, int 
> nr_pages,
>   unsigned long addr, len, end;
>   int nr = 0, ret = 0;
>  
> - if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM)))
> + if (WARN_ON_ONCE(gup_flags & ~(FOLL_WRITE | FOLL_LONGTERM |
> +    FOLL_FORCE)))
>   return -EINVAL;
>  
>   start = untagged_addr(start) & PAGE_MASK;
> -- 
> 2.24.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v8 17/26] media/v4l2-core: set pages dirty upon releasing DMA buffers

2019-12-10 Thread Jan Kara
On Mon 09-12-19 16:56:27, Andrew Morton wrote:
> On Mon, 9 Dec 2019 14:53:35 -0800 John Hubbard  wrote:
> 
> > After DMA is complete, and the device and CPU caches are synchronized,
> > it's still required to mark the CPU pages as dirty, if the data was
> > coming from the device. However, this driver was just issuing a
> > bare put_page() call, without any set_page_dirty*() call.
> > 
> > Fix the problem, by calling set_page_dirty_lock() if the CPU pages
> > were potentially receiving data from the device.
> > 
> > Reviewed-by: Christoph Hellwig 
> > Acked-by: Hans Verkuil 
> > Cc: Mauro Carvalho Chehab 
> > Cc: 
> 
> What are the user-visible effects of this change?

Presumably loss of captured video data if the page writeback hits in the
wrong moment (i.e., after the page was faulted in but before the video HW
stored data in the page) and the page then gets evicted from the page cache.

Honza

-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2 17/19] powerpc: book3s64: convert to pin_user_pages() and put_user_page()

2019-11-29 Thread Jan Kara
On Mon 25-11-19 15:10:33, John Hubbard wrote:
> 1. Convert from get_user_pages() to pin_user_pages().
> 
> 2. As required by pin_user_pages(), release these pages via
> put_user_page(). In this case, do so via put_user_pages_dirty_lock().
> 
> That has the side effect of calling set_page_dirty_lock(), instead
> of set_page_dirty(). This is probably more accurate.

Maybe more accurate but it doesn't work for mm_iommu_unpin(). As I'm
checking mm_iommu_unpin() gets called from RCU callback which is executed
interrupt context and you cannot lock pages from such context. So you need
to queue work from the RCU callback and then do the real work from the
workqueue...

Honza

> 
> As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
> dealing with a file backed page where we have reference on the inode it
> hangs off." [1]
> 
> [1] https://lore.kernel.org/r/20190723153640.gb...@lst.de
> 
> Cc: Jan Kara 
> Signed-off-by: John Hubbard 
> ---
>  arch/powerpc/mm/book3s64/iommu_api.c | 12 +---
>  1 file changed, 5 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/mm/book3s64/iommu_api.c 
> b/arch/powerpc/mm/book3s64/iommu_api.c
> index 56cc84520577..fc1670a6fc3c 100644
> --- a/arch/powerpc/mm/book3s64/iommu_api.c
> +++ b/arch/powerpc/mm/book3s64/iommu_api.c
> @@ -103,7 +103,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
> unsigned long ua,
>   for (entry = 0; entry < entries; entry += chunk) {
>   unsigned long n = min(entries - entry, chunk);
>  
> - ret = get_user_pages(ua + (entry << PAGE_SHIFT), n,
> + ret = pin_user_pages(ua + (entry << PAGE_SHIFT), n,
>   FOLL_WRITE | FOLL_LONGTERM,
>   mem->hpages + entry, NULL);
>   if (ret == n) {
> @@ -167,9 +167,8 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
> unsigned long ua,
>   return 0;
>  
>  free_exit:
> - /* free the reference taken */
> - for (i = 0; i < pinned; i++)
> - put_page(mem->hpages[i]);
> + /* free the references taken */
> + put_user_pages(mem->hpages, pinned);
>  
>   vfree(mem->hpas);
>   kfree(mem);
> @@ -212,10 +211,9 @@ static void mm_iommu_unpin(struct 
> mm_iommu_table_group_mem_t *mem)
>   if (!page)
>   continue;
>  
> - if (mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY)
> - SetPageDirty(page);
> + put_user_pages_dirty_lock(, 1,
> + mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY);
>  
> - put_page(page);
>   mem->hpas[i] = 0;
>   }
>  }
> -- 
> 2.24.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 17/19] powerpc: book3s64: convert to pin_user_pages() and put_user_page()

2019-11-25 Thread Jan Kara
On Sun 24-11-19 20:20:09, John Hubbard wrote:
> 1. Convert from get_user_pages() to pin_user_pages().
> 
> 2. As required by pin_user_pages(), release these pages via
> put_user_page(). In this case, do so via put_user_pages_dirty_lock().
> 
> That has the side effect of calling set_page_dirty_lock(), instead
> of set_page_dirty(). This is probably more accurate.
> 
> As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
> dealing with a file backed page where we have reference on the inode it
> hangs off." [1]
> 
> 3. Release each page in mem->hpages[] (instead of mem->hpas[]), because
> that is the array that pin_longterm_pages() filled in. This is more
> accurate and should be a little safer from a maintenance point of
> view.

Except that this breaks the code. hpages is unioned with hpas...

> [1] https://lore.kernel.org/r/20190723153640.gb...@lst.de
> 
> Signed-off-by: John Hubbard 
> ---
>  arch/powerpc/mm/book3s64/iommu_api.c | 12 +---
>  1 file changed, 5 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/mm/book3s64/iommu_api.c 
> b/arch/powerpc/mm/book3s64/iommu_api.c
> index 56cc84520577..196383e8e5a9 100644
> --- a/arch/powerpc/mm/book3s64/iommu_api.c
> +++ b/arch/powerpc/mm/book3s64/iommu_api.c
> @@ -103,7 +103,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
> unsigned long ua,
>   for (entry = 0; entry < entries; entry += chunk) {
>   unsigned long n = min(entries - entry, chunk);
>  
> - ret = get_user_pages(ua + (entry << PAGE_SHIFT), n,
> + ret = pin_user_pages(ua + (entry << PAGE_SHIFT), n,
>   FOLL_WRITE | FOLL_LONGTERM,
>   mem->hpages + entry, NULL);
>   if (ret == n) {
> @@ -167,9 +167,8 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
> unsigned long ua,
>   return 0;
>  
>  free_exit:
> - /* free the reference taken */
> - for (i = 0; i < pinned; i++)
> - put_page(mem->hpages[i]);
> + /* free the references taken */
> + put_user_pages(mem->hpages, pinned);
>  
>   vfree(mem->hpas);
>   kfree(mem);
> @@ -212,10 +211,9 @@ static void mm_iommu_unpin(struct 
> mm_iommu_table_group_mem_t *mem)
>   if (!page)
>   continue;
>  
> - if (mem->hpas[i] & MM_IOMMU_TABLE_GROUP_PAGE_DIRTY)
> - SetPageDirty(page);
> + put_user_pages_dirty_lock(>hpages[i], 1,
> +   MM_IOMMU_TABLE_GROUP_PAGE_DIRTY);

And the dirtying condition is wrong here as well. Currently it is always
true.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v7 02/24] mm/gup: factor out duplicate code from four routines

2019-11-22 Thread Jan Kara
On Thu 21-11-19 18:54:02, John Hubbard wrote:
> On 11/21/19 1:54 AM, Jan Kara wrote:
> > On Thu 21-11-19 00:29:59, John Hubbard wrote:
> > > > 
> > > > Otherwise this looks fine and might be a worthwhile cleanup to feed
> > > > Andrew for 5.5 independent of the gut of the changes.
> > > > 
> > > > Reviewed-by: Christoph Hellwig 
> > > > 
> > > 
> > > Thanks for the reviews! Say, it sounds like your view here is that this
> > > series should be targeted at 5.6 (not 5.5), is that what you have in mind?
> > > And get the preparatory patches (1-9, and maybe even 10-16) into 5.5?
> > 
> > One more note :) If you are going to push pin_user_pages() interfaces
> > (which I'm fine with), it would probably make sense to push also the
> > put_user_pages() -> unpin_user_pages() renaming so that that inconsistency
> > in naming does not exist in the released upstream kernel.
> > 
> > Honza
> 
> Yes, that's what this patch series does. But I'm not sure if "push" here
> means, "push out: defer to 5.6", "push (now) into 5.5", or "advocate for"?

I meant to include the patch in the "for 5.5" batch.

> I will note that it's not going to be easy to rename in one step, now
> that this is being split up. Because various put_user_pages()-based items
> are going into 5.5 via different maintainer trees now. Probably I'd need
> to introduce unpin_user_page() alongside put_user_page()...thoughts?

Yes, I understand that moving that patch from the end of the series would
cause fair amount of conflicts. I was hoping that you could generate the
patch with sed/Coccinelle and then rebasing what remains for 5.6 on top of
that patch should not be that painful so overall it should not be that much
work. But I may be wrong so if it proves to be too tedious, let's just
postpone the renaming to 5.6. I don't find having both unpin_user_page()
and put_user_page() a better alternative to current state. Thanks!

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v7 02/24] mm/gup: factor out duplicate code from four routines

2019-11-21 Thread Jan Kara
On Thu 21-11-19 00:29:59, John Hubbard wrote:
> > 
> > Otherwise this looks fine and might be a worthwhile cleanup to feed
> > Andrew for 5.5 independent of the gut of the changes.
> > 
> > Reviewed-by: Christoph Hellwig 
> > 
> 
> Thanks for the reviews! Say, it sounds like your view here is that this
> series should be targeted at 5.6 (not 5.5), is that what you have in mind?
> And get the preparatory patches (1-9, and maybe even 10-16) into 5.5?

One more note :) If you are going to push pin_user_pages() interfaces
(which I'm fine with), it would probably make sense to push also the
put_user_pages() -> unpin_user_pages() renaming so that that inconsistency
in naming does not exist in the released upstream kernel.

        Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v7 02/24] mm/gup: factor out duplicate code from four routines

2019-11-21 Thread Jan Kara
On Thu 21-11-19 00:29:59, John Hubbard wrote:
> On 11/21/19 12:03 AM, Christoph Hellwig wrote:
> > Otherwise this looks fine and might be a worthwhile cleanup to feed
> > Andrew for 5.5 independent of the gut of the changes.
> > 
> > Reviewed-by: Christoph Hellwig 
> > 
> 
> Thanks for the reviews! Say, it sounds like your view here is that this
> series should be targeted at 5.6 (not 5.5), is that what you have in mind?
> And get the preparatory patches (1-9, and maybe even 10-16) into 5.5?

Yeah, actually I feel the same. The merge window is going to open on Sunday
and the series isn't still fully baked and happily sitting in linux-next
(and larger changes should really sit in linux-next for at least a week,
preferably two, before the merge window opens to get some reasonable test
coverage).  So I'd take out the independent easy patches that are already
reviewed, get them merged into Andrew's (or whatever other appropriate
tree) now so that they get at least a week of testing in linux-next before
going upstream.  And the more involved bits will have to wait for 5.6 -
which means let's just continue working on them as we do now because
ideally in 4 weeks we should have them ready with all the reviews so that
they can be picked up and integrated into linux-next.

        Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v7 17/24] mm/gup: track FOLL_PIN pages

2019-11-21 Thread Jan Kara
On Wed 20-11-19 23:13:47, John Hubbard wrote:
> Add tracking of pages that were pinned via FOLL_PIN.
> 
> As mentioned in the FOLL_PIN documentation, callers who effectively set
> FOLL_PIN are required to ultimately free such pages via put_user_page().
> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
> for DIO and/or RDMA use".
> 
> Pages that have been pinned via FOLL_PIN are identifiable via a
> new function call:
> 
>bool page_dma_pinned(struct page *page);
> 
> What to do in response to encountering such a page, is left to later
> patchsets. There is discussion about this in [1], [2], and [3].
> 
> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
> 
> [1] Some slow progress on get_user_pages() (Apr 2, 2019):
> https://lwn.net/Articles/784574/
> [2] DMA and get_user_pages() (LPC: Dec 12, 2018):
> https://lwn.net/Articles/774411/
> [3] The trouble with get_user_pages() (Apr 30, 2018):
> https://lwn.net/Articles/753027/
> 
> Suggested-by: Jan Kara 
> Suggested-by: Jérôme Glisse 
> Signed-off-by: John Hubbard 

Thanks for the patch! We are mostly getting there. Some smaller comments
below.

> +/**
> + * try_pin_compound_head() - mark a compound page as being used by
> + * pin_user_pages*().
> + *
> + * This is the FOLL_PIN counterpart to try_get_compound_head().
> + *
> + * @page:pointer to page to be marked
> + * @Return:  true for success, false for failure
> + */
> +__must_check bool try_pin_compound_head(struct page *page, int refs)
> +{
> + page = try_get_compound_head(page, GUP_PIN_COUNTING_BIAS * refs);
> + if (!page)
> + return false;
> +
> + __update_proc_vmstat(page, NR_FOLL_PIN_REQUESTED, refs);
> + return true;
> +}
> +
> +#ifdef CONFIG_DEV_PAGEMAP_OPS
> +static bool __put_devmap_managed_user_page(struct page *page)

Probably call this __unpin_devmap_managed_user_page()? To match the later
conversion of put_user_page() to unpin_user_page()?

> +{
> + bool is_devmap = page_is_devmap_managed(page);
> +
> + if (is_devmap) {
> + int count = page_ref_sub_return(page, GUP_PIN_COUNTING_BIAS);
> +
> + __update_proc_vmstat(page, NR_FOLL_PIN_RETURNED, 1);
> + /*
> +  * devmap page refcounts are 1-based, rather than 0-based: if
> +  * refcount is 1, then the page is free and the refcount is
> +  * stable because nobody holds a reference on the page.
> +  */
> + if (count == 1)
> + free_devmap_managed_page(page);
> + else if (!count)
> + __put_page(page);
> + }
> +
> + return is_devmap;
> +}
> +#else
> +static bool __put_devmap_managed_user_page(struct page *page)
> +{
> + return false;
> +}
> +#endif /* CONFIG_DEV_PAGEMAP_OPS */
> +
> +/**
> + * put_user_page() - release a dma-pinned page
> + * @page:pointer to page to be released
> + *
> + * Pages that were pinned via pin_user_pages*() must be released via either
> + * put_user_page(), or one of the put_user_pages*() routines. This is so that
> + * such pages can be separately tracked and uniquely handled. In particular,
> + * interactions with RDMA and filesystems need special handling.
> + */
> +void put_user_page(struct page *page)
> +{
> + page = compound_head(page);
> +
> + /*
> +  * For devmap managed pages we need to catch refcount transition from
> +  * GUP_PIN_COUNTING_BIAS to 1, when refcount reach one it means the
> +  * page is free and we need to inform the device driver through
> +  * callback. See include/linux/memremap.h and HMM for details.
> +  */
> + if (__put_devmap_managed_user_page(page))
> + return;
> +
> + if (page_ref_sub_and_test(page, GUP_PIN_COUNTING_BIAS))
> + __put_page(page);
> +
> + __update_proc_vmstat(page, NR_FOLL_PIN_RETURNED, 1);
> +}
> +EXPORT_SYMBOL(put_user_page);
> +
>  /**
>   * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned 
> pages
>   * @pages:  array of pages to be maybe marked dirty, and definitely released.
> @@ -237,10 +327,11 @@ static struct page *follow_page_pte(struct 
> vm_area_struct *vma,
>   }
>  
>   page = vm_normal_page(vma, address, pte);
> - if (!page && pte_devmap(pte) && (flags & FOLL_GET)) {
> + if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
>   /*
> -  * Only return device mapping pages in the FOLL_GET case since
> -  * they are only valid while holding t

Re: [PATCH v6 24/24] mm, tree-wide: rename put_user_page*() to unpin_user_page*()

2019-11-19 Thread Jan Kara
On Tue 19-11-19 00:16:43, John Hubbard wrote:
> In order to provide a clearer, more symmetric API for pinning
> and unpinning DMA pages. This way, pin_user_pages*() calls
> match up with unpin_user_pages*() calls, and the API is a lot
> closer to being self-explanatory.
> 
> Signed-off-by: John Hubbard 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  Documentation/core-api/pin_user_pages.rst   |  2 +-
>  arch/powerpc/mm/book3s64/iommu_api.c|  6 +--
>  drivers/gpu/drm/via/via_dmablit.c   |  4 +-
>  drivers/infiniband/core/umem.c  |  2 +-
>  drivers/infiniband/hw/hfi1/user_pages.c |  2 +-
>  drivers/infiniband/hw/mthca/mthca_memfree.c |  6 +--
>  drivers/infiniband/hw/qib/qib_user_pages.c  |  2 +-
>  drivers/infiniband/hw/qib/qib_user_sdma.c   |  6 +--
>  drivers/infiniband/hw/usnic/usnic_uiom.c|  2 +-
>  drivers/infiniband/sw/siw/siw_mem.c |  2 +-
>  drivers/media/v4l2-core/videobuf-dma-sg.c   |  4 +-
>  drivers/platform/goldfish/goldfish_pipe.c   |  4 +-
>  drivers/vfio/vfio_iommu_type1.c |  2 +-
>  fs/io_uring.c   |  4 +-
>  include/linux/mm.h  | 30 +++---
>  include/linux/mmzone.h  |  2 +-
>  mm/gup.c| 46 ++---
>  mm/gup_benchmark.c  |  2 +-
>  mm/process_vm_access.c  |  4 +-
>  net/xdp/xdp_umem.c  |  2 +-
>  20 files changed, 67 insertions(+), 67 deletions(-)
> 
> diff --git a/Documentation/core-api/pin_user_pages.rst 
> b/Documentation/core-api/pin_user_pages.rst
> index baa288a44a77..6d93ef203561 100644
> --- a/Documentation/core-api/pin_user_pages.rst
> +++ b/Documentation/core-api/pin_user_pages.rst
> @@ -220,7 +220,7 @@ since the system was booted, via two new /proc/vmstat 
> entries: ::
>  /proc/vmstat/nr_foll_pin_requested
>  
>  Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is
> -because there is a noticeable performance drop in put_user_page(), when they
> +because there is a noticeable performance drop in unpin_user_page(), when 
> they
>  are activated.
>  
>  References
> diff --git a/arch/powerpc/mm/book3s64/iommu_api.c 
> b/arch/powerpc/mm/book3s64/iommu_api.c
> index 196383e8e5a9..dd7aa5a4f33c 100644
> --- a/arch/powerpc/mm/book3s64/iommu_api.c
> +++ b/arch/powerpc/mm/book3s64/iommu_api.c
> @@ -168,7 +168,7 @@ static long mm_iommu_do_alloc(struct mm_struct *mm, 
> unsigned long ua,
>  
>  free_exit:
>   /* free the references taken */
> - put_user_pages(mem->hpages, pinned);
> + unpin_user_pages(mem->hpages, pinned);
>  
>   vfree(mem->hpas);
>   kfree(mem);
> @@ -211,8 +211,8 @@ static void mm_iommu_unpin(struct 
> mm_iommu_table_group_mem_t *mem)
>   if (!page)
>   continue;
>  
> - put_user_pages_dirty_lock(>hpages[i], 1,
> -   MM_IOMMU_TABLE_GROUP_PAGE_DIRTY);
> + unpin_user_pages_dirty_lock(>hpages[i], 1,
> + MM_IOMMU_TABLE_GROUP_PAGE_DIRTY);
>  
>   mem->hpas[i] = 0;
>   }
> diff --git a/drivers/gpu/drm/via/via_dmablit.c 
> b/drivers/gpu/drm/via/via_dmablit.c
> index 37c5e572993a..719d036c9384 100644
> --- a/drivers/gpu/drm/via/via_dmablit.c
> +++ b/drivers/gpu/drm/via/via_dmablit.c
> @@ -188,8 +188,8 @@ via_free_sg_info(struct pci_dev *pdev, drm_via_sg_info_t 
> *vsg)
>   kfree(vsg->desc_pages);
>   /* fall through */
>   case dr_via_pages_locked:
> - put_user_pages_dirty_lock(vsg->pages, vsg->num_pages,
> -   (vsg->direction == DMA_FROM_DEVICE));
> + unpin_user_pages_dirty_lock(vsg->pages, vsg->num_pages,
> +(vsg->direction == DMA_FROM_DEVICE));
>   /* fall through */
>   case dr_via_pages_alloc:
>   vfree(vsg->pages);
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index 2c287ced3439..119a147da904 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -54,7 +54,7 @@ static void __ib_umem_release(struct ib_device *dev, struct 
> ib_umem *umem, int d
>  
>   for_each_sg_page(umem->sg_head.sgl, _iter, umem->sg_nents, 0) {
>   page = sg_page_iter_page(_iter);
> - put_user_pages_dirty_lock(, 1, umem->writable && dirty);
> + unpin_user

Re: [PATCH v6 17/24] mm/gup: track FOLL_PIN pages

2019-11-19 Thread Jan Kara
On Tue 19-11-19 00:16:36, John Hubbard wrote:
> @@ -2025,6 +2149,20 @@ static int __record_subpages(struct page *page, 
> unsigned long addr,
>   return nr;
>  }
>  
> +static bool __pin_compound_head(struct page *head, int refs, unsigned int 
> flags)
> +{

I don't quite like the proliferation of names starting with __. I don't
think there's a good reason for that, particularly in this case. Also 'pin'
here is somewhat misleading as we already use term "pin" for the particular
way of pinning the page. We could have grab_compound_head() or maybe
nail_compound_head() :), but you're native speaker so you may come up with
better word.

> + if (flags & FOLL_PIN) {
> + if (unlikely(!try_pin_compound_head(head, refs)))
> + return false;
> + } else {
> + head = try_get_compound_head(head, refs);
> + if (!head)
> + return false;
> + }
> +
> + return true;
> +}
> +
>  static void put_compound_head(struct page *page, int refs)
>  {
>   /* Do a get_page() first, in case refs == page->_refcount */

put_compound_head() needs similar treatment as undo_dev_pagemap(), doesn't
it?

> @@ -968,7 +973,18 @@ struct page *follow_devmap_pmd(struct vm_area_struct 
> *vma, unsigned long addr,
>   if (!*pgmap)
>   return ERR_PTR(-EFAULT);
>   page = pfn_to_page(pfn);
> - get_page(page);
> +
> + if (flags & FOLL_GET)
> + get_page(page);
> + else if (flags & FOLL_PIN) {
> + /*
> +  * try_pin_page() is not actually expected to fail here because
> +  * we hold the pmd lock so no one can unmap the pmd and free the
> +  * page that it points to.
> +  */
> + if (unlikely(!try_pin_page(page)))
> + page = ERR_PTR(-EFAULT);
> + }

This pattern is rather common. So maybe I'd add a helper grab_page(page,
flags) doing

if (flags & FOLL_GET)
get_page(page);
else if (flags & FOLL_PIN)
return try_pin_page(page);
return true;

Otherwise the patch looks good to me now.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v6 02/24] mm/gup: factor out duplicate code from four routines

2019-11-19 Thread Jan Kara
On Tue 19-11-19 00:16:21, John Hubbard wrote:
> There are four locations in gup.c that have a fair amount of code
> duplication. This means that changing one requires making the same
> changes in four places, not to mention reading the same code four
> times, and wondering if there are subtle differences.
> 
> Factor out the common code into static functions, thus reducing the
> overall line count and the code's complexity.
> 
> Also, take the opportunity to slightly improve the efficiency of the
> error cases, by doing a mass subtraction of the refcount, surrounded
> by get_page()/put_page().
> 
> Also, further simplify (slightly), by waiting until the the successful
> end of each routine, to increment *nr.
> 
> Reviewed-by: Jérôme Glisse 
> Cc: Jan Kara 
> Cc: Ira Weiny 
> Cc: Christoph Hellwig 
> Cc: Aneesh Kumar K.V 
> Signed-off-by: John Hubbard 

Looks good to me now! You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  mm/gup.c | 91 ++--
>  1 file changed, 36 insertions(+), 55 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 85caf76b3012..f3c7d6625817 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1969,6 +1969,25 @@ static int __gup_device_huge_pud(pud_t pud, pud_t 
> *pudp, unsigned long addr,
>  }
>  #endif
>  
> +static int __record_subpages(struct page *page, unsigned long addr,
> +  unsigned long end, struct page **pages)
> +{
> + int nr;
> +
> + for (nr = 0; addr != end; addr += PAGE_SIZE)
> + pages[nr++] = page++;
> +
> + return nr;
> +}
> +
> +static void put_compound_head(struct page *page, int refs)
> +{
> + /* Do a get_page() first, in case refs == page->_refcount */
> + get_page(page);
> + page_ref_sub(page, refs);
> + put_page(page);
> +}
> +
>  #ifdef CONFIG_ARCH_HAS_HUGEPD
>  static unsigned long hugepte_addr_end(unsigned long addr, unsigned long end,
> unsigned long sz)
> @@ -1998,32 +2017,20 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, 
> unsigned long addr,
>   /* hugepages are never "special" */
>   VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
>  
> - refs = 0;
>   head = pte_page(pte);
> -
>   page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
> - do {
> - VM_BUG_ON(compound_head(page) != head);
> - pages[*nr] = page;
> - (*nr)++;
> - page++;
> - refs++;
> - } while (addr += PAGE_SIZE, addr != end);
> + refs = __record_subpages(page, addr, end, pages + *nr);
>  
>   head = try_get_compound_head(head, refs);
> - if (!head) {
> - *nr -= refs;
> + if (!head)
>   return 0;
> - }
>  
>   if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> - /* Could be optimized better */
> - *nr -= refs;
> - while (refs--)
> - put_page(head);
> + put_compound_head(head, refs);
>   return 0;
>   }
>  
> + *nr += refs;
>   SetPageReferenced(head);
>   return 1;
>  }
> @@ -2071,28 +2078,19 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, 
> unsigned long addr,
>pages, nr);
>   }
>  
> - refs = 0;
>   page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> - do {
> - pages[*nr] = page;
> - (*nr)++;
> - page++;
> - refs++;
> - } while (addr += PAGE_SIZE, addr != end);
> + refs = __record_subpages(page, addr, end, pages + *nr);
>  
>   head = try_get_compound_head(pmd_page(orig), refs);
> - if (!head) {
> - *nr -= refs;
> + if (!head)
>   return 0;
> - }
>  
>   if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
> - *nr -= refs;
> - while (refs--)
> - put_page(head);
> + put_compound_head(head, refs);
>   return 0;
>   }
>  
> + *nr += refs;
>   SetPageReferenced(head);
>   return 1;
>  }
> @@ -2114,28 +2112,19 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, 
> unsigned long addr,
>pages, nr);
>   }
>  
> - refs = 0;
>   page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> - do {
> - pages[*nr] = page;
> - (*nr)++;
> - page++;
> 

Re: [PATCH v5 11/24] goldish_pipe: convert to pin_user_pages() and put_user_page()

2019-11-18 Thread Jan Kara
On Thu 14-11-19 21:53:27, John Hubbard wrote:
> 1. Call the new global pin_user_pages_fast(), from pin_goldfish_pages().
> 
> 2. As required by pin_user_pages(), release these pages via
> put_user_page(). In this case, do so via put_user_pages_dirty_lock().
> 
> That has the side effect of calling set_page_dirty_lock(), instead
> of set_page_dirty(). This is probably more accurate.
> 
> As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
> dealing with a file backed page where we have reference on the inode it
> hangs off." [1]
> 
> Another side effect is that the release code is simplified because
> the page[] loop is now in gup.c instead of here, so just delete the
> local release_user_pages() entirely, and call
> put_user_pages_dirty_lock() directly, instead.
> 
> [1] https://lore.kernel.org/r/20190723153640.gb...@lst.de
> 
> Reviewed-by: Ira Weiny 
> Signed-off-by: John Hubbard 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  drivers/platform/goldfish/goldfish_pipe.c | 17 +++--
>  1 file changed, 3 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/platform/goldfish/goldfish_pipe.c 
> b/drivers/platform/goldfish/goldfish_pipe.c
> index 7ed2a21a0bac..635a8bc1b480 100644
> --- a/drivers/platform/goldfish/goldfish_pipe.c
> +++ b/drivers/platform/goldfish/goldfish_pipe.c
> @@ -274,7 +274,7 @@ static int pin_goldfish_pages(unsigned long first_page,
>   *iter_last_page_size = last_page_size;
>   }
>  
> - ret = get_user_pages_fast(first_page, requested_pages,
> + ret = pin_user_pages_fast(first_page, requested_pages,
> !is_write ? FOLL_WRITE : 0,
> pages);
>   if (ret <= 0)
> @@ -285,18 +285,6 @@ static int pin_goldfish_pages(unsigned long first_page,
>   return ret;
>  }
>  
> -static void release_user_pages(struct page **pages, int pages_count,
> -int is_write, s32 consumed_size)
> -{
> - int i;
> -
> - for (i = 0; i < pages_count; i++) {
> - if (!is_write && consumed_size > 0)
> - set_page_dirty(pages[i]);
> - put_page(pages[i]);
> - }
> -}
> -
>  /* Populate the call parameters, merging adjacent pages together */
>  static void populate_rw_params(struct page **pages,
>  int pages_count,
> @@ -372,7 +360,8 @@ static int transfer_max_buffers(struct goldfish_pipe 
> *pipe,
>  
>   *consumed_size = pipe->command_buffer->rw_params.consumed_size;
>  
> - release_user_pages(pipe->pages, pages_count, is_write, *consumed_size);
> + put_user_pages_dirty_lock(pipe->pages, pages_count,
> +   !is_write && *consumed_size > 0);
>  
>   mutex_unlock(>lock);
>   return 0;
> -- 
> 2.24.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v5 15/24] fs/io_uring: set FOLL_PIN via pin_user_pages()

2019-11-18 Thread Jan Kara
On Thu 14-11-19 21:53:31, John Hubbard wrote:
> Convert fs/io_uring to use the new pin_user_pages() call, which sets
> FOLL_PIN. Setting FOLL_PIN is now required for code that requires
> tracking of pinned pages, and therefore for any code that calls
> put_user_page().
> 
> In partial anticipation of this work, the io_uring code was already
> calling put_user_page() instead of put_page(). Therefore, in order to
> convert from the get_user_pages()/put_page() model, to the
> pin_user_pages()/put_user_page() model, the only change required
> here is to change get_user_pages() to pin_user_pages().
> 
> Signed-off-by: John Hubbard 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  fs/io_uring.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index f9a38998f2fc..cff64bd00db9 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -3433,7 +3433,7 @@ static int io_sqe_buffer_register(struct io_ring_ctx 
> *ctx, void __user *arg,
>  
>   ret = 0;
>   down_read(>mm->mmap_sem);
> - pret = get_user_pages(ubuf, nr_pages,
> + pret = pin_user_pages(ubuf, nr_pages,
> FOLL_WRITE | FOLL_LONGTERM,
>     pages, vmas);
>   if (pret == nr_pages) {
> -- 
> 2.24.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v5 06/24] goldish_pipe: rename local pin_user_pages() routine

2019-11-18 Thread Jan Kara
On Thu 14-11-19 21:53:22, John Hubbard wrote:
> 1. Avoid naming conflicts: rename local static function from
> "pin_user_pages()" to "pin_goldfish_pages()".
> 
> An upcoming patch will introduce a global pin_user_pages()
> function.
> 
> Reviewed-by: Jérôme Glisse 
> Reviewed-by: Ira Weiny 
> Signed-off-by: John Hubbard 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  drivers/platform/goldfish/goldfish_pipe.c | 18 +-
>  1 file changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/platform/goldfish/goldfish_pipe.c 
> b/drivers/platform/goldfish/goldfish_pipe.c
> index cef0133aa47a..7ed2a21a0bac 100644
> --- a/drivers/platform/goldfish/goldfish_pipe.c
> +++ b/drivers/platform/goldfish/goldfish_pipe.c
> @@ -257,12 +257,12 @@ static int goldfish_pipe_error_convert(int status)
>   }
>  }
>  
> -static int pin_user_pages(unsigned long first_page,
> -   unsigned long last_page,
> -   unsigned int last_page_size,
> -   int is_write,
> -   struct page *pages[MAX_BUFFERS_PER_COMMAND],
> -   unsigned int *iter_last_page_size)
> +static int pin_goldfish_pages(unsigned long first_page,
> +   unsigned long last_page,
> +   unsigned int last_page_size,
> +   int is_write,
> +   struct page *pages[MAX_BUFFERS_PER_COMMAND],
> +   unsigned int *iter_last_page_size)
>  {
>   int ret;
>   int requested_pages = ((last_page - first_page) >> PAGE_SHIFT) + 1;
> @@ -354,9 +354,9 @@ static int transfer_max_buffers(struct goldfish_pipe 
> *pipe,
>   if (mutex_lock_interruptible(>lock))
>   return -ERESTARTSYS;
>  
> - pages_count = pin_user_pages(first_page, last_page,
> -  last_page_size, is_write,
> -  pipe->pages, _last_page_size);
> + pages_count = pin_goldfish_pages(first_page, last_page,
> +  last_page_size, is_write,
> +          pipe->pages, _last_page_size);
>   if (pages_count < 0) {
>   mutex_unlock(>lock);
>   return pages_count;
> -- 
> 2.24.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v5 13/24] mm/process_vm_access: set FOLL_PIN via pin_user_pages_remote()

2019-11-18 Thread Jan Kara
On Thu 14-11-19 21:53:29, John Hubbard wrote:
> Convert process_vm_access to use the new pin_user_pages_remote()
> call, which sets FOLL_PIN. Setting FOLL_PIN is now required for
> code that requires tracking of pinned pages.
> 
> Also, release the pages via put_user_page*().
> 
> Also, rename "pages" to "pinned_pages", as this makes for
> easier reading of process_vm_rw_single_vec().
> 
> Reviewed-by: Jérôme Glisse 
> Reviewed-by: Ira Weiny 
> Signed-off-by: John Hubbard 
> ---
>  mm/process_vm_access.c | 28 +++-
>  1 file changed, 15 insertions(+), 13 deletions(-)

The patch looks good to me. You can add:

Reviewed-by: Jan Kara 

        Honza

-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v5 10/24] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-18 Thread Jan Kara
On Thu 14-11-19 21:53:26, John Hubbard wrote:
>  /*
> - * NOTE on FOLL_LONGTERM:
> + * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
> + * other. Here is what they mean, and how to use them:
>   *
>   * FOLL_LONGTERM indicates that the page will be held for an indefinite time
> - * period _often_ under userspace control.  This is contrasted with
> - * iov_iter_get_pages() where usages which are transient.
> + * period _often_ under userspace control.  This is in contrast to
> + * iov_iter_get_pages(), where usages which are transient.
  ^^^ when you touch this, please fix also the
second sentense. It doesn't quite make sense to me... I'd probably write
there "whose usages are transient" but maybe you can come up with something
even better.

Otherwise the patch looks good to me so feel free to add:

Reviewed-by: Jan Kara 

        Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v5 07/24] IB/umem: use get_user_pages_fast() to pin DMA pages

2019-11-18 Thread Jan Kara
On Thu 14-11-19 21:53:23, John Hubbard wrote:
> And get rid of the mmap_sem calls, as part of that. Note
> that get_user_pages_fast() will, if necessary, fall back to
> __gup_longterm_unlocked(), which takes the mmap_sem as needed.
> 
> Reviewed-by: Jason Gunthorpe 
> Reviewed-by: Ira Weiny 
> Signed-off-by: John Hubbard 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza


> ---
>  drivers/infiniband/core/umem.c | 17 ++---
>  1 file changed, 6 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/infiniband/core/umem.c b/drivers/infiniband/core/umem.c
> index 24244a2f68cc..3d664a2539eb 100644
> --- a/drivers/infiniband/core/umem.c
> +++ b/drivers/infiniband/core/umem.c
> @@ -271,16 +271,13 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, 
> unsigned long addr,
>   sg = umem->sg_head.sgl;
>  
>   while (npages) {
> - down_read(>mmap_sem);
> - ret = get_user_pages(cur_base,
> -  min_t(unsigned long, npages,
> -PAGE_SIZE / sizeof (struct page *)),
> -  gup_flags | FOLL_LONGTERM,
> -  page_list, NULL);
> - if (ret < 0) {
> - up_read(>mmap_sem);
> + ret = get_user_pages_fast(cur_base,
> +   min_t(unsigned long, npages,
> + PAGE_SIZE /
> + sizeof(struct page *)),
> +   gup_flags | FOLL_LONGTERM, page_list);
> + if (ret < 0)
>   goto umem_release;
> - }
>  
>   cur_base += ret * PAGE_SIZE;
>   npages   -= ret;
> @@ -288,8 +285,6 @@ struct ib_umem *ib_umem_get(struct ib_udata *udata, 
> unsigned long addr,
>   sg = ib_umem_add_sg_table(sg, page_list, ret,
>   dma_get_max_seg_size(context->device->dma_device),
>   >sg_nents);
> -
> - up_read(>mmap_sem);
>   }
>  
>   sg_mark_end(sg);
> -- 
> 2.24.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v5 02/24] mm/gup: factor out duplicate code from four routines

2019-11-18 Thread Jan Kara
On Thu 14-11-19 21:53:18, John Hubbard wrote:
> There are four locations in gup.c that have a fair amount of code
> duplication. This means that changing one requires making the same
> changes in four places, not to mention reading the same code four
> times, and wondering if there are subtle differences.
> 
> Factor out the common code into static functions, thus reducing the
> overall line count and the code's complexity.
> 
> Also, take the opportunity to slightly improve the efficiency of the
> error cases, by doing a mass subtraction of the refcount, surrounded
> by get_page()/put_page().
> 
> Also, further simplify (slightly), by waiting until the the successful
> end of each routine, to increment *nr.
> 
> Reviewed-by: Jérôme Glisse 
> Cc: Jan Kara 
> Cc: Ira Weiny 
> Cc: Christoph Hellwig 
> Cc: Aneesh Kumar K.V 
> Signed-off-by: John Hubbard 
> ---
>  mm/gup.c | 95 
>  1 file changed, 40 insertions(+), 55 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 85caf76b3012..858541ea30ce 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1969,6 +1969,29 @@ static int __gup_device_huge_pud(pud_t pud, pud_t 
> *pudp, unsigned long addr,
>  }
>  #endif
>  
> +static int __record_subpages(struct page *page, unsigned long addr,
> +  unsigned long end, struct page **pages)
> +{
> + int nr = 0;
> + int nr_recorded_pages = 0;
> +
> + do {
> + pages[nr] = page;
> + nr++;
> + page++;
> + nr_recorded_pages++;
> + } while (addr += PAGE_SIZE, addr != end);
> + return nr_recorded_pages;

nr == nr_recorded_pages so no need for both... BTW, structuring this as a
for loop would be probably more logical and shorter now:

for (nr = 0; addr != end; addr += PAGE_SIZE)
pages[nr++] = page++;
    return nr;

The rest of the patch looks good to me.

Honza

-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v5 17/24] mm/gup: track FOLL_PIN pages

2019-11-18 Thread Jan Kara
On Thu 14-11-19 21:53:33, John Hubbard wrote:
> Add tracking of pages that were pinned via FOLL_PIN.
> 
> As mentioned in the FOLL_PIN documentation, callers who effectively set
> FOLL_PIN are required to ultimately free such pages via put_user_page().
> The effect is similar to FOLL_GET, and may be thought of as "FOLL_GET
> for DIO and/or RDMA use".
> 
> Pages that have been pinned via FOLL_PIN are identifiable via a
> new function call:
> 
>bool page_dma_pinned(struct page *page);
> 
> What to do in response to encountering such a page, is left to later
> patchsets. There is discussion about this in [1].
^^ missing this reference
in the changelog...

> This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask().
> 
> Suggested-by: Jan Kara 
> Suggested-by: Jérôme Glisse 
> Signed-off-by: John Hubbard 
> ---
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 6588d2e02628..db872766480f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1054,6 +1054,8 @@ static inline __must_check bool try_get_page(struct 
> page *page)
>   return true;
>  }
>  
> +__must_check bool user_page_ref_inc(struct page *page);
> +
>  static inline void put_page(struct page *page)
>  {
>   page = compound_head(page);
> @@ -1071,29 +1073,70 @@ static inline void put_page(struct page *page)
>   __put_page(page);
>  }
>  
> -/**
> - * put_user_page() - release a gup-pinned page
> - * @page:pointer to page to be released
> +/*
> + * GUP_PIN_COUNTING_BIAS, and the associated functions that use it, overload
> + * the page's refcount so that two separate items are tracked: the original 
> page
> + * reference count, and also a new count of how many get_user_pages() calls 
> were
^^ pin_user_pages()

> + * made against the page. ("gup-pinned" is another term for the latter).
> + *
> + * With this scheme, get_user_pages() becomes special: such pages are marked
^^^ pin_user_pages()

> + * as distinct from normal pages. As such, the put_user_page() call (and its
> + * variants) must be used in order to release gup-pinned pages.
> + *
> + * Choice of value:
>   *
> - * Pages that were pinned via pin_user_pages*() must be released via either
> - * put_user_page(), or one of the put_user_pages*() routines. This is so that
> - * eventually such pages can be separately tracked and uniquely handled. In
> - * particular, interactions with RDMA and filesystems need special handling.
> + * By making GUP_PIN_COUNTING_BIAS a power of two, debugging of page 
> reference
> + * counts with respect to get_user_pages() and put_user_page() becomes 
> simpler,
^^^ pin_user_pages()

> + * due to the fact that adding an even power of two to the page refcount has
> + * the effect of using only the upper N bits, for the code that counts up 
> using
> + * the bias value. This means that the lower bits are left for the exclusive
> + * use of the original code that increments and decrements by one (or at 
> least,
> + * by much smaller values than the bias value).
>   *
> - * put_user_page() and put_page() are not interchangeable, despite this early
> - * implementation that makes them look the same. put_user_page() calls must
> - * be perfectly matched up with pin*() calls.
> + * Of course, once the lower bits overflow into the upper bits (and this is
> + * OK, because subtraction recovers the original values), then visual 
> inspection
> + * no longer suffices to directly view the separate counts. However, for 
> normal
> + * applications that don't have huge page reference counts, this won't be an
> + * issue.
> + *
> + * Locking: the lockless algorithm described in page_cache_get_speculative()
> + * and page_cache_gup_pin_speculative() provides safe operation for
> + * get_user_pages and page_mkclean and other calls that race to set up page
> + * table entries.
>   */
...
> @@ -2070,9 +2191,16 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, 
> unsigned long addr,
>   page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
>   refs = __record_subpages(page, addr, end, pages + *nr);
>  
> - head = try_get_compound_head(head, refs);
> - if (!head)
> - return 0;
> + if (flags & FOLL_PIN) {
> + head = page;
> + if (unlikely(!user_page_ref_inc(head)))
> + return 0;
> + head = page;

Why do you assign 'head' twice? Also the refcounting logic is repeated
several times so perhaps you can factor it out in to a helper f

Re: [PATCH v4 02/23] mm/gup: factor out duplicate code from four routines

2019-11-13 Thread Jan Kara
On Tue 12-11-19 20:26:49, John Hubbard wrote:
> There are four locations in gup.c that have a fair amount of code
> duplication. This means that changing one requires making the same
> changes in four places, not to mention reading the same code four
> times, and wondering if there are subtle differences.
> 
> Factor out the common code into static functions, thus reducing the
> overall line count and the code's complexity.
> 
> Also, take the opportunity to slightly improve the efficiency of the
> error cases, by doing a mass subtraction of the refcount, surrounded
> by get_page()/put_page().
> 
> Also, further simplify (slightly), by waiting until the the successful
> end of each routine, to increment *nr.
> 
> Reviewed-by: Jérôme Glisse 
> Cc: Ira Weiny 
> Cc: Christoph Hellwig 
> Cc: Aneesh Kumar K.V 
> Signed-off-by: John Hubbard 

> diff --git a/mm/gup.c b/mm/gup.c
> index 85caf76b3012..199da99e8ffc 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1969,6 +1969,34 @@ static int __gup_device_huge_pud(pud_t pud, pud_t 
> *pudp, unsigned long addr,
>  }
>  #endif
>  
> +static int __record_subpages(struct page *page, unsigned long addr,
> +  unsigned long end, struct page **pages, int nr)
> +{
> + int nr_recorded_pages = 0;
> +
> + do {
> + pages[nr] = page;
> + nr++;
> + page++;
> + nr_recorded_pages++;
> + } while (addr += PAGE_SIZE, addr != end);
> + return nr_recorded_pages;
> +}

Why don't you pass in already pages + nr?

> +
> +static void put_compound_head(struct page *page, int refs)
> +{
> + /* Do a get_page() first, in case refs == page->_refcount */
> + get_page(page);
> + page_ref_sub(page, refs);
> + put_page(page);
> +}
> +
> +static void __huge_pt_done(struct page *head, int nr_recorded_pages, int *nr)
> +{
> + *nr += nr_recorded_pages;
> + SetPageReferenced(head);
> +}

I don't find this last helper very useful. It seems to muddy water more
than necessary...

Other than that the cleanup looks nice to me.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v4 01/23] mm/gup: pass flags arg to __gup_device_* functions

2019-11-13 Thread Jan Kara
On Tue 12-11-19 20:26:48, John Hubbard wrote:
> A subsequent patch requires access to gup flags, so
> pass the flags argument through to the __gup_device_*
> functions.
> 
> Also placate checkpatch.pl by shortening a nearby line.
> 
> Reviewed-by: Jérôme Glisse 
> Reviewed-by: Ira Weiny 
> Cc: Kirill A. Shutemov 
> Signed-off-by: John Hubbard 

Looks good! You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  mm/gup.c | 28 ++--
>  1 file changed, 18 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 8f236a335ae9..85caf76b3012 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1890,7 +1890,8 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, 
> unsigned long end,
>  
>  #if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && 
> defined(CONFIG_TRANSPARENT_HUGEPAGE)
>  static int __gup_device_huge(unsigned long pfn, unsigned long addr,
> - unsigned long end, struct page **pages, int *nr)
> +  unsigned long end, unsigned int flags,
> +  struct page **pages, int *nr)
>  {
>   int nr_start = *nr;
>   struct dev_pagemap *pgmap = NULL;
> @@ -1916,13 +1917,14 @@ static int __gup_device_huge(unsigned long pfn, 
> unsigned long addr,
>  }
>  
>  static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> - unsigned long end, struct page **pages, int *nr)
> +  unsigned long end, unsigned int flags,
> +  struct page **pages, int *nr)
>  {
>   unsigned long fault_pfn;
>   int nr_start = *nr;
>  
>   fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> - if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
> + if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
>   return 0;
>  
>   if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
> @@ -1933,13 +1935,14 @@ static int __gup_device_huge_pmd(pmd_t orig, pmd_t 
> *pmdp, unsigned long addr,
>  }
>  
>  static int __gup_device_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
> - unsigned long end, struct page **pages, int *nr)
> +  unsigned long end, unsigned int flags,
> +  struct page **pages, int *nr)
>  {
>   unsigned long fault_pfn;
>   int nr_start = *nr;
>  
>   fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
> - if (!__gup_device_huge(fault_pfn, addr, end, pages, nr))
> + if (!__gup_device_huge(fault_pfn, addr, end, flags, pages, nr))
>   return 0;
>  
>   if (unlikely(pud_val(orig) != pud_val(*pudp))) {
> @@ -1950,14 +1953,16 @@ static int __gup_device_huge_pud(pud_t orig, pud_t 
> *pudp, unsigned long addr,
>  }
>  #else
>  static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
> - unsigned long end, struct page **pages, int *nr)
> +  unsigned long end, unsigned int flags,
> +  struct page **pages, int *nr)
>  {
>   BUILD_BUG();
>   return 0;
>  }
>  
>  static int __gup_device_huge_pud(pud_t pud, pud_t *pudp, unsigned long addr,
> - unsigned long end, struct page **pages, int *nr)
> +  unsigned long end, unsigned int flags,
> +  struct page **pages, int *nr)
>  {
>   BUILD_BUG();
>   return 0;
> @@ -2062,7 +2067,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, 
> unsigned long addr,
>   if (pmd_devmap(orig)) {
>   if (unlikely(flags & FOLL_LONGTERM))
>   return 0;
> - return __gup_device_huge_pmd(orig, pmdp, addr, end, pages, nr);
> + return __gup_device_huge_pmd(orig, pmdp, addr, end, flags,
> +  pages, nr);
>   }
>  
>   refs = 0;
> @@ -2092,7 +2098,8 @@ static int gup_huge_pmd(pmd_t orig, pmd_t *pmdp, 
> unsigned long addr,
>  }
>  
>  static int gup_huge_pud(pud_t orig, pud_t *pudp, unsigned long addr,
> - unsigned long end, unsigned int flags, struct page **pages, int 
> *nr)
> + unsigned long end, unsigned int flags,
> + struct page **pages, int *nr)
>  {
>   struct page *head, *page;
>   int refs;
> @@ -2103,7 +2110,8 @@ static int gup_huge_pud(pud_t orig, pud_t *pudp, 
> unsigned long addr,
>   if (pud_devmap(orig)) {
>   if (unlikely(flags & FOLL_LONGTERM))
>   return 0;
> - return __gup_device_huge_pud(orig, pudp, addr, end, pages, nr);
> + return __gup_device_huge_pud(orig, pudp, addr, end, flags,
> +  pages, nr);
>   }
>  
>   refs = 0;
> -- 
> 2.24.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v4 03/23] mm/gup: move try_get_compound_head() to top, fix minor issues

2019-11-13 Thread Jan Kara
On Tue 12-11-19 20:26:50, John Hubbard wrote:
> An upcoming patch uses try_get_compound_head() more widely,
> so move it to the top of gup.c.
> 
> Also fix a tiny spelling error and a checkpatch.pl warning.
> 
> Signed-off-by: John Hubbard 

Looks good. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  mm/gup.c | 29 +++--
>  1 file changed, 15 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 199da99e8ffc..933524de6249 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -29,6 +29,21 @@ struct follow_page_context {
>   unsigned int page_mask;
>  };
>  
> +/*
> + * Return the compound head page with ref appropriately incremented,
> + * or NULL if that failed.
> + */
> +static inline struct page *try_get_compound_head(struct page *page, int refs)
> +{
> + struct page *head = compound_head(page);
> +
> + if (WARN_ON_ONCE(page_ref_count(head) < 0))
> + return NULL;
> + if (unlikely(!page_cache_add_speculative(head, refs)))
> + return NULL;
> + return head;
> +}
> +
>  /**
>   * put_user_pages_dirty_lock() - release and optionally dirty gup-pinned 
> pages
>   * @pages:  array of pages to be maybe marked dirty, and definitely released.
> @@ -1793,20 +1808,6 @@ static void __maybe_unused undo_dev_pagemap(int *nr, 
> int nr_start,
>   }
>  }
>  
> -/*
> - * Return the compund head page with ref appropriately incremented,
> - * or NULL if that failed.
> - */
> -static inline struct page *try_get_compound_head(struct page *page, int refs)
> -{
> - struct page *head = compound_head(page);
> - if (WARN_ON_ONCE(page_ref_count(head) < 0))
> - return NULL;
> - if (unlikely(!page_cache_add_speculative(head, refs)))
> - return NULL;
> - return head;
> -}
> -
>  #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
>  static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>unsigned int flags, struct page **pages, int *nr)
> -- 
> 2.24.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v4 04/23] mm: devmap: refactor 1-based refcounting for ZONE_DEVICE pages

2019-11-13 Thread Jan Kara
On Tue 12-11-19 20:26:51, John Hubbard wrote:
> An upcoming patch changes and complicates the refcounting and
> especially the "put page" aspects of it. In order to keep
> everything clean, refactor the devmap page release routines:
> 
> * Rename put_devmap_managed_page() to page_is_devmap_managed(),
>   and limit the functionality to "read only": return a bool,
>   with no side effects.
> 
> * Add a new routine, put_devmap_managed_page(), to handle checking
>   what kind of page it is, and what kind of refcount handling it
>   requires.
> 
> * Rename __put_devmap_managed_page() to free_devmap_managed_page(),
>   and limit the functionality to unconditionally freeing a devmap
>   page.
> 
> This is originally based on a separate patch by Ira Weiny, which
> applied to an early version of the put_user_page() experiments.
> Since then, Jérôme Glisse suggested the refactoring described above.
> 
> Suggested-by: Jérôme Glisse 
> Signed-off-by: Ira Weiny 
> Signed-off-by: John Hubbard 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
>  include/linux/mm.h | 27 ---
>  mm/memremap.c  | 67 --
>  2 files changed, 53 insertions(+), 41 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a2adf95b3f9c..96228376139c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -967,9 +967,10 @@ static inline bool is_zone_device_page(const struct page 
> *page)
>  #endif
>  
>  #ifdef CONFIG_DEV_PAGEMAP_OPS
> -void __put_devmap_managed_page(struct page *page);
> +void free_devmap_managed_page(struct page *page);
>  DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
> -static inline bool put_devmap_managed_page(struct page *page)
> +
> +static inline bool page_is_devmap_managed(struct page *page)
>  {
>   if (!static_branch_unlikely(_managed_key))
>   return false;
> @@ -978,7 +979,6 @@ static inline bool put_devmap_managed_page(struct page 
> *page)
>   switch (page->pgmap->type) {
>   case MEMORY_DEVICE_PRIVATE:
>   case MEMORY_DEVICE_FS_DAX:
> - __put_devmap_managed_page(page);
>   return true;
>   default:
>   break;
> @@ -986,6 +986,27 @@ static inline bool put_devmap_managed_page(struct page 
> *page)
>   return false;
>  }
>  
> +static inline bool put_devmap_managed_page(struct page *page)
> +{
> + bool is_devmap = page_is_devmap_managed(page);
> +
> + if (is_devmap) {
> + int count = page_ref_dec_return(page);
> +
> + /*
> +  * devmap page refcounts are 1-based, rather than 0-based: if
> +  * refcount is 1, then the page is free and the refcount is
> +  * stable because nobody holds a reference on the page.
> +  */
> + if (count == 1)
> + free_devmap_managed_page(page);
> + else if (!count)
> + __put_page(page);
> + }
> +
> + return is_devmap;
> +}
> +
>  #else /* CONFIG_DEV_PAGEMAP_OPS */
>  static inline bool put_devmap_managed_page(struct page *page)
>  {
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 03ccbdfeb697..bc7e2a27d025 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -410,48 +410,39 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
>  EXPORT_SYMBOL_GPL(get_dev_pagemap);
>  
>  #ifdef CONFIG_DEV_PAGEMAP_OPS
> -void __put_devmap_managed_page(struct page *page)
> +void free_devmap_managed_page(struct page *page)
>  {
> - int count = page_ref_dec_return(page);
> + /* Clear Active bit in case of parallel mark_page_accessed */
> + __ClearPageActive(page);
> + __ClearPageWaiters(page);
> +
> + mem_cgroup_uncharge(page);
>  
>   /*
> -  * If refcount is 1 then page is freed and refcount is stable as nobody
> -  * holds a reference on the page.
> +  * When a device_private page is freed, the page->mapping field
> +  * may still contain a (stale) mapping value. For example, the
> +  * lower bits of page->mapping may still identify the page as
> +  * an anonymous page. Ultimately, this entire field is just
> +  * stale and wrong, and it will cause errors if not cleared.
> +  * One example is:
> +  *
> +  *  migrate_vma_pages()
> +  *migrate_vma_insert_page()
> +  *  page_add_new_anon_rmap()
> +  *__page_set_anon_rmap()
> +  *  ...checks page->mapping, via PageAnon(page) call,
> +  *and inco

Re: [PATCH v4 09/23] mm/gup: introduce pin_user_pages*() and FOLL_PIN

2019-11-13 Thread Jan Kara
On Tue 12-11-19 20:26:56, John Hubbard wrote:
> Introduce pin_user_pages*() variations of get_user_pages*() calls,
> and also pin_longterm_pages*() variations.
> 
> These variants all set FOLL_PIN, which is also introduced, and
> thoroughly documented.
> 
> The pin_longterm*() variants also set FOLL_LONGTERM, in addition
> to FOLL_PIN:
> 
> pin_user_pages()
> pin_user_pages_remote()
> pin_user_pages_fast()
> 
> pin_longterm_pages()
> pin_longterm_pages_remote()
> pin_longterm_pages_fast()
> 
> All pages that are pinned via the above calls, must be unpinned via
> put_user_page().
> 
> The underlying rules are:
> 
> * These are gup-internal flags, so the call sites should not directly
> set FOLL_PIN nor FOLL_LONGTERM. That behavior is enforced with
> assertions, for the new FOLL_PIN flag. However, for the pre-existing
> FOLL_LONGTERM flag, which has some call sites that still directly
> set FOLL_LONGTERM, there is no assertion yet.
> 
> * Call sites that want to indicate that they are going to do DirectIO
>   ("DIO") or something with similar characteristics, should call a
>   get_user_pages()-like wrapper call that sets FOLL_PIN. These wrappers
>   will:
> * Start with "pin_user_pages" instead of "get_user_pages". That
>   makes it easy to find and audit the call sites.
>     * Set FOLL_PIN
> 
> * For pages that are received via FOLL_PIN, those pages must be returned
>   via put_user_page().
> 
> Thanks to Jan Kara and Vlastimil Babka for explaining the 4 cases
> in this documentation. (I've reworded it and expanded upon it.)
> 
> Reviewed-by: Mike Rapoport   # Documentation
> Reviewed-by: Jérôme Glisse 
> Cc: Jonathan Corbet 
> Cc: Ira Weiny 
> Signed-off-by: John Hubbard 

Thanks for the documentation. It looks great!

> diff --git a/mm/gup.c b/mm/gup.c
> index 83702b2e86c8..4409e84dff51 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -201,6 +201,10 @@ static struct page *follow_page_pte(struct 
> vm_area_struct *vma,
>   spinlock_t *ptl;
>   pte_t *ptep, pte;
>  
> + /* FOLL_GET and FOLL_PIN are mutually exclusive. */
> + if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
> +  (FOLL_PIN | FOLL_GET)))
> + return ERR_PTR(-EINVAL);
>  retry:
>   if (unlikely(pmd_bad(*pmd)))
>   return no_page_table(vma, flags);

How does FOLL_PIN result in grabbing (at least normal, for now) page reference?
I didn't find that anywhere in this patch but it is a prerequisite to
converting any user to pin_user_pages() interface, right?

> +/**
> + * pin_user_pages_fast() - pin user pages in memory without taking locks
> + *
> + * Nearly the same as get_user_pages_fast(), except that FOLL_PIN is set. See
> + * get_user_pages_fast() for documentation on the function arguments, because
> + * the arguments here are identical.
> + *
> + * FOLL_PIN means that the pages must be released via put_user_page(). Please
> + * see Documentation/vm/pin_user_pages.rst for further details.
> + *
> + * This is intended for Case 1 (DIO) in Documentation/vm/pin_user_pages.rst. 
> It
> + * is NOT intended for Case 2 (RDMA: long-term pins).
> + */
> +int pin_user_pages_fast(unsigned long start, int nr_pages,
> + unsigned int gup_flags, struct page **pages)
> +{
> + /* FOLL_GET and FOLL_PIN are mutually exclusive. */
> + if (WARN_ON_ONCE(gup_flags & FOLL_GET))
> + return -EINVAL;
> +
> + gup_flags |= FOLL_PIN;
> + return internal_get_user_pages_fast(start, nr_pages, gup_flags, pages);
> +}
> +EXPORT_SYMBOL_GPL(pin_user_pages_fast);

I was somewhat wondering about the number of functions you add here. So we
have:

pin_user_pages()
pin_user_pages_fast()
pin_user_pages_remote()

and then longterm variants:

pin_longterm_pages()
pin_longterm_pages_fast()
pin_longterm_pages_remote()

and obviously we have gup like:
get_user_pages()
get_user_pages_fast()
get_user_pages_remote()
... and some other gup variants ...

I think we really should have pin_* vs get_* variants as they are very
different in terms of guarantees and after conversion, any use of get_*
variant in non-mm code should be closely scrutinized. OTOH pin_longterm_*
don't look *that* useful to me and just using pin_* instead with
FOLL_LONGTERM flag would look OK to me and somewhat reduce the number of
functions which is already large enough? What do people think? I don't feel
too strongly about this but wanted to bring this up.

Honza



-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v3 00/23] mm/gup: track dma-pinned pages: FOLL_PIN, FOLL_LONGTERM

2019-11-13 Thread Jan Kara
On Wed 13-11-19 01:02:02, John Hubbard wrote:
> On 11/13/19 12:22 AM, Daniel Vetter wrote:
> ...
> > > > Why are we doing this? I think things got confused here someplace, as
> > > 
> > > 
> > > Because:
> > > 
> > > a) These need put_page() calls,  and
> > > 
> > > b) there is no put_pages() call, but there is a release_pages() call that
> > > is, arguably, what put_pages() would be.
> > > 
> > > 
> > > > the comment still says:
> > > > 
> > > > /**
> > > >   * put_user_page() - release a gup-pinned page
> > > >   * @page:pointer to page to be released
> > > >   *
> > > >   * Pages that were pinned via get_user_pages*() must be released via
> > > >   * either put_user_page(), or one of the put_user_pages*() routines
> > > >   * below.
> > > 
> > > 
> > > Ohhh, I missed those comments. They need to all be changed over to
> > > say "pages that were pinned via pin_user_pages*() or
> > > pin_longterm_pages*() must be released via put_user_page*()."
> > > 
> > > The get_user_pages*() pages must still be released via put_page.
> > > 
> > > The churn is due to a fairly significant change in strategy, whis
> > > is: instead of changing all get_user_pages*() sites to call
> > > put_user_page(), change selected sites to call pin_user_pages*() or
> > > pin_longterm_pages*(), plus put_user_page().
> > 
> > Can't we call this unpin_user_page then, for some symmetry? Or is that
> > even more churn?
> > 
> > Looking from afar the naming here seems really confusing.
> 
> 
> That look from afar is valuable, because I'm too close to the problem to see
> how the naming looks. :)
> 
> unpin_user_page() sounds symmetrical. It's true that it would cause more
> churn (which is why I started off with a proposal that avoids changing the
> names of put_user_page*() APIs). But OTOH, the amount of churn is proportional
> to the change in direction here, and it's really only 10 or 20 lines changed,
> in the end.
> 
> So I'm open to changing to that naming. It would be nice to hear what others
> prefer, too...

FWIW I'd find unpin_user_page() also better than put_user_page() as a
counterpart to pin_user_pages().

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] xfs: introduce "metasync" api to sync metadata to fsblock

2019-10-14 Thread Jan Kara
On Mon 14-10-19 08:23:39, Eric Sandeen wrote:
> On 10/14/19 4:43 AM, Jan Kara wrote:
> > On Mon 14-10-19 16:33:15, Pingfan Liu wrote:
> > > On Sun, Oct 13, 2019 at 09:34:17AM -0700, Darrick J. Wong wrote:
> > > > On Sun, Oct 13, 2019 at 10:37:00PM +0800, Pingfan Liu wrote:
> > > > > When using fadump (fireware assist dump) mode on powerpc, a mismatch
> > > > > between grub xfs driver and kernel xfs driver has been obsevered.  
> > > > > Note:
> > > > > fadump boots up in the following sequence: fireware -> grub reads 
> > > > > kernel
> > > > > and initramfs -> kernel boots.
> > > > > 
> > > > > The process to reproduce this mismatch:
> > > > >- On powerpc, boot kernel with fadump=on and edit /etc/kdump.conf.
> > > > >- Replacing "path /var/crash" with "path /var/crashnew", then, 
> > > > > "kdumpctl
> > > > >  restart" to rebuild the initramfs. Detail about the rebuilding 
> > > > > looks
> > > > >  like: mkdumprd /boot/initramfs-`uname -r`.img.tmp;
> > > > >mv /boot/initramfs-`uname -r`.img.tmp 
> > > > > /boot/initramfs-`uname -r`.img
> > > > >sync
> > > > >- "echo c >/proc/sysrq-trigger".
> > > > > 
> > > > > The result:
> > > > > The dump image will not be saved under /var/crashnew/* as expected, 
> > > > > but
> > > > > still saved under /var/crash.
> > > > > 
> > > > > The root cause:
> > > > > As Eric pointed out that on xfs, 'sync' ensures the consistency by 
> > > > > writing
> > > > > back metadata to xlog, but not necessary to fsblock. This raises 
> > > > > issue if
> > > > > grub can not replay the xlog before accessing the xfs files. Since the
> > > > > above dir entry of initramfs should be saved as inline data with 
> > > > > xfs_inode,
> > > > > so xfs_fs_sync_fs() does not guarantee it written to fsblock.
> > > > > 
> > > > > umount can be used to write metadata fsblock, but the filesystem can 
> > > > > not be
> > > > > umounted if still in use.
> > > > > 
> > > > > There are two ways to fix this mismatch, either grub or xfs. It may be
> > > > > easier to do this in xfs side by introducing an interface to flush 
> > > > > metadata
> > > > > to fsblock explicitly.
> > > > > 
> > > > > With this patch, metadata can be written to fsblock by:
> > > > ># update AIL
> > > > >sync
> > > > ># new introduced interface to flush metadata to fsblock
> > > > >mount -o remount,metasync mountpoint
> > > > 
> > > > I think this ought to be an ioctl or some sort of generic call since the
> > > > jbd2 filesystems (ext3, ext4, ocfs2) suffer from the same "$BOOTLOADER
> > > > is too dumb to recover logs but still wants to write to the fs"
> > > > checkpointing problem.
> > > Yes, a syscall sounds more reasonable.
> > > > 
> > > > (Or maybe we should just put all that stuff in a vfat filesystem, I
> > > > don't know...)
> > > I think it is unavoidable to involve in each fs' implementation. What
> > > about introducing an interface sync_to_fsblock(struct super_block *sb) in
> > > the struct super_operations, then let each fs manage its own case?
> > 
> > Well, we already have a way to achieve what you need: fsfreeze.
> > Traditionally, that is guaranteed to put fs into a "clean" state very much
> > equivalent to the fs being unmounted and that seems to be what the
> > bootloader wants so that it can access the filesystem without worrying
> > about some recovery details. So do you see any problem with replacing
> > 'sync' in your example above with 'fsfreeze /boot && fsfreeze -u /boot'?
> > 
> > Honza
> 
> The problem with fsfreeze is that if the device you want to quiesce is, say,
> the root fs, freeze isn't really a good option.

I agree you need to be really careful not to deadlock against yourself in
that case. But this particular use actually has a chance to work.

> But the other thing I want to highlight about this approach is that it does 
> not
> solve the root probl

Re: [PATCH] xfs: introduce "metasync" api to sync metadata to fsblock

2019-10-14 Thread Jan Kara
On Mon 14-10-19 16:33:15, Pingfan Liu wrote:
> On Sun, Oct 13, 2019 at 09:34:17AM -0700, Darrick J. Wong wrote:
> > On Sun, Oct 13, 2019 at 10:37:00PM +0800, Pingfan Liu wrote:
> > > When using fadump (fireware assist dump) mode on powerpc, a mismatch
> > > between grub xfs driver and kernel xfs driver has been obsevered.  Note:
> > > fadump boots up in the following sequence: fireware -> grub reads kernel
> > > and initramfs -> kernel boots.
> > > 
> > > The process to reproduce this mismatch:
> > >   - On powerpc, boot kernel with fadump=on and edit /etc/kdump.conf.
> > >   - Replacing "path /var/crash" with "path /var/crashnew", then, "kdumpctl
> > > restart" to rebuild the initramfs. Detail about the rebuilding looks
> > > like: mkdumprd /boot/initramfs-`uname -r`.img.tmp;
> > >   mv /boot/initramfs-`uname -r`.img.tmp /boot/initramfs-`uname 
> > > -r`.img
> > >   sync
> > >   - "echo c >/proc/sysrq-trigger".
> > > 
> > > The result:
> > > The dump image will not be saved under /var/crashnew/* as expected, but
> > > still saved under /var/crash.
> > > 
> > > The root cause:
> > > As Eric pointed out that on xfs, 'sync' ensures the consistency by writing
> > > back metadata to xlog, but not necessary to fsblock. This raises issue if
> > > grub can not replay the xlog before accessing the xfs files. Since the
> > > above dir entry of initramfs should be saved as inline data with 
> > > xfs_inode,
> > > so xfs_fs_sync_fs() does not guarantee it written to fsblock.
> > > 
> > > umount can be used to write metadata fsblock, but the filesystem can not 
> > > be
> > > umounted if still in use.
> > > 
> > > There are two ways to fix this mismatch, either grub or xfs. It may be
> > > easier to do this in xfs side by introducing an interface to flush 
> > > metadata
> > > to fsblock explicitly.
> > > 
> > > With this patch, metadata can be written to fsblock by:
> > >   # update AIL
> > >   sync
> > >   # new introduced interface to flush metadata to fsblock
> > >   mount -o remount,metasync mountpoint
> > 
> > I think this ought to be an ioctl or some sort of generic call since the
> > jbd2 filesystems (ext3, ext4, ocfs2) suffer from the same "$BOOTLOADER
> > is too dumb to recover logs but still wants to write to the fs"
> > checkpointing problem.
> Yes, a syscall sounds more reasonable.
> > 
> > (Or maybe we should just put all that stuff in a vfat filesystem, I
> > don't know...)
> I think it is unavoidable to involve in each fs' implementation. What
> about introducing an interface sync_to_fsblock(struct super_block *sb) in
> the struct super_operations, then let each fs manage its own case?

Well, we already have a way to achieve what you need: fsfreeze.
Traditionally, that is guaranteed to put fs into a "clean" state very much
equivalent to the fs being unmounted and that seems to be what the
bootloader wants so that it can access the filesystem without worrying
about some recovery details. So do you see any problem with replacing
'sync' in your example above with 'fsfreeze /boot && fsfreeze -u /boot'?

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v3 3/6] mm/nvdimm: Add page size and struct page size to pfn superblock

2019-06-11 Thread Jan Kara
On Tue 04-06-19 14:43:54, Aneesh Kumar K.V wrote:
> This is needed so that we don't wrongly initialize a namespace
> which doesn't have enough space reserved for holding struct pages
> with the current kernel.
> 
> We also increment PFN_MIN_VERSION to make sure that older kernel
> won't initialize namespace created with newer kernel.
> 
> Signed-off-by: Aneesh Kumar K.V 
...
> diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
> index 00c57805cad3..e01eee9efafe 100644
> --- a/drivers/nvdimm/pfn_devs.c
> +++ b/drivers/nvdimm/pfn_devs.c
> @@ -467,6 +467,15 @@ int nd_pfn_validate(struct nd_pfn *nd_pfn, const char 
> *sig)
>   if (__le16_to_cpu(pfn_sb->version_minor) < 2)
>   pfn_sb->align = 0;
>  
> + if (__le16_to_cpu(pfn_sb->version_minor) < 3) {
> + /*
> +  * For a large part we use PAGE_SIZE. But we
> +  * do have some accounting code using SZ_4K.
> +  */
> + pfn_sb->page_struct_size = cpu_to_le16(64);
> + pfn_sb->page_size = cpu_to_le32(SZ_4K);
> + }
> +
>   switch (le32_to_cpu(pfn_sb->mode)) {
>   case PFN_MODE_RAM:
>   case PFN_MODE_PMEM:

As we discussed with Aneesh privately, this actually means that existing
NVDIMM namespaces on PPC64 will stop working due to these defaults for old
superblocks. I don't think that's a good thing as upgrading kernels is
going to be nightmare due to this on PPC64. So I believe we should make
defaults for old superblocks such that working setups keep working without
sysadmin having to touch anything.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2] mm: Move MAP_SYNC to asm-generic/mman-common.h

2019-05-28 Thread Jan Kara
On Tue 28-05-19 14:41:20, Aneesh Kumar K.V wrote:
> This enables support for synchronous DAX fault on powerpc
> 
> The generic changes are added as part of
> commit b6fb293f2497 ("mm: Define MAP_SYNC and VM_SYNC flags")
> 
> Without this, mmap returns EOPNOTSUPP for MAP_SYNC with MAP_SHARED_VALIDATE
> 
> Instead of adding MAP_SYNC with same value to
> arch/powerpc/include/uapi/asm/mman.h, I am moving the #define to
> asm-generic/mman-common.h. Two architectures using mman-common.h directly are
> sparc and powerpc. We should be able to consloidate more #defines to
> mman-common.h. That can be done as a separate patch.
> 
> Signed-off-by: Aneesh Kumar K.V 

Looks good to me FWIW (I don't have much experience with mmap flags and
their peculirarities). So feel free to add:

Reviewed-by: Jan Kara 

Honza

> ---
> Changes from V1:
> * Move #define to mman-common.h instead of powerpc specific mman.h change
> 
> 
>  include/uapi/asm-generic/mman-common.h | 3 ++-
>  include/uapi/asm-generic/mman.h| 1 -
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/uapi/asm-generic/mman-common.h 
> b/include/uapi/asm-generic/mman-common.h
> index abd238d0f7a4..bea0278f65ab 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -25,7 +25,8 @@
>  # define MAP_UNINITIALIZED 0x0   /* Don't support this flag */
>  #endif
>  
> -/* 0x0100 - 0x8 flags are defined in asm-generic/mman.h */
> +/* 0x0100 - 0x4 flags are defined in asm-generic/mman.h */
> +#define MAP_SYNC 0x08 /* perform synchronous page faults for 
> the mapping */
>  #define MAP_FIXED_NOREPLACE  0x10/* MAP_FIXED which doesn't 
> unmap underlying mapping */
>  
>  /*
> diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
> index 653687d9771b..2dffcbf705b3 100644
> --- a/include/uapi/asm-generic/mman.h
> +++ b/include/uapi/asm-generic/mman.h
> @@ -13,7 +13,6 @@
>  #define MAP_NONBLOCK 0x1 /* do not block on IO */
>  #define MAP_STACK0x2 /* give out an address that is best 
> suited for process/thread stacks */
>  #define MAP_HUGETLB  0x4 /* create a huge page mapping */
> -#define MAP_SYNC 0x8 /* perform synchronous page faults for 
> the mapping */
>  
>  /* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */
>  
> -- 
> 2.21.0
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2] mm: Fix modifying of page protection by insert_pfn_pmd()

2019-04-26 Thread Jan Kara
On Thu 25-04-19 17:33:04, Dan Williams wrote:
> On Thu, Apr 25, 2019 at 12:32 AM Jan Kara  wrote:
> >
> > On Wed 24-04-19 11:13:48, Dan Williams wrote:
> > > On Wed, Apr 24, 2019 at 10:38 AM Matthew Wilcox  
> > > wrote:
> > > >
> > > > On Wed, Apr 24, 2019 at 10:13:15AM -0700, Dan Williams wrote:
> > > > > I think unaligned addresses have always been passed to
> > > > > vmf_insert_pfn_pmd(), but nothing cared until this patch. I *think*
> > > > > the only change needed is the following, thoughts?
> > > > >
> > > > > diff --git a/fs/dax.c b/fs/dax.c
> > > > > index ca0671d55aa6..82aee9a87efa 100644
> > > > > --- a/fs/dax.c
> > > > > +++ b/fs/dax.c
> > > > > @@ -1560,7 +1560,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct
> > > > > vm_fault *vmf, pfn_t *pfnp,
> > > > > }
> > > > >
> > > > > trace_dax_pmd_insert_mapping(inode, vmf, PMD_SIZE, 
> > > > > pfn, entry);
> > > > > -   result = vmf_insert_pfn_pmd(vma, vmf->address, 
> > > > > vmf->pmd, pfn,
> > > > > +   result = vmf_insert_pfn_pmd(vma, pmd_addr, vmf->pmd, 
> > > > > pfn,
> > > > > write);
> > > >
> > > > We also call vmf_insert_pfn_pmd() in dax_insert_pfn_mkwrite() -- does
> > > > that need to change too?
> > >
> > > It wasn't clear to me that it was a problem. I think that one already
> > > happens to be pmd-aligned.
> >
> > Why would it need to be? The address is taken from vmf->address and that's
> > set up in __handle_mm_fault() like .address = address & PAGE_MASK. So I
> > don't see anything forcing PMD alignment of the virtual address...
> 
> True. So now I'm wondering if the masking should be done internal to
> the routine. Given it's prefixed vmf_ it seems to imply the api is
> prepared to take raw 'struct vm_fault' parameters. I think I'll go
> that route unless someone sees a reason to require the caller to
> handle this responsibility.

Yeah, that sounds good to me. Thanks for fixing this.

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2] mm: Fix modifying of page protection by insert_pfn_pmd()

2019-04-25 Thread Jan Kara
On Wed 24-04-19 11:13:48, Dan Williams wrote:
> On Wed, Apr 24, 2019 at 10:38 AM Matthew Wilcox  wrote:
> >
> > On Wed, Apr 24, 2019 at 10:13:15AM -0700, Dan Williams wrote:
> > > I think unaligned addresses have always been passed to
> > > vmf_insert_pfn_pmd(), but nothing cared until this patch. I *think*
> > > the only change needed is the following, thoughts?
> > >
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index ca0671d55aa6..82aee9a87efa 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -1560,7 +1560,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct
> > > vm_fault *vmf, pfn_t *pfnp,
> > > }
> > >
> > > trace_dax_pmd_insert_mapping(inode, vmf, PMD_SIZE, pfn, 
> > > entry);
> > > -   result = vmf_insert_pfn_pmd(vma, vmf->address, vmf->pmd, 
> > > pfn,
> > > +   result = vmf_insert_pfn_pmd(vma, pmd_addr, vmf->pmd, pfn,
> > > write);
> >
> > We also call vmf_insert_pfn_pmd() in dax_insert_pfn_mkwrite() -- does
> > that need to change too?
> 
> It wasn't clear to me that it was a problem. I think that one already
> happens to be pmd-aligned.

Why would it need to be? The address is taken from vmf->address and that's
set up in __handle_mm_fault() like .address = address & PAGE_MASK. So I
don't see anything forcing PMD alignment of the virtual address...

Honza
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2] mm: Fix modifying of page protection by insert_pfn_pmd()

2019-04-02 Thread Jan Kara
On Tue 02-04-19 17:21:25, Aneesh Kumar K.V wrote:
> With some architectures like ppc64, set_pmd_at() cannot cope with
> a situation where there is already some (different) valid entry present.
> 
> Use pmdp_set_access_flags() instead to modify the pfn which is built to
> deal with modifying existing PMD entries.
> 
> This is similar to
> commit cae85cb8add3 ("mm/memory.c: fix modifying of page protection by 
> insert_pfn()")
> 
> We also do similar update w.r.t insert_pfn_pud eventhough ppc64 don't support
> pud pfn entries now.
> 
> Without this patch we also see the below message in kernel log
> "BUG: non-zero pgtables_bytes on freeing mm:"
> 
> CC: sta...@vger.kernel.org
> Reported-by: Chandan Rajendra 
> Signed-off-by: Aneesh Kumar K.V 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> ---
> Changes from v1:
> * Fix the pgtable leak 
> 
>  mm/huge_memory.c | 36 
>  1 file changed, 36 insertions(+)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 404acdcd0455..165ea46bf149 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -755,6 +755,21 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, 
> unsigned long addr,
>   spinlock_t *ptl;
>  
>   ptl = pmd_lock(mm, pmd);
> + if (!pmd_none(*pmd)) {
> + if (write) {
> + if (pmd_pfn(*pmd) != pfn_t_to_pfn(pfn)) {
> + WARN_ON_ONCE(!is_huge_zero_pmd(*pmd));
> + goto out_unlock;
> + }
> + entry = pmd_mkyoung(*pmd);
> + entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> + if (pmdp_set_access_flags(vma, addr, pmd, entry, 1))
> + update_mmu_cache_pmd(vma, addr, pmd);
> + }
> +
> + goto out_unlock;
> + }
> +
>   entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
>   if (pfn_t_devmap(pfn))
>   entry = pmd_mkdevmap(entry);
> @@ -766,11 +781,16 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, 
> unsigned long addr,
>   if (pgtable) {
>   pgtable_trans_huge_deposit(mm, pmd, pgtable);
>   mm_inc_nr_ptes(mm);
> + pgtable = NULL;
>   }
>  
>   set_pmd_at(mm, addr, pmd, entry);
>   update_mmu_cache_pmd(vma, addr, pmd);
> +
> +out_unlock:
>   spin_unlock(ptl);
> + if (pgtable)
> + pte_free(mm, pgtable);
>  }
>  
>  vm_fault_t vmf_insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
> @@ -821,6 +841,20 @@ static void insert_pfn_pud(struct vm_area_struct *vma, 
> unsigned long addr,
>   spinlock_t *ptl;
>  
>   ptl = pud_lock(mm, pud);
> + if (!pud_none(*pud)) {
> + if (write) {
> + if (pud_pfn(*pud) != pfn_t_to_pfn(pfn)) {
> + WARN_ON_ONCE(!is_huge_zero_pud(*pud));
> + goto out_unlock;
> + }
> + entry = pud_mkyoung(*pud);
> + entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
> + if (pudp_set_access_flags(vma, addr, pud, entry, 1))
> + update_mmu_cache_pud(vma, addr, pud);
> + }
> + goto out_unlock;
> + }
> +
>   entry = pud_mkhuge(pfn_t_pud(pfn, prot));
>   if (pfn_t_devmap(pfn))
>   entry = pud_mkdevmap(entry);
> @@ -830,6 +864,8 @@ static void insert_pfn_pud(struct vm_area_struct *vma, 
> unsigned long addr,
>   }
>   set_pud_at(mm, addr, pud, entry);
>   update_mmu_cache_pud(vma, addr, pud);
> +
> +out_unlock:
>   spin_unlock(ptl);
>  }
>  
> -- 
> 2.20.1
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] mm: Fix modifying of page protection by insert_pfn_pmd()

2019-04-01 Thread Jan Kara
On Sat 30-03-19 11:11:21, Aneesh Kumar K.V wrote:
> With some architectures like ppc64, set_pmd_at() cannot cope with
> a situation where there is already some (different) valid entry present.
> 
> Use pmdp_set_access_flags() instead to modify the pfn which is built to
> deal with modifying existing PMD entries.
> 
> This is similar to
> commit cae85cb8add3 ("mm/memory.c: fix modifying of page protection by 
> insert_pfn()")
> 
> We also do similar update w.r.t insert_pfn_pud eventhough ppc64 don't support
> pud pfn entries now.
> 
> CC: sta...@vger.kernel.org
> Signed-off-by: Aneesh Kumar K.V 

Thanks for fixing this! The patch looks good to me. Feel free to add:

Reviewed-by: Jan Kara 

Honza

> ---
>  mm/huge_memory.c | 31 +++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 404acdcd0455..f7dca413c4b2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -755,6 +755,20 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, 
> unsigned long addr,
>   spinlock_t *ptl;
>  
>   ptl = pmd_lock(mm, pmd);
> + if (!pmd_none(*pmd)) {
> + if (write) {
> + if (pmd_pfn(*pmd) != pfn_t_to_pfn(pfn)) {
> + WARN_ON_ONCE(!is_huge_zero_pmd(*pmd));
> + goto out_unlock;
> + }
> + entry = pmd_mkyoung(*pmd);
> + entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> + if (pmdp_set_access_flags(vma, addr, pmd, entry, 1))
> + update_mmu_cache_pmd(vma, addr, pmd);
> + }
> + goto out_unlock;
> + }
> +
>   entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
>   if (pfn_t_devmap(pfn))
>   entry = pmd_mkdevmap(entry);
> @@ -770,6 +784,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, 
> unsigned long addr,
>  
>   set_pmd_at(mm, addr, pmd, entry);
>   update_mmu_cache_pmd(vma, addr, pmd);
> +out_unlock:
>   spin_unlock(ptl);
>  }
>  
> @@ -821,6 +836,20 @@ static void insert_pfn_pud(struct vm_area_struct *vma, 
> unsigned long addr,
>   spinlock_t *ptl;
>  
>   ptl = pud_lock(mm, pud);
> + if (!pud_none(*pud)) {
> + if (write) {
> + if (pud_pfn(*pud) != pfn_t_to_pfn(pfn)) {
> + WARN_ON_ONCE(!is_huge_zero_pud(*pud));
> + goto out_unlock;
> + }
> + entry = pud_mkyoung(*pud);
> + entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
> + if (pudp_set_access_flags(vma, addr, pud, entry, 1))
> + update_mmu_cache_pud(vma, addr, pud);
> + }
> + goto out_unlock;
> + }
> +
>   entry = pud_mkhuge(pfn_t_pud(pfn, prot));
>   if (pfn_t_devmap(pfn))
>   entry = pud_mkdevmap(entry);
> @@ -830,6 +859,8 @@ static void insert_pfn_pud(struct vm_area_struct *vma, 
> unsigned long addr,
>   }
>   set_pud_at(mm, addr, pud, entry);
>   update_mmu_cache_pud(vma, addr, pud);
> +
> +out_unlock:
>   spin_unlock(ptl);
>  }
>  
> -- 
> 2.20.1
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2] fs/dax: deposit pagetable even when installing zero page

2019-03-13 Thread Jan Kara
On Wed 13-03-19 10:17:17, Aneesh Kumar K.V wrote:
> 
> Hi Dan/Andrew/Jan,
> 
> "Aneesh Kumar K.V"  writes:
> 
> > Architectures like ppc64 use the deposited page table to store hardware
> > page table slot information. Make sure we deposit a page table when
> > using zero page at the pmd level for hash.
> >
> > Without this we hit
> >
> > Unable to handle kernel paging request for data at address 0x
> > Faulting instruction address: 0xc0082a74
> > Oops: Kernel access of bad area, sig: 11 [#1]
> > 
> >
> > NIP [c0082a74] __hash_page_thp+0x224/0x5b0
> > LR [c00829a4] __hash_page_thp+0x154/0x5b0
> > Call Trace:
> >  hash_page_mm+0x43c/0x740
> >  do_hash_page+0x2c/0x3c
> >  copy_from_iter_flushcache+0xa4/0x4a0
> >  pmem_copy_from_iter+0x2c/0x50 [nd_pmem]
> >  dax_copy_from_iter+0x40/0x70
> >  dax_iomap_actor+0x134/0x360
> >  iomap_apply+0xfc/0x1b0
> >  dax_iomap_rw+0xac/0x130
> >  ext4_file_write_iter+0x254/0x460 [ext4]
> >  __vfs_write+0x120/0x1e0
> >  vfs_write+0xd8/0x220
> >  SyS_write+0x6c/0x110
> >  system_call+0x3c/0x130
> >
> > Fixes: b5beae5e224f ("powerpc/pseries: Add driver for PAPR SCM regions")
> > Reviewed-by: Jan Kara 
> > Signed-off-by: Aneesh Kumar K.V 
> 
> Any suggestion on which tree this patch should got to? Also since this
> fix a kernel crash, we may want to get this to 5.1?

I think this should go through Dan's tree...

Honza

> > ---
> > Changes from v1:
> > * Add reviewed-by:
> > * Add Fixes:
> >
> >  fs/dax.c | 15 +++
> >  1 file changed, 15 insertions(+)
> >
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 6959837cc465..01bfb2ac34f9 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -33,6 +33,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include "internal.h"
> >  
> >  #define CREATE_TRACE_POINTS
> > @@ -1410,7 +1411,9 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state 
> > *xas, struct vm_fault *vmf,
> >  {
> > struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> > unsigned long pmd_addr = vmf->address & PMD_MASK;
> > +   struct vm_area_struct *vma = vmf->vma;
> > struct inode *inode = mapping->host;
> > +   pgtable_t pgtable = NULL;
> > struct page *zero_page;
> > spinlock_t *ptl;
> > pmd_t pmd_entry;
> > @@ -1425,12 +1428,22 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state 
> > *xas, struct vm_fault *vmf,
> > *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
> > DAX_PMD | DAX_ZERO_PAGE, false);
> >  
> > +   if (arch_needs_pgtable_deposit()) {
> > +   pgtable = pte_alloc_one(vma->vm_mm);
> > +   if (!pgtable)
> > +   return VM_FAULT_OOM;
> > +   }
> > +
> > ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
> > if (!pmd_none(*(vmf->pmd))) {
> > spin_unlock(ptl);
> > goto fallback;
> > }
> >  
> > +   if (pgtable) {
> > +   pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
> > +   mm_inc_nr_ptes(vma->vm_mm);
> > +   }
> > pmd_entry = mk_pmd(zero_page, vmf->vma->vm_page_prot);
> > pmd_entry = pmd_mkhuge(pmd_entry);
> > set_pmd_at(vmf->vma->vm_mm, pmd_addr, vmf->pmd, pmd_entry);
> > @@ -1439,6 +1452,8 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state 
> > *xas, struct vm_fault *vmf,
> > return VM_FAULT_NOPAGE;
> >  
> >  fallback:
> > +   if (pgtable)
> > +   pte_free(vma->vm_mm, pgtable);
> > trace_dax_pmd_load_hole_fallback(inode, vmf, zero_page, *entry);
> > return VM_FAULT_FALLBACK;
> >  }
> > -- 
> > 2.20.1
> 
> -aneesh
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default

2019-02-28 Thread Jan Kara
On Thu 28-02-19 14:05:22, Aneesh Kumar K.V wrote:
> Add a flag to indicate the ability to do huge page dax mapping. On 
> architecture
> like ppc64, the hypervisor can disable huge page support in the guest. In
> such a case, we should not enable huge page dax mapping. This patch adds
> a flag which the architecture code will update to indicate huge page
> dax mapping support.
> 
> Architectures mostly do transparent_hugepage_flag = 0; if they can't
> do hugepages. That also takes care of disabling dax hugepage mapping
> with this change.
> 
> Without this patch we get the below error with kvm on ppc64.
> 
> [  118.849975] lpar: Failed hash pte insert with error -4
> 
> NOTE: The patch also use
> 
> echo never > /sys/kernel/mm/transparent_hugepage/enabled
> to disable dax huge page mapping.
> 
> Signed-off-by: Aneesh Kumar K.V 

Added Dan to CC for opinion. I kind of fail to see why you don't use
TRANSPARENT_HUGEPAGE_FLAG for this. I know that technically DAX huge pages
and normal THPs are different things but so far we've tried to avoid making
that distinction visible to userspace.

Honza
> ---
> TODO:
> * Add Fixes: tag
> 
>  include/linux/huge_mm.h | 4 +++-
>  mm/huge_memory.c| 4 
>  2 files changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 381e872bfde0..01ad5258545e 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -53,6 +53,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, 
> unsigned long addr,
>   pud_t *pud, pfn_t pfn, bool write);
>  enum transparent_hugepage_flag {
>   TRANSPARENT_HUGEPAGE_FLAG,
> + TRANSPARENT_HUGEPAGE_DAX_FLAG,
>   TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
>   TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
>   TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
> @@ -111,7 +112,8 @@ static inline bool __transparent_hugepage_enabled(struct 
> vm_area_struct *vma)
>   if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_FLAG))
>   return true;
>  
> - if (vma_is_dax(vma))
> + if (vma_is_dax(vma) &&
> + (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_DAX_FLAG)))
>   return true;
>  
>   if (transparent_hugepage_flags &
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index faf357eaf0ce..43d742fe0341 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -53,6 +53,7 @@ unsigned long transparent_hugepage_flags __read_mostly =
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE_MADVISE
>   (1<  #endif
> + (1 << TRANSPARENT_HUGEPAGE_DAX_FLAG) |
>   (1<   (1<   (1< @@ -475,6 +476,8 @@ static int __init setup_transparent_hugepage(char *str)
> _hugepage_flags);
>   clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
> _hugepage_flags);
> + clear_bit(TRANSPARENT_HUGEPAGE_DAX_FLAG,
> +   _hugepage_flags);
>   ret = 1;
>   }
>  out:
> @@ -753,6 +756,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, 
> unsigned long addr,
>   spinlock_t *ptl;
>  
>   ptl = pmd_lock(mm, pmd);
> + /* should we check for none here again? */
>   entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
>   if (pfn_t_devmap(pfn))
>   entry = pmd_mkdevmap(entry);
> -- 
> 2.20.1
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH 1/2] fs/dax: deposit pagetable even when installing zero page

2019-02-28 Thread Jan Kara
On Thu 28-02-19 14:05:21, Aneesh Kumar K.V wrote:
> Architectures like ppc64 use the deposited page table to store hardware
> page table slot information. Make sure we deposit a page table when
> using zero page at the pmd level for hash.
> 
> Without this we hit
> 
> Unable to handle kernel paging request for data at address 0x
> Faulting instruction address: 0xc0082a74
> Oops: Kernel access of bad area, sig: 11 [#1]
> 
> 
> NIP [c0082a74] __hash_page_thp+0x224/0x5b0
> LR [c00829a4] __hash_page_thp+0x154/0x5b0
> Call Trace:
>  hash_page_mm+0x43c/0x740
>  do_hash_page+0x2c/0x3c
>  copy_from_iter_flushcache+0xa4/0x4a0
>  pmem_copy_from_iter+0x2c/0x50 [nd_pmem]
>  dax_copy_from_iter+0x40/0x70
>  dax_iomap_actor+0x134/0x360
>  iomap_apply+0xfc/0x1b0
>  dax_iomap_rw+0xac/0x130
>  ext4_file_write_iter+0x254/0x460 [ext4]
>  __vfs_write+0x120/0x1e0
>  vfs_write+0xd8/0x220
>  SyS_write+0x6c/0x110
>  system_call+0x3c/0x130
> 
> Signed-off-by: Aneesh Kumar K.V 

Thanks for the patch. It looks good to me. You can add:

Reviewed-by: Jan Kara 

> ---
> TODO:
> * Add fixes tag 

Probably this is a problem since initial PPC PMEM support, isn't it?

Honza

> 
>  fs/dax.c | 15 +++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/fs/dax.c b/fs/dax.c
> index 6959837cc465..01bfb2ac34f9 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -33,6 +33,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "internal.h"
>  
>  #define CREATE_TRACE_POINTS
> @@ -1410,7 +1411,9 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state 
> *xas, struct vm_fault *vmf,
>  {
>   struct address_space *mapping = vmf->vma->vm_file->f_mapping;
>   unsigned long pmd_addr = vmf->address & PMD_MASK;
> + struct vm_area_struct *vma = vmf->vma;
>   struct inode *inode = mapping->host;
> + pgtable_t pgtable = NULL;
>   struct page *zero_page;
>   spinlock_t *ptl;
>   pmd_t pmd_entry;
> @@ -1425,12 +1428,22 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state 
> *xas, struct vm_fault *vmf,
>   *entry = dax_insert_entry(xas, mapping, vmf, *entry, pfn,
>   DAX_PMD | DAX_ZERO_PAGE, false);
>  
> + if (arch_needs_pgtable_deposit()) {
> + pgtable = pte_alloc_one(vma->vm_mm);
> + if (!pgtable)
> + return VM_FAULT_OOM;
> + }
> +
>   ptl = pmd_lock(vmf->vma->vm_mm, vmf->pmd);
>   if (!pmd_none(*(vmf->pmd))) {
>   spin_unlock(ptl);
>   goto fallback;
>   }
>  
> + if (pgtable) {
> + pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
> + mm_inc_nr_ptes(vma->vm_mm);
> + }
>   pmd_entry = mk_pmd(zero_page, vmf->vma->vm_page_prot);
>   pmd_entry = pmd_mkhuge(pmd_entry);
>   set_pmd_at(vmf->vma->vm_mm, pmd_addr, vmf->pmd, pmd_entry);
> @@ -1439,6 +1452,8 @@ static vm_fault_t dax_pmd_load_hole(struct xa_state 
> *xas, struct vm_fault *vmf,
>   return VM_FAULT_NOPAGE;
>  
>  fallback:
> + if (pgtable)
> + pte_free(vma->vm_mm, pgtable);
>   trace_dax_pmd_load_hole_fallback(inode, vmf, zero_page, *entry);
>   return VM_FAULT_FALLBACK;
>  }
> -- 
> 2.20.1
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH] powerpc/64s: Fix possible corruption on big endian due to pgd/pud_present()

2019-02-14 Thread Jan Kara
On Thu 14-02-19 17:23:39, Michael Ellerman wrote:
> In v4.20 we changed our pgd/pud_present() to check for _PAGE_PRESENT
> rather than just checking that the value is non-zero, e.g.:
> 
>   static inline int pgd_present(pgd_t pgd)
>   {
>  -   return !pgd_none(pgd);
>  +   return (pgd_raw(pgd) & cpu_to_be64(_PAGE_PRESENT));
>   }
> 
> Unfortunately this is broken on big endian, as the result of the
> bitwise && is truncated to int, which is always zero because
> _PAGE_PRESENT is 0x8000ul. This means pgd_present() and
> pud_present() are always false at compile time, and the compiler
> elides the subsequent code.
> 
> Remarkably with that bug present we are still able to boot and run
> with few noticeable effects. However under some work loads we are able
> to trigger a warning in the ext4 code:

Wow, good catch. I wouldn't believe there are so few bad effects from
such a major breakage! :)

Honza

> 
>   WARNING: CPU: 11 PID: 29593 at fs/ext4/inode.c:3927 
> .ext4_set_page_dirty+0x70/0xb0
>   CPU: 11 PID: 29593 Comm: debugedit Not tainted 4.20.0-rc1 #1
>   ...
>   NIP .ext4_set_page_dirty+0x70/0xb0
>   LR  .set_page_dirty+0xa0/0x150
>   Call Trace:
>.set_page_dirty+0xa0/0x150
>.unmap_page_range+0xbf0/0xe10
>.unmap_vmas+0x84/0x130
>.unmap_region+0xe8/0x190
>.__do_munmap+0x2f0/0x510
>.__vm_munmap+0x80/0x110
>.__se_sys_munmap+0x14/0x30
>system_call+0x5c/0x70
> 
> The fix is simple, we need to convert the result of the bitwise && to
> an int before returning it.
> 
> Thanks to Jan Kara and Aneesh for help with debugging.
> 
> Fixes: da7ad366b497 ("powerpc/mm/book3s: Update pmd_present to look at 
> _PAGE_PRESENT bit")
> Cc: sta...@vger.kernel.org # v4.20+
> Reported-by: Erhard F. 
> Reviewed-by: Aneesh Kumar K.V 
> Signed-off-by: Michael Ellerman 
> ---
>  arch/powerpc/include/asm/book3s/64/pgtable.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index c9bfe526ca9d..d8c8d7c9df15 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -904,7 +904,7 @@ static inline int pud_none(pud_t pud)
>  
>  static inline int pud_present(pud_t pud)
>  {
> - return (pud_raw(pud) & cpu_to_be64(_PAGE_PRESENT));
> + return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PRESENT));
>  }
>  
>  extern struct page *pud_page(pud_t pud);
> @@ -951,7 +951,7 @@ static inline int pgd_none(pgd_t pgd)
>  
>  static inline int pgd_present(pgd_t pgd)
>  {
> - return (pgd_raw(pgd) & cpu_to_be64(_PAGE_PRESENT));
> + return !!(pgd_raw(pgd) & cpu_to_be64(_PAGE_PRESENT));
>  }
>  
>  static inline pte_t pgd_pte(pgd_t pgd)
> -- 
> 2.20.1
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [linux-next][EXT4][Oops]kernel panics when running fsfuzzer

2017-09-25 Thread Jan Kara
On Fri 22-09-17 15:46:21, Abdul Haleem wrote:
> On Wed, 2017-09-20 at 16:44 +1000, Michael Ellerman wrote:
> > Is it reproducible?
> 
> No, bug is not seen once for 3 re-run from yesterday.
> 
> Hope to hit it again in the future CI runs.

I was also looking into this but I couldn't figure out what has happened.
>From the crash it seems that jh->b_bh->b_private was NULL although it
should point back to 'jh'. But that just doesn't seem possible so maybe it
is something different. If you ever happen to hit it again, please save
the fuzzed image so that it can be used for reproduction. Thanks!

    Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH] fs: convert a pile of fsync routines to errseq_t based reporting

2017-07-31 Thread Jan Kara
On Fri 28-07-17 10:23:21, Jeff Layton wrote:
> From: Jeff Layton <jlay...@redhat.com>
> 
> This patch converts most of the in-kernel filesystems that do writeback
> out of the pagecache to report errors using the errseq_t-based
> infrastructure that was recently added. This allows them to report
> errors once for each open file description.
> 
> Most filesystems have a fairly straightforward fsync operation. They
> call filemap_write_and_wait_range to write back all of the data and
> wait on it, and then (sometimes) sync out the metadata.
> 
> For those filesystems this is a straightforward conversion from calling
> filemap_write_and_wait_range in their fsync operation to calling
> file_write_and_wait_range.
> 
> Signed-off-by: Jeff Layton <jlay...@redhat.com>

This all looks rather obvious. Feel free to add:

Acked-by: Jan Kara <j...@suse.cz>

Honza


> ---
>  arch/powerpc/platforms/cell/spufs/file.c   | 2 +-
>  drivers/staging/lustre/lustre/llite/file.c | 2 +-
>  drivers/video/fbdev/core/fb_defio.c| 2 +-
>  fs/9p/vfs_file.c   | 4 ++--
>  fs/affs/file.c | 2 +-
>  fs/afs/write.c | 2 +-
>  fs/cifs/file.c | 4 ++--
>  fs/exofs/file.c| 2 +-
>  fs/f2fs/file.c | 2 +-
>  fs/hfs/inode.c | 2 +-
>  fs/hfsplus/inode.c | 2 +-
>  fs/hostfs/hostfs_kern.c| 2 +-
>  fs/hpfs/file.c | 2 +-
>  fs/jffs2/file.c| 2 +-
>  fs/jfs/file.c  | 2 +-
>  fs/ncpfs/file.c| 2 +-
>  fs/ntfs/dir.c  | 2 +-
>  fs/ntfs/file.c | 2 +-
>  fs/ocfs2/file.c| 2 +-
>  fs/reiserfs/dir.c  | 2 +-
>  fs/reiserfs/file.c | 2 +-
>  fs/ubifs/file.c| 2 +-
>  22 files changed, 24 insertions(+), 24 deletions(-)
> 
> Rolling up all of these conversions into a single patch, as Christoph
> Hellwig suggested. Most of these are not tested, but the conversion
> here is fairly straightforward.
> 
> Any maintainers who object, please let me know and I'll yank that
> part out of this patch.
> 
> diff --git a/arch/powerpc/platforms/cell/spufs/file.c 
> b/arch/powerpc/platforms/cell/spufs/file.c
> index ae2f740a82f1..5ffcdeb1eb17 100644
> --- a/arch/powerpc/platforms/cell/spufs/file.c
> +++ b/arch/powerpc/platforms/cell/spufs/file.c
> @@ -1749,7 +1749,7 @@ static int spufs_mfc_flush(struct file *file, 
> fl_owner_t id)
>  static int spufs_mfc_fsync(struct file *file, loff_t start, loff_t end, int 
> datasync)
>  {
>   struct inode *inode = file_inode(file);
> - int err = filemap_write_and_wait_range(inode->i_mapping, start, end);
> + int err = file_write_and_wait_range(file, start, end);
>   if (!err) {
>   inode_lock(inode);
>   err = spufs_mfc_flush(file, NULL);
> diff --git a/drivers/staging/lustre/lustre/llite/file.c 
> b/drivers/staging/lustre/lustre/llite/file.c
> index ab1c85c1ed38..f7d07735ac66 100644
> --- a/drivers/staging/lustre/lustre/llite/file.c
> +++ b/drivers/staging/lustre/lustre/llite/file.c
> @@ -2364,7 +2364,7 @@ int ll_fsync(struct file *file, loff_t start, loff_t 
> end, int datasync)
>  PFID(ll_inode2fid(inode)), inode);
>   ll_stats_ops_tally(ll_i2sbi(inode), LPROC_LL_FSYNC, 1);
>  
> - rc = filemap_write_and_wait_range(inode->i_mapping, start, end);
> + rc = file_write_and_wait_range(file, start, end);
>   inode_lock(inode);
>  
>   /* catch async errors that were recorded back when async writeback
> diff --git a/drivers/video/fbdev/core/fb_defio.c 
> b/drivers/video/fbdev/core/fb_defio.c
> index 37f69c061210..487d5e336e1b 100644
> --- a/drivers/video/fbdev/core/fb_defio.c
> +++ b/drivers/video/fbdev/core/fb_defio.c
> @@ -69,7 +69,7 @@ int fb_deferred_io_fsync(struct file *file, loff_t start, 
> loff_t end, int datasy
>  {
>   struct fb_info *info = file->private_data;
>   struct inode *inode = file_inode(file);
> - int err = filemap_write_and_wait_range(inode->i_mapping, start, end);
> + int err = file_write_and_wait_range(file, start, end);
>   if (err)
>   return err;
>  
> diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
> index 3de3b4a89d89..4802d75b3cf7 100644
> --- a/fs/9p/vfs_file.c
> +++ b/fs/9p/vfs_file.c
> @@ -445,7 +445,7 @@ static int v9fs

[PATCH] axonram: Fix gendisk handling

2017-03-08 Thread Jan Kara
It is invalid to call del_gendisk() when disk->queue is NULL. Fix error
handling in axon_ram_probe() to avoid doing that.

Also del_gendisk() does not drop a reference to gendisk allocated by
alloc_disk(). That has to be done by put_disk(). Add that call where
needed.

Reported-by: Dan Carpenter <dan.carpen...@oracle.com>
Signed-off-by: Jan Kara <j...@suse.cz>
---
 arch/powerpc/sysdev/axonram.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

Warning: The patch is not tested in any way. I just based the fix on Smatch
warning and how things should be...

diff --git a/arch/powerpc/sysdev/axonram.c b/arch/powerpc/sysdev/axonram.c
index ada29eaed6e2..f523ac883150 100644
--- a/arch/powerpc/sysdev/axonram.c
+++ b/arch/powerpc/sysdev/axonram.c
@@ -274,7 +274,9 @@ static int axon_ram_probe(struct platform_device *device)
if (bank->disk->major > 0)
unregister_blkdev(bank->disk->major,
bank->disk->disk_name);
-   del_gendisk(bank->disk);
+   if (bank->disk->flags & GENHD_FL_UP)
+   del_gendisk(bank->disk);
+   put_disk(bank->disk);
}
device->dev.platform_data = NULL;
if (bank->io_addr != 0)
@@ -299,6 +301,7 @@ axon_ram_remove(struct platform_device *device)
device_remove_file(>dev, _attr_ecc);
free_irq(bank->irq_id, device);
del_gendisk(bank->disk);
+   put_disk(bank->disk);
iounmap((void __iomem *) bank->io_addr);
kfree(bank);
 
-- 
2.10.2



Re: [PATCH] do_direct_IO: Use inode->i_blkbits to compute block count to be cleaned

2017-01-09 Thread Jan Kara
On Sun 08-01-17 20:17:10, Chandan Rajendra wrote:
> The code currently uses sdio->blkbits to compute the number of blocks to
> be cleaned. However sdio->blkbits is derived from the logical block size
> of the underlying block device (Refer to the definition of
> do_blockdev_direct_IO()). Due to this, generic/299 test would rarely
> fail when executed on an ext4 filesystem with 64k as the block size and
> when using a virtio based disk (having 512 byte as the logical block
> size) inside a kvm guest.
> 
> This commit fixes the bug by using inode->i_blkbits to compute the
> number of blocks to be cleaned.

Ah, good catch. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> Signed-off-by: Chandan Rajendra <chan...@linux.vnet.ibm.com>
> ---
>  fs/direct-io.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/direct-io.c b/fs/direct-io.c
> index aeae8c0..b20adf9 100644
> --- a/fs/direct-io.c
> +++ b/fs/direct-io.c
> @@ -905,6 +905,7 @@ static inline void dio_zero_block(struct dio *dio, struct 
> dio_submit *sdio,
>  static int do_direct_IO(struct dio *dio, struct dio_submit *sdio,
>   struct buffer_head *map_bh)
>  {
> + const unsigned i_blkbits = dio->inode->i_blkbits;
>   const unsigned blkbits = sdio->blkbits;
>   int ret = 0;
>  
> @@ -949,7 +950,7 @@ static int do_direct_IO(struct dio *dio, struct 
> dio_submit *sdio,
>   clean_bdev_aliases(
>   map_bh->b_bdev,
>   map_bh->b_blocknr,
> - map_bh->b_size >> blkbits);
> +         map_bh->b_size >> i_blkbits);
>   }
>  
>   if (!sdio->blkfactor)
> -- 
> 2.5.5
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 08/10] mm: replace __access_remote_vm() write parameter with gup_flags

2016-10-19 Thread Jan Kara
On Thu 13-10-16 01:20:18, Lorenzo Stoakes wrote:
> This patch removes the write parameter from __access_remote_vm() and replaces 
> it
> with a gup_flags parameter as use of this function previously _implied_
> FOLL_FORCE, whereas after this patch callers explicitly pass this flag.
> 
> We make this explicit as use of FOLL_FORCE can result in surprising behaviour
> (and hence bugs) within the mm subsystem.
> 
> Signed-off-by: Lorenzo Stoakes <lstoa...@gmail.com>

So I'm not convinced this (and the following two patches) is actually
helping much. By grepping for FOLL_FORCE we will easily see that any caller
of access_remote_vm() gets that semantics and can thus continue search
accordingly (it is much simpler than searching for all get_user_pages()
users and extracting from parameter lists what they actually pass as
'force' argument). Sure it makes somewhat more visible to callers of
access_remote_vm() that they get FOLL_FORCE semantics but OTOH it also
opens a space for issues where a caller of access_remote_vm() actually
wants FOLL_FORCE (and currently all of them want it) and just mistakenly
does not set it. All in all I'd prefer to keep access_remote_vm() and
friends as is...

Honza

> ---
>  mm/memory.c | 23 +++
>  mm/nommu.c  |  9 ++---
>  2 files changed, 21 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 20a9adb..79ebed3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3869,14 +3869,11 @@ EXPORT_SYMBOL_GPL(generic_access_phys);
>   * given task for page fault accounting.
>   */
>  static int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
> - unsigned long addr, void *buf, int len, int write)
> + unsigned long addr, void *buf, int len, unsigned int gup_flags)
>  {
>   struct vm_area_struct *vma;
>   void *old_buf = buf;
> - unsigned int flags = FOLL_FORCE;
> -
> - if (write)
> - flags |= FOLL_WRITE;
> + int write = gup_flags & FOLL_WRITE;
>  
>   down_read(>mmap_sem);
>   /* ignore errors, just check how much was successfully transferred */
> @@ -3886,7 +3883,7 @@ static int __access_remote_vm(struct task_struct *tsk, 
> struct mm_struct *mm,
>   struct page *page = NULL;
>  
>   ret = get_user_pages_remote(tsk, mm, addr, 1,
> - flags, , );
> + gup_flags, , );
>   if (ret <= 0) {
>  #ifndef CONFIG_HAVE_IOREMAP_PROT
>   break;
> @@ -3945,7 +3942,12 @@ static int __access_remote_vm(struct task_struct *tsk, 
> struct mm_struct *mm,
>  int access_remote_vm(struct mm_struct *mm, unsigned long addr,
>   void *buf, int len, int write)
>  {
> - return __access_remote_vm(NULL, mm, addr, buf, len, write);
> + unsigned int flags = FOLL_FORCE;
> +
> + if (write)
> + flags |= FOLL_WRITE;
> +
> + return __access_remote_vm(NULL, mm, addr, buf, len, flags);
>  }
>  
>  /*
> @@ -3958,12 +3960,17 @@ int access_process_vm(struct task_struct *tsk, 
> unsigned long addr,
>  {
>   struct mm_struct *mm;
>   int ret;
> + unsigned int flags = FOLL_FORCE;
>  
>   mm = get_task_mm(tsk);
>   if (!mm)
>   return 0;
>  
> - ret = __access_remote_vm(tsk, mm, addr, buf, len, write);
> + if (write)
> + flags |= FOLL_WRITE;
> +
> + ret = __access_remote_vm(tsk, mm, addr, buf, len, flags);
> +
>   mmput(mm);
>  
>   return ret;
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 70cb844..bde7df3 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -1809,9 +1809,10 @@ void filemap_map_pages(struct fault_env *fe,
>  EXPORT_SYMBOL(filemap_map_pages);
>  
>  static int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,
> - unsigned long addr, void *buf, int len, int write)
> + unsigned long addr, void *buf, int len, unsigned int gup_flags)
>  {
>   struct vm_area_struct *vma;
> + int write = gup_flags & FOLL_WRITE;
>  
>   down_read(>mmap_sem);
>  
> @@ -1853,7 +1854,8 @@ static int __access_remote_vm(struct task_struct *tsk, 
> struct mm_struct *mm,
>  int access_remote_vm(struct mm_struct *mm, unsigned long addr,
>   void *buf, int len, int write)
>  {
> - return __access_remote_vm(NULL, mm, addr, buf, len, write);
> + return __access_remote_vm(NULL, mm, addr, buf, len,
> + write ? FOLL_WRITE : 0);
>  }
>  
>  /*
> @@ -1871,7 +1873,8 @@ int access_process_vm(struct task_struct *tsk, unsigned 
> long addr, void *buf, in
>   if (!mm)
>   return 0;
>  
> - len = __access_remote_vm(tsk, mm, addr, buf, len, write);
> + len = __access_remote_vm(tsk, mm, addr, buf, len,
> + write ? FOLL_WRITE : 0);
>  
>   mmput(mm);
>   return len;
> -- 
> 2.10.0
> 
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 07/10] mm: replace get_user_pages_remote() write/force parameters with gup_flags

2016-10-19 Thread Jan Kara
On Thu 13-10-16 01:20:17, Lorenzo Stoakes wrote:
> This patch removes the write and force parameters from get_user_pages_remote()
> and replaces them with a gup_flags parameter to make the use of FOLL_FORCE
> explicit in callers as use of this flag can result in surprising behaviour 
> (and
> hence bugs) within the mm subsystem.
> 
> Signed-off-by: Lorenzo Stoakes <lstoa...@gmail.com>

Looks good. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  drivers/gpu/drm/etnaviv/etnaviv_gem.c   |  7 +--
>  drivers/gpu/drm/i915/i915_gem_userptr.c |  6 +-
>  drivers/infiniband/core/umem_odp.c  |  7 +--
>  fs/exec.c   |  9 +++--
>  include/linux/mm.h  |  2 +-
>  kernel/events/uprobes.c |  6 --
>  mm/gup.c| 22 +++---
>  mm/memory.c |  6 +-
>  security/tomoyo/domain.c|  2 +-
>  9 files changed, 40 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_gem.c 
> b/drivers/gpu/drm/etnaviv/etnaviv_gem.c
> index 5ce3603..0370b84 100644
> --- a/drivers/gpu/drm/etnaviv/etnaviv_gem.c
> +++ b/drivers/gpu/drm/etnaviv/etnaviv_gem.c
> @@ -748,19 +748,22 @@ static struct page **etnaviv_gem_userptr_do_get_pages(
>   int ret = 0, pinned, npages = etnaviv_obj->base.size >> PAGE_SHIFT;
>   struct page **pvec;
>   uintptr_t ptr;
> + unsigned int flags = 0;
>  
>   pvec = drm_malloc_ab(npages, sizeof(struct page *));
>   if (!pvec)
>   return ERR_PTR(-ENOMEM);
>  
> + if (!etnaviv_obj->userptr.ro)
> + flags |= FOLL_WRITE;
> +
>   pinned = 0;
>   ptr = etnaviv_obj->userptr.ptr;
>  
>   down_read(>mmap_sem);
>   while (pinned < npages) {
>   ret = get_user_pages_remote(task, mm, ptr, npages - pinned,
> - !etnaviv_obj->userptr.ro, 0,
> - pvec + pinned, NULL);
> + flags, pvec + pinned, NULL);
>   if (ret < 0)
>   break;
>  
> diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c 
> b/drivers/gpu/drm/i915/i915_gem_userptr.c
> index e537930..c6f780f 100644
> --- a/drivers/gpu/drm/i915/i915_gem_userptr.c
> +++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
> @@ -508,6 +508,10 @@ __i915_gem_userptr_get_pages_worker(struct work_struct 
> *_work)
>   pvec = drm_malloc_gfp(npages, sizeof(struct page *), GFP_TEMPORARY);
>   if (pvec != NULL) {
>   struct mm_struct *mm = obj->userptr.mm->mm;
> + unsigned int flags = 0;
> +
> + if (!obj->userptr.read_only)
> + flags |= FOLL_WRITE;
>  
>   ret = -EFAULT;
>   if (atomic_inc_not_zero(>mm_users)) {
> @@ -517,7 +521,7 @@ __i915_gem_userptr_get_pages_worker(struct work_struct 
> *_work)
>   (work->task, mm,
>obj->userptr.ptr + pinned * PAGE_SIZE,
>npages - pinned,
> -  !obj->userptr.read_only, 0,
> +  flags,
>pvec + pinned, NULL);
>   if (ret < 0)
>   break;
> diff --git a/drivers/infiniband/core/umem_odp.c 
> b/drivers/infiniband/core/umem_odp.c
> index 75077a0..1f0fe32 100644
> --- a/drivers/infiniband/core/umem_odp.c
> +++ b/drivers/infiniband/core/umem_odp.c
> @@ -527,6 +527,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem *umem, u64 
> user_virt, u64 bcnt,
>   u64 off;
>   int j, k, ret = 0, start_idx, npages = 0;
>   u64 base_virt_addr;
> + unsigned int flags = 0;
>  
>   if (access_mask == 0)
>   return -EINVAL;
> @@ -556,6 +557,9 @@ int ib_umem_odp_map_dma_pages(struct ib_umem *umem, u64 
> user_virt, u64 bcnt,
>   goto out_put_task;
>   }
>  
> + if (access_mask & ODP_WRITE_ALLOWED_BIT)
> + flags |= FOLL_WRITE;
> +
>   start_idx = (user_virt - ib_umem_start(umem)) >> PAGE_SHIFT;
>   k = start_idx;
>  
> @@ -574,8 +578,7 @@ int ib_umem_odp_map_dma_pages(struct ib_umem *umem, u64 
> user_virt, u64 bcnt,
>*/
>   npages = get_user_pages_remote(owning_process, owning_mm,
>   user_virt, gup_num_pages,
> -   

Re: [PATCH 06/10] mm: replace get_user_pages() write/force parameters with gup_flags

2016-10-19 Thread Jan Kara
On Thu 13-10-16 01:20:16, Lorenzo Stoakes wrote:
> This patch removes the write and force parameters from get_user_pages() and
> replaces them with a gup_flags parameter to make the use of FOLL_FORCE 
> explicit
> in callers as use of this flag can result in surprising behaviour (and hence
> bugs) within the mm subsystem.
> 
> Signed-off-by: Lorenzo Stoakes <lstoa...@gmail.com>

The patch looks good. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  arch/cris/arch-v32/drivers/cryptocop.c |  4 +---
>  arch/ia64/kernel/err_inject.c  |  2 +-
>  arch/x86/mm/mpx.c  |  5 ++---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c|  7 +--
>  drivers/gpu/drm/radeon/radeon_ttm.c|  3 ++-
>  drivers/gpu/drm/via/via_dmablit.c  |  4 ++--
>  drivers/infiniband/core/umem.c |  6 +-
>  drivers/infiniband/hw/mthca/mthca_memfree.c|  2 +-
>  drivers/infiniband/hw/qib/qib_user_pages.c |  3 ++-
>  drivers/infiniband/hw/usnic/usnic_uiom.c   |  5 -
>  drivers/media/v4l2-core/videobuf-dma-sg.c  |  7 +--
>  drivers/misc/mic/scif/scif_rma.c   |  3 +--
>  drivers/misc/sgi-gru/grufault.c|  2 +-
>  drivers/platform/goldfish/goldfish_pipe.c  |  3 ++-
>  drivers/rapidio/devices/rio_mport_cdev.c   |  3 ++-
>  .../vc04_services/interface/vchiq_arm/vchiq_2835_arm.c |  3 +--
>  .../vc04_services/interface/vchiq_arm/vchiq_arm.c  |  3 +--
>  drivers/virt/fsl_hypervisor.c  |  4 ++--
>  include/linux/mm.h |  2 +-
>  mm/gup.c   | 12 +++-
>  mm/mempolicy.c |  2 +-
>  mm/nommu.c | 18 
> --
>  22 files changed, 49 insertions(+), 54 deletions(-)
> 
> diff --git a/arch/cris/arch-v32/drivers/cryptocop.c 
> b/arch/cris/arch-v32/drivers/cryptocop.c
> index b5698c8..099e170 100644
> --- a/arch/cris/arch-v32/drivers/cryptocop.c
> +++ b/arch/cris/arch-v32/drivers/cryptocop.c
> @@ -2722,7 +2722,6 @@ static int cryptocop_ioctl_process(struct inode *inode, 
> struct file *filp, unsig
>   err = get_user_pages((unsigned long int)(oper.indata + prev_ix),
>noinpages,
>0,  /* read access only for in data */
> -  0, /* no force */
>inpages,
>NULL);
>  
> @@ -2736,8 +2735,7 @@ static int cryptocop_ioctl_process(struct inode *inode, 
> struct file *filp, unsig
>   if (oper.do_cipher){
>   err = get_user_pages((unsigned long int)oper.cipher_outdata,
>nooutpages,
> -  1, /* write access for out data */
> -  0, /* no force */
> +  FOLL_WRITE, /* write access for out data */
>outpages,
>NULL);
>   up_read(>mm->mmap_sem);
> diff --git a/arch/ia64/kernel/err_inject.c b/arch/ia64/kernel/err_inject.c
> index 09f8457..5ed0ea9 100644
> --- a/arch/ia64/kernel/err_inject.c
> +++ b/arch/ia64/kernel/err_inject.c
> @@ -142,7 +142,7 @@ store_virtual_to_phys(struct device *dev, struct 
> device_attribute *attr,
>   u64 virt_addr=simple_strtoull(buf, NULL, 16);
>   int ret;
>  
> - ret = get_user_pages(virt_addr, 1, VM_READ, 0, NULL, NULL);
> + ret = get_user_pages(virt_addr, 1, FOLL_WRITE, NULL, NULL);
>   if (ret<=0) {
>  #ifdef ERR_INJ_DEBUG
>   printk("Virtual address %lx is not existing.\n",virt_addr);
> diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
> index 8047687..e4f8009 100644
> --- a/arch/x86/mm/mpx.c
> +++ b/arch/x86/mm/mpx.c
> @@ -544,10 +544,9 @@ static int mpx_resolve_fault(long __user *addr, int 
> write)
>  {
>   long gup_ret;
>   int nr_pages = 1;
> - int force = 0;
>  
> - gup_ret = get_user_pages((unsigned long)addr, nr_pages, write,
> - force, NULL, NULL);
> + gup_ret = get_user_pages((unsigned long)addr, nr_pages,
> + write ? FOLL_WRITE : 0, NULL, NULL);
>   /*
>* get_user_pages() returns number of pages gotten.
>* 0 means we failed to fault in and get anything,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdg

Re: [PATCH 05/10] mm: replace get_vaddr_frames() write/force parameters with gup_flags

2016-10-19 Thread Jan Kara
On Thu 13-10-16 01:20:15, Lorenzo Stoakes wrote:
> This patch removes the write and force parameters from get_vaddr_frames() and
> replaces them with a gup_flags parameter to make the use of FOLL_FORCE 
> explicit
> in callers as use of this flag can result in surprising behaviour (and hence
> bugs) within the mm subsystem.
> 
> Signed-off-by: Lorenzo Stoakes <lstoa...@gmail.com>

Looks good. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

Honza

> ---
>  drivers/gpu/drm/exynos/exynos_drm_g2d.c|  3 ++-
>  drivers/media/platform/omap/omap_vout.c|  2 +-
>  drivers/media/v4l2-core/videobuf2-memops.c |  6 +-
>  include/linux/mm.h |  2 +-
>  mm/frame_vector.c  | 13 ++---
>  5 files changed, 11 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/gpu/drm/exynos/exynos_drm_g2d.c 
> b/drivers/gpu/drm/exynos/exynos_drm_g2d.c
> index aa92dec..fbd13fa 100644
> --- a/drivers/gpu/drm/exynos/exynos_drm_g2d.c
> +++ b/drivers/gpu/drm/exynos/exynos_drm_g2d.c
> @@ -488,7 +488,8 @@ static dma_addr_t *g2d_userptr_get_dma_addr(struct 
> drm_device *drm_dev,
>   goto err_free;
>   }
>  
> - ret = get_vaddr_frames(start, npages, true, true, g2d_userptr->vec);
> + ret = get_vaddr_frames(start, npages, FOLL_FORCE | FOLL_WRITE,
> + g2d_userptr->vec);
>   if (ret != npages) {
>   DRM_ERROR("failed to get user pages from userptr.\n");
>   if (ret < 0)
> diff --git a/drivers/media/platform/omap/omap_vout.c 
> b/drivers/media/platform/omap/omap_vout.c
> index e668dde..a31b95c 100644
> --- a/drivers/media/platform/omap/omap_vout.c
> +++ b/drivers/media/platform/omap/omap_vout.c
> @@ -214,7 +214,7 @@ static int omap_vout_get_userptr(struct videobuf_buffer 
> *vb, u32 virtp,
>   if (!vec)
>   return -ENOMEM;
>  
> - ret = get_vaddr_frames(virtp, 1, true, false, vec);
> + ret = get_vaddr_frames(virtp, 1, FOLL_WRITE, vec);
>   if (ret != 1) {
>   frame_vector_destroy(vec);
>   return -EINVAL;
> diff --git a/drivers/media/v4l2-core/videobuf2-memops.c 
> b/drivers/media/v4l2-core/videobuf2-memops.c
> index 3c3b517..1cd322e 100644
> --- a/drivers/media/v4l2-core/videobuf2-memops.c
> +++ b/drivers/media/v4l2-core/videobuf2-memops.c
> @@ -42,6 +42,10 @@ struct frame_vector *vb2_create_framevec(unsigned long 
> start,
>   unsigned long first, last;
>   unsigned long nr;
>   struct frame_vector *vec;
> + unsigned int flags = FOLL_FORCE;
> +
> + if (write)
> + flags |= FOLL_WRITE;
>  
>   first = start >> PAGE_SHIFT;
>   last = (start + length - 1) >> PAGE_SHIFT;
> @@ -49,7 +53,7 @@ struct frame_vector *vb2_create_framevec(unsigned long 
> start,
>   vec = frame_vector_create(nr);
>   if (!vec)
>   return ERR_PTR(-ENOMEM);
> - ret = get_vaddr_frames(start & PAGE_MASK, nr, write, true, vec);
> + ret = get_vaddr_frames(start & PAGE_MASK, nr, flags, vec);
>   if (ret < 0)
>   goto out_destroy;
>   /* We accept only complete set of PFNs */
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 27ab538..5ff084f6 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1305,7 +1305,7 @@ struct frame_vector {
>  struct frame_vector *frame_vector_create(unsigned int nr_frames);
>  void frame_vector_destroy(struct frame_vector *vec);
>  int get_vaddr_frames(unsigned long start, unsigned int nr_pfns,
> -  bool write, bool force, struct frame_vector *vec);
> +  unsigned int gup_flags, struct frame_vector *vec);
>  void put_vaddr_frames(struct frame_vector *vec);
>  int frame_vector_to_pages(struct frame_vector *vec);
>  void frame_vector_to_pfns(struct frame_vector *vec);
> diff --git a/mm/frame_vector.c b/mm/frame_vector.c
> index 81b6749..db77dcb 100644
> --- a/mm/frame_vector.c
> +++ b/mm/frame_vector.c
> @@ -11,10 +11,7 @@
>   * get_vaddr_frames() - map virtual addresses to pfns
>   * @start:   starting user address
>   * @nr_frames:   number of pages / pfns from start to map
> - * @write:   whether pages will be written to by the caller
> - * @force:   whether to force write access even if user mapping is
> - *   readonly. See description of the same argument of
> - get_user_pages().
> + * @gup_flags:   flags modifying lookup behaviour
>   * @vec: structure which receives pages / pfns of the addresses mapped.
>   *   It should have space for at least nr_frames entries.
&g

Re: [PATCH 04/10] mm: replace get_user_pages_locked() write/force parameters with gup_flags

2016-10-19 Thread Jan Kara
On Thu 13-10-16 01:20:14, Lorenzo Stoakes wrote:
> This patch removes the write and force parameters from get_user_pages_locked()
> and replaces them with a gup_flags parameter to make the use of FOLL_FORCE
> explicit in callers as use of this flag can result in surprising behaviour 
> (and
> hence bugs) within the mm subsystem.
> 
> Signed-off-by: Lorenzo Stoakes <lstoa...@gmail.com>

After our discussion the patch looks good to me. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

        Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 04/10] mm: replace get_user_pages_locked() write/force parameters with gup_flags

2016-10-19 Thread Jan Kara
On Tue 18-10-16 14:56:09, Lorenzo Stoakes wrote:
> On Tue, Oct 18, 2016 at 02:54:25PM +0200, Jan Kara wrote:
> > > @@ -1282,7 +1282,7 @@ long get_user_pages(unsigned long start, unsigned 
> > > long nr_pages,
> > >   int write, int force, struct page **pages,
> > >   struct vm_area_struct **vmas);
> > >  long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
> > > - int write, int force, struct page **pages, int *locked);
> > > + unsigned int gup_flags, struct page **pages, int *locked);
> >
> > Hum, the prototype is inconsistent with e.g. __get_user_pages_unlocked()
> > where gup_flags come after **pages argument. Actually it makes more sense
> > to have it before **pages so that input arguments come first and output
> > arguments second but I don't care that much. But it definitely should be
> > consistent...
> 
> It was difficult to decide quite how to arrange parameters as there was
> inconsitency with regards to parameter ordering already - for example
> __get_user_pages() places its flags argument before pages whereas, as you 
> note,
> __get_user_pages_unlocked() puts them afterwards.
> 
> I ended up compromising by trying to match the existing ordering of the 
> function
> as much as I could by replacing write, force pairs with gup_flags in the same
> location (with the exception of get_user_pages_unlocked() which I felt should
> match __get_user_pages_unlocked() in signature) or if there was already a
> gup_flags parameter as in the case of __get_user_pages_unlocked() I simply
> removed the write, force pair and left the flags as the last parameter.
> 
> I am happy to rearrange parameters as needed, however I am not sure if it'd be
> worthwhile for me to do so (I am keen to try to avoid adding too much noise 
> here
> :)
> 
> If we were to rearrange parameters for consistency I'd suggest adjusting
> __get_user_pages_unlocked() to put gup_flags before pages and do the same with
> get_user_pages_unlocked(), let me know what you think.

Yeah, ok. If the inconsistency is already there, just leave it for now.

Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 04/10] mm: replace get_user_pages_locked() write/force parameters with gup_flags

2016-10-18 Thread Jan Kara
On Thu 13-10-16 01:20:14, Lorenzo Stoakes wrote:
> This patch removes the write and force parameters from get_user_pages_locked()
> and replaces them with a gup_flags parameter to make the use of FOLL_FORCE
> explicit in callers as use of this flag can result in surprising behaviour 
> (and
> hence bugs) within the mm subsystem.
> 
> Signed-off-by: Lorenzo Stoakes <lstoa...@gmail.com>
> ---
>  include/linux/mm.h |  2 +-
>  mm/frame_vector.c  |  8 +++-
>  mm/gup.c   | 12 +++-
>  mm/nommu.c |  5 -
>  4 files changed, 15 insertions(+), 12 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 6adc4bc..27ab538 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1282,7 +1282,7 @@ long get_user_pages(unsigned long start, unsigned long 
> nr_pages,
>   int write, int force, struct page **pages,
>   struct vm_area_struct **vmas);
>  long get_user_pages_locked(unsigned long start, unsigned long nr_pages,
> - int write, int force, struct page **pages, int *locked);
> + unsigned int gup_flags, struct page **pages, int *locked);

Hum, the prototype is inconsistent with e.g. __get_user_pages_unlocked()
where gup_flags come after **pages argument. Actually it makes more sense
to have it before **pages so that input arguments come first and output
arguments second but I don't care that much. But it definitely should be
consistent...

    Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 03/10] mm: replace get_user_pages_unlocked() write/force parameters with gup_flags

2016-10-18 Thread Jan Kara
On Thu 13-10-16 01:20:13, Lorenzo Stoakes wrote:
> This patch removes the write and force parameters from 
> get_user_pages_unlocked()
> and replaces them with a gup_flags parameter to make the use of FOLL_FORCE
> explicit in callers as use of this flag can result in surprising behaviour 
> (and
> hence bugs) within the mm subsystem.
> 
> Signed-off-by: Lorenzo Stoakes <lstoa...@gmail.com>

Looks good. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

        Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 02/10] mm: remove write/force parameters from __get_user_pages_unlocked()

2016-10-18 Thread Jan Kara
On Thu 13-10-16 01:20:12, Lorenzo Stoakes wrote:
> This patch removes the write and force parameters from
> __get_user_pages_unlocked() to make the use of FOLL_FORCE explicit in callers 
> as
> use of this flag can result in surprising behaviour (and hence bugs) within 
> the
> mm subsystem.
> 
> Signed-off-by: Lorenzo Stoakes <lstoa...@gmail.com>

The patch looks good. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

        Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: [PATCH 01/10] mm: remove write/force parameters from __get_user_pages_locked()

2016-10-18 Thread Jan Kara
On Thu 13-10-16 01:20:11, Lorenzo Stoakes wrote:
> This patch removes the write and force parameters from 
> __get_user_pages_locked()
> to make the use of FOLL_FORCE explicit in callers as use of this flag can 
> result
> in surprising behaviour (and hence bugs) within the mm subsystem.
> 
> Signed-off-by: Lorenzo Stoakes <lstoa...@gmail.com>

Looks good. You can add:

Reviewed-by: Jan Kara <j...@suse.cz>

        Honza
-- 
Jan Kara <j...@suse.com>
SUSE Labs, CR


Re: Crash (ext3 ) during 2.6.29-rc6 boot

2009-02-24 Thread Jan Kara
  Hello,

On Tue 24-02-09 12:08:37, Sachin P. Sant wrote:
 Jan Kara wrote:
   Hmm, OK. But then I'm not sure how that can happen. Obviously, memcpy
 somehow got beyond end of the page referenced by bh-b_data. So it means
 that le16_to_cpu(entry-e_value_offs) + size  page_size. But
 ext3_xattr_find_entry() calls ext3_xattr_check_entry() which in
 particular checks whether e_value_offs + e_value_size isn't greater than
 bh-b_size. So I see no way how memcpy can get beyond end of the page.
   Sachin, is the problem reproducible? If yes, can you send us contents
   
 Yes, i am able to recreate this problem easily. As i had mentioned if the
 earlier kernel is booted with selinux enabled and then 2.6.29-rc6 is booted
 i get this crash. But if i specify selinux=0 at command line, 2.6.29-rc6 boots
 without any problem.

 of the page just before the faulting address (i.e., for current fault it
 would be 0xc0003f37-0xc0003f37). As far as I can
 remember powerpc monitor could dump it.
   
 Here is the page dump. This time it crashed while accessing address
 0xc0002d67.
  Thanks for the dump.

 Unable to handle kernel paging request for data at address 0xc
 0002d67
 Faulting instruction address: 0xc0039574
 cpu 0x1: Vector: 300 (Data Access) at [c0004288b0b0]
pc: c0039574: .memcpy+0x74/0x244
lr: c01b497c: .ext3_xattr_get+0x288/0x2f4
sp: c0004288b330
   msr: 80009032

 1:mon d 0xc0002d66
 ... SNIP ...

 c0002d66efd0    ||
 c0002d66efe0    ||
 c0002d66eff0    ||
 c0002d66f000 02ea0004 0100e200d20a  ||
 c0002d66f010    ||
 c0002d66f020 0706e40f 1b00e200d20a  ||
 c0002d66f030 73656c696e757800   |selinux.|
 c0002d66f040    ||
 c0002d66f050    ||
 c0002d66f060    ||

 ... SNIP ...

 c0002d66ff60    ||
 c0002d66ff70    ||
 c0002d66ff80    ||
 c0002d66ff90    ||
 c0002d66ffa0    ||
 c0002d66ffb0    ||
 c0002d66ffc0    ||
 c0002d66ffd0    ||
 c0002d66ffe0 73797374 656d5f753a6f626a  |system_u:obj|
 c0002d66fff0 6563745f723a7573 725f743a7330  |ect_r:usr_t:s0..|
 c0002d67    ||
 1:mon r
 R00 = e40f   R16 = 005d
 R01 = c0004288b330   R17 = 
 R02 = c09f59b8   R18 = fffbfe9e
 R03 = c00044aa34a0   R19 = 10042638
 R04 = c0002d66fff4   R20 = 10041610
 R05 = 0003   R21 = 00ff
 R06 =    R22 = 0006
 R07 = 0001   R23 = c07d27c1
 R08 = 723a7573725f743a   R24 = c0002c0cd758
 R09 = 3a6f626a6563745f   R25 = c00044aa3488
 R10 = c017b43c   R26 = c0002c0cd6f0
 R11 = c0002d66f020   R27 = c0002c0cd860
 R12 = d23c14b0   R28 = c0002c0b0840
 R13 = c0a93680   R29 = 001b
 R14 = 41ed   R30 = c09880b0
 R15 = 1004   R31 = ffde
 pc  = c0039574 .memcpy+0x74/0x244
 lr  = c01b497c .ext3_xattr_get+0x288/0x2f4
 msr = 80009032   cr  = 4400044b
 ctr =    xer = 2001   trap =  300
 dar = c0002d67   dsisr = 4000
 1:mon zr

   BTW, I suppose you use 4KB blocksize on the filesystem, right?
   
 Yes.

 dumpe2fs /dev/sda3 | grep -i block size dumpe2fs 1.39 (29-May-2006)
 Block size:   4096
  OK. The xattr block causing oops is completely correct. To me it seems
more like some problem in powerpc memcpy() (I saw there went some changes
into in in the end of December) - we call it to copy 27 bytes from
address 0xc0002d66ffe4 (which is one byte before end of the page).
Could some of the powerpc guys have a look whether this could be the case?
I'm not quite fluent in the powerpc assembly so it would take me ages ;).

Honza
-- 
Jan Kara j...@suse.cz
SUSE Labs, CR
___
Linuxppc-dev mailing list

Re: Crash (ext3 ) during 2.6.29-rc6 boot

2009-02-24 Thread Jan Kara
 Andrew Morton wrote:
 hm, I wonder what could have caused that - we haven't altered
 fs/ext3/xattr.c in ages.
 
 What is the most recent kernel version you know of which didn't do
 this?  Bear in mind that this crash might be triggered by the
 current contents of the filesystem, so if possible, please test
 some other kernel versions on that disk.
   
 I am trying to boot a vanilla kernel on this machine for the first
 time. Haven't tried any other kernels. Will give it a try.
 
 It looks like we died in ext3_xattr_block_get():
 
  memcpy(buffer, bh-b_data + le16_to_cpu(entry-e_value_offs),
 size);
 
 Perhaps entry-e_value_offs is no good.  I wonder if the filesystem is
 corrupted and this snuck through the defenses.
 
 I also wonder if there is enough info in that trace for a ppc person to
 be able to determine whether the faulting address is in the source or
 destination of the memcpy() (please)?
   
 Some more information if this could be of any help.
 
 0:mon di 0xc0039574
 c0039574  e9240008  ld  r9,8(r4)
 c0039578  409d0010  ble cr7,c0039588# 
 .memcpy+0x88/0x244
 c003957c  79290002  rotldi  r9,r9,32
 c0039580  9123  stw r9,0(r3)
 c0039584  38630004  addir3,r3,4
 c0039588  409e0010  bne cr7,c0039598# 
 .memcpy+0x98/0x244
 c003958c  79298000  rotldi  r9,r9,16
 c0039590  b123  sth r9,0(r3)
 c0039594  38630002  addir3,r3,2
 c0039598  409f000c  bns cr7,c00395a4# 
 .memcpy+0xa4/0x244
 c003959c  79294000  rotldi  r9,r9,8
 c00395a0  9923  stb r9,0(r3)
 c00395a4  e8610030  ld  r3,48(r1)
 c00395a8  4e800020  blr
 c00395ac  78a6e8c2  rldicl  r6,r5,61,3
 c00395b0  38a5fff0  addir5,r5,-16
 0:mon r
 R00 = e40f   R16 = 100edbc8
 R01 = c0003e59b3e0   R17 = 100b
 R02 = c09c2110   R18 = 0005
 R03 = c00044bc90e0   R19 = fff0d7a8
 R04 = c00039c4   R20 = fff0d708
 R05 = 0003   R21 = 00ff
 R06 =    R22 = 0006
 R07 = 0001   R23 = c079ab49
 R08 = 723a7573725f743a   R24 = c000372fe2a8
 R09 = 3a6f626a6563745f   R25 = c00044bc90c8
 R10 = c0003b250968   R26 = c000372fe240
 R11 = c0039500   R27 = c000372fe3b0
 R12 = d244c590   R28 = c000372c5280
 R13 = c0a53480   R29 = 001b
 R14 = 100d   R30 = d24654d0
 R15 =    R31 = ffde
 pc  = c0039574 .memcpy+0x74/0x244
 lr  = d244916c .ext3_xattr_get+0x288/0x2f4 [ext3]
 msr = 80009032   cr  = 4400844b
 ctr =    xer = 0001   trap =  300
 dar = c00039d0   dsisr = 4000
 0:mon
  Yes, this makes me even more suspitious that memcpy() on powerpc could
be at fault. The instruction (ld r9,8(r4)) is loading last 8 bytes to copy,
but in fact it should load only 3 bytes in our case because remaining 5
bytes are not in the range we specified and thus larger load can cause
page fault...

Honza
-- 
Jan Kara j...@suse.cz
SuSE CR Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: Crash (ext3 ) during 2.6.29-rc6 boot

2009-02-23 Thread Jan Kara
 Andrew Morton writes:
 
  It looks like we died in ext3_xattr_block_get():
  
  memcpy(buffer, bh-b_data + le16_to_cpu(entry-e_value_offs),
 size);
  
  Perhaps entry-e_value_offs is no good.  I wonder if the filesystem is
  corrupted and this snuck through the defenses.
  
  I also wonder if there is enough info in that trace for a ppc person to
  be able to determine whether the faulting address is in the source or
  destination of the memcpy() (please)?
 
 It appears to have faulted on a load, implicating the source.  The
 address being referenced (0xc0003f38) doesn't look
 outlandish.  I wonder if this kernel has CONFIG_DEBUG_PAGEALLOC turned
 on, and what page size is selected?
  Hmm, OK. But then I'm not sure how that can happen. Obviously, memcpy
somehow got beyond end of the page referenced by bh-b_data. So it means
that le16_to_cpu(entry-e_value_offs) + size  page_size. But
ext3_xattr_find_entry() calls ext3_xattr_check_entry() which in
particular checks whether e_value_offs + e_value_size isn't greater than
bh-b_size. So I see no way how memcpy can get beyond end of the page.
  Sachin, is the problem reproducible? If yes, can you send us contents
of the page just before the faulting address (i.e., for current fault it
would be 0xc0003f37-0xc0003f37). As far as I can
remember powerpc monitor could dump it.
  BTW, I suppose you use 4KB blocksize on the filesystem, right?

Honza
-- 
Jan Kara j...@suse.cz
SuSE CR Labs
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [BUG] 2.6.24-rc3-mm2 kernel bug on nfs cifs mounted partitions

2007-11-29 Thread Jan Kara
On Thu 29-11-07 17:27:08, Kamalesh Babulal wrote:
 Andrew Morton wrote:
  On Thu, 29 Nov 2007 14:30:14 +0530 Kamalesh Babulal [EMAIL PROTECTED] 
  wrote:
  
  Hi Andrew,
 
  While running file system stress on nfs and cifs mounted partitions, the 
  machine
  drops to xmon
 
  1:mon e
  cpu 0x1: Vector: 300 (Data Access) at [c00080a9f880]
  pc: c01392c8: .inotify_inode_queue_event+0x50/0x158
  lr: c01074d0: .vfs_link+0x204/0x298
  sp: c00080a9fb00
 msr: 80009032
 dar: 280 
   dsisr: 4001
current = 0xc000c8e6f670
paca= 0xc0512c00
  pid   = 2848, comm = fsstress
  1:mon t
  [c00080a9fbd0] c01074d0 .vfs_link+0x204/0x298
  [c00080a9fc70] c010b6e0 .sys_linkat+0x134/0x1b4
  [c00080a9fe30] c000872c syscall_exit+0x0/0x40
  --- Exception: c00 (System Call) at 0ff1bdfc
  SP (ffeaed10) is in userspace
  1:mon r
  R00 = c01074d0   R16 = 
  R01 = c00080a9fb00   R17 = 
  R02 = c060c380   R18 = 
  R03 =    R19 = 
  R04 = 0004   R20 = 
  R05 =    R21 = 
  R06 =    R22 = 
  R07 =    R23 = 0004
  R08 =    R24 = 0280
  R09 =    R25 = f000
  R10 = 0001   R26 = c00082827790
  R11 = c03963e8   R27 = c000828275a0
  R12 = d0deec78   R28 = 
  R13 = c0512c00   R29 = c0007b18fcf0
  R14 =    R30 = c05bc088
  R15 =    R31 = 
  pc  = c01392c8 .inotify_inode_queue_event+0x50/0x158
  lr  = c01074d0 .vfs_link+0x204/0x298
  msr = 80009032   cr  = 24000882
  ctr = c03963e8   xer =    trap =  300 
  dar = 0280   dsisr = 4001
 
 
  The gdb output shows 
 
  0xc01076d4 is in vfs_symlink (include/linux/fsnotify.h:108).
  103  * fsnotify_create - 'name' was linked in
  104  */  
  105 static inline void fsnotify_create(struct inode *inode, struct 
  dentry *dentry)
  106 {   
  107 inode_dir_notify(inode, DN_CREATE);
  108 inotify_inode_queue_event(inode, IN_CREATE, 0, 
  dentry-d_name.name,
  109   dentry-d_inode);
  110 audit_inode_child(dentry-d_name.name, dentry, inode);
  111 }   
  112
 
  
  If it is reproducible can you please try reverting
  inotify-send-in_attrib-events-when-link-count-changes.patch?
 
 Hi Andrew,
 
 reverting the patch 
 inotify-send-in_attrib-events-when-link-count-changes.patch, the 
 bug is not reproduced.
  OK, thanks for testing. I was trying to reproduce the problem locally but
without success so far - I guess it's either NFS or CIFS which makes the
problem appear. I'll investigate more.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
___
Linuxppc-dev mailing list
Linuxppc-dev@ozlabs.org
https://ozlabs.org/mailman/listinfo/linuxppc-dev


Re: [BUG] 2.6.24-rc3-mm2 kernel bug on nfs cifs mounted partitions

2007-11-29 Thread Jan Kara
On Thu 29-11-07 17:27:08, Kamalesh Babulal wrote:
 Andrew Morton wrote:
  On Thu, 29 Nov 2007 14:30:14 +0530 Kamalesh Babulal [EMAIL PROTECTED] 
  wrote:
  
  Hi Andrew,
 
  While running file system stress on nfs and cifs mounted partitions, the 
  machine
  drops to xmon
 
  1:mon e
  cpu 0x1: Vector: 300 (Data Access) at [c00080a9f880]
  pc: c01392c8: .inotify_inode_queue_event+0x50/0x158
  lr: c01074d0: .vfs_link+0x204/0x298
  sp: c00080a9fb00
 msr: 80009032
 dar: 280 
   dsisr: 4001
current = 0xc000c8e6f670
paca= 0xc0512c00
  pid   = 2848, comm = fsstress
  1:mon t
  [c00080a9fbd0] c01074d0 .vfs_link+0x204/0x298
  [c00080a9fc70] c010b6e0 .sys_linkat+0x134/0x1b4
  [c00080a9fe30] c000872c syscall_exit+0x0/0x40
  --- Exception: c00 (System Call) at 0ff1bdfc
  SP (ffeaed10) is in userspace
  1:mon r
  R00 = c01074d0   R16 = 
  R01 = c00080a9fb00   R17 = 
  R02 = c060c380   R18 = 
  R03 =    R19 = 
  R04 = 0004   R20 = 
  R05 =    R21 = 
  R06 =    R22 = 
  R07 =    R23 = 0004
  R08 =    R24 = 0280
  R09 =    R25 = f000
  R10 = 0001   R26 = c00082827790
  R11 = c03963e8   R27 = c000828275a0
  R12 = d0deec78   R28 = 
  R13 = c0512c00   R29 = c0007b18fcf0
  R14 =    R30 = c05bc088
  R15 =    R31 = 
  pc  = c01392c8 .inotify_inode_queue_event+0x50/0x158
  lr  = c01074d0 .vfs_link+0x204/0x298
  msr = 80009032   cr  = 24000882
  ctr = c03963e8   xer =    trap =  300 
  dar = 0280   dsisr = 4001
 
 
  The gdb output shows 
 
  0xc01076d4 is in vfs_symlink (include/linux/fsnotify.h:108).
  103  * fsnotify_create - 'name' was linked in
  104  */  
  105 static inline void fsnotify_create(struct inode *inode, struct 
  dentry *dentry)
  106 {   
  107 inode_dir_notify(inode, DN_CREATE);
  108 inotify_inode_queue_event(inode, IN_CREATE, 0, 
  dentry-d_name.name,
  109   dentry-d_inode);
  110 audit_inode_child(dentry-d_name.name, dentry, inode);
  111 }   
  112
 
  
  If it is reproducible can you please try reverting
  inotify-send-in_attrib-events-when-link-count-changes.patch?
 
 Hi Andrew,
 
 reverting the patch 
 inotify-send-in_attrib-events-when-link-count-changes.patch, the 
 bug is not reproduced.
  OK, it's a problem with CIFS. Its cifs_hardlink() function doesn't call
d_instantiate() and thus returns a dentry with d_inode set to NULL. I'm not
sure if such behavior is really correct but anyway, attached is a new
version of the patch which should handle it gracefully. Kamalesh, can you
please give it a try? Thanks.

Honza
-- 
Jan Kara [EMAIL PROTECTED]
SUSE Labs, CR
---

  Currently, no notification event has been sent when inode's link count
changed. This is inconvenient for the application in some cases:
  Suppose you have the following directory structure
foo/test
bar/

  and you watch test. If someone does mv foo/test bar/, you get event
IN_MOVE_SELF and you know something has happened with the file test.
However if someone does ln foo/test bar/test and rm foo/test you get no
inotify event for the file test (only directories foo and bar receive
events).
  Furthermore it could be argued that link count belongs to file's metadata
and thus IN_ATTRIB should be sent when it changes.
  The following patch implements sending of IN_ATTRIB inotify events when
link count of the inode changes, i.e., when a hardlink to the inode is
created or when it is removed. This event is sent in addition to all the
events sent so far. In particular, when a last link to a file is removed,
IN_ATTRIB event is sent in addition to IN_DELETE_SELF event.

Signed-off-by: Jan Kara [EMAIL PROTECTED]

diff --git a/fs/namei.c b/fs/namei.c
index 3b993db..c1839d1 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2188,6 +2188,7 @@ int vfs_unlink(struct inode *dir, struct dentry *dentry)
 
/* We don't d_delete() NFS sillyrenamed files--they still exist. */
if (!error  !(dentry-d_flags  DCACHE_NFSFS_RENAMED)) {
+   fsnotify_link_count(dentry-d_inode);
d_delete(dentry);
}
 
@@ -2360,7 +2361,7 @@ int vfs_link(struct dentry *old_dentry, struct inode 
*dir, struct dentry *new_de
error = dir-i_op-link(old_dentry, dir, new_dentry);
mutex_unlock(old_dentry-d_inode-i_mutex);
if (!error)
-   fsnotify_create(dir