from:"Christian Brauner"

Re: [PATCH v2 2/4] eventfd: simplify eventfd_signal()

2024-02-08 Thread Christian Brauner

On Wed, Feb 07, 2024 at 03:34:59PM +0100, Paolo Bonzini wrote:
> On Wed, Nov 22, 2023 at 1:49 PM Christian Brauner  wrote:
> >
> > Ever since the evenfd type was introduced back in 2007 in commit
> > e1ad7468c77d ("signal/timer/event: eventfd core") the eventfd_signal()
> > function only ever passed 1 as a value for @n. There's no point in
> > keeping that additional argument.
> >
> > Signed-off-by: Christian Brauner 
> > ---
> >  arch/x86/kvm/hyperv.c |  2 +-
> >  arch/x86/kvm/xen.c|  2 +-
> >  virt/kvm/eventfd.c|  4 ++--
> >  30 files changed, 60 insertions(+), 63 deletions(-)
> 
> For KVM:
> 
> Acked-by: Paolo Bonzini 

I really appreciate all of the ACKs but just fyi that this was merged
for v6.8-rc1. Just so that there's no confusion!

Re: [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression

2023-11-28 Thread Christian Brauner

On Mon, Nov 27, 2023 at 09:10:54AM -0800, Linus Torvalds wrote:
> On Mon, 27 Nov 2023 at 02:27, Christian Brauner  wrote:
> >
> > So I've picked up your patch (vfs.misc). It's clever alright so thanks
> > for the comments in there otherwise I would've stared at this for far
> > too long.
> 
> Note that I should probably have commented on one other thing: that
> whole "just load from fd[0] is always safe, because the fd[] array
> always exists".

I added a comment to that effect in the code.

> 
> IOW, that whole "load and mask" thing only works when you know the
> array exists at all.
> 
> Doing that "just mask the index" wouldn't be valid if "size = 0" is an
> option and might mean that we don't have an array at all (ie if "->fd"
> itself could be NULL.
> 
> But we never have a completely empty file descriptor array, and
> fdp->fd is never NULL.  At a minimum 'max_fds' is NR_OPEN_DEFAULT.
> 
> (The whole 'tsk->files' could be NULL, but only for kernel threads or
> when exiting, so fget_task() will check for *that*, but it's a
> separate thing)

Yep.

> 
> So that's why it's safe to *entirely* remove the whole
> 
> if (unlikely(fd >= fdt->max_fds))
> 
> test, and do it *all* with just "mask the index, and mask the resulting load".

Yep.

Re: [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression

2023-11-27 Thread Christian Brauner

> So that nobody else would waste any time on this, attached is a new
> attempt. This time actually tested *after* the changes.

So I've picked up your patch (vfs.misc). It's clever alright so thanks
for the comments in there otherwise I would've stared at this for far
too long.

It's a little unpleasant because of the cast-orama going on before we
check the file pointer but I don't see that it's in any way wrong. And
given how focussed people are with __fget_* performance I think it might
even be the right thing to do.

But the cleverness means we have the same logic slightly differently
twice. Not too bad ofc but not too nice either especially because that
rcu lookup is pretty complicated already.

A few days ago I did just write a long explanatory off-list email to
someone who had questions about this and who is fairly experienced so
we're not making it easy on people. But performance or simplicity; one
can't necessarily always have both.

Re: [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression

2023-11-27 Thread Christian Brauner

> I took a look at the code generation, and honestly, I think we're
> better off just making __fget_files_rcu() have special logic for this
> all, and not use __get_file_rcu().

My initial massaging of the patch did that btw. Then I sat there
wondering whether it would matter if we just made it possible to reuse
that code and I went through a bunch of iterations. Oh well, it seems to
matter.

> Comments? I also looked at that odd OPTIMIZER_HIDE_VAR() that

Concept looks sane to me.

> __get_file_rcu() does, and I don't get it. Both things come from
> volatile accesses, I don't see the point of those games, but I also
> didn't care, since it's no longer in a critical code path.
> 
> Christian?

Puts his completely imagined "I understand RCU head on".
SLAB_TYPESAFE_BY_RCU makes the RCU consume memory ordering that the
compiler doesn't officialy support (afaik) a bit wonky.

So the thinking was that we could have code patterns where you could
free the object and reallocate it while legitimatly passing the pointer
recheck. In that case there is no memory ordering between the allocation
and the pointer recheck because the last (re)allocation could have been
after the rcu_dereference().

To combat that all future loads were made to have a dependency on the
first load using the hidevar trick.

I guess that might only be theoretically possible but not in practice?
But then I liked that we explicitly commented on it as a reminder.

Re: [PATCH v2 0/4] eventfd: simplify signal helpers

2023-11-24 Thread Christian Brauner

On Wed, 22 Nov 2023 13:48:21 +0100, Christian Brauner wrote:
> Hey everyone,
> 
> This simplifies the eventfd_signal() and eventfd_signal_mask() helpers
> significantly. They can be made void and not take any unnecessary
> arguments.
> 
> I've added a few more simplifications based on Sean's suggestion.
> 
> [...]

Applied to the vfs.misc branch of the vfs/vfs.git tree.
Patches in the vfs.misc branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.misc

[1/4] i915: make inject_virtual_interrupt() void
  https://git.kernel.org/vfs/vfs/c/858848719210
[2/4] eventfd: simplify eventfd_signal()
  https://git.kernel.org/vfs/vfs/c/ded0f31f825f
[3/4] eventfd: simplify eventfd_signal_mask()
  https://git.kernel.org/vfs/vfs/c/45ee1c990e88
[4/4] eventfd: make eventfd_signal{_mask}() void
  https://git.kernel.org/vfs/vfs/c/37d5d473e749

Re: [PATCH v2 2/4] eventfd: simplify eventfd_signal()

2023-11-23 Thread Christian Brauner

> >   * eventfd_signal - Adds @n to the eventfd counter.
> 
> This still refers to @n here, and in patch 4.

Fixed and folded. Thanks!

Re: [PATCH v2 1/4] i915: make inject_virtual_interrupt() void

2023-11-23 Thread Christian Brauner

> > +   if (!vgpu->msi_trigger)
> > +   return;
> > +   eventfd_signal(vgpu->msi_trigger, 1);
> >  }
> 
> I think it's a little simpler to write as
> if (vgpu->msi_trigger)
> eventfd_signal(vgpu->msi_trigger, 1);

Good point. I folded that suggestion into the patch.

[PATCH v2 4/4] eventfd: make eventfd_signal{_mask}() void

2023-11-22 Thread Christian Brauner

No caller care about the return value.

Signed-off-by: Christian Brauner 
---
 fs/eventfd.c| 40 +++-
 include/linux/eventfd.h | 16 +++-
 2 files changed, 22 insertions(+), 34 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index a9a6de920fb4..13be2fb7fc96 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -43,10 +43,19 @@ struct eventfd_ctx {
int id;
 };
 
-__u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __poll_t mask)
+/**
+ * eventfd_signal - Adds @n to the eventfd counter.
+ * @ctx: [in] Pointer to the eventfd context.
+ * @mask: [in] poll mask
+ *
+ * This function is supposed to be called by the kernel in paths that do not
+ * allow sleeping. In this function we allow the counter to reach the 
ULLONG_MAX
+ * value, and we signal this as overflow condition by returning a EPOLLERR
+ * to poll(2).
+ */
+void eventfd_signal_mask(struct eventfd_ctx *ctx, __poll_t mask)
 {
unsigned long flags;
-   __u64 n = 1;
 
/*
 * Deadlock or stack overflow issues can happen if we recurse here
@@ -57,37 +66,18 @@ __u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __poll_t 
mask)
 * safe context.
 */
if (WARN_ON_ONCE(current->in_eventfd))
-   return 0;
+   return;
 
spin_lock_irqsave(>wqh.lock, flags);
current->in_eventfd = 1;
-   if (ULLONG_MAX - ctx->count < n)
-   n = ULLONG_MAX - ctx->count;
-   ctx->count += n;
+   if (ctx->count < ULLONG_MAX)
+   ctx->count++;
if (waitqueue_active(>wqh))
wake_up_locked_poll(>wqh, EPOLLIN | mask);
current->in_eventfd = 0;
spin_unlock_irqrestore(>wqh.lock, flags);
-
-   return n == 1;
-}
-
-/**
- * eventfd_signal - Adds @n to the eventfd counter.
- * @ctx: [in] Pointer to the eventfd context.
- *
- * This function is supposed to be called by the kernel in paths that do not
- * allow sleeping. In this function we allow the counter to reach the 
ULLONG_MAX
- * value, and we signal this as overflow condition by returning a EPOLLERR
- * to poll(2).
- *
- * Returns the amount by which the counter was incremented.
- */
-__u64 eventfd_signal(struct eventfd_ctx *ctx)
-{
-   return eventfd_signal_mask(ctx, 0);
 }
-EXPORT_SYMBOL_GPL(eventfd_signal);
+EXPORT_SYMBOL_GPL(eventfd_signal_mask);
 
 static void eventfd_free_ctx(struct eventfd_ctx *ctx)
 {
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index 4f8aac7eb62a..fea7c4eb01d6 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -35,8 +35,7 @@ void eventfd_ctx_put(struct eventfd_ctx *ctx);
 struct file *eventfd_fget(int fd);
 struct eventfd_ctx *eventfd_ctx_fdget(int fd);
 struct eventfd_ctx *eventfd_ctx_fileget(struct file *file);
-__u64 eventfd_signal(struct eventfd_ctx *ctx);
-__u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __poll_t mask);
+void eventfd_signal_mask(struct eventfd_ctx *ctx, __poll_t mask);
 int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_entry_t 
*wait,
  __u64 *cnt);
 void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt);
@@ -58,14 +57,8 @@ static inline struct eventfd_ctx *eventfd_ctx_fdget(int fd)
return ERR_PTR(-ENOSYS);
 }
 
-static inline int eventfd_signal(struct eventfd_ctx *ctx)
+static inline void eventfd_signal_mask(struct eventfd_ctx *ctx, unsigned mask)
 {
-   return -ENOSYS;
-}
-
-static inline int eventfd_signal_mask(struct eventfd_ctx *ctx, unsigned mask)
-{
-   return -ENOSYS;
 }
 
 static inline void eventfd_ctx_put(struct eventfd_ctx *ctx)
@@ -91,5 +84,10 @@ static inline void eventfd_ctx_do_read(struct eventfd_ctx 
*ctx, __u64 *cnt)
 
 #endif
 
+static inline void eventfd_signal(struct eventfd_ctx *ctx)
+{
+   eventfd_signal_mask(ctx, 0);
+}
+
 #endif /* _LINUX_EVENTFD_H */
 

-- 
2.42.0

[PATCH v2 3/4] eventfd: simplify eventfd_signal_mask()

2023-11-22 Thread Christian Brauner

The eventfd_signal_mask() helper was introduced for io_uring and similar
to eventfd_signal() it always passed 1 for @n. So don't bother with that
argument at all.

Signed-off-by: Christian Brauner 
---
 fs/eventfd.c| 7 ---
 include/linux/eventfd.h | 5 ++---
 io_uring/io_uring.c | 4 ++--
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index dc9e01053235..a9a6de920fb4 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -43,9 +43,10 @@ struct eventfd_ctx {
int id;
 };
 
-__u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 n, __poll_t mask)
+__u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __poll_t mask)
 {
unsigned long flags;
+   __u64 n = 1;
 
/*
 * Deadlock or stack overflow issues can happen if we recurse here
@@ -68,7 +69,7 @@ __u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 n, 
__poll_t mask)
current->in_eventfd = 0;
spin_unlock_irqrestore(>wqh.lock, flags);
 
-   return n;
+   return n == 1;
 }
 
 /**
@@ -84,7 +85,7 @@ __u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 n, 
__poll_t mask)
  */
 __u64 eventfd_signal(struct eventfd_ctx *ctx)
 {
-   return eventfd_signal_mask(ctx, 1, 0);
+   return eventfd_signal_mask(ctx, 0);
 }
 EXPORT_SYMBOL_GPL(eventfd_signal);
 
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index 562089431551..4f8aac7eb62a 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -36,7 +36,7 @@ struct file *eventfd_fget(int fd);
 struct eventfd_ctx *eventfd_ctx_fdget(int fd);
 struct eventfd_ctx *eventfd_ctx_fileget(struct file *file);
 __u64 eventfd_signal(struct eventfd_ctx *ctx);
-__u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 n, __poll_t mask);
+__u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __poll_t mask);
 int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_entry_t 
*wait,
  __u64 *cnt);
 void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt);
@@ -63,8 +63,7 @@ static inline int eventfd_signal(struct eventfd_ctx *ctx)
return -ENOSYS;
 }
 
-static inline int eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 n,
- unsigned mask)
+static inline int eventfd_signal_mask(struct eventfd_ctx *ctx, unsigned mask)
 {
return -ENOSYS;
 }
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index ed254076c723..70170a41eac4 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -558,7 +558,7 @@ static void io_eventfd_ops(struct rcu_head *rcu)
int ops = atomic_xchg(_fd->ops, 0);
 
if (ops & BIT(IO_EVENTFD_OP_SIGNAL_BIT))
-   eventfd_signal_mask(ev_fd->cq_ev_fd, 1, EPOLL_URING_WAKE);
+   eventfd_signal_mask(ev_fd->cq_ev_fd, EPOLL_URING_WAKE);
 
/* IO_EVENTFD_OP_FREE_BIT may not be set here depending on callback
 * ordering in a race but if references are 0 we know we have to free
@@ -594,7 +594,7 @@ static void io_eventfd_signal(struct io_ring_ctx *ctx)
goto out;
 
if (likely(eventfd_signal_allowed())) {
-   eventfd_signal_mask(ev_fd->cq_ev_fd, 1, EPOLL_URING_WAKE);
+   eventfd_signal_mask(ev_fd->cq_ev_fd, EPOLL_URING_WAKE);
} else {
atomic_inc(_fd->refs);
if (!atomic_fetch_or(BIT(IO_EVENTFD_OP_SIGNAL_BIT), 
_fd->ops))

-- 
2.42.0

[PATCH v2 2/4] eventfd: simplify eventfd_signal()

2023-11-22 Thread Christian Brauner

Ever since the evenfd type was introduced back in 2007 in commit
e1ad7468c77d ("signal/timer/event: eventfd core") the eventfd_signal()
function only ever passed 1 as a value for @n. There's no point in
keeping that additional argument.

Signed-off-by: Christian Brauner 
---
 arch/x86/kvm/hyperv.c |  2 +-
 arch/x86/kvm/xen.c|  2 +-
 drivers/accel/habanalabs/common/device.c  |  2 +-
 drivers/fpga/dfl.c|  2 +-
 drivers/gpu/drm/drm_syncobj.c |  6 +++---
 drivers/gpu/drm/i915/gvt/interrupt.c  |  2 +-
 drivers/infiniband/hw/mlx5/devx.c |  2 +-
 drivers/misc/ocxl/file.c  |  2 +-
 drivers/s390/cio/vfio_ccw_chp.c   |  2 +-
 drivers/s390/cio/vfio_ccw_drv.c   |  4 ++--
 drivers/s390/cio/vfio_ccw_ops.c   |  6 +++---
 drivers/s390/crypto/vfio_ap_ops.c |  2 +-
 drivers/usb/gadget/function/f_fs.c|  4 ++--
 drivers/vdpa/vdpa_user/vduse_dev.c|  6 +++---
 drivers/vfio/fsl-mc/vfio_fsl_mc_intr.c|  2 +-
 drivers/vfio/pci/vfio_pci_core.c  |  6 +++---
 drivers/vfio/pci/vfio_pci_intrs.c | 12 ++--
 drivers/vfio/platform/vfio_platform_irq.c |  4 ++--
 drivers/vhost/vdpa.c  |  4 ++--
 drivers/vhost/vhost.c | 10 +-
 drivers/vhost/vhost.h |  2 +-
 drivers/virt/acrn/ioeventfd.c |  2 +-
 drivers/xen/privcmd.c |  2 +-
 fs/aio.c  |  2 +-
 fs/eventfd.c  |  9 +++--
 include/linux/eventfd.h   |  4 ++--
 mm/memcontrol.c   | 10 +-
 mm/vmpressure.c   |  2 +-
 samples/vfio-mdev/mtty.c  |  4 ++--
 virt/kvm/eventfd.c|  4 ++--
 30 files changed, 60 insertions(+), 63 deletions(-)

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 238afd7335e4..4943f6b2bbee 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -2388,7 +2388,7 @@ static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, 
struct kvm_hv_hcall *h
if (!eventfd)
return HV_STATUS_INVALID_PORT_ID;
 
-   eventfd_signal(eventfd, 1);
+   eventfd_signal(eventfd);
return HV_STATUS_SUCCESS;
 }
 
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index e53fad915a62..523bb6df5ac9 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -2088,7 +2088,7 @@ static bool kvm_xen_hcall_evtchn_send(struct kvm_vcpu 
*vcpu, u64 param, u64 *r)
if (ret < 0 && ret != -ENOTCONN)
return false;
} else {
-   eventfd_signal(evtchnfd->deliver.eventfd.ctx, 1);
+   eventfd_signal(evtchnfd->deliver.eventfd.ctx);
}
 
*r = 0;
diff --git a/drivers/accel/habanalabs/common/device.c 
b/drivers/accel/habanalabs/common/device.c
index 9711e8fc979d..3a89644f087c 100644
--- a/drivers/accel/habanalabs/common/device.c
+++ b/drivers/accel/habanalabs/common/device.c
@@ -2044,7 +2044,7 @@ static void hl_notifier_event_send(struct 
hl_notifier_event *notifier_event, u64
notifier_event->events_mask |= event_mask;
 
if (notifier_event->eventfd)
-   eventfd_signal(notifier_event->eventfd, 1);
+   eventfd_signal(notifier_event->eventfd);
 
mutex_unlock(_event->lock);
 }
diff --git a/drivers/fpga/dfl.c b/drivers/fpga/dfl.c
index dd7a783d53b5..e73f88050f08 100644
--- a/drivers/fpga/dfl.c
+++ b/drivers/fpga/dfl.c
@@ -1872,7 +1872,7 @@ static irqreturn_t dfl_irq_handler(int irq, void *arg)
 {
struct eventfd_ctx *trigger = arg;
 
-   eventfd_signal(trigger, 1);
+   eventfd_signal(trigger);
return IRQ_HANDLED;
 }
 
diff --git a/drivers/gpu/drm/drm_syncobj.c b/drivers/gpu/drm/drm_syncobj.c
index 01da6789d044..b9cc62982196 100644
--- a/drivers/gpu/drm/drm_syncobj.c
+++ b/drivers/gpu/drm/drm_syncobj.c
@@ -1365,7 +1365,7 @@ static void syncobj_eventfd_entry_fence_func(struct 
dma_fence *fence,
struct syncobj_eventfd_entry *entry =
container_of(cb, struct syncobj_eventfd_entry, fence_cb);
 
-   eventfd_signal(entry->ev_fd_ctx, 1);
+   eventfd_signal(entry->ev_fd_ctx);
syncobj_eventfd_entry_free(entry);
 }
 
@@ -1388,13 +1388,13 @@ syncobj_eventfd_entry_func(struct drm_syncobj *syncobj,
entry->fence = fence;
 
if (entry->flags & DRM_SYNCOBJ_WAIT_FLAGS_WAIT_AVAILABLE) {
-   eventfd_signal(entry->ev_fd_ctx, 1);
+   eventfd_signal(entry->ev_fd_ctx);
syncobj_eventfd_entry_free(entry);
} else {
ret = dma_fence_add_callback(fence, >fence_cb,
 syncobj_eventfd_entry_fence_func);
if (ret == -ENOENT) {
-   eventfd_

[PATCH v2 1/4] i915: make inject_virtual_interrupt() void

2023-11-22 Thread Christian Brauner

The single caller of inject_virtual_interrupt() ignores the return value
anyway. This allows us to simplify eventfd_signal() in follow-up
patches.

Signed-off-by: Christian Brauner 
---
 drivers/gpu/drm/i915/gvt/interrupt.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/gvt/interrupt.c 
b/drivers/gpu/drm/i915/gvt/interrupt.c
index de3f5903d1a7..9665876b4b13 100644
--- a/drivers/gpu/drm/i915/gvt/interrupt.c
+++ b/drivers/gpu/drm/i915/gvt/interrupt.c
@@ -422,7 +422,7 @@ static void init_irq_map(struct intel_gvt_irq *irq)
 #define MSI_CAP_DATA(offset) (offset + 8)
 #define MSI_CAP_EN 0x1
 
-static int inject_virtual_interrupt(struct intel_vgpu *vgpu)
+static void inject_virtual_interrupt(struct intel_vgpu *vgpu)
 {
unsigned long offset = vgpu->gvt->device_info.msi_cap_offset;
u16 control, data;
@@ -434,10 +434,10 @@ static int inject_virtual_interrupt(struct intel_vgpu 
*vgpu)
 
/* Do not generate MSI if MSIEN is disabled */
if (!(control & MSI_CAP_EN))
-   return 0;
+   return;
 
if (WARN(control & GENMASK(15, 1), "only support one MSI format\n"))
-   return -EINVAL;
+   return;
 
trace_inject_msi(vgpu->id, addr, data);
 
@@ -451,10 +451,10 @@ static int inject_virtual_interrupt(struct intel_vgpu 
*vgpu)
 * returned and don't inject interrupt into guest.
 */
if (!test_bit(INTEL_VGPU_STATUS_ATTACHED, vgpu->status))
-   return -ESRCH;
-   if (vgpu->msi_trigger && eventfd_signal(vgpu->msi_trigger, 1) != 1)
-   return -EFAULT;
-   return 0;
+   return;
+   if (!vgpu->msi_trigger)
+   return;
+   eventfd_signal(vgpu->msi_trigger, 1);
 }
 
 static void propagate_event(struct intel_gvt_irq *irq,

-- 
2.42.0

[PATCH v2 0/4] eventfd: simplify signal helpers

2023-11-22 Thread Christian Brauner

Hey everyone,

This simplifies the eventfd_signal() and eventfd_signal_mask() helpers
significantly. They can be made void and not take any unnecessary
arguments.

I've added a few more simplifications based on Sean's suggestion.

Signed-off-by: Christian Brauner 

Changes in v2:
- further simplify helpers
- Link to v1: 
https://lore.kernel.org/r/20230713-vfs-eventfd-signal-v1-0-7fda6c5d2...@kernel.org

---



---
base-commit: b85ea95d086471afb4ad062012a4d73cd328fa86
change-id: 20230713-vfs-eventfd-signal-0b0d167ad6ec

Re: [PATCH 14/34] fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure()

2023-11-06 Thread Christian Brauner

On Sun, Nov 05, 2023 at 05:30:17PM +0100, Paolo Bonzini wrote:
> The call to the inode_init_security_anon() LSM hook is not the sole
> reason to use anon_inode_getfile_secure() or anon_inode_getfd_secure().
> For example, the functions also allow one to create a file with non-zero
> size, without needing a full-blown filesystem.  In this case, you don't
> need a "secure" version, just unique inodes; the current name of the
> functions is confusing and does not explain well the difference with
> the more "standard" anon_inode_getfile() and anon_inode_getfd().
> 
> Of course, there is another side of the coin; neither io_uring nor
> userfaultfd strictly speaking need distinct inodes, and it is not
> that clear anymore that anon_inode_create_get{file,fd}() allow the LSM
> to intercept and block the inode's creation.  If one was so inclined,
> anon_inode_getfile_secure() and anon_inode_getfd_secure() could be kept,
> using the shared inode or a new one depending on CONFIG_SECURITY.
> However, this is probably overkill, and potentially a cause of bugs in
> different configurations.  Therefore, just add a comment to io_uring
> and userfaultfd explaining the choice of the function.
> 
> While at it, remove the export for what is now anon_inode_create_getfd().
> There is no in-tree module that uses it, and the old name is gone anyway.

That's great, thanks.

> If anybody actually needs the symbol, they can ask or they can just use
> anon_inode_create_getfile(), which will be exported very soon for use
> in KVM.
> 
> Suggested-by: Christian Brauner 
> Signed-off-by: Paolo Bonzini 
> ---

Looks good to me,
Reviewed-by: Christian Brauner

Re: [PATCH v13 15/35] fs: Export anon_inode_getfile_secure() for use by KVM

2023-11-02 Thread Christian Brauner

On Fri, Oct 27, 2023 at 11:21:57AM -0700, Sean Christopherson wrote:
> Export anon_inode_getfile_secure() so that it can be used by KVM to create
> and manage file-based guest memory without need a fullblow filesystem.
> The "standard" anon_inode_getfd() doesn't work for KVM's use case as KVM
> needs a unique inode for each file, e.g. to be able to independently
> manage the size and lifecycle of a given file.
> 
> Note, KVM doesn't need a "secure" version, just unique inodes, i.e. ignore
> the name.
> 
> Signed-off-by: Sean Christopherson 
> ---

Before we enshrine this misleading name let's rename this to:

create_anon_inode_getfile()

I don't claim it's a great name but it's better than *_secure() which is
very confusing. So just:

struct file *create_anon_inode_getfile(const char *name,
   const struct file_operations *fops,
   void *priv, int flags)

May also just remove that context_inode argument from the exported
function. The only other caller is io_uring. And neither it nor this
patchset need the context_inode thing afaict. Merge conflict risk is
extremely low so carrying that as part of this patchset is fine and
shouldn't cause huge issues for you.

Re: [PATCH v2 00/89] fs: new accessor methods for inode atime and mtime

2023-10-09 Thread Christian Brauner

On Wed, Oct 04, 2023 at 02:52:21PM -0400, Jeff Layton wrote:
> v2:
> - bugfix in mtime handling
> - incorporate _sec and _nsec accessor functions (Chuck Lever)
> - move i_generation to plug hole after changing timestamps (Amir Goldstein)
> 
> While working on the multigrain timestamp changes, Linus suggested
> adding some similar wrappers for accessing the atime and mtime that we
> have for the ctime. With that, we could then move to using discrete
> integers instead of struct timespec64 in struct inode and shrink it.
> 
> This patch implements this. Linus suggested using macros for the new
> accessors, but the existing ctime wrappers were static inlines and since
> there are only 3 different timestamps, I didn't see that trying to
> fiddle with macros would gain us anything (other than less verbosity in
> fs.h).
> 
> [...]

This was applied on top of -next but vfs.ctime is now based on v6.6-rc3
as stable tag otherwise this is too much of a moving target with other
stuff in -next. Anything that had to be dropped and requires fixups
should just be explained in the pr. The sooner this sees some -next, the
better, I think.

---

Applied to the vfs.ctime branch of the vfs/vfs.git tree.
Patches in the vfs.ctime branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.ctime

[01/86] fs: new accessor methods for atime and mtime
https://git.kernel.org/vfs/vfs/c/22f45fee808d
[02/86] fs: convert core infrastructure to new timestamp accessors
https://git.kernel.org/vfs/vfs/c/6ac95fb71485
[03/86] spufs: convert to new timestamp accessors
https://git.kernel.org/vfs/vfs/c/9953073d5f20
[04/86] hypfs: convert to new timestamp accessors
https://git.kernel.org/vfs/vfs/c/1d64bfe22112
[05/86] android: convert to new timestamp accessors
https://git.kernel.org/vfs/vfs/c/a8a74b6b4f2c
[06/86] char: convert to new timestamp accessors
https://git.kernel.org/vfs/vfs/c/671ffa0775a7
[07/86] qib: convert to new timestamp accessors
https://git.kernel.org/vfs/vfs/c/ebd5458f3b52
[08/86] ibmasm: convert to new timestamp accessors
https://git.kernel.org/vfs/vfs/c/1d4257d57a41
[09/86] misc: convert to new timestamp accessors
https://git.kernel.org/vfs/vfs/c/d4bf8378b9cb
[10/86] x86: convert to new timestamp accessors
https://git.kernel.org/vfs/vfs/c/070601b1e496
[11/86] tty: convert to new timestamp accessors
https://git.kernel.org/vfs/vfs/c/5c9f26b87bed
[12/86] function: convert to new timestamp accessors
https://git.kernel.org/vfs/vfs/c/092f46404245
[13/86] legacy: convert to new timestamp accessors
https://git.kernel.org/vfs/vfs/c/5c51d80e51d0
[14/86] usb: convert to new timestamp accessors
https://git.kernel.org/vfs/vfs/c/4707a33afd6f
[15/86] 9p: convert to new timestamp accessors
https://git.kernel.org/vfs/vfs/c/20fc454b4493
[16/86] adfs: convert to new timestamp accessors
https://git.kernel.org/vfs/vfs/c/3e8d59046f6d
[17/86] affs: convert to new timestamp accessors
https://git.kernel.org/vfs/vfs/c/60d4d0d37086
[18/86] afs: convert to new timestamp accessors
https://git.kernel.org/vfs/vfs/c/6471772aa6fe
[19/86] autofs: convert to new timestamp accessors
https://git.kernel.org/vfs/vfs/c/3eaad981548b
[20/86] befs: convert to new timestamp accessors
https://git.kernel.org/vfs/vfs/c/21d0433caf69
[21/86] bfs: convert to new timestamp accessors
https://git.kernel.org/vfs/vfs/c/06e502c123a6
[22/86] btrfs: convert to new timestamp accessors
https://git.kernel.org/vfs/vfs/c/f62049d7838d
[23/86] ceph: convert to new timestamp accessors
https://git.kernel.org/vfs/vfs/c/ac7750d84e38
[24/86] coda: convert to new timestamp accessors
https://git.kernel.org/vfs/vfs/c/5c4bf2507baa
[25/86] configfs: convert to new timestamp accessors
https://git.kernel.org/vfs/vfs/c/3b930e187f16
[26/86] cramfs: convert to new timestamp accessors
https://git.kernel.org/vfs/vfs/c/bb0bf9d3bda8
[27/86] debugfs: convert to new timestamp accessors
https://git.kernel.org/vfs/vfs/c/7dc950e659d6
[28/86] devpts: convert to new timestamp accessors
https://git.kernel.org/vfs/vfs/c/a1eb5c26d5a1
[29/86] efivarfs: convert to new timestamp accessors
https://git.kernel.org/vfs/vfs/c/17b5652aa824
[30/86] efs: convert to new timestamp accessors
https://git.kernel.org/vfs/vfs/c/a3cfbea29e7d
[31/86] erofs: convert to new timestamp accessors

Re: [PATCH 86/87] fs: switch timespec64 fields in inode to discrete integers

2023-09-29 Thread Christian Brauner

> It is a lot of churn though.

I think that i_{a,c,m}time shouldn't be accessed directly by
filesystems same as no filesystem should really access i_{g,u}id which
we also provide i_{g,u}id_{read,write}() accessors for. The mode is
another example where really most often should use helpers because of all
the set*id stripping that we need to do (and the bugs that we had
because of this...).

The interdependency between ctime and mtime is enough to hide this in
accessors. The other big advantage is simply grepability. So really I
would like to see this change even without the type switch.

In other words, there's no need to lump the two changes together. Do the
conversion part and we can argue about the switch to discrete integers
separately.

The other adavantage is that we have a cycle to see any possible
regression from the conversion.

Thoughts anyone?

Re: [PATCH 87/87] fs: move i_blocks up a few places in struct inode

2023-09-29 Thread Christian Brauner

On Thu, Sep 28, 2023 at 10:41:34AM -0700, Linus Torvalds wrote:
> On Thu, 28 Sept 2023 at 04:06, Jeff Layton  wrote:
> >
> > Move i_blocks up above the i_lock, which moves the new 4 byte hole to
> > just after the timestamps, without changing the size of the structure.
> 
> I'm sure others have mentioned this, but 'struct inode' is marked with
> __randomize_layout, so the actual layout may end up being very
> different.
> 
> I'm personally not convinced the whole structure randomization is
> worth it - it's easy enough to figure out for any distro kernel since
> the seed has to be the same across machines for modules to work, so
> even if the seed isn't "public", any layout is bound to be fairly
> easily discoverable.
> 
> So the whole randomization only really works for private kernel
> builds, and it adds this kind of pain where "optimizing" the structure
> layout is kind of pointless depending on various options.
> 
> I certainly *hope* no distro enables that pointless thing, but it's a worry.

They don't last we checked. Just last cycle we moved stuff in struct
file around to optimize things and we explicitly said we don't give a
damn about struct randomization. Anyone who enables this will bleed
performance pretty badly, I would reckon.

Re: [PATCH v4 2/5] fs: Add fchmodat2()

2023-07-27 Thread Christian Brauner

On Thu, Jul 27, 2023 at 01:13:37PM -0400, dal...@libc.org wrote:
> On Thu, Jul 27, 2023 at 07:02:53PM +0200, Christian Brauner wrote:
> > On Thu, Jul 27, 2023 at 06:28:53PM +0200, Andreas Schwab wrote:
> > > On Jul 27 2023, David Laight wrote:
> > > 
> > > > From: Aleksa Sarai
> > > >> Sent: 25 July 2023 17:36
> > > > ...
> > > >> We almost certainly want to support AT_EMPTY_PATH at the same time.
> > > >> Otherwise userspace will still need to go through /proc when trying to
> > > >> chmod a file handle they have.
> > > >
> > > > That can't be allowed.
> > > 
> > > IIUC, fchmodat2(fd, "", m, AT_EMPTY_PATH) is equivalent to fchmod(fd,
> > > m).  With that, new architectures only need to implement the fchmodat2
> > > syscall to cover all chmod variants.
> > 
> > There's a difference though as fchmod() doesn't work with O_PATH file
> > descriptors while AT_EMPTY_PATH does. Similar to how fchown() doesn't
> > work with O_PATH file descriptors.
> > 
> > However, we do allow AT_EMPTY_PATH with fchownat() so there's no reason
> > to not allow it for fchmodat2().
> > 
> > But it's a bit of a shame that O_PATH looks less and less like O_PATH.
> > It came from can-do-barely-anything to can-do-quite-a-lot-of-things over
> > the years.
> > 
> > In any case, AT_EMPTY_PATH for fchmodat2() can be an additional patch on
> > top.
> 
> From a standpoint of implementing O_SEARCH/O_EXEC using it, I don't
> see any reason fchown/fchmod should not work on O_PATH file
> descriptors. And indeed when you have procfs available to emulate them
> via procfs, it already does. So I don't see this as unwanted

I'm really not talking about the fact that proc is a giant loophole for
basically everyhing related to O_PATH and reopening fds.

I'm saying that both fchmod() and fchown() don't work on O_PATH fds.
They explicitly reject them.

AT_EMPTY_PATH and therefore O_PATH for fchmodat2() is fine given that we
do it for fchownat() already.

Re: [PATCH v4 2/5] fs: Add fchmodat2()

2023-07-27 Thread Christian Brauner

On Thu, Jul 27, 2023 at 06:28:53PM +0200, Andreas Schwab wrote:
> On Jul 27 2023, David Laight wrote:
> 
> > From: Aleksa Sarai
> >> Sent: 25 July 2023 17:36
> > ...
> >> We almost certainly want to support AT_EMPTY_PATH at the same time.
> >> Otherwise userspace will still need to go through /proc when trying to
> >> chmod a file handle they have.
> >
> > That can't be allowed.
> 
> IIUC, fchmodat2(fd, "", m, AT_EMPTY_PATH) is equivalent to fchmod(fd,
> m).  With that, new architectures only need to implement the fchmodat2
> syscall to cover all chmod variants.

There's a difference though as fchmod() doesn't work with O_PATH file
descriptors while AT_EMPTY_PATH does. Similar to how fchown() doesn't
work with O_PATH file descriptors.

However, we do allow AT_EMPTY_PATH with fchownat() so there's no reason
to not allow it for fchmodat2().

But it's a bit of a shame that O_PATH looks less and less like O_PATH.
It came from can-do-barely-anything to can-do-quite-a-lot-of-things over
the years.

In any case, AT_EMPTY_PATH for fchmodat2() can be an additional patch on
top.

Re: [PATCH v4 3/5] arch: Register fchmodat2, usually as syscall 452

2023-07-27 Thread Christian Brauner

On Wed, Jul 26, 2023 at 02:43:41AM +1000, Aleksa Sarai wrote:
> On 2023-07-11, Alexey Gladkov  wrote:
> > From: Palmer Dabbelt 
> > 
> > This registers the new fchmodat2 syscall in most places as nuber 452,
> > with alpha being the exception where it's 562.  I found all these sites
> > by grepping for fspick, which I assume has found me everything.
> 
> Shouldn't this patch be squashed with the patch that adds the syscall?
> At least, that's how I've usually seen it done...

Depends. Iirc, someone said they'd prefer for doing it in one patch
in some circumstances on some system call we added years ago. But otoh,
having the syscall wiring done separately makes it easy for arch
maintainers to ack only the wiring up part. Both ways are valid imho.
(cachestat() did it for x86 and then all the others separately. So
really it seems a bit all over the place depending on the scenario.)

Re: [PATCH v4 2/5] fs: Add fchmodat2()

2023-07-27 Thread Christian Brauner

> > I think it'd be much neater to do the conversion of AT_ flags here and
> > pass 0 as a flags argument for all of the wrappers (this is how most of
> > the other xyz(), fxyz(), fxyzat() syscall wrappers are done IIRC).

I've fixed that up in-tree.

Re: Add fchmodat2() - or add a more general syscall?

2023-07-27 Thread Christian Brauner

On Wed, Jul 26, 2023 at 08:57:10PM -0700, Eric Biggers wrote:
> On Tue, Jul 25, 2023 at 04:58:34PM +0100, David Howells wrote:
> > Rather than adding a fchmodat2() syscall, should we add a "set_file_attrs()"
> > syscall that takes a mask and allows you to set a bunch of stuff all in one
> > go?  Basically, an interface to notify_change() in the kernel that would 
> > allow
> > several stats to be set atomically.  This might be of particular interest to
> > network filesystems.
> > 
> > David
> > 
> 
> fchmodat2() is a simple addition that fits well with the existing syscalls.
> It fixes an oversight in fchmodat().
> 
> IMO we should just add fchmodat2(), and not get sidetracked by trying to add
> some super-generalized syscall instead.  That can always be done later.

Agreed.

Re: Add fchmodat2() - or add a more general syscall?

2023-07-26 Thread Christian Brauner

On Tue, Jul 25, 2023 at 07:39:51PM +0100, David Howells wrote:
> Florian Weimer  wrote:
> 
> > > Rather than adding a fchmodat2() syscall, should we add a
> > > "set_file_attrs()" syscall that takes a mask and allows you to set a bunch
> > > of stuff all in one go?  Basically, an interface to notify_change() in the

That system call would likely be blocked in seccomp sandboxes completely
as seccomp cannot filter structs. I don't consider this an argument to
block new good functionality in general as that would mean arbitrarily
limiting us but it is something to keep in mind. If there's additional
benefit other than just being able to set mutliple values at once then
yeah might be something to discuss.

> > > kernel that would allow several stats to be set atomically.  This might be
> > > of particular interest to network filesystems.
> > 
> > Do you mean atomically as in compare-and-swap (update only if old values
> > match), or just a way to update multiple file attributes with a single
> > system call?
> 
> I was thinking more in terms of the latter.  AFAIK, there aren't any network
> filesystems support a CAS interface on file attributes like that.  To be able
> to do a CAS operation, we'd need to pass in the old values as well as the new.
> 
> Another thing we could look at is doing "create_and_set_attrs()", possibly
> allowing it to take a list of xattrs also.

That would likely require variable sized pointers in a struct which is
something we really try to stay away from. I also think it's not a good
idea to lump xattrs toegether with generic file attributes. They should
remain a separate api imho.

Re: [PATCH v3 0/5] Add a new fchmodat4() syscall

2023-07-25 Thread Christian Brauner

On Tue, Jul 25, 2023 at 01:05:40PM +0200, Alexey Gladkov wrote:
> On Tue, Jul 11, 2023 at 05:14:24PM +0200, Christian Brauner wrote:
> > On Tue, Jul 11, 2023 at 02:24:51PM +0200, Florian Weimer wrote:
> > > * Alexey Gladkov:
> > > 
> > > > This patch set adds fchmodat4(), a new syscall. The actual
> > > > implementation is super simple: essentially it's just the same as
> > > > fchmodat(), but LOOKUP_FOLLOW is conditionally set based on the flags.
> > > > I've attempted to make this match "man 2 fchmodat" as closely as
> > > > possible, which says EINVAL is returned for invalid flags (as opposed to
> > > > ENOTSUPP, which is currently returned by glibc for AT_SYMLINK_NOFOLLOW).
> > > > I have a sketch of a glibc patch that I haven't even compiled yet, but
> > > > seems fairly straight-forward:
> > > >
> > > > diff --git a/sysdeps/unix/sysv/linux/fchmodat.c 
> > > > b/sysdeps/unix/sysv/linux/fchmodat.c
> > > > index 6d9cbc1ce9e0..b1beab76d56c 100644
> > > > --- a/sysdeps/unix/sysv/linux/fchmodat.c
> > > > +++ b/sysdeps/unix/sysv/linux/fchmodat.c
> > > > @@ -29,12 +29,36 @@
> > > >  int
> > > >  fchmodat (int fd, const char *file, mode_t mode, int flag)
> > > >  {
> > > > -  if (flag & ~AT_SYMLINK_NOFOLLOW)
> > > > -return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL);
> > > > -#ifndef __NR_lchmod/* Linux so far has no lchmod 
> > > > syscall.  */
> > > > +  /* There are four paths through this code:
> > > > +  - The flags are zero.  In this case it's fine to call 
> > > > fchmodat.
> > > > +  - The flags are non-zero and glibc doesn't have access to
> > > > +   __NR_fchmodat4.  In this case all we can do is emulate the 
> > > > error codes
> > > > +   defined by the glibc interface from userspace.
> > > > +  - The flags are non-zero, glibc has __NR_fchmodat4, and the 
> > > > kernel has
> > > > +   fchmodat4.  This is the simplest case, as the fchmodat4 syscall 
> > > > exactly
> > > > +   matches glibc's library interface so it can be called directly.
> > > > +  - The flags are non-zero, glibc has __NR_fchmodat4, but the 
> > > > kernel does
> > > 
> > > If you define __NR_fchmodat4 on all architectures, we can use these
> > > constants directly in glibc.  We no longer depend on the UAPI
> > > definitions of those constants, to cut down the number of code variants,
> > > and to make glibc's system call profile independent of the kernel header
> > > version at build time.
> > > 
> > > Your version is based on 2.31, more recent versions have some reasonable
> > > emulation for fchmodat based on /proc/self/fd.  I even wrote a comment
> > > describing the same buggy behavior that you witnessed:
> > > 
> > > +  /* Some Linux versions with some file systems can actually
> > > +change symbolic link permissions via /proc, but this is not
> > > +intentional, and it gives inconsistent results (e.g., error
> > > +return despite mode change).  The expected behavior is that
> > > +symbolic link modes cannot be changed at all, and this check
> > > +enforces that.  */
> > > +  if (S_ISLNK (st.st_mode))
> > > +   {
> > > + __close_nocancel (pathfd);
> > > + __set_errno (EOPNOTSUPP);
> > > + return -1;
> > > +   }
> > > 
> > > I think there was some kernel discussion about that behavior before, but
> > > apparently, it hasn't led to fixes.
> > 
> > I think I've explained this somewhere else a couple of months ago but
> > just in case you weren't on that thread or don't remember and apologies
> > if you should already know.
> > 
> > A lot of filesystem will happily update the mode of a symlink. The VFS
> > doesn't do anything to prevent this from happening. This is filesystem
> > specific.
> > 
> > The EOPNOTSUPP you're seeing very likely comes from POSIX ACLs.
> > Specifically it comes from filesystems that call posix_acl_chmod(),
> > e.g., btrfs via
> > 
> > if (!err && attr->ia_valid & ATTR_MODE)
> > err = posix_acl_chmod(idmap, dentry, inode->i_mode);
> > 
> > Most filesystems don't impleme

Re: [PATCH 0/2] eventfd: simplify signal helpers

2023-07-14 Thread Christian Brauner

On Thu, Jul 13, 2023 at 11:10:54AM -0600, Alex Williamson wrote:
> On Thu, 13 Jul 2023 12:05:36 +0200
> Christian Brauner  wrote:
> 
> > Hey everyone,
> > 
> > This simplifies the eventfd_signal() and eventfd_signal_mask() helpers
> > by removing the count argument which is effectively unused.
> 
> We have a patch under review which does in fact make use of the
> signaling value:
> 
> https://lore.kernel.org/all/20230630155936.3015595-1-...@semihalf.com/

Huh, thanks for the link.

Quoting from
https://patchwork.kernel.org/project/kvm/patch/20230307220553.631069-1-...@semihalf.com/#25266856

> Reading an eventfd returns an 8-byte value, we generally only use it
> as a counter, but it's been discussed previously and IIRC, it's possible
> to use that value as a notification value.

So the goal is to pipe a specific value through eventfd? But it is
explicitly a counter. The whole thing is written around a counter and
each write and signal adds to the counter.

The consequences are pretty well described in the cover letter of
v6 https://lore.kernel.org/all/20230630155936.3015595-1-...@semihalf.com/

> Since the eventfd counter is used as ACPI notification value
> placeholder, the eventfd signaling needs to be serialized in order to
> not end up with notification values being coalesced. Therefore ACPI
> notification values are buffered and signalized one by one, when the
> previous notification value has been consumed.

But isn't this a good indication that you really don't want an eventfd
but something that's explicitly designed to associate specific data with
a notification? Using eventfd in that manner requires serialization,
buffering, and enforces ordering.

I have no skin in the game aside from having to drop this conversion
which I'm fine to do if there are actually users for this btu really,
that looks a lot like abusing an api that really wasn't designed for
this.

Re: [PATCH 2/2] eventfd: simplify eventfd_signal_mask()

2023-07-13 Thread Christian Brauner

On Thu, Jul 13, 2023 at 07:33:05AM -0700, Sean Christopherson wrote:
> On Thu, Jul 13, 2023, Christian Brauner wrote:
> > diff --git a/fs/eventfd.c b/fs/eventfd.c
> > index dc9e01053235..077be5da72bd 100644
> > --- a/fs/eventfd.c
> > +++ b/fs/eventfd.c
> > @@ -43,9 +43,10 @@ struct eventfd_ctx {
> > int id;
> >  };
> >  
> > -__u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 n, __poll_t mask)
> > +bool eventfd_signal_mask(struct eventfd_ctx *ctx, __poll_t mask)
> >  {
> > unsigned long flags;
> > +   __u64 n = 1;
> >  
> > /*
> >  * Deadlock or stack overflow issues can happen if we recurse here
> > @@ -68,7 +69,7 @@ __u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 
> > n, __poll_t mask)
> > current->in_eventfd = 0;
> > spin_unlock_irqrestore(>wqh.lock, flags);
> >  
> > -   return n;
> > +   return n == 1;
> >  }
> 
> ...
> 
> > @@ -58,13 +58,12 @@ static inline struct eventfd_ctx *eventfd_ctx_fdget(int 
> > fd)
> > return ERR_PTR(-ENOSYS);
> >  }
> >  
> > -static inline int eventfd_signal(struct eventfd_ctx *ctx)
> > +static inline bool eventfd_signal(struct eventfd_ctx *ctx)
> >  {
> > return -ENOSYS;
> >  }
> >  
> > -static inline int eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 n,
> > - unsigned mask)
> > +static inline bool eventfd_signal_mask(struct eventfd_ctx *ctx, unsigned 
> > mask)
> >  {
> > return -ENOSYS;
> 
> This will morph to "true" for what should be an error case.  One option would 
> be

Ewww, that means it did return -ENOSYS before any of this.

> to have eventfd_signal_mask() return 0/-errno instead of the count, but 
> looking
> at all the callers, nothing ever actually consumes the result.
> 
> KVMGT morphs failure into -EFAULT
> 
>   if (vgpu->msi_trigger && eventfd_signal(vgpu->msi_trigger, 1) != 1)
>   return -EFAULT;
> 
> but the only caller of that user ignores the return value.
> 
>   if (vgpu_vreg(vgpu, i915_mmio_reg_offset(GEN8_MASTER_IRQ))
>   & ~GEN8_MASTER_IRQ_CONTROL)
>   inject_virtual_interrupt(vgpu);
> 
> The sample driver in samples/vfio-mdev/mtty.c uses a similar pattern: prints 
> an
> error but otherwise ignores the result.
> 
> So why not return nothing?  That will simplify eventfd_signal_mask() a wee bit
> more, and eliminate that bizarre return value confusion for the ugly stubs, 
> e.g.

Yeah, it used to return an int in the non-eventfd and a __u64 in the
eventfd case.

> 
> void eventfd_signal_mask(struct eventfd_ctx *ctx, unsigned mask)
> {
>   unsigned long flags;
> 
>   /*
>* Deadlock or stack overflow issues can happen if we recurse here
>* through waitqueue wakeup handlers. If the caller users potentially
>* nested waitqueues with custom wakeup handlers, then it should
>* check eventfd_signal_allowed() before calling this function. If
>* it returns false, the eventfd_signal() call should be deferred to a
>* safe context.
>*/
>   if (WARN_ON_ONCE(current->in_eventfd))
>   return;
> 
>   spin_lock_irqsave(>wqh.lock, flags);
>   current->in_eventfd = 1;
>   if (ctx->count < ULLONG_MAX)
>   ctx->count++;
>   if (waitqueue_active(>wqh))
>   wake_up_locked_poll(>wqh, EPOLLIN | mask);
>   current->in_eventfd = 0;
>   spin_unlock_irqrestore(>wqh.lock, flags);
> }
> 
> You could even go further and unify the real and stub versions of 
> eventfd_signal().

The reason I didn't make eventfd_signal_mask() return void was that it
was called from eventfd_signal() which did, I didn't realize the caller
didn't actually consume the return value.

If we can let both return void it gets simpler.

Thanks for that.

[PATCH 2/2] eventfd: simplify eventfd_signal_mask()

2023-07-13 Thread Christian Brauner

The eventfd_signal_mask() helper was introduced for io_uring and similar
to eventfd_signal() it always passed 1 for @n. So don't bother with that
argument at all.

Signed-off-by: Christian Brauner 
---
 drivers/gpu/drm/i915/gvt/interrupt.c | 2 +-
 fs/eventfd.c | 9 +
 include/linux/eventfd.h  | 9 -
 io_uring/io_uring.c  | 4 ++--
 4 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/i915/gvt/interrupt.c 
b/drivers/gpu/drm/i915/gvt/interrupt.c
index 3d9e09c2add4..31aff6f733d4 100644
--- a/drivers/gpu/drm/i915/gvt/interrupt.c
+++ b/drivers/gpu/drm/i915/gvt/interrupt.c
@@ -435,7 +435,7 @@ static int inject_virtual_interrupt(struct intel_vgpu *vgpu)
 */
if (!test_bit(INTEL_VGPU_STATUS_ATTACHED, vgpu->status))
return -ESRCH;
-   if (vgpu->msi_trigger && eventfd_signal(vgpu->msi_trigger) != 1)
+   if (vgpu->msi_trigger && !eventfd_signal(vgpu->msi_trigger))
return -EFAULT;
return 0;
 }
diff --git a/fs/eventfd.c b/fs/eventfd.c
index dc9e01053235..077be5da72bd 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -43,9 +43,10 @@ struct eventfd_ctx {
int id;
 };
 
-__u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 n, __poll_t mask)
+bool eventfd_signal_mask(struct eventfd_ctx *ctx, __poll_t mask)
 {
unsigned long flags;
+   __u64 n = 1;
 
/*
 * Deadlock or stack overflow issues can happen if we recurse here
@@ -68,7 +69,7 @@ __u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 n, 
__poll_t mask)
current->in_eventfd = 0;
spin_unlock_irqrestore(>wqh.lock, flags);
 
-   return n;
+   return n == 1;
 }
 
 /**
@@ -82,9 +83,9 @@ __u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 n, 
__poll_t mask)
  *
  * Returns the amount by which the counter was incremented.
  */
-__u64 eventfd_signal(struct eventfd_ctx *ctx)
+bool eventfd_signal(struct eventfd_ctx *ctx)
 {
-   return eventfd_signal_mask(ctx, 1, 0);
+   return eventfd_signal_mask(ctx, 0);
 }
 EXPORT_SYMBOL_GPL(eventfd_signal);
 
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index 562089431551..0155ee25f7c8 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -35,8 +35,8 @@ void eventfd_ctx_put(struct eventfd_ctx *ctx);
 struct file *eventfd_fget(int fd);
 struct eventfd_ctx *eventfd_ctx_fdget(int fd);
 struct eventfd_ctx *eventfd_ctx_fileget(struct file *file);
-__u64 eventfd_signal(struct eventfd_ctx *ctx);
-__u64 eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 n, __poll_t mask);
+bool eventfd_signal(struct eventfd_ctx *ctx);
+bool eventfd_signal_mask(struct eventfd_ctx *ctx, __poll_t mask);
 int eventfd_ctx_remove_wait_queue(struct eventfd_ctx *ctx, wait_queue_entry_t 
*wait,
  __u64 *cnt);
 void eventfd_ctx_do_read(struct eventfd_ctx *ctx, __u64 *cnt);
@@ -58,13 +58,12 @@ static inline struct eventfd_ctx *eventfd_ctx_fdget(int fd)
return ERR_PTR(-ENOSYS);
 }
 
-static inline int eventfd_signal(struct eventfd_ctx *ctx)
+static inline bool eventfd_signal(struct eventfd_ctx *ctx)
 {
return -ENOSYS;
 }
 
-static inline int eventfd_signal_mask(struct eventfd_ctx *ctx, __u64 n,
- unsigned mask)
+static inline bool eventfd_signal_mask(struct eventfd_ctx *ctx, unsigned mask)
 {
return -ENOSYS;
 }
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index e8096d502a7c..a9359ef73935 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -537,7 +537,7 @@ static void io_eventfd_ops(struct rcu_head *rcu)
int ops = atomic_xchg(_fd->ops, 0);
 
if (ops & BIT(IO_EVENTFD_OP_SIGNAL_BIT))
-   eventfd_signal_mask(ev_fd->cq_ev_fd, 1, EPOLL_URING_WAKE);
+   eventfd_signal_mask(ev_fd->cq_ev_fd, EPOLL_URING_WAKE);
 
/* IO_EVENTFD_OP_FREE_BIT may not be set here depending on callback
 * ordering in a race but if references are 0 we know we have to free
@@ -573,7 +573,7 @@ static void io_eventfd_signal(struct io_ring_ctx *ctx)
goto out;
 
if (likely(eventfd_signal_allowed())) {
-   eventfd_signal_mask(ev_fd->cq_ev_fd, 1, EPOLL_URING_WAKE);
+   eventfd_signal_mask(ev_fd->cq_ev_fd, EPOLL_URING_WAKE);
} else {
atomic_inc(_fd->refs);
if (!atomic_fetch_or(BIT(IO_EVENTFD_OP_SIGNAL_BIT), 
_fd->ops))

-- 
2.34.1

[PATCH 1/2] eventfd: simplify eventfd_signal()

2023-07-13 Thread Christian Brauner

Ever since the evenfd type was introduced back in 2007 in commit
e1ad7468c77d ("signal/timer/event: eventfd core") the eventfd_signal()
function only ever passed 1 as a value for @n. There's no point in
keeping that additional argument.

Signed-off-by: Christian Brauner 
---
 arch/x86/kvm/hyperv.c |  2 +-
 arch/x86/kvm/xen.c|  2 +-
 drivers/accel/habanalabs/common/device.c  |  2 +-
 drivers/fpga/dfl.c|  2 +-
 drivers/gpu/drm/i915/gvt/interrupt.c  |  2 +-
 drivers/infiniband/hw/mlx5/devx.c |  2 +-
 drivers/misc/ocxl/file.c  |  2 +-
 drivers/s390/cio/vfio_ccw_chp.c   |  2 +-
 drivers/s390/cio/vfio_ccw_drv.c   |  4 ++--
 drivers/s390/cio/vfio_ccw_ops.c   |  6 +++---
 drivers/s390/crypto/vfio_ap_ops.c |  2 +-
 drivers/usb/gadget/function/f_fs.c|  4 ++--
 drivers/vdpa/vdpa_user/vduse_dev.c|  6 +++---
 drivers/vfio/fsl-mc/vfio_fsl_mc_intr.c|  2 +-
 drivers/vfio/pci/vfio_pci_core.c  |  6 +++---
 drivers/vfio/pci/vfio_pci_intrs.c | 12 ++--
 drivers/vfio/platform/vfio_platform_irq.c |  4 ++--
 drivers/vhost/vdpa.c  |  4 ++--
 drivers/vhost/vhost.c | 10 +-
 drivers/vhost/vhost.h |  2 +-
 drivers/virt/acrn/ioeventfd.c |  2 +-
 fs/aio.c  |  2 +-
 fs/eventfd.c  |  9 +++--
 include/linux/eventfd.h   |  4 ++--
 mm/memcontrol.c   | 10 +-
 mm/vmpressure.c   |  2 +-
 samples/vfio-mdev/mtty.c  |  4 ++--
 virt/kvm/eventfd.c|  4 ++--
 28 files changed, 56 insertions(+), 59 deletions(-)

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index b28fd020066f..2f4bd74b482c 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -2387,7 +2387,7 @@ static u16 kvm_hvcall_signal_event(struct kvm_vcpu *vcpu, 
struct kvm_hv_hcall *h
if (!eventfd)
return HV_STATUS_INVALID_PORT_ID;
 
-   eventfd_signal(eventfd, 1);
+   eventfd_signal(eventfd);
return HV_STATUS_SUCCESS;
 }
 
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 40edf4d1974c..a7b62bafd57b 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -2043,7 +2043,7 @@ static bool kvm_xen_hcall_evtchn_send(struct kvm_vcpu 
*vcpu, u64 param, u64 *r)
if (ret < 0 && ret != -ENOTCONN)
return false;
} else {
-   eventfd_signal(evtchnfd->deliver.eventfd.ctx, 1);
+   eventfd_signal(evtchnfd->deliver.eventfd.ctx);
}
 
*r = 0;
diff --git a/drivers/accel/habanalabs/common/device.c 
b/drivers/accel/habanalabs/common/device.c
index b97339d1f7c6..30357b371d61 100644
--- a/drivers/accel/habanalabs/common/device.c
+++ b/drivers/accel/habanalabs/common/device.c
@@ -1963,7 +1963,7 @@ static void hl_notifier_event_send(struct 
hl_notifier_event *notifier_event, u64
notifier_event->events_mask |= event_mask;
 
if (notifier_event->eventfd)
-   eventfd_signal(notifier_event->eventfd, 1);
+   eventfd_signal(notifier_event->eventfd);
 
mutex_unlock(_event->lock);
 }
diff --git a/drivers/fpga/dfl.c b/drivers/fpga/dfl.c
index dd7a783d53b5..e73f88050f08 100644
--- a/drivers/fpga/dfl.c
+++ b/drivers/fpga/dfl.c
@@ -1872,7 +1872,7 @@ static irqreturn_t dfl_irq_handler(int irq, void *arg)
 {
struct eventfd_ctx *trigger = arg;
 
-   eventfd_signal(trigger, 1);
+   eventfd_signal(trigger);
return IRQ_HANDLED;
 }
 
diff --git a/drivers/gpu/drm/i915/gvt/interrupt.c 
b/drivers/gpu/drm/i915/gvt/interrupt.c
index 68eca023bbc6..3d9e09c2add4 100644
--- a/drivers/gpu/drm/i915/gvt/interrupt.c
+++ b/drivers/gpu/drm/i915/gvt/interrupt.c
@@ -435,7 +435,7 @@ static int inject_virtual_interrupt(struct intel_vgpu *vgpu)
 */
if (!test_bit(INTEL_VGPU_STATUS_ATTACHED, vgpu->status))
return -ESRCH;
-   if (vgpu->msi_trigger && eventfd_signal(vgpu->msi_trigger, 1) != 1)
+   if (vgpu->msi_trigger && eventfd_signal(vgpu->msi_trigger) != 1)
return -EFAULT;
return 0;
 }
diff --git a/drivers/infiniband/hw/mlx5/devx.c 
b/drivers/infiniband/hw/mlx5/devx.c
index db5fb196c728..ad50487790ff 100644
--- a/drivers/infiniband/hw/mlx5/devx.c
+++ b/drivers/infiniband/hw/mlx5/devx.c
@@ -2498,7 +2498,7 @@ static void dispatch_event_fd(struct list_head *fd_list,
 
list_for_each_entry_rcu(item, fd_list, xa_list) {
if (item->eventfd)
-   eventfd_signal(item->eventfd, 1);
+   eventfd_signal(item->eventfd);
else
deliver_event(item, data);
}
diff --git a/dr

[PATCH 0/2] eventfd: simplify signal helpers

2023-07-13 Thread Christian Brauner

Hey everyone,

This simplifies the eventfd_signal() and eventfd_signal_mask() helpers
by removing the count argument which is effectively unused.

---



---
base-commit: 6be357f00aad4189130147fdc6f568cf776a4909
change-id: 20230713-vfs-eventfd-signal-0b0d167ad6ec

Re: [PATCH v3 0/5] Add a new fchmodat4() syscall

2023-07-11 Thread Christian Brauner

On Tue, Jul 11, 2023 at 02:24:51PM +0200, Florian Weimer wrote:
> * Alexey Gladkov:
> 
> > This patch set adds fchmodat4(), a new syscall. The actual
> > implementation is super simple: essentially it's just the same as
> > fchmodat(), but LOOKUP_FOLLOW is conditionally set based on the flags.
> > I've attempted to make this match "man 2 fchmodat" as closely as
> > possible, which says EINVAL is returned for invalid flags (as opposed to
> > ENOTSUPP, which is currently returned by glibc for AT_SYMLINK_NOFOLLOW).
> > I have a sketch of a glibc patch that I haven't even compiled yet, but
> > seems fairly straight-forward:
> >
> > diff --git a/sysdeps/unix/sysv/linux/fchmodat.c 
> > b/sysdeps/unix/sysv/linux/fchmodat.c
> > index 6d9cbc1ce9e0..b1beab76d56c 100644
> > --- a/sysdeps/unix/sysv/linux/fchmodat.c
> > +++ b/sysdeps/unix/sysv/linux/fchmodat.c
> > @@ -29,12 +29,36 @@
> >  int
> >  fchmodat (int fd, const char *file, mode_t mode, int flag)
> >  {
> > -  if (flag & ~AT_SYMLINK_NOFOLLOW)
> > -return INLINE_SYSCALL_ERROR_RETURN_VALUE (EINVAL);
> > -#ifndef __NR_lchmod/* Linux so far has no lchmod syscall.  
> > */
> > +  /* There are four paths through this code:
> > +  - The flags are zero.  In this case it's fine to call fchmodat.
> > +  - The flags are non-zero and glibc doesn't have access to
> > +   __NR_fchmodat4.  In this case all we can do is emulate the 
> > error codes
> > +   defined by the glibc interface from userspace.
> > +  - The flags are non-zero, glibc has __NR_fchmodat4, and the 
> > kernel has
> > +   fchmodat4.  This is the simplest case, as the fchmodat4 syscall 
> > exactly
> > +   matches glibc's library interface so it can be called directly.
> > +  - The flags are non-zero, glibc has __NR_fchmodat4, but the 
> > kernel does
> 
> If you define __NR_fchmodat4 on all architectures, we can use these
> constants directly in glibc.  We no longer depend on the UAPI
> definitions of those constants, to cut down the number of code variants,
> and to make glibc's system call profile independent of the kernel header
> version at build time.
> 
> Your version is based on 2.31, more recent versions have some reasonable
> emulation for fchmodat based on /proc/self/fd.  I even wrote a comment
> describing the same buggy behavior that you witnessed:
> 
> +  /* Some Linux versions with some file systems can actually
> +change symbolic link permissions via /proc, but this is not
> +intentional, and it gives inconsistent results (e.g., error
> +return despite mode change).  The expected behavior is that
> +symbolic link modes cannot be changed at all, and this check
> +enforces that.  */
> +  if (S_ISLNK (st.st_mode))
> +   {
> + __close_nocancel (pathfd);
> + __set_errno (EOPNOTSUPP);
> + return -1;
> +   }
> 
> I think there was some kernel discussion about that behavior before, but
> apparently, it hasn't led to fixes.

I think I've explained this somewhere else a couple of months ago but
just in case you weren't on that thread or don't remember and apologies
if you should already know.

A lot of filesystem will happily update the mode of a symlink. The VFS
doesn't do anything to prevent this from happening. This is filesystem
specific.

The EOPNOTSUPP you're seeing very likely comes from POSIX ACLs.
Specifically it comes from filesystems that call posix_acl_chmod(),
e.g., btrfs via

if (!err && attr->ia_valid & ATTR_MODE)
err = posix_acl_chmod(idmap, dentry, inode->i_mode);

Most filesystems don't implement i_op->set_acl() for POSIX ACLs.
So posix_acl_chmod() will report EOPNOTSUPP. By the time
posix_acl_chmod() is called, most filesystems will have finished
updating the inode. POSIX ACLs also often aren't integrated into
transactions so a rollback wouldn't even be possible on some
filesystems.

Any filesystem that doesn't implement POSIX ACLs at all will obviously
never fail unless it blocks mode changes on symlinks. Or filesystems
that do have a way to rollback failures from posix_acl_chmod(), or
filesystems that do return an error on chmod() on symlinks such as 9p,
ntfs, ocfs2.

> 
> I wonder if it makes sense to add a similar error return to the system
> call implementation?

Hm, blocking symlink mode changes is pretty regression prone. And just
blocking it through one interface seems weird and makes things even more
inconsistent.

So two options I see:
(1) minimally invasive:
Filesystems that do call posix_acl_chmod() on symlinks need to be
changed to stop doing that.
(2) might hit us on the head invasive:
Try and block symlink mode changes in chmod_common().

Thoughts?

Re: [PATCH v3 2/5] fs: Add fchmodat4()

2023-07-11 Thread Christian Brauner

On Tue, Jul 11, 2023 at 02:51:01PM +0200, Alexey Gladkov wrote:
> On Tue, Jul 11, 2023 at 01:52:01PM +0200, Christian Brauner wrote:
> > On Tue, Jul 11, 2023 at 01:42:19PM +0200, Arnd Bergmann wrote:
> > > On Tue, Jul 11, 2023, at 13:25, Alexey Gladkov wrote:
> > > > From: Palmer Dabbelt 
> > > >
> > > > On the userspace side fchmodat(3) is implemented as a wrapper
> > > > function which implements the POSIX-specified interface. This
> > > > interface differs from the underlying kernel system call, which does not
> > > > have a flags argument. Most implementations require procfs [1][2].
> > > >
> > > > There doesn't appear to be a good userspace workaround for this issue
> > > > but the implementation in the kernel is pretty straight-forward.
> > > >
> > > > The new fchmodat4() syscall allows to pass the AT_SYMLINK_NOFOLLOW flag,
> > > > unlike existing fchmodat.
> > > >
> > > > [1] 
> > > > https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35
> > > > [2] 
> > > > https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28
> > > >
> > > > Signed-off-by: Palmer Dabbelt 
> > > > Signed-off-by: Alexey Gladkov 
> > > 
> > > I don't know the history of why we ended up with the different
> > > interface, or whether this was done intentionally in the kernel
> > > or if we want this syscall.
> > > 
> > > Assuming this is in fact needed, I double-checked that the
> > > implementation looks correct to me and is portable to all the
> > > architectures, without the need for a compat wrapper.
> > > 
> > > Acked-by: Arnd Bergmann 
> > 
> > The system call itself is useful afaict. But please,
> > 
> > s/fchmodat4/fchmodat2/
> 
> Sure. I will.

Thanks. Can you also wire this up for every architecture, please?
I don't see that this has been done in this series.

Re: (subset) [PATCH v4 0/5] Add a new fchmodat2() syscall

2023-07-11 Thread Christian Brauner

On Tue, 11 Jul 2023 18:16:02 +0200, Alexey Gladkov wrote:
> In glibc, the fchmodat(3) function has a flags argument according to the
> POSIX specification [1], but kernel syscalls has no such argument.
> Therefore, libc implementations do workarounds using /proc. However,
> this requires procfs to be mounted and accessible.
> 
> This patch set adds fchmodat2(), a new syscall. The syscall allows to
> pass the AT_SYMLINK_NOFOLLOW flag to disable LOOKUP_FOLLOW. In all other
> respects, this syscall is no different from fchmodat().
> 
> [...]

Tools updates usually go separately.
Flags argument ported to unsigned int; otherwise unchanged.

---

Applied to the master branch of the vfs/vfs.git tree.
Patches in the master branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: master

[1/5] Non-functional cleanup of a "__user * filename"
  https://git.kernel.org/vfs/vfs/c/0f05a6af6b7e
[2/5] fs: Add fchmodat2()
  https://git.kernel.org/vfs/vfs/c/8d593559ec09
[3/5] arch: Register fchmodat2, usually as syscall 452
  https://git.kernel.org/vfs/vfs/c/2ee63b04f206
[5/5] selftests: Add fchmodat2 selftest
  https://git.kernel.org/vfs/vfs/c/f175b92081ec

Re: [PATCH v4 2/5] fs: Add fchmodat2()

2023-07-11 Thread Christian Brauner

On Tue, Jul 11, 2023 at 06:16:04PM +0200, Alexey Gladkov wrote:
> On the userspace side fchmodat(3) is implemented as a wrapper
> function which implements the POSIX-specified interface. This
> interface differs from the underlying kernel system call, which does not
> have a flags argument. Most implementations require procfs [1][2].
> 
> There doesn't appear to be a good userspace workaround for this issue
> but the implementation in the kernel is pretty straight-forward.
> 
> The new fchmodat2() syscall allows to pass the AT_SYMLINK_NOFOLLOW flag,
> unlike existing fchmodat.
> 
> [1] 
> https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35
> [2] 
> https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28
> 
> Co-developed-by: Palmer Dabbelt 
> Signed-off-by: Palmer Dabbelt 
> Signed-off-by: Alexey Gladkov 
> Acked-by: Arnd Bergmann 
> ---
>  fs/open.c| 18 ++
>  include/linux/syscalls.h |  2 ++
>  2 files changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/open.c b/fs/open.c
> index 0c55c8e7f837..39a7939f0d00 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -671,11 +671,11 @@ SYSCALL_DEFINE2(fchmod, unsigned int, fd, umode_t, mode)
>   return err;
>  }
>  
> -static int do_fchmodat(int dfd, const char __user *filename, umode_t mode)
> +static int do_fchmodat(int dfd, const char __user *filename, umode_t mode, 
> int lookup_flags)

Should all be unsigned instead of int here for flags. We also had a
documentation update to that effect but smh never sent it.
user_path_at() itself takes an unsigned as well.

I'll fix that up though.

Re: [PATCH v3 2/5] fs: Add fchmodat4()

2023-07-11 Thread Christian Brauner

On Tue, Jul 11, 2023 at 01:42:19PM +0200, Arnd Bergmann wrote:
> On Tue, Jul 11, 2023, at 13:25, Alexey Gladkov wrote:
> > From: Palmer Dabbelt 
> >
> > On the userspace side fchmodat(3) is implemented as a wrapper
> > function which implements the POSIX-specified interface. This
> > interface differs from the underlying kernel system call, which does not
> > have a flags argument. Most implementations require procfs [1][2].
> >
> > There doesn't appear to be a good userspace workaround for this issue
> > but the implementation in the kernel is pretty straight-forward.
> >
> > The new fchmodat4() syscall allows to pass the AT_SYMLINK_NOFOLLOW flag,
> > unlike existing fchmodat.
> >
> > [1] 
> > https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/fchmodat.c;h=17eca54051ee28ba1ec3f9aed170a62630959143;hb=a492b1e5ef7ab50c6fdd4e4e9879ea5569ab0a6c#l35
> > [2] 
> > https://git.musl-libc.org/cgit/musl/tree/src/stat/fchmodat.c?id=718f363bc2067b6487900eddc9180c84e7739f80#n28
> >
> > Signed-off-by: Palmer Dabbelt 
> > Signed-off-by: Alexey Gladkov 
> 
> I don't know the history of why we ended up with the different
> interface, or whether this was done intentionally in the kernel
> or if we want this syscall.
> 
> Assuming this is in fact needed, I double-checked that the
> implementation looks correct to me and is portable to all the
> architectures, without the need for a compat wrapper.
> 
> Acked-by: Arnd Bergmann 

The system call itself is useful afaict. But please,

s/fchmodat4/fchmodat2/

With very few exceptions we don't version by argument number but by
revision and we should stick to one scheme:

openat()->openat2()
eventfd()->eventfd2()
clone()/clone2()->clone3()
dup()->dup2()->dup3() // coincides with nr of arguments
pipe()->pipe2() // coincides with nr of arguments
renameat()->renameat2()

Re: [PATCH v2 00/89] fs: new accessors for inode->i_ctime

2023-07-10 Thread Christian Brauner

On Fri, Jul 07, 2023 at 08:42:31AM -0400, Jeff Layton wrote:
> On Wed, 2023-07-05 at 14:58 -0400, Jeff Layton wrote:
> > v2:
> > - prepend patches to add missing ctime updates
> > - add simple_rename_timestamp helper function
> > - rename ctime accessor functions as inode_get_ctime/inode_set_ctime_*
> > - drop individual inode_ctime_set_{sec,nsec} helpers
> > 
> 
> After review by Jan and others, and Jan's ext4 rework, the diff on top
> of the series I posted a couple of days ago is below. I don't really
> want to spam everyone with another ~100 patch v3 series, but I can if
> you think that's best.
> 
> Christian, what would you like me to do here?

I picked up the series from the list and folded the fixups you posted
here into the respective fs conversion patches. I hope that helps you
avoid a resend. You should have received a separate "thank you" mail for
all of this.

To each patch that I folded one of the fixlets from below into I added a
git note that records a link to your mail here and the respective patch
hunk from this mail that I folded into the patch. git.kernel.org will
show notes by default. For example,
https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git/commit/?h=vfs.ctime=8b0e3c2e99004609a16ba145bcbdfdddb78e220e
should show you the note I added. You can also fetch them via
git fetch $remote refs/notes/*:refs/notes/*
(You probably know that ofc but jic.) if you're interested.

Based on v6.5-rc1 as of today.

Btw, both b4 and patchwork somehow treat the series in weird was.
IOW, based on the message id of the cover letter I was able to pull most
messages except for:

[07/92] fs: add ctime accessors infrastructure
[08/92] fs: new helper: simple_rename_timestamp
[92/92] fs: rename i_ctime field to __i_ctime

which I pulled in separately. Not sure what the cause of this is.

Re: [PATCH v2 00/92] fs: new accessors for inode->i_ctime

2023-07-10 Thread Christian Brauner

On Wed, 05 Jul 2023 14:58:09 -0400, Jeff Layton wrote:
> v2:
> - prepend patches to add missing ctime updates
> - add simple_rename_timestamp helper function
> - rename ctime accessor functions as inode_get_ctime/inode_set_ctime_*
> - drop individual inode_ctime_set_{sec,nsec} helpers
> 
> I've been working on a patchset to change how the inode->i_ctime is
> accessed in order to give us conditional, high-res timestamps for the
> ctime and mtime. struct timespec64 has unused bits in it that we can use
> to implement this. In order to do that however, we need to wrap all
> accesses of inode->i_ctime to ensure that bits used as flags are
> appropriately handled.
> 
> [...]

Applied to the vfs.ctime branch of the vfs/vfs.git tree.
Patches in the vfs.ctime branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.ctime

[01/92] ibmvmc: update ctime in conjunction with mtime on write
https://git.kernel.org/vfs/vfs/c/ead310563ad2
[02/92] bfs: update ctime in addition to mtime when adding entries
https://git.kernel.org/vfs/vfs/c/f42faf14b838
[03/92] efivarfs: update ctime when mtime changes on a write
https://git.kernel.org/vfs/vfs/c/d8d026e0d1f2
[04/92] exfat: ensure that ctime is updated whenever the mtime is
https://git.kernel.org/vfs/vfs/c/d84bd8fa48d7
[05/92] apparmor: update ctime whenever the mtime changes on an inode
https://git.kernel.org/vfs/vfs/c/73955caedfae
[06/92] cifs: update the ctime on a partial page write
https://git.kernel.org/vfs/vfs/c/c2f784379c99
[07/92] fs: add ctime accessors infrastructure
https://git.kernel.org/vfs/vfs/c/64f0367de800
[08/92] fs: new helper: simple_rename_timestamp
https://git.kernel.org/vfs/vfs/c/54ced54a0239
[09/92] btrfs: convert to simple_rename_timestamp
https://git.kernel.org/vfs/vfs/c/218e0f662fee
[10/92] ubifs: convert to simple_rename_timestamp
https://git.kernel.org/vfs/vfs/c/caac4f65568d
[11/92] shmem: convert to simple_rename_timestamp
https://git.kernel.org/vfs/vfs/c/d3d11e9927b6
[12/92] exfat: convert to simple_rename_timestamp
https://git.kernel.org/vfs/vfs/c/71534b484c63
[13/92] ntfs3: convert to simple_rename_timestamp
https://git.kernel.org/vfs/vfs/c/140880821ce0
[14/92] reiserfs: convert to simple_rename_timestamp
https://git.kernel.org/vfs/vfs/c/1a1a4df5e8fc
[15/92] spufs: convert to ctime accessor functions
https://git.kernel.org/vfs/vfs/c/784e5a93c9cf
[16/92] s390: convert to ctime accessor functions
https://git.kernel.org/vfs/vfs/c/1cece1f8e5c2
[17/92] binderfs: convert to ctime accessor functions
https://git.kernel.org/vfs/vfs/c/0bcd830a76f3
[18/92] infiniband: convert to ctime accessor functions
https://git.kernel.org/vfs/vfs/c/811f97f80b01
[19/92] ibm: convert to ctime accessor functions
https://git.kernel.org/vfs/vfs/c/b447ed7597f0
[20/92] usb: convert to ctime accessor functions
https://git.kernel.org/vfs/vfs/c/2557dc7f2dde
[21/92] 9p: convert to ctime accessor functions
https://git.kernel.org/vfs/vfs/c/4cd4b11385ef
[22/92] adfs: convert to ctime accessor functions
https://git.kernel.org/vfs/vfs/c/e257d7ade66e
[23/92] affs: convert to ctime accessor functions
https://git.kernel.org/vfs/vfs/c/770619f19a77
[24/92] afs: convert to ctime accessor functions
https://git.kernel.org/vfs/vfs/c/758506e44668
[25/92] fs: convert to ctime accessor functions
https://git.kernel.org/vfs/vfs/c/a0a5a9810b37
[26/92] autofs: convert to ctime accessor functions
https://git.kernel.org/vfs/vfs/c/d7d1363cc3f6
[27/92] befs: convert to ctime accessor functions
https://git.kernel.org/vfs/vfs/c/d6218773de2d
[28/92] bfs: convert to ctime accessor functions
https://git.kernel.org/vfs/vfs/c/368b313ac2ab
[29/92] btrfs: convert to ctime accessor functions
https://git.kernel.org/vfs/vfs/c/d3d15221956a
[30/92] ceph: convert to ctime accessor functions
https://git.kernel.org/vfs/vfs/c/818fc6e0129a
[31/92] coda: convert to ctime accessor functions
https://git.kernel.org/vfs/vfs/c/4e0b22fbc012
[32/92] configfs: convert to ctime accessor functions
https://git.kernel.org/vfs/vfs/c/69c977798a6a
[33/92] cramfs: convert to ctime accessor functions
https://git.kernel.org/vfs/vfs/c/911f086eae23
[34/92] debugfs: convert to ctime accessor functions
https://git.kernel.org/vfs/vfs/c/634a50390dbb
[35/92] devpts: convert to ctime accessor

Re: [PATCH 00/79] fs: new accessors for inode->i_ctime

2023-06-23 Thread Christian Brauner

On Wed, Jun 21, 2023 at 03:52:27PM -0400, Jeff Layton wrote:
> On Wed, 2023-06-21 at 15:21 -0400, Steven Rostedt wrote:
> > On Wed, 21 Jun 2023 10:45:05 -0400
> > Jeff Layton  wrote:
> > 
> > > Most of this conversion was done via coccinelle, with a few of the more
> > > non-standard accesses done by hand. There should be no behavioral
> > > changes with this set. That will come later, as we convert individual
> > > filesystems to use multigrain timestamps.
> > 
> > BTW, Linus has suggested to me that whenever a conccinelle script is used,
> > it should be included in the change log.
> > 
> 
> Ok, here's what I have. I note again that my usage of coccinelle is
> pretty primitive, so I ended up doing a fair bit of by-hand fixing after
> applying these.
> 
> Given the way that this change is broken up into 77 patches by
> subsystem, to which changelogs should I add it? I could add it to the
> "infrastructure" patch, but that's the one where I _didn't_ use it. 
> 
> Maybe to patch #79 (the one that renames i_ctime)?

That works. I can also put this into a merge commit or pr message.

Re: [PATCH] procfs: consolidate arch_report_meminfo declaration

2023-05-17 Thread Christian Brauner

On Tue, 16 May 2023 21:57:29 +0200, Arnd Bergmann wrote:
> The arch_report_meminfo() function is provided by four architectures,
> with a __weak fallback in procfs itself. On architectures that don't
> have a custom version, the __weak version causes a warning because
> of the missing prototype.
> 
> Remove the architecture specific prototypes and instead add one
> in linux/proc_fs.h.
> 
> [...]

Applied to the vfs.misc branch of the vfs/vfs.git tree.
Patches in the vfs.misc branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.misc

[1/1] procfs: consolidate arch_report_meminfo declaration
  https://git.kernel.org/vfs/vfs/c/edb0469aa6e8

Re: [RFC PATCH v2] fs/xattr: add *at family syscalls

2023-05-15 Thread Christian Brauner

On Mon, May 15, 2023 at 04:04:21PM +0300, Amir Goldstein wrote:
> On Mon, May 15, 2023 at 1:33 PM Christian Brauner  wrote:
> >
> > On Thu, May 11, 2023 at 05:08:02PM +0200, Christian Göttsche wrote:
> > > Add the four syscalls setxattrat(), getxattrat(), listxattrat() and
> > > removexattrat().  Those can be used to operate on extended attributes,
> > > especially security related ones, either relative to a pinned directory
> > > or on a file descriptor without read access, avoiding a
> > > /proc//fd/ detour, requiring a mounted procfs.
> > >
> > > One use case will be setfiles(8) setting SELinux file contexts
> > > ("security.selinux") without race conditions.
> > >
> > > Add XATTR flags to the private namespace of AT_* flags.
> > >
> > > Use the do_{name}at() pattern from fs/open.c.
> > >
> > > Use a single flag parameter for extended attribute flags (currently
> > > XATTR_CREATE and XATTR_REPLACE) and *at() flags to not exceed six
> > > syscall arguments in setxattrat().
> > >
> > > Previous approach ("f*xattr: allow O_PATH descriptors"): 
> > > https://lore.kernel.org/all/20220607153139.35588-1-cgzo...@googlemail.com/
> > > v1 discussion: 
> > > https://lore.kernel.org/all/20220830152858.14866-2-cgzo...@googlemail.com/
> > >
> > > Signed-off-by: Christian Göttsche 
> > > CC: x...@kernel.org
> > > CC: linux-al...@vger.kernel.org
> > > CC: linux-ker...@vger.kernel.org
> > > CC: linux-arm-ker...@lists.infradead.org
> > > CC: linux-i...@vger.kernel.org
> > > CC: linux-m...@lists.linux-m68k.org
> > > CC: linux-m...@vger.kernel.org
> > > CC: linux-par...@vger.kernel.org
> > > CC: linuxppc-dev@lists.ozlabs.org
> > > CC: linux-s...@vger.kernel.org
> > > CC: linux...@vger.kernel.org
> > > CC: sparcli...@vger.kernel.org
> > > CC: linux-fsde...@vger.kernel.org
> > > CC: au...@vger.kernel.org
> > > CC: linux-a...@vger.kernel.org
> > > CC: linux-...@vger.kernel.org
> > > CC: linux-security-mod...@vger.kernel.org
> > > CC: seli...@vger.kernel.org
> > > ---
> >
> > Fwiw, your header doesn't let me see who the mail was directly sent to
> > so I'm only able to reply to lists which is a bit pointless...
> >
> > > v2:
> > >   - squash syscall introduction and wire up commits
> > >   - add AT_XATTR_CREATE and AT_XATTR_REPLACE constants
> >
> > > +#define AT_XATTR_CREATE  0x1 /* setxattrat(2): set 
> > > value, fail if attr already exists */
> > > +#define AT_XATTR_REPLACE 0x2 /* setxattrat(2): set value, fail 
> > > if attr does not exist */
> >
> > We really shouldn't waste any AT_* flags for this. Otherwise we'll run
> > out of them rather quickly. Two weeks ago we added another AT_* flag
> > which is up for merging for v6.5 iirc and I've glimpsed another AT_*
> > flag proposal in one of the talks at last weeks Vancouver conference
> > extravaganza.
> >
> > Even if we reuse 0x200 for AT_XATTR_CREATE (like we did for AT_EACCESS
> > and AT_REMOVEDIR) we still need another bit for AT_XATTR_REPLACE.
> >
> > Plus, this is really ugly since AT_XATTR_{CREATE,REPLACE} really isn't
> > in any way related to lookup and we're mixing it in with lookup
> > modifying flags.
> >
> > So my proposal for {g,s}etxattrat() would be:
> >
> > struct xattr_args {
> > __aligned_u64 value;
> > __u32 size;
> > __u32 cmd;
> > };
> >
> > So everything's nicely 64bit aligned in the struct. Use the @cmd member
> > to set either XATTR_REPLACE or XATTR_CREATE and treat it as a proper
> > enum and not as a flag argument like the old calls did.
> >
> > So then we'd have:
> >
> > setxattrat(int dfd, const char *path, const char __user *name,
> >struct xattr_args __user *args, size_t size, unsigned int flags)
> > getxattrat(int dfd, const char *path, const char __user *name,
> >struct xattr_args __user *args, size_t size, unsigned int flags)
> >
> > The current in-kernel struct xattr_ctx would be renamed to struct
> > kernel_xattr_args and then we do the usual copy_struct_from_user()
> > dance:
> >
> > struct xattr_args args;
> > err = copy_struct_from_user(, sizeof(args), uargs, usize);
> >
> > and then go on to handle value/size for setxattrat()/getxattrat()
> > accordingly.
> >
> > getxattr()/setxattr() aren't meaningful

Re: [RFC PATCH v2] fs/xattr: add *at family syscalls

2023-05-15 Thread Christian Brauner

On Thu, May 11, 2023 at 05:08:02PM +0200, Christian Göttsche wrote:
> Add the four syscalls setxattrat(), getxattrat(), listxattrat() and
> removexattrat().  Those can be used to operate on extended attributes,
> especially security related ones, either relative to a pinned directory
> or on a file descriptor without read access, avoiding a
> /proc//fd/ detour, requiring a mounted procfs.
> 
> One use case will be setfiles(8) setting SELinux file contexts
> ("security.selinux") without race conditions.
> 
> Add XATTR flags to the private namespace of AT_* flags.
> 
> Use the do_{name}at() pattern from fs/open.c.
> 
> Use a single flag parameter for extended attribute flags (currently
> XATTR_CREATE and XATTR_REPLACE) and *at() flags to not exceed six
> syscall arguments in setxattrat().
> 
> Previous approach ("f*xattr: allow O_PATH descriptors"): 
> https://lore.kernel.org/all/20220607153139.35588-1-cgzo...@googlemail.com/
> v1 discussion: 
> https://lore.kernel.org/all/20220830152858.14866-2-cgzo...@googlemail.com/
> 
> Signed-off-by: Christian Göttsche 
> CC: x...@kernel.org
> CC: linux-al...@vger.kernel.org
> CC: linux-ker...@vger.kernel.org
> CC: linux-arm-ker...@lists.infradead.org
> CC: linux-i...@vger.kernel.org
> CC: linux-m...@lists.linux-m68k.org
> CC: linux-m...@vger.kernel.org
> CC: linux-par...@vger.kernel.org
> CC: linuxppc-dev@lists.ozlabs.org
> CC: linux-s...@vger.kernel.org
> CC: linux...@vger.kernel.org
> CC: sparcli...@vger.kernel.org
> CC: linux-fsde...@vger.kernel.org
> CC: au...@vger.kernel.org
> CC: linux-a...@vger.kernel.org
> CC: linux-...@vger.kernel.org
> CC: linux-security-mod...@vger.kernel.org
> CC: seli...@vger.kernel.org
> ---

Fwiw, your header doesn't let me see who the mail was directly sent to
so I'm only able to reply to lists which is a bit pointless...

> v2:
>   - squash syscall introduction and wire up commits
>   - add AT_XATTR_CREATE and AT_XATTR_REPLACE constants

> +#define AT_XATTR_CREATE  0x1 /* setxattrat(2): set value, 
> fail if attr already exists */
> +#define AT_XATTR_REPLACE 0x2 /* setxattrat(2): set value, fail if 
> attr does not exist */

We really shouldn't waste any AT_* flags for this. Otherwise we'll run
out of them rather quickly. Two weeks ago we added another AT_* flag
which is up for merging for v6.5 iirc and I've glimpsed another AT_*
flag proposal in one of the talks at last weeks Vancouver conference
extravaganza.

Even if we reuse 0x200 for AT_XATTR_CREATE (like we did for AT_EACCESS
and AT_REMOVEDIR) we still need another bit for AT_XATTR_REPLACE.

Plus, this is really ugly since AT_XATTR_{CREATE,REPLACE} really isn't
in any way related to lookup and we're mixing it in with lookup
modifying flags.

So my proposal for {g,s}etxattrat() would be:

struct xattr_args {
__aligned_u64 value;
__u32 size;
__u32 cmd;
};

So everything's nicely 64bit aligned in the struct. Use the @cmd member
to set either XATTR_REPLACE or XATTR_CREATE and treat it as a proper
enum and not as a flag argument like the old calls did.

So then we'd have:

setxattrat(int dfd, const char *path, const char __user *name,
   struct xattr_args __user *args, size_t size, unsigned int flags)
getxattrat(int dfd, const char *path, const char __user *name,
   struct xattr_args __user *args, size_t size, unsigned int flags)

The current in-kernel struct xattr_ctx would be renamed to struct
kernel_xattr_args and then we do the usual copy_struct_from_user()
dance:

struct xattr_args args;
err = copy_struct_from_user(, sizeof(args), uargs, usize);

and then go on to handle value/size for setxattrat()/getxattrat()
accordingly.

getxattr()/setxattr() aren't meaningfully filterable by seccomp already
so there's not point in not using a struct.

If that isn't very appealing then another option is to add a new flag
namespace just for setxattrat() similar to fspick() and move_mount()
duplicating the needed lookup modifying flags.
Thoughts?

Re: [RFC PATCH 2/2] fs/xattr: wire up syscalls

2022-08-30 Thread Christian Brauner

On Tue, Aug 30, 2022 at 05:28:38PM +0200, Christian Göttsche wrote:
> Enable the new added extended attribute related syscalls.
> 
> Signed-off-by: Christian Göttsche 
> ---

Fwiw, I think a while ago it was pointed out that for most syscall
additions you can just fold the hookup patch in. It probably also
wouldn't hurt to trim that Cc list significantly down to mostly the
lists...

Re: [RFC PATCH v2 2/3] fs: define a firmware security filesystem named fwsecurityfs

2022-06-28 Thread Christian Brauner

On Mon, Jun 27, 2022 at 09:37:28AM +0200, Greg Kroah-Hartman wrote:
> On Sun, Jun 26, 2022 at 11:48:06AM -0400, Mimi Zohar wrote:
> > On Thu, 2022-06-23 at 09:23 -0400, James Bottomley wrote:
> > > On Thu, 2022-06-23 at 10:54 +0200, Greg Kroah-Hartman wrote:
> > > [...]
> > > > > diff --git a/fs/fwsecurityfs/inode.c b/fs/fwsecurityfs/inode.c
> > > > > new file mode 100644
> > > > > index ..5d06dc0de059
> > > > > --- /dev/null
> > > > > +++ b/fs/fwsecurityfs/inode.c
> > > > > @@ -0,0 +1,159 @@
> > > > > +// SPDX-License-Identifier: GPL-2.0-only
> > > > > +/*
> > > > > + * Copyright (C) 2022 IBM Corporation
> > > > > + * Author: Nayna Jain 
> > > > > + */
> > > > > +
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > > +
> > > > > +#include "internal.h"
> > > > > +
> > > > > +int fwsecurityfs_remove_file(struct dentry *dentry)
> > > > > +{
> > > > > + drop_nlink(d_inode(dentry));
> > > > > + dput(dentry);
> > > > > + return 0;
> > > > > +};
> > > > > +EXPORT_SYMBOL_GPL(fwsecurityfs_remove_file);
> > > > > +
> > > > > +int fwsecurityfs_create_file(const char *name, umode_t mode,
> > > > > + u16 filesize, struct dentry
> > > > > *parent,
> > > > > + struct dentry *dentry,
> > > > > + const struct file_operations
> > > > > *fops)
> > > > > +{
> > > > > + struct inode *inode;
> > > > > + int error;
> > > > > + struct inode *dir;
> > > > > +
> > > > > + if (!parent)
> > > > > + return -EINVAL;
> > > > > +
> > > > > + dir = d_inode(parent);
> > > > > + pr_debug("securityfs: creating file '%s'\n", name);
> > > > 
> > > > Did you forget to call simple_pin_fs() here or anywhere else?
> > > > 
> > > > And this can be just one function with the directory creation file,
> > > > just check the mode and you will be fine.  Look at securityfs as an
> > > > example of how to make this simpler.
> > > 
> > > Actually, before you go down this route can you consider the namespace
> > > ramifications.  In fact we're just having to rework securityfs to pull
> > > out all the simple_pin_... calls because simple_pin_... is completely
> > > inimical to namespaces.

I described this at length in the securityfs namespacing thread at
various points. simple_pin_*() should be avoided if possible. Ideally
the filesystem will just be cleaned up on umount. There might be a
reason to make it survive umounts if you have state that stays around
and somehow is intimately tied to that filesystem.

> > > 
> > > The first thing to consider is if you simply use securityfs you'll
> > > inherit all the simple_pin_... removal work and be namespace ready.  It
> > > could be that creating a new filesystem that can't be namespaced is the
> > > right thing to do here, but at least ask the question: would we ever
> > > want any of these files to be presented selectively inside containers? 
> > > If the answer is "yes" then simple_pin_... is the wrong interface.
> > 
> > Greg, the securityfs changes James is referring to are part of the IMA
> > namespacing patch set:
> > https://lore.kernel.org/linux-integrity/20220420140633.753772-1-stef...@linux.ibm.com/
> > 
> > I'd really appreciate your reviewing the first two patches:
> > [PATCH v12 01/26] securityfs: rework dentry creation
> > [PATCH v12 02/26] securityfs: Extend securityfs with namespacing
> > support
> 
> Looks like others have already reviewed them, they seem sane to me if
> they past testing.

Thanks for taking a look.

Re: [PATCH v4 2/3] audit: add support for the openat2 syscall

2021-05-20 Thread Christian Brauner

gt; + case __NR_openat2:
> + return AUDITSC_OPENAT2;
>   default:
>   return AUDITSC_NATIVE;
>   }
> diff --git a/arch/s390/kernel/compat_audit.c b/arch/s390/kernel/compat_audit.c
> index fc3d1c7ad21c..4b3d463e7d97 100644
> --- a/arch/s390/kernel/compat_audit.c
> +++ b/arch/s390/kernel/compat_audit.c
> @@ -40,6 +40,8 @@ int s390_classify_syscall(unsigned syscall)
>   return AUDITSC_SOCKETCALL;
>   case __NR_execve:
>   return AUDITSC_EXECVE;
> + case __NR_openat2:
> + return AUDITSC_OPENAT2;
>   default:
>   return AUDITSC_COMPAT;
>   }
> diff --git a/arch/sparc/kernel/audit.c b/arch/sparc/kernel/audit.c
> index 50fab35bdaba..b092274eca79 100644
> --- a/arch/sparc/kernel/audit.c
> +++ b/arch/sparc/kernel/audit.c
> @@ -55,6 +55,8 @@ int audit_classify_syscall(int abi, unsigned int syscall)
>   return AUDITSC_SOCKETCALL;
>   case __NR_execve:
>   return AUDITSC_EXECVE;
> + case __NR_openat2:
> + return AUDITSC_OPENAT2;
>   default:
>   return AUDITSC_NATIVE;
>   }
> diff --git a/arch/sparc/kernel/compat_audit.c 
> b/arch/sparc/kernel/compat_audit.c
> index 1c1b6d075421..2a3f71206fc5 100644
> --- a/arch/sparc/kernel/compat_audit.c
> +++ b/arch/sparc/kernel/compat_audit.c
> @@ -40,6 +40,8 @@ int sparc32_classify_syscall(unsigned int syscall)
>   return AUDITSC_SOCKETCALL;
>   case __NR_execve:
>   return AUDITSC_EXECVE;
> + case __NR_openat2:
> + return AUDITSC_OPENAT2;
>   default:
>   return AUDITSC_COMPAT;
>   }
> diff --git a/arch/x86/ia32/audit.c b/arch/x86/ia32/audit.c
> index eedc37a1ee13..efc7d832fefb 100644
> --- a/arch/x86/ia32/audit.c
> +++ b/arch/x86/ia32/audit.c
> @@ -40,6 +40,8 @@ int ia32_classify_syscall(unsigned syscall)
>   case __NR_execve:
>   case __NR_execveat:
>   return AUDITSC_EXECVE;
> + case __NR_openat2:
> + return AUDITSC_OPENAT2;
>   default:
>   return AUDITSC_COMPAT;
>   }
> diff --git a/arch/x86/kernel/audit_64.c b/arch/x86/kernel/audit_64.c
> index 2a6cc9c9c881..44c3601cfdc4 100644
> --- a/arch/x86/kernel/audit_64.c
> +++ b/arch/x86/kernel/audit_64.c
> @@ -53,6 +53,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
>   case __NR_execve:
>   case __NR_execveat:
>   return AUDITSC_EXECVE;
> + case __NR_openat2:
> + return AUDITSC_OPENAT2;
>   default:
>   return AUDITSC_NATIVE;
>   }
> diff --git a/include/linux/auditsc_classmacros.h 
> b/include/linux/auditsc_classmacros.h
> index 18757d270961..dc8e72536dbd 100644
> --- a/include/linux/auditsc_classmacros.h
> +++ b/include/linux/auditsc_classmacros.h
> @@ -16,6 +16,7 @@ enum auditsc_class_t {
>   AUDITSC_OPENAT,
>   AUDITSC_SOCKETCALL,
>   AUDITSC_EXECVE,
> + AUDITSC_OPENAT2,
>  
>   AUDITSC_NVALS /* count */
>  };
> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> index d775ea16505b..3f59ab209dfd 100644
> --- a/kernel/auditsc.c
> +++ b/kernel/auditsc.c
> @@ -76,6 +76,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "audit.h"
>  
> @@ -196,6 +197,8 @@ static int audit_match_perm(struct audit_context *ctx, 
> int mask)
>   return ((mask & AUDIT_PERM_WRITE) && ctx->argv[0] == SYS_BIND);
>   case AUDITSC_EXECVE:
>   return mask & AUDIT_PERM_EXEC;
> + case AUDITSC_OPENAT2:
> + return mask & ACC_MODE((u32)((struct open_how 
> *)ctx->argv[2])->flags);

That's a lot of dereferncing, casting and masking all at once. Maybe a
small static inline helper would be good for the sake of legibility? Sm
like:

static inline u32 audit_openat2_acc(struct open_how *how, int mask)
{
u32 flags = how->flags;
return mask & ACC_MODE(flags);
}

but not sure. Just seems more legible to me.
Otherwise.
Acked-by: Christian Brauner

Re: [PATCH v4 1/3] audit: replace magic audit syscall class numbers with macros

2021-05-20 Thread Christian Brauner

On Wed, May 19, 2021 at 04:00:20PM -0400, Richard Guy Briggs wrote:
> Replace audit syscall class magic numbers with macros.
> 
> This required putting the macros into new header file
> include/linux/auditsc_classmacros.h since the syscall macros were
> included for both 64 bit and 32 bit in any compat code, causing
> redefinition warnings.
> 
> Signed-off-by: Richard Guy Briggs 
> Link: 
> https://lore.kernel.org/r/2300b1083a32aade7ae7efb95826e8f3f260b1df.1621363275.git@redhat.com

Looks good.
Acked-by: Christian Brauner 

Fwiw, I would explicitly number all enum values in auditsc_class_t not
just the first one.

> ---
>  MAINTAINERS |  1 +
>  arch/alpha/kernel/audit.c   |  8 
>  arch/ia64/kernel/audit.c|  8 
>  arch/parisc/kernel/audit.c  |  8 
>  arch/parisc/kernel/compat_audit.c   |  9 +
>  arch/powerpc/kernel/audit.c | 10 +-
>  arch/powerpc/kernel/compat_audit.c  | 11 ++-
>  arch/s390/kernel/audit.c| 10 +-
>  arch/s390/kernel/compat_audit.c | 11 ++-
>  arch/sparc/kernel/audit.c   | 10 +-
>  arch/sparc/kernel/compat_audit.c| 11 ++-
>  arch/x86/ia32/audit.c   | 11 ++-
>  arch/x86/kernel/audit_64.c  |  8 
>  include/linux/audit.h   |  1 +
>  include/linux/auditsc_classmacros.h | 23 +++
>  kernel/auditsc.c| 12 ++--
>  lib/audit.c | 10 +-
>  lib/compat_audit.c  | 11 ++-
>  18 files changed, 102 insertions(+), 71 deletions(-)
>  create mode 100644 include/linux/auditsc_classmacros.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index bd7aff0c120f..3348d12019f9 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3036,6 +3036,7 @@ W:  https://github.com/linux-audit
>  T:   git git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit.git
>  F:   include/asm-generic/audit_*.h
>  F:   include/linux/audit.h
> +F:   include/linux/auditsc_classmacros.h
>  F:   include/uapi/linux/audit.h
>  F:   kernel/audit*
>  F:   lib/*audit.c
> diff --git a/arch/alpha/kernel/audit.c b/arch/alpha/kernel/audit.c
> index 96a9d18ff4c4..81cbd804e375 100644
> --- a/arch/alpha/kernel/audit.c
> +++ b/arch/alpha/kernel/audit.c
> @@ -37,13 +37,13 @@ int audit_classify_syscall(int abi, unsigned syscall)
>  {
>   switch(syscall) {
>   case __NR_open:
> - return 2;
> + return AUDITSC_OPEN;
>   case __NR_openat:
> - return 3;
> + return AUDITSC_OPENAT;
>   case __NR_execve:
> - return 5;
> + return AUDITSC_EXECVE;
>   default:
> - return 0;
> + return AUDITSC_NATIVE;
>   }
>  }
>  
> diff --git a/arch/ia64/kernel/audit.c b/arch/ia64/kernel/audit.c
> index 5192ca899fe6..dba6a74c9ab3 100644
> --- a/arch/ia64/kernel/audit.c
> +++ b/arch/ia64/kernel/audit.c
> @@ -38,13 +38,13 @@ int audit_classify_syscall(int abi, unsigned syscall)
>  {
>   switch(syscall) {
>   case __NR_open:
> - return 2;
> + return AUDITSC_OPEN;
>   case __NR_openat:
> - return 3;
> + return AUDITSC_OPENAT;
>   case __NR_execve:
> - return 5;
> + return AUDITSC_EXECVE;
>   default:
> - return 0;
> + return AUDITSC_NATIVE;
>   }
>  }
>  
> diff --git a/arch/parisc/kernel/audit.c b/arch/parisc/kernel/audit.c
> index 9eb47b2225d2..14244e83db75 100644
> --- a/arch/parisc/kernel/audit.c
> +++ b/arch/parisc/kernel/audit.c
> @@ -47,13 +47,13 @@ int audit_classify_syscall(int abi, unsigned syscall)
>  #endif
>   switch (syscall) {
>   case __NR_open:
> - return 2;
> + return AUDITSC_OPEN;
>   case __NR_openat:
> - return 3;
> + return AUDITSC_OPENAT;
>   case __NR_execve:
> - return 5;
> + return AUDITSC_EXECVE;
>   default:
> - return 0;
> + return AUDITSC_NATIVE;
>   }
>  }
>  
> diff --git a/arch/parisc/kernel/compat_audit.c 
> b/arch/parisc/kernel/compat_audit.c
> index 20c39c9d86a9..1d6347d37d92 100644
> --- a/arch/parisc/kernel/compat_audit.c
> +++ b/arch/parisc/kernel/compat_audit.c
> @@ -1,4 +1,5 @@
>  // SPDX-License-Identifier: GPL-2.0
> +#include 
>  #include 
>  
>  unsigned int parisc32_dir_class[] = {
> @@ -30,12 +31,12 @@ int parisc32_classify_syscall(unsigned syscall)
>  {
>   switch (syscall) {
>   case __NR_

Re: [PATCH 1/2] audit: add support for the openat2 syscall

2021-04-23 Thread Christian Brauner

On Thu, Apr 22, 2021 at 10:34:08PM -0400, Richard Guy Briggs wrote:
> On 2021-03-18 08:08, Richard Guy Briggs wrote:
> > On 2021-03-18 11:48, Christian Brauner wrote:
> > > [+Cc Aleksa, the author of openat2()]
> > 
> > Ah!  Thanks for pulling in Aleksa.  I thought I caught everyone...
> > 
> > > and a comment below. :)
> > 
> > Same...
> > 
> > > On Wed, Mar 17, 2021 at 09:47:17PM -0400, Richard Guy Briggs wrote:
> > > > The openat2(2) syscall was added in kernel v5.6 with commit fddb5d430ad9
> > > > ("open: introduce openat2(2) syscall")
> > > > 
> > > > Add the openat2(2) syscall to the audit syscall classifier.
> > > > 
> > > > See the github issue
> > > > https://github.com/linux-audit/audit-kernel/issues/67
> > > > 
> > > > Signed-off-by: Richard Guy Briggs 
> > > > ---
> > > >  arch/alpha/kernel/audit.c  | 2 ++
> > > >  arch/ia64/kernel/audit.c   | 2 ++
> > > >  arch/parisc/kernel/audit.c | 2 ++
> > > >  arch/parisc/kernel/compat_audit.c  | 2 ++
> > > >  arch/powerpc/kernel/audit.c| 2 ++
> > > >  arch/powerpc/kernel/compat_audit.c | 2 ++
> > > >  arch/s390/kernel/audit.c   | 2 ++
> > > >  arch/s390/kernel/compat_audit.c| 2 ++
> > > >  arch/sparc/kernel/audit.c  | 2 ++
> > > >  arch/sparc/kernel/compat_audit.c   | 2 ++
> > > >  arch/x86/ia32/audit.c  | 2 ++
> > > >  arch/x86/kernel/audit_64.c | 2 ++
> > > >  kernel/auditsc.c   | 3 +++
> > > >  lib/audit.c| 4 
> > > >  lib/compat_audit.c | 4 
> > > >  15 files changed, 35 insertions(+)
> > > > 
> > > > diff --git a/arch/alpha/kernel/audit.c b/arch/alpha/kernel/audit.c
> > > > index 96a9d18ff4c4..06a911b685d1 100644
> > > > --- a/arch/alpha/kernel/audit.c
> > > > +++ b/arch/alpha/kernel/audit.c
> > > > @@ -42,6 +42,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
> > > > return 3;
> > > > case __NR_execve:
> > > > return 5;
> > > > +   case __NR_openat2:
> > > > +   return 6;
> > > > default:
> > > > return 0;
> > > > }
> > > > diff --git a/arch/ia64/kernel/audit.c b/arch/ia64/kernel/audit.c
> > > > index 5192ca899fe6..5eaa888c8fd3 100644
> > > > --- a/arch/ia64/kernel/audit.c
> > > > +++ b/arch/ia64/kernel/audit.c
> > > > @@ -43,6 +43,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
> > > > return 3;
> > > > case __NR_execve:
> > > > return 5;
> > > > +   case __NR_openat2:
> > > > +   return 6;
> > > > default:
> > > > return 0;
> > > > }
> > > > diff --git a/arch/parisc/kernel/audit.c b/arch/parisc/kernel/audit.c
> > > > index 9eb47b2225d2..fc721a7727ba 100644
> > > > --- a/arch/parisc/kernel/audit.c
> > > > +++ b/arch/parisc/kernel/audit.c
> > > > @@ -52,6 +52,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
> > > > return 3;
> > > > case __NR_execve:
> > > > return 5;
> > > > +   case __NR_openat2:
> > > > +   return 6;
> > > > default:
> > > > return 0;
> > > > }
> > > > diff --git a/arch/parisc/kernel/compat_audit.c 
> > > > b/arch/parisc/kernel/compat_audit.c
> > > > index 20c39c9d86a9..fc6d35918c44 100644
> > > > --- a/arch/parisc/kernel/compat_audit.c
> > > > +++ b/arch/parisc/kernel/compat_audit.c
> > > > @@ -35,6 +35,8 @@ int parisc32_classify_syscall(unsigned syscall)
> > > > return 3;
> > > > case __NR_execve:
> > > > return 5;
> > > > +   case __NR_openat2:
> > > > +   return 6;
> > > > default:
> > > > return 1;
> > > > }
> > > > diff --git a/arch/powerpc/kernel/audit.c b/arch/powerpc/kernel/audit.c
> > > > index a27f3d09..8f32700b0baa 100644
> > > > --- a/arch/po

Re: [PATCH v1 1/1] kernel.h: Split out panic and oops helpers

2021-04-06 Thread Christian Brauner

On Tue, Apr 06, 2021 at 04:31:58PM +0300, Andy Shevchenko wrote:
> kernel.h is being used as a dump for all kinds of stuff for a long time.
> Here is the attempt to start cleaning it up by splitting out panic and
> oops helpers.
> 
> At the same time convert users in header and lib folder to use new header.
> Though for time being include new header back to kernel.h to avoid twisted
> indirected includes for existing users.
> 
> Signed-off-by: Andy Shevchenko 
> ---

(I think David has tried something like this a few years ago too?)
Good idea in any case. (Be good to see kbuild do an allmodconfig build
of this though.)
Acked-by: Christian Brauner 

>  arch/powerpc/kernel/setup-common.c   |  1 +
>  arch/x86/include/asm/desc.h  |  1 +
>  arch/x86/kernel/cpu/mshyperv.c   |  1 +
>  arch/x86/kernel/setup.c  |  1 +
>  drivers/char/ipmi/ipmi_msghandler.c  |  1 +
>  drivers/remoteproc/remoteproc_core.c |  1 +
>  include/asm-generic/bug.h|  3 +-
>  include/linux/kernel.h   | 84 +---
>  include/linux/panic.h| 98 
>  include/linux/panic_notifier.h   | 12 
>  kernel/hung_task.c   |  1 +
>  kernel/kexec_core.c  |  1 +
>  kernel/panic.c   |  1 +
>  kernel/rcu/tree.c|  2 +
>  kernel/sysctl.c  |  1 +
>  kernel/trace/trace.c |  1 +
>  16 files changed, 126 insertions(+), 84 deletions(-)
>  create mode 100644 include/linux/panic.h
>  create mode 100644 include/linux/panic_notifier.h
> 
> diff --git a/arch/powerpc/kernel/setup-common.c 
> b/arch/powerpc/kernel/setup-common.c
> index 74a98fff2c2f..046fe21b5c3b 100644
> --- a/arch/powerpc/kernel/setup-common.c
> +++ b/arch/powerpc/kernel/setup-common.c
> @@ -9,6 +9,7 @@
>  #undef DEBUG
>  
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
> index 476082a83d1c..ceb12683b6d1 100644
> --- a/arch/x86/include/asm/desc.h
> +++ b/arch/x86/include/asm/desc.h
> @@ -9,6 +9,7 @@
>  #include 
>  #include 
>  
> +#include 
>  #include 
>  #include 
>  
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index 22f13343b5da..9e5c6f2b044d 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 59e5e0903b0c..570699eecf90 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -14,6 +14,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> diff --git a/drivers/char/ipmi/ipmi_msghandler.c 
> b/drivers/char/ipmi/ipmi_msghandler.c
> index 8a0e97b33cae..e96cb5c4f97a 100644
> --- a/drivers/char/ipmi/ipmi_msghandler.c
> +++ b/drivers/char/ipmi/ipmi_msghandler.c
> @@ -16,6 +16,7 @@
>  
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> diff --git a/drivers/remoteproc/remoteproc_core.c 
> b/drivers/remoteproc/remoteproc_core.c
> index 626a6b90fba2..76dd8e2b1e7e 100644
> --- a/drivers/remoteproc/remoteproc_core.c
> +++ b/drivers/remoteproc/remoteproc_core.c
> @@ -20,6 +20,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> diff --git a/include/asm-generic/bug.h b/include/asm-generic/bug.h
> index 76a10e0dca9f..719410b93f99 100644
> --- a/include/asm-generic/bug.h
> +++ b/include/asm-generic/bug.h
> @@ -17,7 +17,8 @@
>  #endif
>  
>  #ifndef __ASSEMBLY__
> -#include 
> +#include 
> +#include 
>  
>  #ifdef CONFIG_BUG
>  
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index 09035ac67d4b..6c5a05ac1ecb 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -14,6 +14,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -70,7 +71,6 @@
>  #define lower_32_bits(n) ((u32)((n) & 0x))
>  
>  struct completion;
> -struct pt_regs;
>  struct user;
>  
>  #ifdef CONFIG_PREEMPT_VOLUNTARY
> @@ -175,14 +175,6 @@ void __might_fault(const char *file, int line);
>  static inline void might_fault(void) { }
>  #endif
>  
> -extern struct atomic_notifier_head panic_notifier_list;
> -extern long (*panic_blink)(int state);
> -__printf(1, 2)
> -void panic(const char *fmt, ...) __noreturn __cold;
> -vo

Re: [PATCH 1/2] audit: add support for the openat2 syscall

2021-03-18 Thread Christian Brauner

On Thu, Mar 18, 2021 at 11:48:45AM +0100, Christian Brauner wrote:
> [+Cc Aleksa, the author of openat2()]
> 
> and a comment below. :)
> 
> On Wed, Mar 17, 2021 at 09:47:17PM -0400, Richard Guy Briggs wrote:
> > The openat2(2) syscall was added in kernel v5.6 with commit fddb5d430ad9
> > ("open: introduce openat2(2) syscall")
> > 
> > Add the openat2(2) syscall to the audit syscall classifier.
> > 
> > See the github issue
> > https://github.com/linux-audit/audit-kernel/issues/67
> > 
> > Signed-off-by: Richard Guy Briggs 
> > ---
> >  arch/alpha/kernel/audit.c  | 2 ++
> >  arch/ia64/kernel/audit.c   | 2 ++
> >  arch/parisc/kernel/audit.c | 2 ++
> >  arch/parisc/kernel/compat_audit.c  | 2 ++
> >  arch/powerpc/kernel/audit.c| 2 ++
> >  arch/powerpc/kernel/compat_audit.c | 2 ++
> >  arch/s390/kernel/audit.c   | 2 ++
> >  arch/s390/kernel/compat_audit.c| 2 ++
> >  arch/sparc/kernel/audit.c  | 2 ++
> >  arch/sparc/kernel/compat_audit.c   | 2 ++
> >  arch/x86/ia32/audit.c  | 2 ++
> >  arch/x86/kernel/audit_64.c | 2 ++
> >  kernel/auditsc.c   | 3 +++
> >  lib/audit.c| 4 
> >  lib/compat_audit.c | 4 
> >  15 files changed, 35 insertions(+)
> > 
> > diff --git a/arch/alpha/kernel/audit.c b/arch/alpha/kernel/audit.c
> > index 96a9d18ff4c4..06a911b685d1 100644
> > --- a/arch/alpha/kernel/audit.c
> > +++ b/arch/alpha/kernel/audit.c
> > @@ -42,6 +42,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
> > return 3;
> > case __NR_execve:
> > return 5;
> > +   case __NR_openat2:
> > +   return 6;
> > default:
> > return 0;
> > }
> > diff --git a/arch/ia64/kernel/audit.c b/arch/ia64/kernel/audit.c
> > index 5192ca899fe6..5eaa888c8fd3 100644
> > --- a/arch/ia64/kernel/audit.c
> > +++ b/arch/ia64/kernel/audit.c
> > @@ -43,6 +43,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
> > return 3;
> > case __NR_execve:
> > return 5;
> > +   case __NR_openat2:
> > +   return 6;
> > default:
> > return 0;
> > }
> > diff --git a/arch/parisc/kernel/audit.c b/arch/parisc/kernel/audit.c
> > index 9eb47b2225d2..fc721a7727ba 100644
> > --- a/arch/parisc/kernel/audit.c
> > +++ b/arch/parisc/kernel/audit.c
> > @@ -52,6 +52,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
> > return 3;
> > case __NR_execve:
> > return 5;
> > +   case __NR_openat2:
> > +   return 6;
> > default:
> > return 0;
> > }
> > diff --git a/arch/parisc/kernel/compat_audit.c 
> > b/arch/parisc/kernel/compat_audit.c
> > index 20c39c9d86a9..fc6d35918c44 100644
> > --- a/arch/parisc/kernel/compat_audit.c
> > +++ b/arch/parisc/kernel/compat_audit.c
> > @@ -35,6 +35,8 @@ int parisc32_classify_syscall(unsigned syscall)
> > return 3;
> > case __NR_execve:
> > return 5;
> > +   case __NR_openat2:
> > +   return 6;
> > default:
> > return 1;
> > }
> > diff --git a/arch/powerpc/kernel/audit.c b/arch/powerpc/kernel/audit.c
> > index a27f3d09..8f32700b0baa 100644
> > --- a/arch/powerpc/kernel/audit.c
> > +++ b/arch/powerpc/kernel/audit.c
> > @@ -54,6 +54,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
> > return 4;
> > case __NR_execve:
> > return 5;
> > +   case __NR_openat2:
> > +   return 6;
> > default:
> > return 0;
> > }
> > diff --git a/arch/powerpc/kernel/compat_audit.c 
> > b/arch/powerpc/kernel/compat_audit.c
> > index 55c6ccda0a85..ebe45534b1c9 100644
> > --- a/arch/powerpc/kernel/compat_audit.c
> > +++ b/arch/powerpc/kernel/compat_audit.c
> > @@ -38,6 +38,8 @@ int ppc32_classify_syscall(unsigned syscall)
> > return 4;
> > case __NR_execve:
> > return 5;
> > +   case __NR_openat2:
> > +   return 6;
> > default:
> > return 1;
> > }
> > diff --git a/arch/s390/kernel/audit.c b/arch/s390/kernel/audit.c
> > index d395c6c9944c..d964cb94cfaf 100644
> > --- a/arch/s390/kernel/audit.c
> > +++ b/arch/s390/kernel/audit.c
> > @@ -54,6 +54,8 @@ int audit_classify_syscall(int abi, unsigned sysc

Re: [PATCH 1/2] audit: add support for the openat2 syscall

2021-03-18 Thread Christian Brauner

[+Cc Aleksa, the author of openat2()]

and a comment below. :)

On Wed, Mar 17, 2021 at 09:47:17PM -0400, Richard Guy Briggs wrote:
> The openat2(2) syscall was added in kernel v5.6 with commit fddb5d430ad9
> ("open: introduce openat2(2) syscall")
> 
> Add the openat2(2) syscall to the audit syscall classifier.
> 
> See the github issue
> https://github.com/linux-audit/audit-kernel/issues/67
> 
> Signed-off-by: Richard Guy Briggs 
> ---
>  arch/alpha/kernel/audit.c  | 2 ++
>  arch/ia64/kernel/audit.c   | 2 ++
>  arch/parisc/kernel/audit.c | 2 ++
>  arch/parisc/kernel/compat_audit.c  | 2 ++
>  arch/powerpc/kernel/audit.c| 2 ++
>  arch/powerpc/kernel/compat_audit.c | 2 ++
>  arch/s390/kernel/audit.c   | 2 ++
>  arch/s390/kernel/compat_audit.c| 2 ++
>  arch/sparc/kernel/audit.c  | 2 ++
>  arch/sparc/kernel/compat_audit.c   | 2 ++
>  arch/x86/ia32/audit.c  | 2 ++
>  arch/x86/kernel/audit_64.c | 2 ++
>  kernel/auditsc.c   | 3 +++
>  lib/audit.c| 4 
>  lib/compat_audit.c | 4 
>  15 files changed, 35 insertions(+)
> 
> diff --git a/arch/alpha/kernel/audit.c b/arch/alpha/kernel/audit.c
> index 96a9d18ff4c4..06a911b685d1 100644
> --- a/arch/alpha/kernel/audit.c
> +++ b/arch/alpha/kernel/audit.c
> @@ -42,6 +42,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
>   return 3;
>   case __NR_execve:
>   return 5;
> + case __NR_openat2:
> + return 6;
>   default:
>   return 0;
>   }
> diff --git a/arch/ia64/kernel/audit.c b/arch/ia64/kernel/audit.c
> index 5192ca899fe6..5eaa888c8fd3 100644
> --- a/arch/ia64/kernel/audit.c
> +++ b/arch/ia64/kernel/audit.c
> @@ -43,6 +43,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
>   return 3;
>   case __NR_execve:
>   return 5;
> + case __NR_openat2:
> + return 6;
>   default:
>   return 0;
>   }
> diff --git a/arch/parisc/kernel/audit.c b/arch/parisc/kernel/audit.c
> index 9eb47b2225d2..fc721a7727ba 100644
> --- a/arch/parisc/kernel/audit.c
> +++ b/arch/parisc/kernel/audit.c
> @@ -52,6 +52,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
>   return 3;
>   case __NR_execve:
>   return 5;
> + case __NR_openat2:
> + return 6;
>   default:
>   return 0;
>   }
> diff --git a/arch/parisc/kernel/compat_audit.c 
> b/arch/parisc/kernel/compat_audit.c
> index 20c39c9d86a9..fc6d35918c44 100644
> --- a/arch/parisc/kernel/compat_audit.c
> +++ b/arch/parisc/kernel/compat_audit.c
> @@ -35,6 +35,8 @@ int parisc32_classify_syscall(unsigned syscall)
>   return 3;
>   case __NR_execve:
>   return 5;
> + case __NR_openat2:
> + return 6;
>   default:
>   return 1;
>   }
> diff --git a/arch/powerpc/kernel/audit.c b/arch/powerpc/kernel/audit.c
> index a27f3d09..8f32700b0baa 100644
> --- a/arch/powerpc/kernel/audit.c
> +++ b/arch/powerpc/kernel/audit.c
> @@ -54,6 +54,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
>   return 4;
>   case __NR_execve:
>   return 5;
> + case __NR_openat2:
> + return 6;
>   default:
>   return 0;
>   }
> diff --git a/arch/powerpc/kernel/compat_audit.c 
> b/arch/powerpc/kernel/compat_audit.c
> index 55c6ccda0a85..ebe45534b1c9 100644
> --- a/arch/powerpc/kernel/compat_audit.c
> +++ b/arch/powerpc/kernel/compat_audit.c
> @@ -38,6 +38,8 @@ int ppc32_classify_syscall(unsigned syscall)
>   return 4;
>   case __NR_execve:
>   return 5;
> + case __NR_openat2:
> + return 6;
>   default:
>   return 1;
>   }
> diff --git a/arch/s390/kernel/audit.c b/arch/s390/kernel/audit.c
> index d395c6c9944c..d964cb94cfaf 100644
> --- a/arch/s390/kernel/audit.c
> +++ b/arch/s390/kernel/audit.c
> @@ -54,6 +54,8 @@ int audit_classify_syscall(int abi, unsigned syscall)
>   return 4;
>   case __NR_execve:
>   return 5;
> + case __NR_openat2:
> + return 6;
>   default:
>   return 0;
>   }
> diff --git a/arch/s390/kernel/compat_audit.c b/arch/s390/kernel/compat_audit.c
> index 444fb1f66944..f7b32933ce0e 100644
> --- a/arch/s390/kernel/compat_audit.c
> +++ b/arch/s390/kernel/compat_audit.c
> @@ -39,6 +39,8 @@ int s390_classify_syscall(unsigned syscall)
>   return 4;
>   case __NR_execve:
>   return 5;
> + case __NR_openat2:
> + return 6;
>   default:
>   return 1;
>   }
> diff --git a/arch/sparc/kernel/audit.c b/arch/sparc/kernel/audit.c
> index a6e91bf34d48..b6dcca9c6520 100644
> --- a/arch/sparc/kernel/audit.c
> +++ b/arch/sparc/kernel/audit.c
> @@ -55,6 +55,8 @@ int audit_classify_syscall(int abi, unsigned int syscall)
>

Re: [PATCH 2/9] fs: add an argument-less alloc_anon_inode

2021-03-10 Thread Christian Brauner

On Tue, Mar 09, 2021 at 04:53:41PM +0100, Christoph Hellwig wrote:
> Add a new alloc_anon_inode helper that allocates an inode on
> the anon_inode file system.
> 
> Signed-off-by: Christoph Hellwig 
> ---

Looks good!
Reviewed-by: Christian Brauner

Re: [PATCH 1/9] fs: rename alloc_anon_inode to alloc_anon_inode_sb

2021-03-10 Thread Christian Brauner

On Tue, Mar 09, 2021 at 04:53:40PM +0100, Christoph Hellwig wrote:
> Rename alloc_inode to free the name for a new variant that does not
> need boilerplate to create a super_block first.
> 
> Signed-off-by: Christoph Hellwig 
> ---

Looks good (with the metioned fix in
https://lore.kernel.org/lkml/20210310083040.ga5...@lst.de)

Reviewed-by: Christian Brauner 

>  arch/powerpc/platforms/pseries/cmm.c | 2 +-
>  drivers/dma-buf/dma-buf.c| 2 +-
>  drivers/gpu/drm/drm_drv.c| 2 +-
>  drivers/misc/cxl/api.c   | 2 +-
>  drivers/misc/vmw_balloon.c   | 2 +-
>  drivers/scsi/cxlflash/ocxl_hw.c  | 2 +-
>  drivers/virtio/virtio_balloon.c  | 2 +-
>  fs/aio.c | 2 +-
>  fs/anon_inodes.c | 4 ++--
>  fs/libfs.c   | 2 +-
>  include/linux/fs.h   | 2 +-
>  kernel/resource.c| 2 +-
>  mm/z3fold.c  | 2 +-
>  mm/zsmalloc.c| 2 +-
>  14 files changed, 15 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/cmm.c 
> b/arch/powerpc/platforms/pseries/cmm.c
> index 45a3a3022a85c9..6d36b858b14df1 100644
> --- a/arch/powerpc/platforms/pseries/cmm.c
> +++ b/arch/powerpc/platforms/pseries/cmm.c
> @@ -580,7 +580,7 @@ static int cmm_balloon_compaction_init(void)
>   return rc;
>   }
>  
> - b_dev_info.inode = alloc_anon_inode(balloon_mnt->mnt_sb);
> + b_dev_info.inode = alloc_anon_inode_sb(balloon_mnt->mnt_sb);
>   if (IS_ERR(b_dev_info.inode)) {
>   rc = PTR_ERR(b_dev_info.inode);
>   b_dev_info.inode = NULL;
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index f264b70c383eb4..dedcc9483352dc 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -445,7 +445,7 @@ static inline int is_dma_buf_file(struct file *file)
>  static struct file *dma_buf_getfile(struct dma_buf *dmabuf, int flags)
>  {
>   struct file *file;
> - struct inode *inode = alloc_anon_inode(dma_buf_mnt->mnt_sb);
> + struct inode *inode = alloc_anon_inode_sb(dma_buf_mnt->mnt_sb);
>  
>   if (IS_ERR(inode))
>   return ERR_CAST(inode);
> diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> index 20d22e41d7ce74..87e7214a8e3565 100644
> --- a/drivers/gpu/drm/drm_drv.c
> +++ b/drivers/gpu/drm/drm_drv.c
> @@ -519,7 +519,7 @@ static struct inode *drm_fs_inode_new(void)
>   return ERR_PTR(r);
>   }
>  
> - inode = alloc_anon_inode(drm_fs_mnt->mnt_sb);
> + inode = alloc_anon_inode_sb(drm_fs_mnt->mnt_sb);
>   if (IS_ERR(inode))
>   simple_release_fs(_fs_mnt, _fs_cnt);
>  
> diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c
> index b493de962153ba..2efbf6c98028ef 100644
> --- a/drivers/misc/cxl/api.c
> +++ b/drivers/misc/cxl/api.c
> @@ -73,7 +73,7 @@ static struct file *cxl_getfile(const char *name,
>   goto err_module;
>   }
>  
> - inode = alloc_anon_inode(cxl_vfs_mount->mnt_sb);
> + inode = alloc_anon_inode_sb(cxl_vfs_mount->mnt_sb);
>   if (IS_ERR(inode)) {
>   file = ERR_CAST(inode);
>   goto err_fs;
> diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c
> index b837e7eba5f7dc..5d057a05ddbee8 100644
> --- a/drivers/misc/vmw_balloon.c
> +++ b/drivers/misc/vmw_balloon.c
> @@ -1900,7 +1900,7 @@ static __init int vmballoon_compaction_init(struct 
> vmballoon *b)
>   return PTR_ERR(vmballoon_mnt);
>  
>   b->b_dev_info.migratepage = vmballoon_migratepage;
> - b->b_dev_info.inode = alloc_anon_inode(vmballoon_mnt->mnt_sb);
> + b->b_dev_info.inode = alloc_anon_inode_sb(vmballoon_mnt->mnt_sb);
>  
>   if (IS_ERR(b->b_dev_info.inode))
>   return PTR_ERR(b->b_dev_info.inode);
> diff --git a/drivers/scsi/cxlflash/ocxl_hw.c b/drivers/scsi/cxlflash/ocxl_hw.c
> index 244fc27215dc79..40184ed926b557 100644
> --- a/drivers/scsi/cxlflash/ocxl_hw.c
> +++ b/drivers/scsi/cxlflash/ocxl_hw.c
> @@ -88,7 +88,7 @@ static struct file *ocxlflash_getfile(struct device *dev, 
> const char *name,
>   goto err2;
>   }
>  
> - inode = alloc_anon_inode(ocxlflash_vfs_mount->mnt_sb);
> + inode = alloc_anon_inode_sb(ocxlflash_vfs_mount->mnt_sb);
>   if (IS_ERR(inode)) {
>   rc = PTR_ERR(inode);
>   dev_err(dev, "%s: alloc_anon_inode failed rc=%d\n",
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 8985fc2cea8615..cae76ee5bdd688 1006

Re: [PATCH v2 4/4] selftests/clone3: Avoid OS-defined clone_args

2020-09-21 Thread Christian Brauner

On Sat, Sep 19, 2020 at 01:06:37AM -0700, Kees Cook wrote:
> As the UAPI headers start to appear in distros, we need to avoid outdated
> versions of struct clone_args to be able to test modern features;
> rename to "struct __clone_args". Additionally update the struct size
> macro names to match UAPI names.
> 
> Signed-off-by: Kees Cook 
> ---

Looks good, thanks!
Acked-by: Christian Brauner

Re: [PATCH v2 3/4] selftests/seccomp: powerpc: Set syscall return during ptrace syscall exit

2020-09-21 Thread Christian Brauner

On Sat, Sep 19, 2020 at 01:06:36AM -0700, Kees Cook wrote:
> Some archs (like powerpc) only support changing the return code during
> syscall exit when ptrace is used. Test entry vs exit phases for which
> portions of the syscall number and return values need to be set at which
> different phases. For non-powerpc, all changes are made during ptrace
> syscall entry, as before. For powerpc, the syscall number is changed at
> ptrace syscall entry and the syscall return value is changed on ptrace
> syscall exit.
> 
> Reported-by: Thadeu Lima de Souza Cascardo 
> Suggested-by: Thadeu Lima de Souza Cascardo 
> Link: 
> https://lore.kernel.org/linux-kselftest/20200911181012.171027-1-casca...@canonical.com/
> Fixes: 58d0a862f573 ("seccomp: add tests for ptrace hole")
> Signed-off-by: Kees Cook 
> ---

Looks good!
Acked-by: Christian Brauner

Re: [PATCH v2 2/4] selftests/seccomp: Allow syscall nr and ret value to be set separately

2020-09-21 Thread Christian Brauner

On Sat, Sep 19, 2020 at 01:06:35AM -0700, Kees Cook wrote:
> In preparation for setting syscall nr and ret values separately, refactor
> the helpers to take a pointer to a value, so that a NULL can indicate
> "do not change this respective value". This is done to keep the regset
> read/write happening once and in one code path.
> 
> Signed-off-by: Kees Cook 
> ---

Looks good!
Acked-by: Christian Brauner

Re: [PATCH v2 1/4] selftests/seccomp: Record syscall during ptrace entry

2020-09-21 Thread Christian Brauner

On Sat, Sep 19, 2020 at 01:06:34AM -0700, Kees Cook wrote:
> In preparation for performing actions during ptrace syscall exit, save
> the syscall number during ptrace syscall entry. Some architectures do
> no have the syscall number available during ptrace syscall exit.
> 
> Suggested-by: Thadeu Lima de Souza Cascardo 
> Link: 
> https://lore.kernel.org/linux-kselftest/20200911181012.171027-1-casca...@canonical.com/
> Signed-off-by: Kees Cook 
> ---
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 40 +--
>  1 file changed, 27 insertions(+), 13 deletions(-)
> 
> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
> b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index bc0fb463c709..c0311b4c736b 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -1949,12 +1949,19 @@ void tracer_seccomp(struct __test_metadata 
> *_metadata, pid_t tracee,
>  
>  }
>  
> +FIXTURE(TRACE_syscall) {
> + struct sock_fprog prog;
> + pid_t tracer, mytid, mypid, parent;
> + long syscall_nr;
> +};
> +
>  void tracer_ptrace(struct __test_metadata *_metadata, pid_t tracee,
>  int status, void *args)
>  {
> - int ret, nr;
> + int ret;
>   unsigned long msg;
>   static bool entry;
> + FIXTURE_DATA(TRACE_syscall) *self = args;
>  
>   /*
>* The traditional way to tell PTRACE_SYSCALL entry/exit
> @@ -1968,24 +1975,31 @@ void tracer_ptrace(struct __test_metadata *_metadata, 
> pid_t tracee,
>   EXPECT_EQ(entry ? PTRACE_EVENTMSG_SYSCALL_ENTRY
>   : PTRACE_EVENTMSG_SYSCALL_EXIT, msg);
>  
> - if (!entry)
> + /*
> +  * Some architectures only support setting return values during
> +  * syscall exit under ptrace, and on exit the syscall number may
> +  * no longer be available. Therefore, save the initial sycall

s/sycall/syscall/

Otherwise looks good. Thanks!
Acked-by: Christian Brauner 

> +  * number here, so it can be examined during both entry and exit
> +  * phases.
> +  */
> + if (entry)
> + self->syscall_nr = get_syscall(_metadata, tracee);
> + else
>   return;
>  
> - nr = get_syscall(_metadata, tracee);
> -
> - if (nr == __NR_getpid)
> + switch (self->syscall_nr) {
> + case __NR_getpid:
>   change_syscall(_metadata, tracee, __NR_getppid, 0);
> - if (nr == __NR_gettid)
> + break;
> + case __NR_gettid:
>   change_syscall(_metadata, tracee, -1, 45000);
> - if (nr == __NR_openat)
> + break;
> + case __NR_openat:
>   change_syscall(_metadata, tracee, -1, -ESRCH);
> + break;
> + }
>  }
>  
> -FIXTURE(TRACE_syscall) {
> - struct sock_fprog prog;
> - pid_t tracer, mytid, mypid, parent;
> -};
> -
>  FIXTURE_VARIANT(TRACE_syscall) {
>   /*
>* All of the SECCOMP_RET_TRACE behaviors can be tested with either
> @@ -2044,7 +2058,7 @@ FIXTURE_SETUP(TRACE_syscall)
>   self->tracer = setup_trace_fixture(_metadata,
>  variant->use_ptrace ? tracer_ptrace
>  : tracer_seccomp,
> -NULL, variant->use_ptrace);
> +self, variant->use_ptrace);
>  
>   ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
>   ASSERT_EQ(0, ret);
> -- 
> 2.25.1
>

Re: [PATCH 14/15] selftests/clone3: Avoid OS-defined clone_args

2020-09-15 Thread Christian Brauner

On Sat, Sep 12, 2020 at 04:08:19AM -0700, Kees Cook wrote:
> As the UAPI headers start to appear in distros, we need to avoid
> outdated versions of struct clone_args to be able to test modern
> features. Additionally pull in the syscall numbers correctly.
> 
> Signed-off-by: Kees Cook 
> ---

Hm, with this patch applied I'm getting:

gcc -g -I../../../../usr/include/clone3_set_tid.c 
/home/brauner/src/git/linux/linux/tools/testing/selftests/kselftest_harness.h 
/home/brauner/src/git/linux/linux/tools/testing/selftests/kselftest.h -lcap -o 
/home/brauner/src/git/linux/linux/tools/testing/selftests/clone3/clone3_set_tid
In file included from clone3_set_tid.c:24:
clone3_selftests.h:37:8: error: redefinition of ‘struct clone_args’
   37 | struct clone_args {
  |^~
In file included from clone3_set_tid.c:12:
/usr/include/linux/sched.h:92:8: note: originally defined here
   92 | struct clone_args {
  |^~
make: *** [../lib.mk:140: 
/home/brauner/src/git/linux/linux/tools/testing/selftests/clone3/clone3_set_tid]
 Error 1

One trick to avoid this could be:

#ifndef CLONE_ARGS_SIZE_VER0
#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
#endif

#ifndef CLONE_ARGS_SIZE_VER1
#define CLONE_ARGS_SIZE_VER1 80 /* sizeof second published struct */
#endif

#ifndef CLONE_ARGS_SIZE_VER2
#define CLONE_ARGS_SIZE_VER2 88 /* sizeof third published struct */
#endif

struct __clone_args {
__aligned_u64 flags;
__aligned_u64 pidfd;
__aligned_u64 child_tid;
__aligned_u64 parent_tid;
__aligned_u64 exit_signal;
__aligned_u64 stack;
__aligned_u64 stack_size;
__aligned_u64 tls;
__aligned_u64 set_tid;
__aligned_u64 set_tid_size;
__aligned_u64 cgroup;
};

static pid_t sys_clone3(struct __clone_args *args, size_t size)
{
return syscall(__NR_clone3, args, size);
}

Christian

Re: [PATCH 15/15] selftests/seccomp: Use __NR_mknodat instead of __NR_mknod

2020-09-15 Thread Christian Brauner

On Sat, Sep 12, 2020 at 04:08:20AM -0700, Kees Cook wrote:
> The __NR_mknod syscall doesn't exist on arm64 (only __NR_mknodat).
> Switch to the modern syscall.
> 
> Fixes: ad5682184a81 ("selftests/seccomp: Check for EPOLLHUP for user_notif")
> Signed-off-by: Kees Cook 
> ---

Thanks! Looks good.
Acked-by: Christian Brauner

Re: [PATCH 11/15] selftests/seccomp: Remove SYSCALL_NUM_RET_SHARE_REG in favor of SYSCALL_RET_SET

2020-09-15 Thread Christian Brauner

On Sat, Sep 12, 2020 at 04:08:16AM -0700, Kees Cook wrote:
> Instead of special-casing the specific case of shared registers, create
> a default SYSCALL_RET_SET() macro (mirroring SYSCALL_NUM_SET()), that
> writes to the SYSCALL_RET register. For architectures that can't set the
> return value (for whatever reason), they can define SYSCALL_RET_SET()
> without an associated SYSCALL_RET() macro. This also paves the way for
> architectures that need to do special things to set the return value
> (e.g. powerpc).
> 
> Signed-off-by: Kees Cook 
> ---

Looks good!
Acked-by: Christian Brauner

Re: [PATCH 10/15] selftests/seccomp: Avoid redundant register flushes

2020-09-15 Thread Christian Brauner

On Sat, Sep 12, 2020 at 04:08:15AM -0700, Kees Cook wrote:
> When none of the registers have changed, don't flush them back. This can
> happen if the architecture uses a non-register way to change the syscall
> (e.g. arm64) , and a return value hasn't been written.
> 
> Signed-off-by: Kees Cook 
> ---

Looks good!
Acked-by: Christian Brauner

Re: [PATCH 09/15] selftests/seccomp: Convert REGSET calls into ARCH_GETREG/ARCH_SETREG

2020-09-15 Thread Christian Brauner

On Sat, Sep 12, 2020 at 04:08:14AM -0700, Kees Cook wrote:
> Consolidate the REGSET logic into the new ARCH_GETREG() and
> ARCH_SETREG() macros, avoiding more #ifdef code in function bodies.
> 
> Signed-off-by: Kees Cook 
> ---

Looks good!
Acked-by: Christian Brauner

Re: [PATCH 08/15] selftests/seccomp: Convert HAVE_GETREG into ARCH_GETREG/ARCH_SETREG

2020-09-15 Thread Christian Brauner

On Sat, Sep 12, 2020 at 04:08:13AM -0700, Kees Cook wrote:
> Instead of special-casing the get/set-registers routines, move the
> HAVE_GETREG logic into the new ARCH_GETREG() and ARCH_SETREG() macros.
> 
> Signed-off-by: Kees Cook 
> ---

Looks good!
Acked-by: Christian Brauner

Re: [PATCH 07/15] selftests/seccomp: Remove syscall setting #ifdefs

2020-09-15 Thread Christian Brauner

On Sat, Sep 12, 2020 at 04:08:12AM -0700, Kees Cook wrote:
> With all architectures now using the common SYSCALL_NUM_SET() macro, the
> arch-specific #ifdef can be removed from change_syscall() itself.
> 
> Signed-off-by: Kees Cook 
> ---

Looks good!
Acked-by: Christian Brauner

Re: [PATCH 06/15] selftests/seccomp: mips: Remove O32-specific macro

2020-09-15 Thread Christian Brauner

On Sat, Sep 12, 2020 at 04:08:11AM -0700, Kees Cook wrote:
> Instead of having the mips O32 macro special-cased, pull the logic into
> the SYSCALL_NUM() macro. Additionally include the ABI headers, since
> these appear to have been missing, leaving __NR_O32_Linux undefined.
> 
> Signed-off-by: Kees Cook 
> ---

Looks good!
Acked-by: Christian Brauner

Re: [PATCH 05/15] selftests/seccomp: arm64: Define SYSCALL_NUM_SET macro

2020-09-15 Thread Christian Brauner

On Sat, Sep 12, 2020 at 04:08:10AM -0700, Kees Cook wrote:
> Remove the arm64 special-case in change_syscall().
> 
> Signed-off-by: Kees Cook 
> ---

We're using iovecs in ptrace()??

Looks good!
Acked-by: Christian Brauner

Re: [PATCH 04/15] selftests/seccomp: arm: Define SYSCALL_NUM_SET macro

2020-09-15 Thread Christian Brauner

On Sat, Sep 12, 2020 at 04:08:09AM -0700, Kees Cook wrote:
> Remove the arm special-case in change_syscall().
> 
> Signed-off-by: Kees Cook 
> ---

Looks good!
Acked-by: Christian Brauner

Re: [PATCH 03/15] selftests/seccomp: mips: Define SYSCALL_NUM_SET macro

2020-09-15 Thread Christian Brauner

On Sat, Sep 12, 2020 at 04:08:08AM -0700, Kees Cook wrote:
> Remove the mips special-case in change_syscall().
> 
> Signed-off-by: Kees Cook 
> ---
>  tools/testing/selftests/seccomp/seccomp_bpf.c | 17 +
>  1 file changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c 
> b/tools/testing/selftests/seccomp/seccomp_bpf.c
> index 1c83e743bfb1..02a9a6599746 100644
> --- a/tools/testing/selftests/seccomp/seccomp_bpf.c
> +++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
> @@ -1742,6 +1742,13 @@ TEST_F(TRACE_poke, getpid_runs_normally)
>  # define ARCH_REGS   struct pt_regs
>  # define SYSCALL_NUM(_regs)  (_regs).regs[2]
>  # define SYSCALL_SYSCALL_NUM regs[4]
> +# define SYSCALL_NUM_SET(_regs, _nr) \
> + do {\
> + if ((_regs).regs[2] == __NR_O32_Linux)  \
> + (_regs).regs[4] = _nr;  \
> + else\
> + (_regs).regs[2] = _nr;  \
> + } while (0)

I think that

# define SYSCALL_NUM_SET(_regs, _nr)\
do {\
if (SYSCALL_NUM(_regs) == __NR_O32_Linux)   \
(_regs).regs[4] = _nr;  \
else\
(_regs).regs[2] = _nr;  \
} while (0)

would read better but that's just a matter of taste. :)

Looks good!
Acked-by: Christian Brauner

Re: [PATCH 02/15] selftests/seccomp: Provide generic syscall setting macro

2020-09-15 Thread Christian Brauner

On Sat, Sep 12, 2020 at 04:08:07AM -0700, Kees Cook wrote:
> In order to avoid "#ifdef"s in the main function bodies, create a new
> macro, SYSCALL_NUM_SET(), where arch-specific logic can live.
> 
> Signed-off-by: Kees Cook 
> ---

SYSCALL_SWITCH(_regs, nr)?

But looks good either way!
Acked-by: Christian Brauner

Re: [PATCH 01/15] selftests/seccomp: Refactor arch register macros to avoid xtensa special case

2020-09-15 Thread Christian Brauner

On Sat, Sep 12, 2020 at 04:08:06AM -0700, Kees Cook wrote:
> To avoid an xtensa special-case, refactor all arch register macros to
> take the register variable instead of depending on the macro expanding
> as a struct member name.
> 
> Signed-off-by: Kees Cook 
> ---

Looks good!
Acked-by: Christian Brauner

Re: [PATCH 17/17] arch: rename copy_thread_tls() back to copy_thread()

2020-07-04 Thread Christian Brauner

On Fri, Jun 26, 2020 at 06:17:49AM +0900, Stafford Horne wrote:
> On Tue, Jun 23, 2020 at 01:43:26AM +0200, Christian Brauner wrote:
> 
> > diff --git a/arch/openrisc/kernel/process.c b/arch/openrisc/kernel/process.c
> > index d7010e72450c..19045a3efb8a 100644
> > --- a/arch/openrisc/kernel/process.c
> > +++ b/arch/openrisc/kernel/process.c
> > @@ -116,7 +116,7 @@ void release_thread(struct task_struct *dead_task)
> >  extern asmlinkage void ret_from_fork(void);
> >  
> >  /*
> > - * copy_thread_tls
> > + * copy_thread
> >   * @clone_flags: flags
> >   * @usp: user stack pointer or fn for kernel thread
> >   * @arg: arg to fn for kernel thread; always NULL for userspace thread
> > @@ -147,7 +147,7 @@ extern asmlinkage void ret_from_fork(void);
> >   */
> >  
> >  int
> > -copy_thread_tls(unsigned long clone_flags, unsigned long usp,
> > +copy_thread(unsigned long clone_flags, unsigned long usp,
> > unsigned long arg, struct task_struct *p, unsigned long tls)
> >  {
> 
> For the OpenRISC bit.
> 
> Acked-by: Stafford Horne 
> 
> Also, I would agree with what Kees mentioned about aligning the parameters.
> Sure that would be more work but it would be appreciated.

Sure, this is a mechanical change. I'll update the individual patches to 
account for that.

Thanks!
Christian

[PATCH 17/17] arch: rename copy_thread_tls() back to copy_thread()

2020-06-22 Thread Christian Brauner

Now that HAVE_COPY_THREAD_TLS has been removed, rename copy_thread_tls()
back simply copy_thread(). It's a simpler name, and doesn't imply that only
tls is copied here. This finishes an outstanding chunk of internal process
creation work since we've added clone3().

Cc: Richard Henderson 
Cc: Ivan Kokshaysky 
Cc: Matt Turner 
Cc: Vineet Gupta 
Cc: Russell King 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Mark Salter 
Cc: Aurelien Jacquiot 
Cc: Guo Ren 
Cc: Yoshinori Sato 
Cc: Brian Cain 
Cc: Tony Luck 
Cc: Fenghua Yu 
Cc: Geert Uytterhoeven 
Cc: Michal Simek 
Cc: Thomas Bogendoerfer 
Cc: Nick Hu 
Cc: Greentime Hu 
Cc: Vincent Chen 
Cc: Ley Foon Tan 
Cc: Jonas Bonn 
Cc: Stefan Kristiansson 
Cc: Stafford Horne 
Cc: "James E.J. Bottomley" 
Cc: Helge Deller 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Paul Walmsley 
Cc: Palmer Dabbelt 
Cc: Albert Ou 
Cc: Christian Borntraeger 
Cc: Rich Felker 
Cc: "David S. Miller" 
Cc: Jeff Dike 
Cc: Richard Weinberger 
Cc: Anton Ivanov 
Cc: Guan Xuetao 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: x...@kernel.org
Cc: "H. Peter Anvin" 
Cc: Chris Zankel 
Cc: Max Filippov 
Cc: Kees Cook 
Cc: "Peter Zijlstra (Intel)" 
Cc: "Matthew Wilcox (Oracle)" 
Cc: Al Viro 
Cc: linux-al...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux-snps-...@lists.infradead.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-c6x-...@linux-c6x.org
Cc: linux-c...@vger.kernel.org
Cc: uclinux-h8-de...@lists.sourceforge.jp
Cc: linux-hexa...@vger.kernel.org
Cc: linux-i...@vger.kernel.org
Cc: linux-m...@lists.linux-m68k.org
Cc: linux-m...@vger.kernel.org
Cc: openr...@lists.librecores.org
Cc: linux-par...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-ri...@lists.infradead.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linux...@lists.infradead.org
Cc: linux-xte...@linux-xtensa.org
Signed-off-by: Christian Brauner 
---
 arch/alpha/kernel/process.c  | 2 +-
 arch/arc/kernel/process.c| 2 +-
 arch/arm/kernel/process.c| 2 +-
 arch/arm64/kernel/process.c  | 2 +-
 arch/c6x/kernel/process.c| 2 +-
 arch/csky/kernel/process.c   | 2 +-
 arch/h8300/kernel/process.c  | 2 +-
 arch/hexagon/kernel/process.c| 2 +-
 arch/ia64/kernel/process.c   | 4 ++--
 arch/m68k/kernel/process.c   | 2 +-
 arch/microblaze/kernel/process.c | 2 +-
 arch/mips/kernel/process.c   | 2 +-
 arch/nds32/kernel/process.c  | 2 +-
 arch/nios2/kernel/process.c  | 2 +-
 arch/openrisc/kernel/process.c   | 4 ++--
 arch/parisc/kernel/process.c | 2 +-
 arch/powerpc/kernel/process.c| 2 +-
 arch/riscv/kernel/process.c  | 2 +-
 arch/s390/kernel/process.c   | 2 +-
 arch/sh/kernel/process_32.c  | 2 +-
 arch/sparc/kernel/process.c  | 6 +++---
 arch/sparc/kernel/process_32.c   | 2 +-
 arch/sparc/kernel/process_64.c   | 2 +-
 arch/um/kernel/process.c | 2 +-
 arch/unicore32/kernel/process.c  | 2 +-
 arch/x86/kernel/process.c| 2 +-
 arch/x86/kernel/unwind_frame.c   | 2 +-
 arch/xtensa/kernel/process.c | 2 +-
 include/linux/sched/task.h   | 2 +-
 kernel/fork.c| 2 +-
 30 files changed, 34 insertions(+), 34 deletions(-)

diff --git a/arch/alpha/kernel/process.c b/arch/alpha/kernel/process.c
index dfdb6b6ba61c..bce96ddaf2fc 100644
--- a/arch/alpha/kernel/process.c
+++ b/arch/alpha/kernel/process.c
@@ -233,7 +233,7 @@ release_thread(struct task_struct *dead_task)
 /*
  * Copy architecture-specific thread state
  */
-int copy_thread_tls(unsigned long clone_flags, unsigned long usp,
+int copy_thread(unsigned long clone_flags, unsigned long usp,
unsigned long kthread_arg, struct task_struct *p,
unsigned long tls)
 {
diff --git a/arch/arc/kernel/process.c b/arch/arc/kernel/process.c
index 8c8e5172fecd..5b6995c823a6 100644
--- a/arch/arc/kernel/process.c
+++ b/arch/arc/kernel/process.c
@@ -173,7 +173,7 @@ asmlinkage void ret_from_fork(void);
  * |user_r25|
  * --  <= END of PAGE
  */
-int copy_thread_tls(unsigned long clone_flags, unsigned long usp,
+int copy_thread(unsigned long clone_flags, unsigned long usp,
unsigned long kthread_arg, struct task_struct *p, unsigned long tls)
 {
struct pt_regs *c_regs;/* child's pt_regs */
diff --git a/arch/arm/kernel/process.c b/arch/arm/kernel/process.c
index 58eaa1f60e16..038669071f9a 100644
--- a/arch/arm/kernel/process.c
+++ b/arch/arm/kernel/process.c
@@ -226,7 +226,7 @@ void release_thread(struct task_struct *dead_task)
 asmlinkage void ret_from_fork(void) __asm__("ret_from_fork");
 
 int
-copy_thread_tls(unsigned long clone_flags, unsigned long stack_start,
+copy_thread(unsigned long clone_flags, unsigned long stack_start,
unsigned long stk_sz, struct task_struct *p, unsigned long tls)
 {
struct thread_info *thread = task_thread_info(p

[PATCH 16/17] arch: remove HAVE_COPY_THREAD_TLS

2020-06-22 Thread Christian Brauner

All architectures support copy_thread_tls() now, so remove the legacy
copy_thread() function and the HAVE_COPY_THREAD_TLS config option. Everyone
uses the same process creation calling convention based on
copy_thread_tls() and struct kernel_clone_args. This will make it easier to
maintain the core process creation code under kernel/, simplifies the
callpaths and makes the identical for all architectures.

Cc: Richard Henderson 
Cc: Ivan Kokshaysky 
Cc: Matt Turner 
Cc: Vineet Gupta 
Cc: Russell King 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Mark Salter 
Cc: Aurelien Jacquiot 
Cc: Guo Ren 
Cc: Yoshinori Sato 
Cc: Brian Cain 
Cc: Tony Luck 
Cc: Fenghua Yu 
Cc: Geert Uytterhoeven 
Cc: Michal Simek 
Cc: Thomas Bogendoerfer 
Cc: Nick Hu 
Cc: Greentime Hu 
Cc: Vincent Chen 
Cc: Ley Foon Tan 
Cc: Jonas Bonn 
Cc: Stefan Kristiansson 
Cc: Stafford Horne 
Cc: "James E.J. Bottomley" 
Cc: Helge Deller 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Paul Walmsley 
Cc: Palmer Dabbelt 
Cc: Albert Ou 
Cc: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: Christian Borntraeger 
Cc: Rich Felker 
Cc: "David S. Miller" 
Cc: Jeff Dike 
Cc: Richard Weinberger 
Cc: Anton Ivanov 
Cc: Guan Xuetao 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: x...@kernel.org
Cc: Peter Zijlstra 
Cc: Kees Cook 
Cc: Mike Rapoport 
Cc: "Matthew Wilcox
Cc: Al Viro 
Cc: linux-ker...@vger.kernel.org
Cc: linux-al...@vger.kernel.org
Cc: linux-snps-...@lists.infradead.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-c6x-...@linux-c6x.org
Cc: linux-c...@vger.kernel.org
Cc: uclinux-h8-de...@lists.sourceforge.jp
Cc: linux-hexa...@vger.kernel.org
Cc: linux-i...@vger.kernel.org
Cc: linux-m...@lists.linux-m68k.org
Cc: linux-m...@vger.kernel.org
Cc: openr...@lists.librecores.org
Cc: linux-par...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-ri...@lists.infradead.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linux...@lists.infradead.org
Cc: linux-xte...@linux-xtensa.org
Signed-off-by: Christian Brauner 
---
 arch/Kconfig   |  7 ---
 arch/alpha/Kconfig |  1 -
 arch/arc/Kconfig   |  1 -
 arch/arm/Kconfig   |  1 -
 arch/arm64/Kconfig |  1 -
 arch/c6x/Kconfig   |  1 -
 arch/csky/Kconfig  |  1 -
 arch/h8300/Kconfig |  1 -
 arch/hexagon/Kconfig   |  1 -
 arch/ia64/Kconfig  |  1 -
 arch/m68k/Kconfig  |  1 -
 arch/microblaze/Kconfig|  1 -
 arch/mips/Kconfig  |  1 -
 arch/nds32/Kconfig |  1 -
 arch/nios2/Kconfig |  1 -
 arch/openrisc/Kconfig  |  1 -
 arch/parisc/Kconfig|  1 -
 arch/powerpc/Kconfig   |  1 -
 arch/riscv/Kconfig |  1 -
 arch/s390/Kconfig  |  1 -
 arch/sh/Kconfig|  1 -
 arch/sparc/Kconfig |  1 -
 arch/um/Kconfig|  1 -
 arch/unicore32/Kconfig |  1 -
 arch/x86/Kconfig   |  1 -
 arch/xtensa/Kconfig|  1 -
 include/linux/sched/task.h | 15 +--
 kernel/fork.c  |  9 -
 28 files changed, 1 insertion(+), 55 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 8cc35dc556c7..943aac2f3ebe 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -754,13 +754,6 @@ config ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
depends on MMU
select ARCH_HAS_ELF_RANDOMIZE
 
-config HAVE_COPY_THREAD_TLS
-   bool
-   help
- Architecture provides copy_thread_tls to accept tls argument via
- normal C parameter passing, rather than extracting the syscall
- argument from pt_regs.
-
 config HAVE_STACK_VALIDATION
bool
help
diff --git a/arch/alpha/Kconfig b/arch/alpha/Kconfig
index b01515c6b2ed..10862c5a8c76 100644
--- a/arch/alpha/Kconfig
+++ b/arch/alpha/Kconfig
@@ -38,7 +38,6 @@ config ALPHA
select OLD_SIGSUSPEND
select CPU_NO_EFFICIENT_FFS if !ALPHA_EV67
select MMU_GATHER_NO_RANGE
-   select HAVE_COPY_THREAD_TLS
help
  The Alpha is a 64-bit general-purpose processor designed and
  marketed by the Digital Equipment Corporation of blessed memory,
diff --git a/arch/arc/Kconfig b/arch/arc/Kconfig
index fddc70029727..1fa0b98ed9ce 100644
--- a/arch/arc/Kconfig
+++ b/arch/arc/Kconfig
@@ -29,7 +29,6 @@ config ARC
select GENERIC_SMP_IDLE_THREAD
select HAVE_ARCH_KGDB
select HAVE_ARCH_TRACEHOOK
-   select HAVE_COPY_THREAD_TLS
select HAVE_DEBUG_STACKOVERFLOW
select HAVE_DEBUG_KMEMLEAK
select HAVE_FUTEX_CMPXCHG if FUTEX
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 2ac74904a3ce..445b5ed693f0 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -72,7 +72,6 @@ config ARM
select HAVE_ARM_SMCCC if CPU_V7
select HAVE_EBPF_JIT if !CPU_ENDIAN_BE32
select HAVE_CONTEXT_TRACKING
-   select HAVE_COPY_THREAD_TLS
select HAVE_C_RECORDMCOUNT
select HAVE_DEBU

Re: linux-next: manual merge of the pidfd tree with the powerpc-fixes tree

2020-06-19 Thread Christian Brauner

On Fri, Jun 19, 2020 at 09:17:30PM +1000, Michael Ellerman wrote:
> Stephen Rothwell  writes:
> > Hi all,
> >
> > Today's linux-next merge of the pidfd tree got a conflict in:
> >
> >   arch/powerpc/kernel/syscalls/syscall.tbl
> >
> > between commit:
> >
> >   35e32a6cb5f6 ("powerpc/syscalls: Split SPU-ness out of ABI")
> >
> > from the powerpc-fixes tree and commit:
> >
> >   9b4feb630e8e ("arch: wire-up close_range()")
> >
> > from the pidfd tree.
> >
> > I fixed it up (see below) and can carry the fix as necessary. This
> > is now fixed as far as linux-next is concerned, but any non trivial
> > conflicts should be mentioned to your upstream maintainer when your tree
> > is submitted for merging.  You may also want to consider cooperating
> > with the maintainer of the conflicting tree to minimise any particularly
> > complex conflicts.
> 
> Thanks.
> 
> I thought the week between rc1 and rc2 would be a safe time to do that
> conversion of the syscall table, but I guess I was wrong :)

:)

> 
> I'm planning to send those changes to Linus for rc2, so the conflict
> will then be vs mainline. But I guess it's pretty trivial so it doesn't
> really matter.

close_range() is targeted for the v5.9 merge window. I always do
test-merges with mainline at the time I'm creating a pr and I'll just
mention to Linus that there's conflict with ppc. :)

Thanks!
Christian

[PATCH v5 2/3] arch: wire-up close_range()

2020-06-02 Thread Christian Brauner

This wires up the close_range() syscall into all arches at once.

Suggested-by: Arnd Bergmann 
Signed-off-by: Christian Brauner 
Reviewed-by: Oleg Nesterov 
Acked-by: Arnd Bergmann 
Acked-by: Michael Ellerman  (powerpc)
Cc: Jann Horn 
Cc: David Howells 
Cc: Dmitry V. Levin 
Cc: Linus Torvalds 
Cc: Al Viro 
Cc: Florian Weimer 
Cc: linux-...@vger.kernel.org
Cc: linux-al...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linux-m...@lists.linux-m68k.org
Cc: linux-m...@vger.kernel.org
Cc: linux-par...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linux-xte...@linux-xtensa.org
Cc: linux-a...@vger.kernel.org
Cc: x...@kernel.org
---
/* v2 */
not present

/* v3 */
not present

/* v4 */
introduced
- Arnd Bergmann :
  - split into two patches:
1. add close_range()
2. add syscall to all arches at once
  - bump __NR_compat_syscalls in arch/arm64/include/asm/unistd.h

/* v5 */
unchanged
---
 arch/alpha/kernel/syscalls/syscall.tbl  | 1 +
 arch/arm/tools/syscall.tbl  | 1 +
 arch/arm64/include/asm/unistd.h | 2 +-
 arch/arm64/include/asm/unistd32.h   | 2 ++
 arch/ia64/kernel/syscalls/syscall.tbl   | 1 +
 arch/m68k/kernel/syscalls/syscall.tbl   | 1 +
 arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   | 1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   | 1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   | 1 +
 arch/parisc/kernel/syscalls/syscall.tbl | 1 +
 arch/powerpc/kernel/syscalls/syscall.tbl| 1 +
 arch/s390/kernel/syscalls/syscall.tbl   | 1 +
 arch/sh/kernel/syscalls/syscall.tbl | 1 +
 arch/sparc/kernel/syscalls/syscall.tbl  | 1 +
 arch/x86/entry/syscalls/syscall_32.tbl  | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl  | 1 +
 arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
 include/uapi/asm-generic/unistd.h   | 4 +++-
 19 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl 
b/arch/alpha/kernel/syscalls/syscall.tbl
index 36d42da7466a..67ef02ead4da 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -475,5 +475,6 @@
 543common  fspick  sys_fspick
 544common  pidfd_open  sys_pidfd_open
 # 545 reserved for clone3
+546common  close_range sys_close_range
 547common  openat2 sys_openat2
 548common  pidfd_getfd sys_pidfd_getfd
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 4d1cf74a2caa..13c5652137fb 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -449,5 +449,6 @@
 433common  fspick  sys_fspick
 434common  pidfd_open  sys_pidfd_open
 435common  clone3  sys_clone3
+436common  close_range sys_close_range
 437common  openat2 sys_openat2
 438common  pidfd_getfd sys_pidfd_getfd
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 803039d504de..3b859596840d 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
 #define __ARM_NR_compat_set_tls(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls   439
+#define __NR_compat_syscalls   440
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h 
b/arch/arm64/include/asm/unistd32.h
index c1c61635f89c..902bfb136002 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -879,6 +879,8 @@ __SYSCALL(__NR_fspick, sys_fspick)
 __SYSCALL(__NR_pidfd_open, sys_pidfd_open)
 #define __NR_clone3 435
 __SYSCALL(__NR_clone3, sys_clone3)
+#define __NR_close_range 436
+__SYSCALL(__NR_close_range, sys_close_range)
 #define __NR_openat2 437
 __SYSCALL(__NR_openat2, sys_openat2)
 #define __NR_pidfd_getfd 438
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl 
b/arch/ia64/kernel/syscalls/syscall.tbl
index 042911e670b8..df2e14da6a29 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -356,5 +356,6 @@
 433common  fspick  sys_fspick
 434common  pidfd_open  sys_pidfd_open
 # 435 reserved for clone3
+436common  close_range sys_close_range
 437common  openat2 sys_openat2
 438common  pidfd_getfd sys_pidfd_getfd
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl 
b/arch/m68k/kernel/syscalls/syscall.tbl
index f4f49fcb76d0..553b8858e667 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b

Re: [PATCH] selftests: pidfd: Add pidfd_fdinfo_test in .gitignore

2020-02-28 Thread Christian Brauner

On Fri, Feb 28, 2020 at 01:18:44AM +0100, Christian Brauner wrote:
> On February 28, 2020 1:00:08 AM GMT+01:00, Christophe Leroy 
>  wrote:
> >The commit identified below added pidfd_fdinfo_test
> >but failed to add it to .gitignore
> >
> >Fixes: 2def297ec7fb ("pidfd: add tests for NSpid info in fdinfo")
> >Cc: sta...@vger.kernel.org
> >Signed-off-by: Christophe Leroy 
> >---
> > tools/testing/selftests/pidfd/.gitignore | 1 +
> > 1 file changed, 1 insertion(+)
> >
> >diff --git a/tools/testing/selftests/pidfd/.gitignore
> >b/tools/testing/selftests/pidfd/.gitignore
> >index 3a779c084d96..39559d723c41 100644
> >--- a/tools/testing/selftests/pidfd/.gitignore
> >+++ b/tools/testing/selftests/pidfd/.gitignore
> >@@ -2,4 +2,5 @@ pidfd_open_test
> > pidfd_poll_test
> > pidfd_test
> > pidfd_wait
> >+pidfd_fdinfo_test
> > pidfd_getfd_test
> 
> Thanks for spotting this.
> I'll pick this up along with other fixes I have waiting.
> 
> Acked-by: Christian Brauner 

Applied, thanks!
Christian

Re: [PATCH] selftests: pidfd: Add pidfd_fdinfo_test in .gitignore

2020-02-27 Thread Christian Brauner

On February 28, 2020 1:00:08 AM GMT+01:00, Christophe Leroy 
 wrote:
>The commit identified below added pidfd_fdinfo_test
>but failed to add it to .gitignore
>
>Fixes: 2def297ec7fb ("pidfd: add tests for NSpid info in fdinfo")
>Cc: sta...@vger.kernel.org
>Signed-off-by: Christophe Leroy 
>---
> tools/testing/selftests/pidfd/.gitignore | 1 +
> 1 file changed, 1 insertion(+)
>
>diff --git a/tools/testing/selftests/pidfd/.gitignore
>b/tools/testing/selftests/pidfd/.gitignore
>index 3a779c084d96..39559d723c41 100644
>--- a/tools/testing/selftests/pidfd/.gitignore
>+++ b/tools/testing/selftests/pidfd/.gitignore
>@@ -2,4 +2,5 @@ pidfd_open_test
> pidfd_poll_test
> pidfd_test
> pidfd_wait
>+pidfd_fdinfo_test
> pidfd_getfd_test

Thanks for spotting this.
I'll pick this up along with other fixes I have waiting.

Acked-by: Christian Brauner

Re: [PATCH v15 0/9] open: introduce openat2(2) syscall

2019-11-12 Thread Christian Brauner

On Tue, Nov 12, 2019 at 03:01:26PM -0800, Kees Cook wrote:
> On Tue, Nov 12, 2019 at 12:24:04AM +1100, Aleksa Sarai wrote:
> > On 2019-11-05, Aleksa Sarai  wrote:
> > > This patchset is being developed here:
> > >   
> > > 
> > > Patch changelog:
> > >  v15:
> > >   * Fix code style for LOOKUP_IN_ROOT handling in path_init(). [Linus 
> > > Torvalds]
> > >   * Split out patches for each individual LOOKUP flag.
> > >   * Reword commit messages to give more background information about the
> > > series, as well as mention the semantics of each flag in more detail.
> > > [...]
> > 
> > Ping -- this patch hasn't been touched for a week. Thanks.
> 
> If I've been following correctly, everyone is happy with this series.
> (i.e. Linus's comment appear to have been addressed.)
> 
> Perhaps the next question is should this go via a pull request by you to
> Linus directly during the v5.5 merge window, via akpm, via akpm, via
> Christian, or some other path? Besides Linus, it's not been clear who
> should "claim" this series. :)

I like this series and the same with the copy_struct_from_user() part of
it I've taken I'm happy to stuff this into a dedicated branch, merge it
into my for-next and send it for v5.5.
Though I'd _much_ rather see Al pick this up or have him give his
blessing first.

Christian

Re: [PATCH 12/23] y2038: syscalls: change remaining timeval to __kernel_old_timeval

2019-11-11 Thread Christian Brauner

On Fri, Nov 08, 2019 at 10:12:11PM +0100, Arnd Bergmann wrote:
> All of the remaining syscalls that pass a timeval (gettimeofday, utime,
> futimesat) can trivially be changed to pass a __kernel_old_timeval
> instead, which has a compatible layout, but avoids ambiguity with
> the timeval type in user space.
> 
> Signed-off-by: Arnd Bergmann 

Seems reasonable.
Acked-by: Christian Brauner

[REVIEW PATCH v5 2/3] arch: wire-up close_range()

2019-10-25 Thread Christian Brauner

This wires up the close_range() syscall into all arches at once.

Suggested-by: Arnd Bergmann 
Signed-off-by: Christian Brauner 
Reviewed-by: Oleg Nesterov 
Acked-by: Arnd Bergmann 
Acked-by: Michael Ellerman  (powerpc)
Cc: Jann Horn 
Cc: David Howells 
Cc: Dmitry V. Levin 
Cc: Linus Torvalds 
Cc: Al Viro 
Cc: Florian Weimer 
Cc: linux-...@vger.kernel.org
Cc: linux-al...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linux-m...@lists.linux-m68k.org
Cc: linux-m...@vger.kernel.org
Cc: linux-par...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linux-xte...@linux-xtensa.org
Cc: linux-a...@vger.kernel.org
Cc: x...@kernel.org
---
/* v2 */
not present

/* v3 */
not present

/* v4 */
introduced
- Arnd Bergmann :
  - split into two patches:
1. add close_range()
2. add syscall to all arches at once
  - bump __NR_compat_syscalls in arch/arm64/include/asm/unistd.h

/* v5 */
- Christian Brauner :
---
 arch/alpha/kernel/syscalls/syscall.tbl  | 1 +
 arch/arm/tools/syscall.tbl  | 1 +
 arch/arm64/include/asm/unistd.h | 2 +-
 arch/arm64/include/asm/unistd32.h   | 2 ++
 arch/ia64/kernel/syscalls/syscall.tbl   | 1 +
 arch/m68k/kernel/syscalls/syscall.tbl   | 1 +
 arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   | 1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   | 1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   | 1 +
 arch/parisc/kernel/syscalls/syscall.tbl | 1 +
 arch/powerpc/kernel/syscalls/syscall.tbl| 1 +
 arch/s390/kernel/syscalls/syscall.tbl   | 1 +
 arch/sh/kernel/syscalls/syscall.tbl | 1 +
 arch/sparc/kernel/syscalls/syscall.tbl  | 1 +
 arch/x86/entry/syscalls/syscall_32.tbl  | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl  | 1 +
 arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
 include/uapi/asm-generic/unistd.h   | 4 +++-
 19 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl 
b/arch/alpha/kernel/syscalls/syscall.tbl
index 728fe028c02c..f08906efa5ff 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -475,3 +475,4 @@
 543common  fspick  sys_fspick
 544common  pidfd_open  sys_pidfd_open
 # 545 reserved for clone3
+546common  close_range sys_close_range
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 6da7dc4d79cc..f25716576d13 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -449,3 +449,4 @@
 433common  fspick  sys_fspick
 434common  pidfd_open  sys_pidfd_open
 435common  clone3  sys_clone3
+436common  close_range sys_close_range
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 2629a68b8724..368761302768 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
 #define __ARM_NR_compat_set_tls(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls   436
+#define __NR_compat_syscalls   437
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h 
b/arch/arm64/include/asm/unistd32.h
index 94ab29cf4f00..c1309f52c8ac 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -879,6 +879,8 @@ __SYSCALL(__NR_fspick, sys_fspick)
 __SYSCALL(__NR_pidfd_open, sys_pidfd_open)
 #define __NR_clone3 435
 __SYSCALL(__NR_clone3, sys_clone3)
+#define __NR_close_range 436
+__SYSCALL(__NR_close_range, sys_close_range)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl 
b/arch/ia64/kernel/syscalls/syscall.tbl
index 36d5faf4c86c..151f4fd234be 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -356,3 +356,4 @@
 433common  fspick  sys_fspick
 434common  pidfd_open  sys_pidfd_open
 # 435 reserved for clone3
+436common  close_range sys_close_range
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl 
b/arch/m68k/kernel/syscalls/syscall.tbl
index a88a285a0e5f..adff06c08d2f 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -435,3 +435,4 @@
 433common  fspick  sys_fspick
 434common  pidfd_open  sys_pidfd_open
 # 435 reserved for clone3
+436common  close_range sys_close_range
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl 
b/arch/microblaze/kernel/syscalls/syscall.tbl
index

Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers

2019-09-07 Thread Christian Brauner

On Fri, Sep 06, 2019 at 05:56:18AM +1000, Aleksa Sarai wrote:
> On 2019-09-05, Al Viro  wrote:
> > On Thu, Sep 05, 2019 at 08:23:03PM +0200, Christian Brauner wrote:
> > 
> > > Because every caller of that function right now has that limit set
> > > anyway iirc. So we can either remove it from here and place it back for
> > > the individual callers or leave it in the helper.
> > > Also, I'm really asking, why not? Is it unreasonable to have an upper
> > > bound on the size (for a long time probably) or are you disagreeing with
> > > PAGE_SIZE being used? PAGE_SIZE limit is currently used by sched, perf,
> > > bpf, and clone3 and in a few other places.
> > 
> > For a primitive that can be safely used with any size (OK, any within
> > the usual 2Gb limit)?  Why push the random policy into the place where
> > it doesn't belong?
> > 
> > Seriously, what's the point?  If they want to have a large chunk of
> > userland memory zeroed or checked for non-zeroes - why would that
> > be a problem?
> 
> Thinking about it some more, there isn't really any r/w amplification --
> so there isn't much to gain by passing giant structs. Though, if we are
> going to permit 2GB buffers, isn't that also an argument to use
> memchr_inv()? :P

I think we should just do a really dumb, easy to understand minimal
thing for now. It could even just be what every caller is doing right
now anyway with the get_user() loop.
The only way to settle memchr_inv() vs all the other clever ways
suggested here is to benchmark it and see if it matters *for the current
users* of this helper. If it does, great we can do it. If it doesn't why
bother having that argument right now?
Once we somehow end up in a possible world where we apparently have
decided it's a great idea to copy 2GB argument structures for a syscall
into or from the kernel we can start optimizing the hell out of this.
Before that and especially with current callers I honestly doubt it
matters whether we use memchr_inv() or while() {get_user()} loops.

Christian

Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers

2019-09-07 Thread Christian Brauner

On Thu, Sep 05, 2019 at 07:28:01PM +0100, Al Viro wrote:
> On Thu, Sep 05, 2019 at 08:23:03PM +0200, Christian Brauner wrote:
> 
> > Because every caller of that function right now has that limit set
> > anyway iirc. So we can either remove it from here and place it back for
> > the individual callers or leave it in the helper.
> > Also, I'm really asking, why not? Is it unreasonable to have an upper
> > bound on the size (for a long time probably) or are you disagreeing with
> > PAGE_SIZE being used? PAGE_SIZE limit is currently used by sched, perf,
> > bpf, and clone3 and in a few other places.
> 
> For a primitive that can be safely used with any size (OK, any within
> the usual 2Gb limit)?  Why push the random policy into the place where
> it doesn't belong?

Ah, the "not in the helper part" makes sense.
As long as leave the check for the callers themselves.

> 
> Seriously, what's the point?  If they want to have a large chunk of
> userland memory zeroed or checked for non-zeroes - why would that
> be a problem?

Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers

2019-09-07 Thread Christian Brauner

On Thu, Sep 05, 2019 at 07:07:50PM +0100, Al Viro wrote:
> On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> > +/*
> > + * "memset(p, 0, size)" but for user space buffers. Caller must have 
> > already
> > + * checked access_ok(p, size).
> > + */
> > +static int __memzero_user(void __user *p, size_t s)
> > +{
> > +   const char zeros[BUFFER_SIZE] = {};
> > +   while (s > 0) {
> > +   size_t n = min(s, sizeof(zeros));
> > +
> > +   if (__copy_to_user(p, zeros, n))
> > +   return -EFAULT;
> > +
> > +   p += n;
> > +   s -= n;
> > +   }
> > +   return 0;
> > +}
> 
> That's called clear_user().
> 
> > +int copy_struct_to_user(void __user *dst, size_t usize,
> > +   const void *src, size_t ksize)
> > +{
> > +   size_t size = min(ksize, usize);
> > +   size_t rest = abs(ksize - usize);
> > +
> > +   if (unlikely(usize > PAGE_SIZE))
> > +   return -EFAULT;
> 
> Why?

Because every caller of that function right now has that limit set
anyway iirc. So we can either remove it from here and place it back for
the individual callers or leave it in the helper.
Also, I'm really asking, why not? Is it unreasonable to have an upper
bound on the size (for a long time probably) or are you disagreeing with
PAGE_SIZE being used? PAGE_SIZE limit is currently used by sched, perf,
bpf, and clone3 and in a few other places.

Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers

2019-09-07 Thread Christian Brauner

On Thu, Sep 05, 2019 at 09:27:18PM +1000, Aleksa Sarai wrote:
> On 2019-09-05, Christian Brauner  wrote:
> > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> > > A common pattern for syscall extensions is increasing the size of a
> > > struct passed from userspace, such that the zero-value of the new fields
> > > result in the old kernel behaviour (allowing for a mix of userspace and
> > > kernel vintages to operate on one another in most cases). This is done
> > > in both directions -- hence two helpers -- though it's more common to
> > > have to copy user space structs into kernel space.
> > > 
> > > Previously there was no common lib/ function that implemented
> > > the necessary extension-checking semantics (and different syscalls
> > > implemented them slightly differently or incompletely[1]). A future
> > > patch replaces all of the common uses of this pattern to use the new
> > > copy_struct_{to,from}_user() helpers.
> > > 
> > > [1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
> > >  similar checks to copy_struct_from_user() while rt_sigprocmask(2)
> > >  always rejects differently-sized struct arguments.
> > > 
> > > Suggested-by: Rasmus Villemoes 
> > > Signed-off-by: Aleksa Sarai 
> > 
> > I would probably split this out into a separate patchset. It can very
> > well go in before openat2(). Thoughts?
> 
> Yeah, I'll split this and the related patches out -- though I will admit
> I'm not sure how you're supposed to deal with multiple independent
> patchsets that depend on each other. How will folks reviewing openat2(2)
> know to include the lib/struct_user.c changes?

The way I usually deal with this is to make two branches. One with the
changes the other depends on and then merge this branch into the other
and put the changes on top. Then you can provide a complete branch that
people can test when you send the patchset out by just linking to it in
the cover letter.
(But if it's too much hazzle just leave it.)

> 
> Also, whose tree should it go through?

If people think splitting it out makes sense and we can settle the
technical details I can take it and let it stew in linux-next at least
for a little while.
I have changes to clone3() in there that touch
copy_clone_args_from_user() anyway and there are tests for clone3()
struct copying so we'd catch regressions (for clone3() at least) pretty
quickly.
If we don't see any major issues in the next two weeks it might even be
ok to send for 5.4.

Christian

Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers

2019-09-07 Thread Christian Brauner

On Thu, Sep 05, 2019 at 01:17:38PM +0200, Rasmus Villemoes wrote:
> On 05/09/2019 13.05, Christian Brauner wrote:
> > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> 
> >> +  if (unlikely(!access_ok(dst, usize)))
> >> +  return -EFAULT;
> >> +
> >> +  /* Deal with trailing bytes. */
> >> +  if (usize < ksize) {
> >> +  if (memchr_inv(src + size, 0, rest))
> >> +  return -EFBIG;
> >> +  } else if (usize > ksize) {
> >> +  if (__memzero_user(dst + size, rest))
> >> +  return -EFAULT;
> > 
> > Is zeroing that memory really our job? Seems to me we should just check
> > it is zeroed.
> 
> Of course it is, otherwise you'd require userspace to clear the output
> buffer it gives us, which in the majority of cases is wasted work. It's
> much easier to reason about if we just say "the kernel populates [uaddr,
> uaddr + usize)".

I don't really mind either way so sure. :)

Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers

2019-09-07 Thread Christian Brauner

On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> A common pattern for syscall extensions is increasing the size of a
> struct passed from userspace, such that the zero-value of the new fields
> result in the old kernel behaviour (allowing for a mix of userspace and
> kernel vintages to operate on one another in most cases). This is done
> in both directions -- hence two helpers -- though it's more common to
> have to copy user space structs into kernel space.
> 
> Previously there was no common lib/ function that implemented
> the necessary extension-checking semantics (and different syscalls
> implemented them slightly differently or incompletely[1]). A future
> patch replaces all of the common uses of this pattern to use the new
> copy_struct_{to,from}_user() helpers.
> 
> [1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
>  similar checks to copy_struct_from_user() while rt_sigprocmask(2)
>  always rejects differently-sized struct arguments.
> 
> Suggested-by: Rasmus Villemoes 
> Signed-off-by: Aleksa Sarai 

I would probably split this out into a separate patchset. It can very
well go in before openat2(). Thoughts?

Christian

Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers

2019-09-07 Thread Christian Brauner

On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> A common pattern for syscall extensions is increasing the size of a
> struct passed from userspace, such that the zero-value of the new fields
> result in the old kernel behaviour (allowing for a mix of userspace and
> kernel vintages to operate on one another in most cases). This is done
> in both directions -- hence two helpers -- though it's more common to
> have to copy user space structs into kernel space.
> 
> Previously there was no common lib/ function that implemented
> the necessary extension-checking semantics (and different syscalls
> implemented them slightly differently or incompletely[1]). A future
> patch replaces all of the common uses of this pattern to use the new
> copy_struct_{to,from}_user() helpers.
> 
> [1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
>  similar checks to copy_struct_from_user() while rt_sigprocmask(2)
>  always rejects differently-sized struct arguments.
> 
> Suggested-by: Rasmus Villemoes 
> Signed-off-by: Aleksa Sarai 
> ---
>  include/linux/uaccess.h |   5 ++
>  lib/Makefile|   2 +-
>  lib/struct_user.c   | 182 
>  3 files changed, 188 insertions(+), 1 deletion(-)
>  create mode 100644 lib/struct_user.c
> 
> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> index 34a038563d97..0ad9544a1aee 100644
> --- a/include/linux/uaccess.h
> +++ b/include/linux/uaccess.h
> @@ -230,6 +230,11 @@ static inline unsigned long 
> __copy_from_user_inatomic_nocache(void *to,
>  
>  #endif   /* ARCH_HAS_NOCACHE_UACCESS */
>  
> +extern int copy_struct_to_user(void __user *dst, size_t usize,
> +const void *src, size_t ksize);
> +extern int copy_struct_from_user(void *dst, size_t ksize,
> +  const void __user *src, size_t usize);
> +
>  /*
>   * probe_kernel_read(): safely attempt to read from a location
>   * @dst: pointer to the buffer that shall take the data
> diff --git a/lib/Makefile b/lib/Makefile
> index 29c02a924973..d86c71feaf0a 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -28,7 +28,7 @@ endif
>  CFLAGS_string.o := $(call cc-option, -fno-stack-protector)
>  endif
>  
> -lib-y := ctype.o string.o vsprintf.o cmdline.o \
> +lib-y := ctype.o string.o struct_user.o vsprintf.o cmdline.o \
>rbtree.o radix-tree.o timerqueue.o xarray.o \
>idr.o extable.o \
>sha1.o chacha.o irq_regs.o argv_split.o \
> diff --git a/lib/struct_user.c b/lib/struct_user.c
> new file mode 100644
> index ..7301ab1bbe98
> --- /dev/null
> +++ b/lib/struct_user.c
> @@ -0,0 +1,182 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Copyright (C) 2019 SUSE LLC
> + * Copyright (C) 2019 Aleksa Sarai 
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define BUFFER_SIZE 64
> +
> +/*
> + * "memset(p, 0, size)" but for user space buffers. Caller must have already
> + * checked access_ok(p, size).
> + */
> +static int __memzero_user(void __user *p, size_t s)
> +{
> + const char zeros[BUFFER_SIZE] = {};
> + while (s > 0) {
> + size_t n = min(s, sizeof(zeros));
> +
> + if (__copy_to_user(p, zeros, n))
> + return -EFAULT;
> +
> + p += n;
> + s -= n;
> + }
> + return 0;
> +}
> +
> +/**
> + * copy_struct_to_user: copy a struct to user space
> + * @dst:   Destination address, in user space.
> + * @usize: Size of @dst struct.
> + * @src:   Source address, in kernel space.
> + * @ksize: Size of @src struct.
> + *
> + * Copies a struct from kernel space to user space, in a way that guarantees
> + * backwards-compatibility for struct syscall arguments (as long as future
> + * struct extensions are made such that all new fields are *appended* to the
> + * old struct, and zeroed-out new fields have the same meaning as the old
> + * struct).
> + *
> + * @ksize is just sizeof(*dst), and @usize should've been passed by user 
> space.
> + * The recommended usage is something like the following:
> + *
> + *   SYSCALL_DEFINE2(foobar, struct foo __user *, uarg, size_t, usize)
> + *   {
> + *  int err;
> + *  struct foo karg = {};
> + *
> + *  // do something with karg
> + *
> + *  err = copy_struct_to_user(uarg, usize, , sizeof(karg));
> + *  if (err)
> + *return err;
> + *
> + *  // ...
> + *   }
> + *
> + * There are three cases to consider:
> + *  * If @usize == @ksize, then it's copied verbatim.
> + *  * If @usize < @ksize, then kernel space is "returning" a newer struct to 
> an
> + *older user space. In order to avoid user space getting incomplete
> + *information (new fields might be important), all trailing bytes in @src
> + *(@ksize - @usize) must be zerored, otherwise -EFBIG is returned.
> + *  * If @usize > @ksize, then the kernel is "returning" an older struct to a
> + *newer

Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers

2019-09-07 Thread Christian Brauner

On Thu, Sep 05, 2019 at 07:50:26PM +1000, Aleksa Sarai wrote:
> On 2019-09-05, Rasmus Villemoes  wrote:
> > On 04/09/2019 22.19, Aleksa Sarai wrote:
> > > A common pattern for syscall extensions is increasing the size of a
> > > struct passed from userspace, such that the zero-value of the new fields
> > > result in the old kernel behaviour (allowing for a mix of userspace and
> > > kernel vintages to operate on one another in most cases). This is done
> > > in both directions -- hence two helpers -- though it's more common to
> > > have to copy user space structs into kernel space.
> > > 
> > > Previously there was no common lib/ function that implemented
> > > the necessary extension-checking semantics (and different syscalls
> > > implemented them slightly differently or incompletely[1]). A future
> > > patch replaces all of the common uses of this pattern to use the new
> > > copy_struct_{to,from}_user() helpers.
> > > 
> > > [1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
> > >  similar checks to copy_struct_from_user() while rt_sigprocmask(2)
> > >  always rejects differently-sized struct arguments.
> > > 
> > > Suggested-by: Rasmus Villemoes 
> > > Signed-off-by: Aleksa Sarai 
> > > ---
> > > diff --git a/lib/struct_user.c b/lib/struct_user.c
> > > new file mode 100644
> > > index ..7301ab1bbe98
> > > --- /dev/null
> > > +++ b/lib/struct_user.c
> > > @@ -0,0 +1,182 @@
> > > +// SPDX-License-Identifier: GPL-2.0-or-later
> > > +/*
> > > + * Copyright (C) 2019 SUSE LLC
> > > + * Copyright (C) 2019 Aleksa Sarai 
> > > + */
> > > +
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +
> > > +#define BUFFER_SIZE 64
> > > +
> > > +/*
> > > + * "memset(p, 0, size)" but for user space buffers. Caller must have 
> > > already
> > > + * checked access_ok(p, size).
> > > + */
> > 
> > Isn't this __clear_user() exactly (perhaps except for the return value)?
> > Perhaps not every arch has that?
> 
> I didn't know about clear_user() -- I will switch to it.
> 
> > > +static int __memzero_user(void __user *p, size_t s)
> > > +{
> > > + const char zeros[BUFFER_SIZE] = {};
> > > + while (s > 0) {
> > > + size_t n = min(s, sizeof(zeros));
> > > +
> > > + if (__copy_to_user(p, zeros, n))
> > > + return -EFAULT;
> > > +
> > > + p += n;
> > > + s -= n;
> > > + }
> > > + return 0;
> > > +}
> > > +
> > > +/**
> > > + * copy_struct_to_user: copy a struct to user space
> > > + * @dst:   Destination address, in user space.
> > > + * @usize: Size of @dst struct.
> > > + * @src:   Source address, in kernel space.
> > > + * @ksize: Size of @src struct.
> > > + *
> > > + * Returns (in all cases, some data may have been copied):
> > > + *  * -EFBIG:  (@usize < @ksize) and there are non-zero trailing bytes 
> > > in @src.
> > > + *  * -EFAULT: access to user space failed.
> > > + */
> > > +int copy_struct_to_user(void __user *dst, size_t usize,
> > > + const void *src, size_t ksize)
> > > +{
> > > + size_t size = min(ksize, usize);
> > > + size_t rest = abs(ksize - usize);
> > 
> > Eh, I'd avoid abs() here due to the funkiness of the implicit type
> > conversions - ksize-usize has type size_t, then that's coerced to an int
> > (or a long maybe?), the abs is applied which return an int/long (or
> > unsigned versions?). Something like "rest = max(ksize, usize) - size;"
> > is more obviously correct and doesn't fall into any
> > narrowing/widening/sign extending traps.
> 
> Yeah, I originally used "max(ksize, usize) - size" for that reason but
> was worried it looked too funky (and some quick tests showed that abs()
> gives the right results in most cases -- though I just realised it would
> probably not give the right results around SIZE_MAX). I'll switch back.
> 
> > > + if (unlikely(usize > PAGE_SIZE))
> > > + return -EFAULT;
> > 
> > Please don't. That is a restriction on all future extensions - once a
> > kernel is shipped with a syscall using this helper with that arbitrary
> > restriction in place, that syscall is forever prevented from extending
> > its arg struct beyond PAGE_SIZE (which is arch-dependent anyway). Sure,
> > it's hard to imagine, but who'd have thought 32 O_* or CLONE_* bits
> > weren't enough for everybody?
> >
> > This is only for future compatibility, and if someone runs an app
> > compiled against 7.3 headers on a 5.4 kernel, they probably don't care
> > about performance, but they would like their app to run.
> 
> I'm not sure I agree that the limit is in place *forever* -- it's
> generally not a break in compatibility to convert an error into a
> success (though, there are counterexamples such as mknod(2) -- but that
> was a very specific case).
> 
> You're right that it would mean that some very new code won't run on
> very ancient kernels (assuming we ever pass around structs that
> massive), but there should be a reasonable trade-off here IMHO.

[PATCH 3/5] arch: wire-up pidfd_wait()

2019-07-24 Thread Christian Brauner

This wires up the pidfd_wait() syscall into all arches at once.

Signed-off-by: Christian Brauner 
Cc: Arnd Bergmann 
Cc: "Eric W. Biederman" 
Cc: Kees Cook 
Cc: Joel Fernandes (Google) 
Cc: Thomas Gleixner 
Cc: Jann Horn 
Cc: David Howells 
Cc: Andy Lutomirsky 
Cc: Andrew Morton 
Cc: Oleg Nesterov 
Cc: Aleksa Sarai 
Cc: Linus Torvalds 
Cc: Al Viro 
Cc: linux-...@vger.kernel.org
Cc: linux-al...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linux-m...@lists.linux-m68k.org
Cc: linux-m...@vger.kernel.org
Cc: linux-par...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linux-xte...@linux-xtensa.org
Cc: linux-a...@vger.kernel.org
Cc: x...@kernel.org
---
 arch/alpha/kernel/syscalls/syscall.tbl  | 1 +
 arch/arm/tools/syscall.tbl  | 1 +
 arch/arm64/include/asm/unistd.h | 2 +-
 arch/arm64/include/asm/unistd32.h   | 4 +++-
 arch/ia64/kernel/syscalls/syscall.tbl   | 1 +
 arch/m68k/kernel/syscalls/syscall.tbl   | 1 +
 arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   | 1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   | 1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   | 1 +
 arch/parisc/kernel/syscalls/syscall.tbl | 1 +
 arch/powerpc/kernel/syscalls/syscall.tbl| 1 +
 arch/s390/kernel/syscalls/syscall.tbl   | 1 +
 arch/sh/kernel/syscalls/syscall.tbl | 1 +
 arch/sparc/kernel/syscalls/syscall.tbl  | 1 +
 arch/x86/entry/syscalls/syscall_32.tbl  | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl  | 1 +
 arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
 include/linux/syscalls.h| 4 
 include/uapi/asm-generic/unistd.h   | 4 +++-
 20 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl 
b/arch/alpha/kernel/syscalls/syscall.tbl
index 728fe028c02c..ca3e593f0c7a 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -475,3 +475,4 @@
 543common  fspick  sys_fspick
 544common  pidfd_open  sys_pidfd_open
 # 545 reserved for clone3
+548common  pidfd_wait  sys_pidfd_wait
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 6da7dc4d79cc..5e448d915b2f 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -449,3 +449,4 @@
 433common  fspick  sys_fspick
 434common  pidfd_open  sys_pidfd_open
 435common  clone3  sys_clone3
+438common  pidfd_wait  sys_pidfd_wait
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 2629a68b8724..b722e47377a5 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
 #define __ARM_NR_compat_set_tls(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls   436
+#define __NR_compat_syscalls   439
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h 
b/arch/arm64/include/asm/unistd32.h
index 94ab29cf4f00..ca77c9d4f7a1 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -877,7 +877,9 @@ __SYSCALL(__NR_fsmount, sys_fsmount)
 __SYSCALL(__NR_fspick, sys_fspick)
 #define __NR_pidfd_open 434
 __SYSCALL(__NR_pidfd_open, sys_pidfd_open)
-#define __NR_clone3 435
+#define __NR_pidfd_wait 438
+__SYSCALL(__NR_pidfd_wait, sys_pidfd_wait)
+#define __NR_clone3 439
 __SYSCALL(__NR_clone3, sys_clone3)
 
 /*
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl 
b/arch/ia64/kernel/syscalls/syscall.tbl
index 36d5faf4c86c..f038afaced9b 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -356,3 +356,4 @@
 433common  fspick  sys_fspick
 434common  pidfd_open  sys_pidfd_open
 # 435 reserved for clone3
+438common  pidfd_wait  sys_pidfd_wait
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl 
b/arch/m68k/kernel/syscalls/syscall.tbl
index a88a285a0e5f..51f86f7b4cec 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -435,3 +435,4 @@
 433common  fspick  sys_fspick
 434common  pidfd_open  sys_pidfd_open
 # 435 reserved for clone3
+438common  pidfd_wait  sys_pidfd_wait
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl 
b/arch/microblaze/kernel/syscalls/syscall.tbl
index 09b0cd7dab0a..24f912ac5dfa 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -441,3 +441,4 @@
 433comm

Re: [PATCH v2] powerpc: Wire up clone3 syscall

2019-07-24 Thread Christian Brauner

On Thu, Jul 25, 2019 at 12:02:59AM +1000, Michael Ellerman wrote:
> Wire up the new clone3 syscall added in commit 7f192e3cd316 ("fork:
> add clone3").
> 
> This requires a ppc_clone3 wrapper, in order to save the non-volatile
> GPRs before calling into the generic syscall code. Otherwise we hit
> the BUG_ON in CHECK_FULL_REGS in copy_thread().
> 
> Lightly tested using Christian's test code on a Power8 LE VM.
> 
> Signed-off-by: Michael Ellerman 

Acked-by: Christian Brauner 

> ---
>  arch/powerpc/include/asm/unistd.h| 1 +
>  arch/powerpc/kernel/entry_32.S   | 8 
>  arch/powerpc/kernel/entry_64.S   | 5 +
>  arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
>  4 files changed, 15 insertions(+)
> 
> v2: Add the wrapper for 32-bit as well, don't allow SPU programs to call
> clone3 (switch ABI to nospu).
> 
> v1: https://lore.kernel.org/r/20190722132231.10169-1-...@ellerman.id.au
> 
> diff --git a/arch/powerpc/include/asm/unistd.h 
> b/arch/powerpc/include/asm/unistd.h
> index 68473c3c471c..b0720c7c3fcf 100644
> --- a/arch/powerpc/include/asm/unistd.h
> +++ b/arch/powerpc/include/asm/unistd.h
> @@ -49,6 +49,7 @@
>  #define __ARCH_WANT_SYS_FORK
>  #define __ARCH_WANT_SYS_VFORK
>  #define __ARCH_WANT_SYS_CLONE
> +#define __ARCH_WANT_SYS_CLONE3
>  
>  #endif   /* __ASSEMBLY__ */
>  #endif /* _ASM_POWERPC_UNISTD_H_ */
> diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
> index 85fdb6d879f1..54fab22c9a43 100644
> --- a/arch/powerpc/kernel/entry_32.S
> +++ b/arch/powerpc/kernel/entry_32.S
> @@ -597,6 +597,14 @@ END_FTR_SECTION_IFSET(CPU_FTR_NEED_PAIRED_STWCX)
>   stw r0,_TRAP(r1)/* register set saved */
>   b   sys_clone
>  
> + .globl  ppc_clone3
> +ppc_clone3:
> + SAVE_NVGPRS(r1)
> + lwz r0,_TRAP(r1)
> + rlwinm  r0,r0,0,0,30/* clear LSB to indicate full */
> + stw r0,_TRAP(r1)/* register set saved */
> + b   sys_clone3
> +
>   .globl  ppc_swapcontext
>  ppc_swapcontext:
>   SAVE_NVGPRS(r1)
> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> index d9105fcf4021..0a0b5310f54a 100644
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -487,6 +487,11 @@ _GLOBAL(ppc_clone)
>   bl  sys_clone
>   b   .Lsyscall_exit
>  
> +_GLOBAL(ppc_clone3)
> +   bl  save_nvgprs
> +   bl  sys_clone3
> +   b   .Lsyscall_exit
> +
>  _GLOBAL(ppc32_swapcontext)
>   bl  save_nvgprs
>   bl  compat_sys_swapcontext
> diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl 
> b/arch/powerpc/kernel/syscalls/syscall.tbl
> index f2c3bda2d39f..43f736ed47f2 100644
> --- a/arch/powerpc/kernel/syscalls/syscall.tbl
> +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
> @@ -516,3 +516,4 @@
>  432  common  fsmount sys_fsmount
>  433  common  fspick  sys_fspick
>  434  common  pidfd_open  sys_pidfd_open
> +435  nospu   clone3  ppc_clone3
> -- 
> 2.20.1
>

Re: [PATCH] powerpc: Wire up clone3 syscall

2019-07-24 Thread Christian Brauner

On Wed, Jul 24, 2019 at 12:25:14PM +0700, Arseny Solokha wrote:
> Hi,
> 
> may I also ask to provide ppc_clone3 symbol also for 32-bit powerpc? Otherwise
> Michael's patch breaks build for me:

Makes sense. Michael, are you planning on picking this up? :)

Christian

> 
>   powerpc-e500v2-linux-gnuspe-ld: arch/powerpc/kernel/systbl.o: in function 
> `sys_call_table':
>   (.rodata+0x6cc): undefined reference to `ppc_clone3'
>   make: *** [Makefile:1060: vmlinux] Error 1
> 
> The patch was tested using Christian's program on a real e500 machine.
> 
> --- a/arch/powerpc/kernel/entry_32.S
> +++ b/arch/powerpc/kernel/entry_32.S
> @@ -597,6 +597,14 @@ ppc_clone:
>   stw r0,_TRAP(r1)/* register set saved */
>   b   sys_clone
> 
> + .globl  ppc_clone3
> +ppc_clone3:
> + SAVE_NVGPRS(r1)
> + lwz r0,_TRAP(r1)
> + rlwinm  r0,r0,0,0,30/* clear LSB to indicate full */
> + stw r0,_TRAP(r1)/* register set saved */
> + b   sys_clone3
> +
>   .globl  ppc_swapcontext
>  ppc_swapcontext:
>   SAVE_NVGPRS(r1)
> 
> I don't think this trivial hunk deserves a separate patch submission.
> 
> Thanks,
> Arseny

Re: [PATCH] powerpc: Wire up clone3 syscall

2019-07-22 Thread Christian Brauner

On Mon, Jul 22, 2019 at 11:22:31PM +1000, Michael Ellerman wrote:
> Wire up the new clone3 syscall added in commit 7f192e3cd316 ("fork:
> add clone3").
> 
> This requires a ppc_clone3 wrapper, in order to save the non-volatile
> GPRs before calling into the generic syscall code. Otherwise we hit
> the BUG_ON in CHECK_FULL_REGS in copy_thread().
> 
> Lightly tested using Christian's test code on a Power8 LE VM.
> 
> Signed-off-by: Michael Ellerman 

Thank you, Michael!

One comment below, otherwise:

Acked-by: Christian Brauner 

> ---
>  arch/powerpc/include/asm/unistd.h| 1 +
>  arch/powerpc/kernel/entry_64.S   | 5 +
>  arch/powerpc/kernel/syscalls/syscall.tbl | 1 +
>  3 files changed, 7 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/unistd.h 
> b/arch/powerpc/include/asm/unistd.h
> index 68473c3c471c..b0720c7c3fcf 100644
> --- a/arch/powerpc/include/asm/unistd.h
> +++ b/arch/powerpc/include/asm/unistd.h
> @@ -49,6 +49,7 @@
>  #define __ARCH_WANT_SYS_FORK
>  #define __ARCH_WANT_SYS_VFORK
>  #define __ARCH_WANT_SYS_CLONE
> +#define __ARCH_WANT_SYS_CLONE3
>  
>  #endif   /* __ASSEMBLY__ */
>  #endif /* _ASM_POWERPC_UNISTD_H_ */
> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> index d9105fcf4021..0a0b5310f54a 100644
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -487,6 +487,11 @@ _GLOBAL(ppc_clone)
>   bl  sys_clone
>   b   .Lsyscall_exit
>  
> +_GLOBAL(ppc_clone3)
> +   bl  save_nvgprs
> +   bl  sys_clone3
> +   b   .Lsyscall_exit
> +
>  _GLOBAL(ppc32_swapcontext)
>   bl  save_nvgprs
>   bl  compat_sys_swapcontext
> diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl 
> b/arch/powerpc/kernel/syscalls/syscall.tbl
> index f2c3bda2d39f..6886ecb590d5 100644
> --- a/arch/powerpc/kernel/syscalls/syscall.tbl
> +++ b/arch/powerpc/kernel/syscalls/syscall.tbl
> @@ -516,3 +516,4 @@
>  432  common  fsmount sys_fsmount
>  433  common  fspick  sys_fspick
>  434  common  pidfd_open  sys_pidfd_open
> +435  common  clone3  ppc_clone3

Note that in v5.3-rc1 there's now a comment that 435 is reserved in
there. So this will likely cause a merge conflict. You might want to
base your change off of v5.3-rc1 instead to avoid that. :)

So basically:

>From 10b2e4176d712e45c7cb22af4aed4ce09818785c Mon Sep 17 00:00:00 2001
From: Michael Ellerman 
Date: Mon, 22 Jul 2019 23:22:31 +1000
Subject: [PATCH] powerpc: Wire up clone3 syscall

Wire up the new clone3 syscall added in commit 7f192e3cd316 ("fork:
add clone3").

This requires a ppc_clone3 wrapper, in order to save the non-volatile
GPRs before calling into the generic syscall code. Otherwise we hit
the BUG_ON in CHECK_FULL_REGS in copy_thread().

Lightly tested using Christian's test code on a Power8 LE VM.

Signed-off-by: Michael Ellerman 
Acked-by: Christian Brauner 
Link: https://lore.kernel.org/r/20190722132231.10169-1-...@ellerman.id.au
---
 arch/powerpc/include/asm/unistd.h| 1 +
 arch/powerpc/kernel/entry_64.S   | 5 +
 arch/powerpc/kernel/syscalls/syscall.tbl | 2 +-
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/unistd.h 
b/arch/powerpc/include/asm/unistd.h
index 68473c3c471c..b0720c7c3fcf 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -49,6 +49,7 @@
 #define __ARCH_WANT_SYS_FORK
 #define __ARCH_WANT_SYS_VFORK
 #define __ARCH_WANT_SYS_CLONE
+#define __ARCH_WANT_SYS_CLONE3
 
 #endif /* __ASSEMBLY__ */
 #endif /* _ASM_POWERPC_UNISTD_H_ */
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index d9105fcf4021..0a0b5310f54a 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -487,6 +487,11 @@ _GLOBAL(ppc_clone)
bl  sys_clone
b   .Lsyscall_exit
 
+_GLOBAL(ppc_clone3)
+   bl  save_nvgprs
+   bl  sys_clone3
+   b   .Lsyscall_exit
+
 _GLOBAL(ppc32_swapcontext)
bl  save_nvgprs
bl  compat_sys_swapcontext
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl 
b/arch/powerpc/kernel/syscalls/syscall.tbl
index 3331749aab20..6886ecb590d5 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -516,4 +516,4 @@
 432common  fsmount sys_fsmount
 433common  fspick  sys_fspick
 434common  pidfd_open  sys_pidfd_open
-# 435 reserved for clone3
+435common  clone3  ppc_clone3
-- 
2.22.0

Re: [PATCH 1/2] arch: mark syscall number 435 reserved for clone3

2019-07-19 Thread Christian Brauner

On Fri, Jul 19, 2019 at 09:13:16PM +1000, Michael Ellerman wrote:
> Christian Brauner  writes:
> > On Fri, Jul 19, 2019 at 08:18:02PM +1000, Michael Ellerman wrote:
> >> Christian Brauner  writes:
> >> > On Mon, Jul 15, 2019 at 03:56:04PM +0200, Christian Borntraeger wrote:
> >> >> I think Vasily already has a clone3 patch for s390x with 435. 
> >> >
> >> > A quick follow-up on this. Helge and Michael have asked whether there
> >> > are any tests for clone3. Yes, there will be and I try to have them
> >> > ready by the end of the this or next week for review. In the meantime I
> >> > hope the following minimalistic test program that just verifies very
> >> > very basic functionality (It's not pretty.) will help you test:
> >> 
> >> Hi Christian,
> >> 
> >> Thanks for the test.
> >> 
> >> This actually oopses on powerpc, it hits the BUG_ON in CHECK_FULL_REGS
> >> in process.c around line 1633:
> >> 
> >>} else {
> >>/* user thread */
> >>struct pt_regs *regs = current_pt_regs();
> >>CHECK_FULL_REGS(regs);
> >>*childregs = *regs;
> >>if (usp)
> >> 
> >> 
> >> So I'll have to dig into how we fix that before we wire up clone3.
> >> 
> >> Turns out testing is good! :)
> >
> > Indeed. I have a test-suite for clone3 in mind and I hope to have it
> > ready by the end of next week. It's just always the finding the time
> > part that is annoying. :)
> 
> I know the feeling!
> 
> > Thanks for digging into this, Michael!
> 
> No worries, happy to help where I can.
> 
> In the intervening five minutes I remembered how we handle this, we just
> need a little wrapper to save the non-volatile regs:
> 
> _GLOBAL(ppc_clone3)
>   bl  save_nvgprs
>   bl  sys_clone3
>   b   .Lsyscall_exit

Sounds good.

> 
> 
> A while back I meant to make it generate those automatically based on a
> flag in the syscall.tbl but of course haven't got around to it :)
> 
> So with the above it seems all good:
> 
> $ ./clone3 ; echo $?
> Parent process received child's pid 4204 as return value
> Parent process received child's pidfd 3
> Parent process received child's pid 4204 as return argument
> Child process with pid 4204
> 0
> 
> I'll send a patch to wire it up on Monday.

Excellent! Thank you!
Christian

Re: [PATCH v9 08/10] open: openat2(2) syscall

2019-07-19 Thread Christian Brauner

On Fri, Jul 19, 2019 at 05:12:18AM +0300, Dmitry V. Levin wrote:
> On Thu, Jul 18, 2019 at 11:29:50PM +0200, Arnd Bergmann wrote:
> [...]
> > 5. you get the same problem with seccomp and strace that
> >clone3() has -- these and others only track the register
> >arguments by default.
> 
> Just for the record, this is definitely not the case for strace:
> it decodes arrays, structures, netlink messages, and so on by default.

There sure is value in trying to design syscalls that can be handled
nicely by seccomp but that shouldn't become a burden on designing
extensible syscalls.
I suggested a session for Ksummit where we can discuss if and how we can
make seccomp more compatible with pointer-args in syscalls.

Christian

Re: [PATCH 1/2] arch: mark syscall number 435 reserved for clone3

2019-07-19 Thread Christian Brauner

On Fri, Jul 19, 2019 at 08:18:02PM +1000, Michael Ellerman wrote:
> Christian Brauner  writes:
> > On Mon, Jul 15, 2019 at 03:56:04PM +0200, Christian Borntraeger wrote:
> >> I think Vasily already has a clone3 patch for s390x with 435. 
> >
> > A quick follow-up on this. Helge and Michael have asked whether there
> > are any tests for clone3. Yes, there will be and I try to have them
> > ready by the end of the this or next week for review. In the meantime I
> > hope the following minimalistic test program that just verifies very
> > very basic functionality (It's not pretty.) will help you test:
> 
> Hi Christian,
> 
> Thanks for the test.
> 
> This actually oopses on powerpc, it hits the BUG_ON in CHECK_FULL_REGS
> in process.c around line 1633:
> 
>   } else {
>   /* user thread */
>   struct pt_regs *regs = current_pt_regs();
>   CHECK_FULL_REGS(regs);
>   *childregs = *regs;
>   if (usp)
> 
> 
> So I'll have to dig into how we fix that before we wire up clone3.
> 
> Turns out testing is good! :)

Indeed. I have a test-suite for clone3 in mind and I hope to have it
ready by the end of next week. It's just always the finding the time
part that is annoying. :)

Thanks for digging into this, Michael!
Christian

Re: [PATCH 1/2] arch: mark syscall number 435 reserved for clone3

2019-07-16 Thread Christian Brauner

On Tue, Jul 16, 2019 at 08:53:10PM +0200, Sven Schnelle wrote:
> Hi,
> 
> [Adding Helge to CC list]
> 
> On Tue, Jul 16, 2019 at 03:06:33PM +0200, Christian Brauner wrote:
> > On Mon, Jul 15, 2019 at 03:56:04PM +0200, Christian Borntraeger wrote:
> > > I think Vasily already has a clone3 patch for s390x with 435. 
> > 
> > A quick follow-up on this. Helge and Michael have asked whether there
> > are any tests for clone3. Yes, there will be and I try to have them
> > ready by the end of the this or next week for review. In the meantime I
> > hope the following minimalistic test program that just verifies very
> > very basic functionality (It's not pretty.) will help you test:
> > [..]
> 
> On PA-RISC this seems to work fine with Helge's patch to wire up the
> clone3 syscall.

I think I already responded to Helge before and yes, I think that parisc
doesn't do anything special for fork, vfork, clone, and by extension
also probably doesn't need to for clone3.
It should only be a problem for arches that require mucking explicitly
with arguments of clone-like syscalls.
In any case, I saw Helge's patch and I think I might've missed to add an
Acked-by but feel free to add it.

Thanks for testing it and sorry that I couldn't test!
Christian

Re: [PATCH 1/2] arch: mark syscall number 435 reserved for clone3

2019-07-16 Thread Christian Brauner

On Mon, Jul 15, 2019 at 03:56:04PM +0200, Christian Borntraeger wrote:
> I think Vasily already has a clone3 patch for s390x with 435. 

A quick follow-up on this. Helge and Michael have asked whether there
are any tests for clone3. Yes, there will be and I try to have them
ready by the end of the this or next week for review. In the meantime I
hope the following minimalistic test program that just verifies very
very basic functionality (It's not pretty.) will help you test:

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#ifndef CLONE_PIDFD
#define CLONE_PIDFD 0x1000
#endif

#ifndef __NR_clone3
#define __NR_clone3 -1
#endif

static pid_t sys_clone3(struct clone_args *args)
{
return syscall(__NR_clone3, args, sizeof(struct clone_args));
}

static int wait_for_pid(pid_t pid)
{
int status, ret;

again:
ret = waitpid(pid, , 0);
if (ret == -1) {
if (errno == EINTR)
goto again;

return -1;
}

if (ret != pid)
goto again;

if (!WIFEXITED(status) || WEXITSTATUS(status) != 0)
return -1;

return 0;
}

#define ptr_to_u64(ptr) ((__u64)((uintptr_t)(ptr)))

int main(int argc, char *argv[])
{
int pidfd = -1;
pid_t parent_tid = -1, pid = -1;
struct clone_args args = {0};

args.parent_tid = ptr_to_u64(_tid); /* CLONE_PARENT_SETTID */
args.pidfd = ptr_to_u64(); /* CLONE_PIDFD */
args.flags = CLONE_PIDFD | CLONE_PARENT_SETTID;
args.exit_signal = SIGCHLD;

pid = sys_clone3();
if (pid < 0) {
fprintf(stderr, "%s - Failed to create new process\n", 
strerror(errno));
exit(EXIT_FAILURE);
}

if (pid == 0) {
printf("Child process with pid %d\n", getpid());
exit(EXIT_SUCCESS);
}

printf("Parent process received child's pid %d as return value\n", pid);
printf("Parent process received child's pidfd %d\n", *(int 
*)args.pidfd);
printf("Parent process received child's pid %d as return argument\n",
   *(pid_t *)args.parent_tid);

if (wait_for_pid(pid))
exit(EXIT_FAILURE);

if (pid != *(pid_t *)args.parent_tid)
exit(EXIT_FAILURE);

close(pidfd);

return 0;
}

Re: [PATCH 1/2] arch: mark syscall number 435 reserved for clone3

2019-07-15 Thread Christian Brauner

On Mon, Jul 15, 2019 at 03:56:04PM +0200, Christian Borntraeger wrote:
> I think Vasily already has a clone3 patch for s390x with 435. 

Excellent. I'll leave the # 435 reserved for clone3 on s390x in until
this patch has landed. It shouldn't be a merge conflict and if so it
should be trivial.

Christian

[PATCH 1/2] arch: mark syscall number 435 reserved for clone3

2019-07-14 Thread Christian Brauner

A while ago Arnd made it possible to give new system calls the same
syscall number on all architectures (except alpha). To not break this
nice new feature let's mark 435 for clone3 as reserved on all
architectures that do not yet implement it.
Even if an architecture does not plan to implement it this ensures that
new system calls coming after clone3 will have the same number on all
architectures.

Signed-off-by: Christian Brauner 
Cc: Arnd Bergmann 
Cc: linux-a...@vger.kernel.org
Cc: linux-al...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: linux-i...@vger.kernel.org
Cc: linux-m...@lists.linux-m68k.org
Cc: linux-m...@vger.kernel.org
Cc: linux-par...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
---
 arch/alpha/kernel/syscalls/syscall.tbl| 1 +
 arch/ia64/kernel/syscalls/syscall.tbl | 1 +
 arch/m68k/kernel/syscalls/syscall.tbl | 1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl | 1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl | 1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl | 1 +
 arch/parisc/kernel/syscalls/syscall.tbl   | 1 +
 arch/powerpc/kernel/syscalls/syscall.tbl  | 1 +
 arch/s390/kernel/syscalls/syscall.tbl | 1 +
 arch/sh/kernel/syscalls/syscall.tbl   | 1 +
 arch/sparc/kernel/syscalls/syscall.tbl| 1 +
 11 files changed, 11 insertions(+)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl 
b/arch/alpha/kernel/syscalls/syscall.tbl
index 1db9bbcfb84e..728fe028c02c 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -474,3 +474,4 @@
 542common  fsmount sys_fsmount
 543common  fspick  sys_fspick
 544common  pidfd_open  sys_pidfd_open
+# 545 reserved for clone3
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl 
b/arch/ia64/kernel/syscalls/syscall.tbl
index ecc44926737b..36d5faf4c86c 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -355,3 +355,4 @@
 432common  fsmount sys_fsmount
 433common  fspick  sys_fspick
 434common  pidfd_open  sys_pidfd_open
+# 435 reserved for clone3
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl 
b/arch/m68k/kernel/syscalls/syscall.tbl
index 9a3eb2558568..a88a285a0e5f 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -434,3 +434,4 @@
 432common  fsmount sys_fsmount
 433common  fspick  sys_fspick
 434common  pidfd_open  sys_pidfd_open
+# 435 reserved for clone3
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl 
b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 97035e19ad03..c9c879ec9b6d 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -373,3 +373,4 @@
 432n32 fsmount sys_fsmount
 433n32 fspick  sys_fspick
 434n32 pidfd_open  sys_pidfd_open
+# 435 reserved for clone3
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl 
b/arch/mips/kernel/syscalls/syscall_n64.tbl
index d7292722d3b0..bbce9159caa1 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -349,3 +349,4 @@
 432n64 fsmount sys_fsmount
 433n64 fspick  sys_fspick
 434n64 pidfd_open  sys_pidfd_open
+# 435 reserved for clone3
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl 
b/arch/mips/kernel/syscalls/syscall_o32.tbl
index dba084c92f14..9653591428ec 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -422,3 +422,4 @@
 432o32 fsmount sys_fsmount
 433o32 fspick  sys_fspick
 434o32 pidfd_open  sys_pidfd_open
+# 435 reserved for clone3
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl 
b/arch/parisc/kernel/syscalls/syscall.tbl
index 5022b9e179c2..c7aadfef5386 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -431,3 +431,4 @@
 432common  fsmount sys_fsmount
 433common  fspick  sys_fspick
 434common  pidfd_open  sys_pidfd_open
+# 435 reserved for clone3
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl 
b/arch/powerpc/kernel/syscalls/syscall.tbl
index f2c3bda2d39f..3331749aab20 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -516,3 +516,4 @@
 432common  fsmount sys_fsmount
 433common  fspick  sys_fspick
 434common  pidfd_open

Re: [PATCH v2 2/2] tests: add close_range() tests

2019-05-28 Thread Christian Brauner

On Tue, May 28, 2019 at 12:33:41PM +1000, Michael Ellerman wrote:
> Christian Brauner  writes:
> > This adds basic tests for the new close_range() syscall.
> > - test that no invalid flags can be passed
> > - test that a range of file descriptors is correctly closed
> > - test that a range of file descriptors is correctly closed if there there
> >   are already closed file descriptors in the range
> > - test that max_fd is correctly capped to the current fdtable maximum
> >
> > Signed-off-by: Christian Brauner 
> > Cc: Arnd Bergmann 
> > Cc: Jann Horn 
> > Cc: David Howells 
> > Cc: Dmitry V. Levin 
> > Cc: Oleg Nesterov 
> > Cc: Linus Torvalds 
> > Cc: Florian Weimer 
> > Cc: linux-...@vger.kernel.org
> > ---
> > v1: unchanged
> > v2:
> > - Christian Brauner :
> >   - verify that close_range() correctly closes a single file descriptor
> > ---
> >  tools/testing/selftests/Makefile  |   1 +
> >  tools/testing/selftests/core/.gitignore   |   1 +
> >  tools/testing/selftests/core/Makefile |   6 +
> >  .../testing/selftests/core/close_range_test.c | 142 ++
> >  4 files changed, 150 insertions(+)
> >  create mode 100644 tools/testing/selftests/core/.gitignore
> >  create mode 100644 tools/testing/selftests/core/Makefile
> >  create mode 100644 tools/testing/selftests/core/close_range_test.c
> >
> > diff --git a/tools/testing/selftests/core/.gitignore 
> > b/tools/testing/selftests/core/.gitignore
> > new file mode 100644
> > index ..6e6712ce5817
> > --- /dev/null
> > +++ b/tools/testing/selftests/core/.gitignore
> > @@ -0,0 +1 @@
> > +close_range_test
> > diff --git a/tools/testing/selftests/core/Makefile 
> > b/tools/testing/selftests/core/Makefile
> > new file mode 100644
> > index ..de3ae68aa345
> > --- /dev/null
> > +++ b/tools/testing/selftests/core/Makefile
> > @@ -0,0 +1,6 @@
> > +CFLAGS += -g -I../../../../usr/include/ -I../../../../include
> 
> Your second -I pulls the unexported kernel headers in, userspace
> programs shouldn't include unexported kernel headers.
> 
> It breaks the build on powerpc with eg:
> 
>   powerpc64le-linux-gnu-gcc -g -I../../../../usr/include/ 
> -I../../../../includeclose_range_test.c  -o 
> /output/kselftest/core/close_range_test
>   In file included from 
> /usr/powerpc64le-linux-gnu/include/bits/fcntl-linux.h:346,
>from /usr/powerpc64le-linux-gnu/include/bits/fcntl.h:62,
>from /usr/powerpc64le-linux-gnu/include/fcntl.h:35,
>from close_range_test.c:5:
>   ../../../../include/linux/falloc.h:13:2: error: unknown type name '__s16'
> __s16  l_type;
> ^
> 
> 
> Did you do that on purpose or just copy it from one of the other
> Makefiles? :)

I originally did that on purpose because checkpatch was yammering on
about me not having used ARRAY_SIZE(). But that include can go, you are
right.

Christian

[PATCH v3 2/3] arch: wire-up close_range()

2019-05-24 Thread Christian Brauner

This wires up the close_range() syscall into all arches at once.

Suggested-by: Arnd Bergmann 
Signed-off-by: Christian Brauner 
Reviewed-by: Oleg Nesterov 
Acked-by: Arnd Bergmann 
Cc: Jann Horn 
Cc: David Howells 
Cc: Dmitry V. Levin 
Cc: Linus Torvalds 
Cc: Al Viro 
Cc: Florian Weimer 
Cc: linux-...@vger.kernel.org
Cc: linux-al...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-i...@vger.kernel.org
Cc: linux-m...@lists.linux-m68k.org
Cc: linux-m...@vger.kernel.org
Cc: linux-par...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s...@vger.kernel.org
Cc: linux...@vger.kernel.org
Cc: sparcli...@vger.kernel.org
Cc: linux-xte...@linux-xtensa.org
Cc: linux-a...@vger.kernel.org
Cc: x...@kernel.org
---
v1:
v2:
v3: added
- Arnd Bergmann :
  - split into two patches:
1. add close_range()
2. add syscall to all arches at once
  - bump __NR_compat_syscalls in arch/arm64/include/asm/unistd.h
---
 arch/alpha/kernel/syscalls/syscall.tbl  | 1 +
 arch/arm/tools/syscall.tbl  | 1 +
 arch/arm64/include/asm/unistd.h | 2 +-
 arch/arm64/include/asm/unistd32.h   | 2 ++
 arch/ia64/kernel/syscalls/syscall.tbl   | 1 +
 arch/m68k/kernel/syscalls/syscall.tbl   | 1 +
 arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   | 1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   | 1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   | 1 +
 arch/parisc/kernel/syscalls/syscall.tbl | 1 +
 arch/powerpc/kernel/syscalls/syscall.tbl| 1 +
 arch/s390/kernel/syscalls/syscall.tbl   | 1 +
 arch/sh/kernel/syscalls/syscall.tbl | 1 +
 arch/sparc/kernel/syscalls/syscall.tbl  | 1 +
 arch/x86/entry/syscalls/syscall_32.tbl  | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl  | 1 +
 arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
 include/uapi/asm-generic/unistd.h   | 4 +++-
 19 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl 
b/arch/alpha/kernel/syscalls/syscall.tbl
index 9e7704e44f6d..b55d93af8096 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -473,3 +473,4 @@
 541common  fsconfigsys_fsconfig
 542common  fsmount sys_fsmount
 543common  fspick  sys_fspick
+545common  close_range sys_close_range
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index aaf479a9e92d..0125c97c75dd 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -447,3 +447,4 @@
 431common  fsconfigsys_fsconfig
 432common  fsmount sys_fsmount
 433common  fspick  sys_fspick
+435common  close_range sys_close_range
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 70e6882853c0..d04eb26cfaeb 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -44,7 +44,7 @@
 #define __ARM_NR_compat_set_tls(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls   434
+#define __NR_compat_syscalls   436
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h 
b/arch/arm64/include/asm/unistd32.h
index c39e90600bb3..9a3270d29b42 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -886,6 +886,8 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
 __SYSCALL(__NR_fsmount, sys_fsmount)
 #define __NR_fspick 433
 __SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_close_range 435
+__SYSCALL(__NR_close_range, sys_close_range)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl 
b/arch/ia64/kernel/syscalls/syscall.tbl
index e01df3f2f80d..1a90b464e96f 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -354,3 +354,4 @@
 431common  fsconfigsys_fsconfig
 432common  fsmount sys_fsmount
 433common  fspick  sys_fspick
+435common  close_range sys_close_range
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl 
b/arch/m68k/kernel/syscalls/syscall.tbl
index 7e3d0734b2f3..2dee2050f9ef 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -433,3 +433,4 @@
 431common  fsconfigsys_fsconfig
 432common  fsmount sys_fsmount
 433common  fspick  sys_fspick
+435common  close_range sys_close_range
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl 
b/arch/microblaze/kernel/syscalls/syscall.tbl
index 26339e417695..923ef69e5a76

Re: [PATCH v3 1/2] pid: add pidfd_open()

2019-05-24 Thread Christian Brauner

On Tue, May 21, 2019 at 04:32:20PM +0200, Christian Brauner wrote:
> On Mon, May 20, 2019 at 05:56:29PM +0200, Christian Brauner wrote:
> > This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
> > pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
> > process that is created via traditional fork()/clone() calls that is only
> > referenced by a PID:
> > 
> > int pidfd = pidfd_open(1234, 0);
> > ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0);
> > 
> > With the introduction of pidfds through CLONE_PIDFD it is possible to
> > created pidfds at process creation time.
> > However, a lot of processes get created with traditional PID-based calls
> > such as fork() or clone() (without CLONE_PIDFD). For these processes a
> > caller can currently not create a pollable pidfd. This is a problem for
> > Android's low memory killer (LMK) and service managers such as systemd.
> > Both are examples of tools that want to make use of pidfds to get reliable
> > notification of process exit for non-parents (pidfd polling) and race-free
> > signal sending (pidfd_send_signal()). They intend to switch to this API for
> > process supervision/management as soon as possible. Having no way to get
> > pollable pidfds from PID-only processes is one of the biggest blockers for
> > them in adopting this api. With pidfd_open() making it possible to retrieve
> > pidfds for PID-based processes we enable them to adopt this api.
> > 
> > In line with Arnd's recent changes to consolidate syscall numbers across
> > architectures, I have added the pidfd_open() syscall to all architectures
> > at the same time.
> > 
> > Signed-off-by: Christian Brauner 
> > Reviewed-by: Oleg Nesterov 
> 
> This now also carries a Reviewed-by from David.
> 
> > Acked-by: Arnd Bergmann 
> > Cc: "Eric W. Biederman" 
> > Cc: Kees Cook 
> > Cc: Joel Fernandes (Google) 
> > Cc: Thomas Gleixner 
> > Cc: Jann Horn 
> > Cc: David Howells 
> > Cc: Andy Lutomirsky 
> > Cc: Andrew Morton 
> > Cc: Aleksa Sarai 
> > Cc: Linus Torvalds 
> > Cc: Al Viro 
> > Cc: linux-...@vger.kernel.org
> 
> I've moved pidfd_open() into my for-next branch together with Joel's
> pidfd polling changes. Everything is based on v5.2-rc1.
> 
> The chosen syscall number for now is 434. David is going to send out
> another pile of mount api related syscalls. I'll coordinate with him
> accordingly prior to the 5.3 merge window.

After talking to Arnd, I split the syscall addition and the per-arch
wiring-up of pidfd_open() into two patches. There are no functional
changes and everything is still sitting in for-next.

Thanks!
Christian

Re: [PATCH v1 1/2] open: add close_range()

2019-05-24 Thread Christian Brauner

On Thu, May 23, 2019 at 07:22:17PM +0300, Konstantin Khlebnikov wrote:
> On 22.05.2019 18:52, Christian Brauner wrote:> This adds the close_range() 
> syscall. It allows to efficiently close a range
> > of file descriptors up to all file descriptors of a calling task.
> >
> > The syscall came up in a recent discussion around the new mount API and
> > making new file descriptor types cloexec by default. During this
> > discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
> > syscall in this manner has been requested by various people over time.
> >
> > First, it helps to close all file descriptors of an exec()ing task. This
> > can be done safely via (quoting Al's example from [1] verbatim):
> >
> >  /* that exec is sensitive */
> >  unshare(CLONE_FILES);
> >  /* we don't want anything past stderr here */
> >  close_range(3, ~0U);
> >  execve();
> >
> > The code snippet above is one way of working around the problem that file
> > descriptors are not cloexec by default. This is aggravated by the fact that
> > we can't just switch them over without massively regressing userspace. For
> > a whole class of programs having an in-kernel method of closing all file
> > descriptors is very helpful (e.g. demons, service managers, programming
> > language standard libraries, container managers etc.).
> > (Please note, unshare(CLONE_FILES) should only be needed if the calling
> >   task is multi-threaded and shares the file descriptor table with another
> >   thread in which case two threads could race with one thread allocating
> >   file descriptors and the other one closing them via close_range(). For the
> >   general case close_range() before the execve() is sufficient.)
> >
> > Second, it allows userspace to avoid implementing closing all file
> > descriptors by parsing through /proc//fd/* and calling close() on each
> > file descriptor. From looking at various large(ish) userspace code bases
> > this or similar patterns are very common in:
> > - service managers (cf. [4])
> > - libcs (cf. [6])
> > - container runtimes (cf. [5])
> > - programming language runtimes/standard libraries
> >- Python (cf. [2])
> >- Rust (cf. [7], [8])
> > As Dmitry pointed out there's even a long-standing glibc bug about missing
> > kernel support for this task (cf. [3]).
> > In addition, the syscall will also work for tasks that do not have procfs
> > mounted and on kernels that do not have procfs support compiled in. In such
> > situations the only way to make sure that all file descriptors are closed
> > is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
> > OPEN_MAX trickery (cf. comment [8] on Rust).
> >
> > The performance is striking. For good measure, comparing the following
> > simple close_all_fds() userspace implementation that is essentially just
> > glibc's version in [6]:
> >
> > static int close_all_fds(void)
> > {
> >  int dir_fd;
> >  DIR *dir;
> >  struct dirent *direntp;
> >
> >  dir = opendir("/proc/self/fd");
> >  if (!dir)
> >  return -1;
> >  dir_fd = dirfd(dir);
> >  while ((direntp = readdir(dir))) {
> >  int fd;
> >  if (strcmp(direntp->d_name, ".") == 0)
> >  continue;
> >  if (strcmp(direntp->d_name, "..") == 0)
> >  continue;
> >  fd = atoi(direntp->d_name);
> >  if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
> >  continue;
> >  close(fd);
> >  }
> >  closedir(dir);
> >  return 0;
> > }
> >
> > to close_range() yields:
> > 1. closing 4 open files:
> > - close_all_fds(): ~280 us
> > - close_range():~24 us
> >
> > 2. closing 1000 open files:
> > - close_all_fds(): ~5000 us
> > - close_range():   ~800 us
> >
> > close_range() is designed to allow for some flexibility. Specifically, it
> > does not simply always close all open file descriptors of a task. Instead,
> > callers can specify an upper bound.
> > This is e.g. useful for scenarios where specific file descriptors are
> > created with well-known numbers that are supposed to be excluded from
> > getting closed.
> > For extra paranoia close_range() comes with a flags argument. This can e.g.
> &

1 2 >

1 - 100 of 140 matches

Mail list logo