Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support

2024-03-01 Thread Willem de Bruijn
Maciej Fijalkowski wrote:
> On Wed, Feb 28, 2024 at 07:05:56PM +0800, Yunjian Wang wrote:
> > This patch set allows TUN to support the AF_XDP Tx zero-copy feature,
> > which can significantly reduce CPU utilization for XDP programs.
> 
> Why no Rx ZC support though? What will happen if I try rxdrop xdpsock
> against tun with this patch? You clearly allow for that.

This is AF_XDP receive zerocopy, right?

The naming is always confusing with tun, but even though from a tun
PoV this happens on ndo_start_xmit, it is the AF_XDP equivalent to
tun_put_user.

So the implementation is more like other device's Rx ZC.

I would have preferred that name, but I think Jason asked for this
and given tun's weird status, there is something bo said for either.



Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support

2024-01-24 Thread Willem de Bruijn
Yunjian Wang wrote:
> Now the zero-copy feature of AF_XDP socket is supported by some
> drivers, which can reduce CPU utilization on the xdp program.
> This patch set allows tun to support AF_XDP Rx zero-copy feature.
> 
> This patch tries to address this by:
> - Use peek_len to consume a xsk->desc and get xsk->desc length.
> - When the tun support AF_XDP Rx zero-copy, the vq's array maybe empty.
> So add a check for empty vq's array in vhost_net_buf_produce().
> - add XDP_SETUP_XSK_POOL and ndo_xsk_wakeup callback support
> - add tun_put_user_desc function to copy the Rx data to VM
> 
> Signed-off-by: Yunjian Wang 

I don't fully understand the higher level design of this feature yet.

But some initial comments at the code level.

> ---
>  drivers/net/tun.c   | 165 +++-
>  drivers/vhost/net.c |  18 +++--
>  2 files changed, 176 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index afa5497f7c35..248b0f8e07d1 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -77,6 +77,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -145,6 +146,10 @@ struct tun_file {
>   struct tun_struct *detached;
>   struct ptr_ring tx_ring;
>   struct xdp_rxq_info xdp_rxq;
> + struct xdp_desc desc;
> + /* protects xsk pool */
> + spinlock_t pool_lock;
> + struct xsk_buff_pool *pool;
>  };
>  
>  struct tun_page {
> @@ -208,6 +213,8 @@ struct tun_struct {
>   struct bpf_prog __rcu *xdp_prog;
>   struct tun_prog __rcu *steering_prog;
>   struct tun_prog __rcu *filter_prog;
> + /* tracks AF_XDP ZC enabled queues */
> + unsigned long *af_xdp_zc_qps;
>   struct ethtool_link_ksettings link_ksettings;
>   /* init args */
>   struct file *file;
> @@ -795,6 +802,8 @@ static int tun_attach(struct tun_struct *tun, struct file 
> *file,
>  
>   tfile->queue_index = tun->numqueues;
>   tfile->socket.sk->sk_shutdown &= ~RCV_SHUTDOWN;
> + tfile->desc.len = 0;
> + tfile->pool = NULL;
>  
>   if (tfile->detached) {
>   /* Re-attach detached tfile, updating XDP queue_index */
> @@ -989,6 +998,13 @@ static int tun_net_init(struct net_device *dev)
>   return err;
>   }
>  
> + tun->af_xdp_zc_qps = bitmap_zalloc(MAX_TAP_QUEUES, GFP_KERNEL);
> + if (!tun->af_xdp_zc_qps) {
> + security_tun_dev_free_security(tun->security);
> + free_percpu(dev->tstats);
> + return -ENOMEM;
> + }
> +
>   tun_flow_init(tun);
>  
>   dev->hw_features = NETIF_F_SG | NETIF_F_FRAGLIST |
> @@ -1009,6 +1025,7 @@ static int tun_net_init(struct net_device *dev)
>   tun_flow_uninit(tun);
>   security_tun_dev_free_security(tun->security);
>   free_percpu(dev->tstats);
> + bitmap_free(tun->af_xdp_zc_qps);

Please release state in inverse order of acquire.

>   return err;
>   }
>   return 0;
> @@ -1222,11 +1239,77 @@ static int tun_xdp_set(struct net_device *dev, struct 
> bpf_prog *prog,
>   return 0;
>  }
>  
> +static int tun_xsk_pool_enable(struct net_device *netdev,
> +struct xsk_buff_pool *pool,
> +u16 qid)
> +{
> + struct tun_struct *tun = netdev_priv(netdev);
> + struct tun_file *tfile;
> + unsigned long flags;
> +
> + rcu_read_lock();
> + tfile = rtnl_dereference(tun->tfiles[qid]);
> + if (!tfile) {
> + rcu_read_unlock();
> + return -ENODEV;
> + }

No need for rcu_read_lock with rtnl_dereference.

Consider ASSERT_RTNL() if unsure whether this patch could be reached
without the rtnl held.

> +
> + spin_lock_irqsave(>pool_lock, flags);
> + xsk_pool_set_rxq_info(pool, >xdp_rxq);
> + tfile->pool = pool;
> + spin_unlock_irqrestore(>pool_lock, flags);
> +
> + rcu_read_unlock();
> + set_bit(qid, tun->af_xdp_zc_qps);

What are the concurrency semantics: there's a spinlock to make
the update to xdp_rxq and pool a critical section, but the bitmap
is not part of this? Please also then document why the irqsave.

> +
> + return 0;
> +}
> +
> +static int tun_xsk_pool_disable(struct net_device *netdev, u16 qid)
> +{
> + struct tun_struct *tun = netdev_priv(netdev);
> + struct tun_file *tfile;
> + unsigned long flags;
> +
> + if (!test_bit(qid, tun->af_xdp_zc_qps))
> + return 0;
> +
> + clear_bit(qid, tun->af_xdp_zc_qps);

Time of check to time of use race between test and clear? Or is
there no race because anything that clears will hold the RTNL? If so,
please add a comment.

> +
> + rcu_read_lock();
> + tfile = rtnl_dereference(tun->tfiles[qid]);
> + if (!tfile) {
> + rcu_read_unlock();
> + return 0;
> + }
> +
> + spin_lock_irqsave(>pool_lock, flags);
> + if (tfile->desc.len) {
> + 

Re: [PATCH net-next v3 3/3] net: add netmem_ref to skb_frag_t

2023-12-21 Thread Willem de Bruijn
Mina Almasry wrote:
> Use netmem_ref instead of page in skb_frag_t. Currently netmem_ref
> is always a struct page underneath, but the abstraction allows efforts
> to add support for skb frags not backed by pages.
> 
> There is unfortunately 1 instance where the skb_frag_t is assumed to be
> a bio_vec in kcm. For this case, add a debug assert that the skb frag is
> indeed backed by a page, and do a cast.
> 
> Add skb[_frag]_fill_netmem_*() and skb_add_rx_frag_netmem() helpers so
> that the API can be used to create netmem skbs.
> 
> Signed-off-by: Mina Almasry 
> 
> ---
> 
> v3;
> - Renamed the fields in skb_frag_t.
> 
> v2:
> - Add skb frag filling helpers.
> 
> ---
>  include/linux/skbuff.h | 92 +-
>  net/core/skbuff.c  | 22 +++---
>  net/kcm/kcmsock.c  | 10 -
>  3 files changed, 89 insertions(+), 35 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 7ce38874dbd1..729c95e97be1 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -37,6 +37,7 @@
>  #endif
>  #include 
>  #include 
> +#include 
>  
>  /**
>   * DOC: skb checksums
> @@ -359,7 +360,11 @@ extern int sysctl_max_skb_frags;
>   */
>  #define GSO_BY_FRAGS 0x
>  
> -typedef struct bio_vec skb_frag_t;
> +typedef struct skb_frag {
> + netmem_ref netmem;
> + unsigned int len;
> + unsigned int offset;
> +} skb_frag_t;
>  
>  /**
>   * skb_frag_size() - Returns the size of a skb fragment
> @@ -367,7 +372,7 @@ typedef struct bio_vec skb_frag_t;
>   */
>  static inline unsigned int skb_frag_size(const skb_frag_t *frag)
>  {
> - return frag->bv_len;
> + return frag->len;
>  }
>  
>  /**
> @@ -377,7 +382,7 @@ static inline unsigned int skb_frag_size(const skb_frag_t 
> *frag)
>   */
>  static inline void skb_frag_size_set(skb_frag_t *frag, unsigned int size)
>  {
> - frag->bv_len = size;
> + frag->len = size;
>  }
>  
>  /**
> @@ -387,7 +392,7 @@ static inline void skb_frag_size_set(skb_frag_t *frag, 
> unsigned int size)
>   */
>  static inline void skb_frag_size_add(skb_frag_t *frag, int delta)
>  {
> - frag->bv_len += delta;
> + frag->len += delta;
>  }
>  
>  /**
> @@ -397,7 +402,7 @@ static inline void skb_frag_size_add(skb_frag_t *frag, 
> int delta)
>   */
>  static inline void skb_frag_size_sub(skb_frag_t *frag, int delta)
>  {
> - frag->bv_len -= delta;
> + frag->len -= delta;
>  }
>  
>  /**
> @@ -417,7 +422,7 @@ static inline bool skb_frag_must_loop(struct page *p)
>   *   skb_frag_foreach_page - loop over pages in a fragment
>   *
>   *   @f: skb frag to operate on
> - *   @f_off: offset from start of f->bv_page
> + *   @f_off: offset from start of f->netmem
>   *   @f_len: length from f_off to loop over
>   *   @p: (temp var) current page
>   *   @p_off: (temp var) offset from start of current page,
> @@ -2431,22 +2436,37 @@ static inline unsigned int skb_pagelen(const struct 
> sk_buff *skb)
>   return skb_headlen(skb) + __skb_pagelen(skb);
>  }
>  
> +static inline void skb_frag_fill_netmem_desc(skb_frag_t *frag,
> +  netmem_ref netmem, int off,
> +  int size)
> +{
> + frag->netmem = netmem;
> + frag->offset = off;
> + skb_frag_size_set(frag, size);
> +}
> +
>  static inline void skb_frag_fill_page_desc(skb_frag_t *frag,
>  struct page *page,
>  int off, int size)
>  {
> - frag->bv_page = page;
> - frag->bv_offset = off;
> - skb_frag_size_set(frag, size);
> + skb_frag_fill_netmem_desc(frag, page_to_netmem(page), off, size);
> +}
> +
> +static inline void __skb_fill_netmem_desc_noacc(struct skb_shared_info 
> *shinfo,
> + int i, netmem_ref netmem,
> + int off, int size)
> +{
> + skb_frag_t *frag = >frags[i];
> +
> + skb_frag_fill_netmem_desc(frag, netmem, off, size);
>  }
>  
>  static inline void __skb_fill_page_desc_noacc(struct skb_shared_info *shinfo,
> int i, struct page *page,
> int off, int size)
>  {
> - skb_frag_t *frag = >frags[i];
> -
> - skb_frag_fill_page_desc(frag, page, off, size);
> + __skb_fill_netmem_desc_noacc(shinfo, i, page_to_netmem(page), off,
> +  size);
>  }
>  
>  /**
> @@ -2462,10 +2482,10 @@ static inline void skb_len_add(struct sk_buff *skb, 
> int delta)
>  }
>  
>  /**
> - * __skb_fill_page_desc - initialise a paged fragment in an skb
> + * __skb_fill_netmem_desc - initialise a fragment in an skb
>   * @skb: buffer containing fragment to be initialised
> - * @i: paged fragment index to initialise
> - * @page: the page to use for this fragment
> + * @i: fragment index to initialise
> + * 

Re: [PATCH net-next v3 1/3] vsock/virtio: use skb_frag_*() helpers

2023-12-21 Thread Willem de Bruijn
Mina Almasry wrote:
> Minor fix for virtio: code wanting to access the fields inside an skb
> frag should use the skb_frag_*() helpers, instead of accessing the
> fields directly. This allows for extensions where the underlying
> memory is not a page.
> 
> Signed-off-by: Mina Almasry 
> 
> ---
> 
> v2:
> 
> - Also fix skb_frag_off() + skb_frag_size() (David)
> - Did not apply the reviewed-by from Stefano since the patch changed
> relatively much.
> 
> ---
>  net/vmw_vsock/virtio_transport.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/net/vmw_vsock/virtio_transport.c 
> b/net/vmw_vsock/virtio_transport.c
> index f495b9e5186b..1748268e0694 100644
> --- a/net/vmw_vsock/virtio_transport.c
> +++ b/net/vmw_vsock/virtio_transport.c
> @@ -153,10 +153,10 @@ virtio_transport_send_pkt_work(struct work_struct *work)
>* 'virt_to_phys()' later to fill the buffer 
> descriptor.
>* We don't touch memory at "virtual" address 
> of this page.
>*/
> - va = page_to_virt(skb_frag->bv_page);
> + va = page_to_virt(skb_frag_page(skb_frag));
>   sg_init_one(sgs[out_sg],
> - va + skb_frag->bv_offset,
> - skb_frag->bv_len);
> + va + skb_frag_off(skb_frag),
> + skb_frag_size(skb_frag));
>   out_sg++;
>   }
>   }

If there are requests for further revision in the series, can send
this virtio cleanup on its own to get it off the stack.

> -- 
> 2.43.0.472.g3155946c3a-goog
> 





Re: [PATCH RFC 0/4] virtio-net: add tx-hash, rx-tstamp, tx-tstamp and tx-time

2023-12-19 Thread Willem de Bruijn
Jason Wang wrote:
> On Tue, Dec 19, 2023 at 12:36 AM Willem de Bruijn
>  wrote:
> >
> > Steffen Trumtrar wrote:
> > > This series tries to pick up the work on the virtio-net timestamping
> > > feature from Willem de Bruijn.
> > >
> > > Original series
> > > Message-Id: 20210208185558.995292-1-willemdebruijn.ker...@gmail.com
> > > Subject: [PATCH RFC v2 0/4] virtio-net: add tx-hash, rx-tstamp,
> > > tx-tstamp and tx-time
> > > From: Willem de Bruijn 
> > >
> > > RFC for four new features to the virtio network device:
> > >
> > > 1. pass tx flow state to host, for routing + telemetry
> > > 2. pass rx tstamp to guest, for better RTT estimation
> > > 3. pass tx tstamp to guest, idem
> > > 3. pass tx delivery time to host, for accurate pacing
> > >
> > > All would introduce an extension to the virtio spec.
> > >
> > > The original series consisted of a hack around the DMA API, which should
> > > be fixed in this series.
> > >
> > > The changes in this series are to the driver side. For the changes to 
> > > qemu see:
> > > https://github.com/strumtrar/qemu/tree/v8.1.1/virtio-net-ptp
> > >
> > > Currently only virtio-net is supported. The original series used
> > > vhost-net as backend. However, the path through tun via sendmsg doesn't
> > > allow us to write data back to the driver side without any hacks.
> > > Therefore use the way via plain virtio-net without vhost albeit better
> > > performance.
> > >
> > > Signed-off-by: Steffen Trumtrar 
> >
> > Thanks for picking this back up, Steffen. Nice to see that the code still
> > applies mostly cleanly.
> >
> > For context: I dropped the work only because I had no real device
> > implementation. The referenced patch series to qemu changes that.
> >
> > I suppose the main issue is the virtio API changes that this introduces,
> > which will have to be accepted to the spec.
> >
> > One small comment to patch 4: there I just assumed the virtual device
> > time is CLOCK_TAI. There is a concurrent feature under review for HW
> > pacing offload with AF_XDP sockets. The clock issue comes up a bit. In
> > general, for hardware we cannot assume a clock.
> 
> Any reason for this? E.g some modern NIC have PTP support.

I meant that we cannot assume a specific clock, if aiming to offload
existing pacing (or "launch time") methods.

The issue discussed in the AF_XDP thread is whether to use CLOCK_TAI
or CLOCK_MONOTONIC. Both of which are already in use in software
pacing offload, in the ETF and FQ qdiscs, respectively.

But for virtio it may be acceptable to restrict to one clock, such as
CLOCK_REALTIME or CLOCK_TAI.

CLOCK_MONOTONIC being boottime is almost certain to have an offset.
Even if the clocks' rates are synchronized with phc2sys.

> > For virtio, perhaps
> > assuming the same monotonic hardware clock in guest and host can be
> > assumed.
> 
> Note that virtio can be implemented in hardware now. So we can assume
> things like the kvm ptp clock.
> 
> > But this clock alignment needs some thought.
> >
> 
> Thanks
> 





Re: [PATCH RFC 0/4] virtio-net: add tx-hash, rx-tstamp, tx-tstamp and tx-time

2023-12-18 Thread Willem de Bruijn
Steffen Trumtrar wrote:
> This series tries to pick up the work on the virtio-net timestamping
> feature from Willem de Bruijn.
> 
> Original series
> Message-Id: 20210208185558.995292-1-willemdebruijn.ker...@gmail.com
> Subject: [PATCH RFC v2 0/4] virtio-net: add tx-hash, rx-tstamp,
> tx-tstamp and tx-time
>     From: Willem de Bruijn 
> 
> RFC for four new features to the virtio network device:
> 
> 1. pass tx flow state to host, for routing + telemetry
> 2. pass rx tstamp to guest, for better RTT estimation
> 3. pass tx tstamp to guest, idem
> 3. pass tx delivery time to host, for accurate pacing
> 
> All would introduce an extension to the virtio spec.
> 
> The original series consisted of a hack around the DMA API, which should
> be fixed in this series.
> 
> The changes in this series are to the driver side. For the changes to qemu 
> see:
> https://github.com/strumtrar/qemu/tree/v8.1.1/virtio-net-ptp
> 
> Currently only virtio-net is supported. The original series used
> vhost-net as backend. However, the path through tun via sendmsg doesn't
> allow us to write data back to the driver side without any hacks.
> Therefore use the way via plain virtio-net without vhost albeit better
> performance.
> 
> Signed-off-by: Steffen Trumtrar 

Thanks for picking this back up, Steffen. Nice to see that the code still
applies mostly cleanly.

For context: I dropped the work only because I had no real device
implementation. The referenced patch series to qemu changes that.

I suppose the main issue is the virtio API changes that this introduces,
which will have to be accepted to the spec.

One small comment to patch 4: there I just assumed the virtual device
time is CLOCK_TAI. There is a concurrent feature under review for HW
pacing offload with AF_XDP sockets. The clock issue comes up a bit. In
general, for hardware we cannot assume a clock. For virtio, perhaps
assuming the same monotonic hardware clock in guest and host can be
assumed. But this clock alignment needs some thought.




Re: [PATCH RFC v2 1/4] virtio: fix up virtio_disable_cb

2021-04-13 Thread Willem de Bruijn
> > >
> > >
> > > but even yours is also fixed I think.
> > >
> > > The common point is that a single spurious interrupt is not a problem.
> > > The problem only exists if there are tons of spurious interrupts with no
> > > real ones. For this to trigger, we keep polling the ring and while we do
> > > device keeps firing interrupts. So just disable interrupts while we
> > > poll.
> >
> > But the main change in this patch is to turn some virtqueue_disable_cb
> > calls into no-ops.
>
> Well this was not the design. This is the main change:
>
>
> @@ -739,7 +742,10 @@ static void virtqueue_disable_cb_split(struct virtqueue 
> *_vq)
>
> if (!(vq->split.avail_flags_shadow & VRING_AVAIL_F_NO_INTERRUPT)) {
> vq->split.avail_flags_shadow |= VRING_AVAIL_F_NO_INTERRUPT;
> -   if (!vq->event)
> +   if (vq->event)
> +   /* TODO: this is a hack. Figure out a cleaner value 
> to write. */
> +   vring_used_event(>split.vring) = 0x0;
> +   else
> vq->split.vring.avail->flags =
> cpu_to_virtio16(_vq->vdev,
> vq->split.avail_flags_shadow);
>
>
> IIUC previously when event index was enabled (vq->event) 
> virtqueue_disable_cb_split
> was a nop. Now it sets index to 0x0 (which is a hack, but good enough
> for testing I think).

So now tx interrupts will really be suppressed even in event-idx mode.

And what is the purpose of suppressing this operation if
event_triggered, i.e., after an interrupt occurred? You mention " if
using event index with a packed ring, and if being called from a
callback, we actually do disable interrupts which is unnecessary." Can
you elaborate? Also, even if unnecessary, does it matter? The
operation itself seems fairly cheap.

These should probably be two separate patches.

There is also a third case, split ring without event index. That
behaves more like packed ring, I suppose.


> > I don't understand how that helps reduce spurious
> > interrupts, as if anything, it keeps interrupts enabled for longer.


Re: [PATCH RFC v2 1/4] virtio: fix up virtio_disable_cb

2021-04-13 Thread Willem de Bruijn
On Tue, Apr 13, 2021 at 3:54 PM Michael S. Tsirkin  wrote:
>
> On Tue, Apr 13, 2021 at 10:01:11AM -0400, Willem de Bruijn wrote:
> > On Tue, Apr 13, 2021 at 1:47 AM Michael S. Tsirkin  wrote:
> > >
> > > virtio_disable_cb is currently a nop for split ring with event index.
> > > This is because it used to be always called from a callback when we know
> > > device won't trigger more events until we update the index.  However,
> > > now that we run with interrupts enabled a lot we also poll without a
> > > callback so that is different: disabling callbacks will help reduce the
> > > number of spurious interrupts.
> >
> > The device may poll for transmit completions as a result of an interrupt
> > from virtnet_poll_tx.
> >
> > As well as asynchronously to this transmit interrupt, from start_xmit or
> > from virtnet_poll_cleantx as a result of a receive interrupt.
> >
> > As of napi-tx, transmit interrupts are left enabled to operate in standard
> > napi mode. While previously they would be left disabled for most of the
> > time, enabling only when the queue as low on descriptors.
> >
> > (in practice, for the at the time common case of split ring with event 
> > index,
> > little changed, as that mode does not actually enable/disable the interrupt,
> > but looks at the consumer index in the ring to decide whether to interrupt)
> >
> > Combined, this may cause the following:
> >
> > 1. device sends a packet and fires transmit interrupt
> > 2. driver cleans interrupts using virtnet_poll_cleantx
> > 3. driver handles transmit interrupt using vring_interrupt,
> > detects that the vring is empty: !more_used(vq),
> > and records a spurious interrupt.
> >
> > I don't quite follow how suppressing interrupt suppression, i.e.,
> > skipping disable_cb, helps avoid this.
> > I'm probably missing something. Is this solving a subtly different
> > problem from the one as I understand it?
>
> I was thinking of this one:
>
>  1. device is sending packets
>  2. driver cleans them at the same time using virtnet_poll_cleantx
>  3. device fires transmit interrupts
>  4. driver handles transmit interrupts using vring_interrupt,
>  detects that the vring is empty: !more_used(vq),
>  and records spurious interrupts.

I think that's the same scenario

>
>
> but even yours is also fixed I think.
>
> The common point is that a single spurious interrupt is not a problem.
> The problem only exists if there are tons of spurious interrupts with no
> real ones. For this to trigger, we keep polling the ring and while we do
> device keeps firing interrupts. So just disable interrupts while we
> poll.

But the main change in this patch is to turn some virtqueue_disable_cb
calls into no-ops. I don't understand how that helps reduce spurious
interrupts, as if anything, it keeps interrupts enabled for longer.

Another patch in the series disable callbacks* before starting to
clean the descriptors from the rx interrupt. That I do understand will
suppress additional tx interrupts that might see no work to be done. I
just don't entire follow this patch on its own.

*(I use interrupt and callback as a synonym in this context, correct
me if I'm glancing over something essential)


Re: [PATCH RFC v2 3/4] virtio_net: move tx vq operation under tx queue lock

2021-04-13 Thread Willem de Bruijn
On Tue, Apr 13, 2021 at 10:03 AM Michael S. Tsirkin  wrote:
>
> On Tue, Apr 13, 2021 at 04:54:42PM +0800, Jason Wang wrote:
> >
> > 在 2021/4/13 下午1:47, Michael S. Tsirkin 写道:
> > > It's unsafe to operate a vq from multiple threads.
> > > Unfortunately this is exactly what we do when invoking
> > > clean tx poll from rx napi.

Actually, the issue goes back to the napi-tx even without the
opportunistic cleaning from the receive interrupt, I think? That races
with processing the vq in start_xmit.

> > > As a fix move everything that deals with the vq to under tx lock.
> > >

If the above is correct:

Fixes: b92f1e6751a6 ("virtio-net: transmit napi")

> > > Signed-off-by: Michael S. Tsirkin 
> > > ---
> > >   drivers/net/virtio_net.c | 22 +-
> > >   1 file changed, 21 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > index 16d5abed582c..460ccdbb840e 100644
> > > --- a/drivers/net/virtio_net.c
> > > +++ b/drivers/net/virtio_net.c
> > > @@ -1505,6 +1505,8 @@ static int virtnet_poll_tx(struct napi_struct 
> > > *napi, int budget)
> > > struct virtnet_info *vi = sq->vq->vdev->priv;
> > > unsigned int index = vq2txq(sq->vq);
> > > struct netdev_queue *txq;
> > > +   int opaque;

nit: virtqueue_napi_complete also stores as int opaque, but
virtqueue_enable_cb_prepare actually returns, and virtqueue_poll
expects, an unsigned int. In the end, conversion works correctly. But
cleaner to use the real type.

> > > +   bool done;
> > > if (unlikely(is_xdp_raw_buffer_queue(vi, index))) {
> > > /* We don't need to enable cb for XDP */
> > > @@ -1514,10 +1516,28 @@ static int virtnet_poll_tx(struct napi_struct 
> > > *napi, int budget)
> > > txq = netdev_get_tx_queue(vi->dev, index);
> > > __netif_tx_lock(txq, raw_smp_processor_id());
> > > +   virtqueue_disable_cb(sq->vq);
> > > free_old_xmit_skbs(sq, true);
> > > +
> > > +   opaque = virtqueue_enable_cb_prepare(sq->vq);
> > > +
> > > +   done = napi_complete_done(napi, 0);
> > > +
> > > +   if (!done)
> > > +   virtqueue_disable_cb(sq->vq);
> > > +
> > > __netif_tx_unlock(txq);
> > > -   virtqueue_napi_complete(napi, sq->vq, 0);
> >
> >
> > So I wonder why not simply move __netif_tx_unlock() after
> > virtqueue_napi_complete()?
> >
> > Thanks
> >
>
>
> Because that calls tx poll which also takes tx lock internally ...

which tx poll?


Re: [PATCH RFC v2 2/4] virtio_net: disable cb aggressively

2021-04-13 Thread Willem de Bruijn
On Tue, Apr 13, 2021 at 4:53 AM Jason Wang  wrote:
>
>
> 在 2021/4/13 下午1:47, Michael S. Tsirkin 写道:
> > There are currently two cases where we poll TX vq not in response to a
> > callback: start xmit and rx napi.  We currently do this with callbacks
> > enabled which can cause extra interrupts from the card.  Used not to be
> > a big issue as we run with interrupts disabled but that is no longer the
> > case, and in some cases the rate of spurious interrupts is so high
> > linux detects this and actually kills the interrupt.
> >
> > Fix up by disabling the callbacks before polling the tx vq.
> >
> > Signed-off-by: Michael S. Tsirkin 
> > ---
> >   drivers/net/virtio_net.c | 2 ++
> >   1 file changed, 2 insertions(+)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 82e520d2cb12..16d5abed582c 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -1429,6 +1429,7 @@ static void virtnet_poll_cleantx(struct receive_queue 
> > *rq)
> >   return;
> >
> >   if (__netif_tx_trylock(txq)) {
> > + virtqueue_disable_cb(sq->vq);
> >   free_old_xmit_skbs(sq, true);
> >   __netif_tx_unlock(txq);
>
>
> Any reason that we don't need to enable the cb here?

This is an opportunistic clean outside the normal tx-napi path, so if
disabling the tx interrupt here, it won't be reenabled based on
napi_complete_done.

I think that means that it stays disabled until the following start_xmit:

if (use_napi && kick)
virtqueue_enable_cb_delayed(sq->vq);

But that seems sufficient.


Re: [PATCH RFC v2 1/4] virtio: fix up virtio_disable_cb

2021-04-13 Thread Willem de Bruijn
On Tue, Apr 13, 2021 at 1:47 AM Michael S. Tsirkin  wrote:
>
> virtio_disable_cb is currently a nop for split ring with event index.
> This is because it used to be always called from a callback when we know
> device won't trigger more events until we update the index.  However,
> now that we run with interrupts enabled a lot we also poll without a
> callback so that is different: disabling callbacks will help reduce the
> number of spurious interrupts.

The device may poll for transmit completions as a result of an interrupt
from virtnet_poll_tx.

As well as asynchronously to this transmit interrupt, from start_xmit or
from virtnet_poll_cleantx as a result of a receive interrupt.

As of napi-tx, transmit interrupts are left enabled to operate in standard
napi mode. While previously they would be left disabled for most of the
time, enabling only when the queue as low on descriptors.

(in practice, for the at the time common case of split ring with event index,
little changed, as that mode does not actually enable/disable the interrupt,
but looks at the consumer index in the ring to decide whether to interrupt)

Combined, this may cause the following:

1. device sends a packet and fires transmit interrupt
2. driver cleans interrupts using virtnet_poll_cleantx
3. driver handles transmit interrupt using vring_interrupt,
detects that the vring is empty: !more_used(vq),
and records a spurious interrupt.

I don't quite follow how suppressing interrupt suppression, i.e.,
skipping disable_cb, helps avoid this.

I'm probably missing something. Is this solving a subtly different
problem from the one as I understand it?

> Further, if using event index with a packed ring, and if being called
> from a callback, we actually do disable interrupts which is unnecessary.
>
> Fix both issues by tracking whenever we get a callback. If that is
> the case disabling interrupts with event index can be a nop.
> If not the case disable interrupts. Note: with a split ring
> there's no explicit "no interrupts" value. For now we write
> a fixed value so our chance of triggering an interupt
> is 1/ring size. It's probably better to write something
> related to the last used index there to reduce the chance
> even further. For now I'm keeping it simple.
>
> Signed-off-by: Michael S. Tsirkin 


Re: BUG: unable to handle kernel paging request in __build_skb

2021-04-11 Thread Willem de Bruijn
On Sun, Apr 11, 2021 at 9:31 PM Hao Sun  wrote:
>
> Hi
>
> When using Healer(https://github.com/SunHao-0/healer/tree/dev) to fuzz
> the Linux kernel, I found the following bug report, but I'm not sure
> about this.
> Sorry, I do not have a reproducing program for this bug.
> I hope that the stack trace information in the crash log can help you
> locate the problem.
>
> Here is the details:
> commit:   4ebaab5fb428374552175aa39832abf5cedb916a
> version:   linux 5.12
> git tree:kmsan
>
> ==
> RAX: ffda RBX: 0059c080 RCX: 0047338d
> RDX: 0010 RSI: 20002400 RDI: 0003
> RBP: 7fb6512c2c90 R08:  R09: 
> R10:  R11: 0246 R12: 0005
> R13: 7fffbb36285f R14: 7fffbb362a00 R15: 7fb6512c2dc0
> BUG: unable to handle page fault for address: a73d01c96a40
> #PF: supervisor write access in kernel mode
> #PF: error_code(0x0002) - not-present page
> PGD 1810067 P4D 1810067 PUD 1915067 PMD 4b84067 PTE 0
> Oops: 0002 [#1] SMP
> CPU: 0 PID: 6273 Comm: syz-executor Not tainted 5.12.0-rc6+ #1
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> 1.13.0-1ubuntu1.1 04/01/2014
> RIP: 0010:memset_erms+0x9/0x10 arch/x86/lib/memset_64.S:64
> Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6
> f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1  aa
> 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
> RSP: 0018:9f3d01c9b930 EFLAGS: 00010082
> RAX: a73d01c96a00 RBX: 0020 RCX: 0020
> RDX: 0020 RSI:  RDI: a73d01c96a40
> RBP: 9f3d01c9b960 R08: c239000f R09: a73d01c96a40
> R10: 7dee4e6b R11: b2000782 R12: 
> R13: 0020 R14:  R15: 9f3d01c96a40
> FS:  7fb6512c3700() GS:97407fa0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: a73d01c96a40 CR3: 30087005 CR4: 00770ef0
> PKRU: 5554
> Call Trace:
>  kmsan_internal_unpoison_shadow+0x1d/0x70 mm/kmsan/kmsan.c:110
>  __msan_memset+0x64/0xb0 mm/kmsan/kmsan_instr.c:130
>  __build_skb_around net/core/skbuff.c:209 [inline]
>  __build_skb+0x34b/0x520 net/core/skbuff.c:243
>  netlink_alloc_large_skb net/netlink/af_netlink.c:1193 [inline]
>  netlink_sendmsg+0xdc1/0x14d0 net/netlink/af_netlink.c:1902
>  sock_sendmsg_nosec net/socket.c:654 [inline]
>  sock_sendmsg net/socket.c:674 [inline]

I don't have an idea what might be up, but some context:

This happens in __build_skb_around at

memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));

on vmalloc'd memory in netloc_alloc_large_skb:

data = vmalloc(size);
if (data == NULL)
return NULL;

skb = __build_skb(data, size);


Re: [RFC net] net: skbuff: fix stack variable out of bounds access

2021-03-23 Thread Willem de Bruijn
On Tue, Mar 23, 2021 at 12:30 PM Arnd Bergmann  wrote:
>
> On Tue, Mar 23, 2021 at 3:42 PM Willem de Bruijn
>  wrote:
> >
> > On Tue, Mar 23, 2021 at 8:52 AM Arnd Bergmann  wrote:
> > >>
> > A similar fix already landed in 5.12-rc3: commit b228c9b05876 ("net:
> > expand textsearch ts_state to fit skb_seq_state"). That fix landed in
> > 5.12-rc3.
>
> Ah nice, even the same BUILD_BUG_ON() ;-)

Indeed :) Sorry that your work ended up essentially reproducing that.

> Too bad it had to be found through runtime testing when it could have been
> found by the compiler warning.

Definitely useful. Had I enabled it, it would have saved me a lot of debug time.


Re: [RFC net] net: skbuff: fix stack variable out of bounds access

2021-03-23 Thread Willem de Bruijn
On Tue, Mar 23, 2021 at 8:52 AM Arnd Bergmann  wrote:
>
> From: Arnd Bergmann 
>
> gcc-11 warns that the TS_SKB_CB()) cast in skb_find_text()
> leads to an out-of-bounds access in skb_prepare_seq_read() after
> the addition of a new struct member made skb_seq_state longer
> than ts_state:
>
> net/core/skbuff.c: In function ‘skb_find_text’:
> net/core/skbuff.c:3498:26: error: array subscript ‘struct skb_seq_state[0]’ 
> is partly outside array bounds of ‘struct ts_state[1]’ [-Werror=array-bounds]
>  3498 | st->lower_offset = from;
>   | ~^~
> net/core/skbuff.c:3659:25: note: while referencing ‘state’
>  3659 | struct ts_state state;
>   | ^
>
> The warning is currently disabled globally, but I found this
> instance during experimental build testing, and it seems
> legitimate.
>
> Make the textsearch buffer longer and add a compile-time check to
> ensure the two remain the same length.
>
> Fixes: 97550f6fa592 ("net: compound page support in skb_seq_read")
> Signed-off-by: Arnd Bergmann 
> ---
>  include/linux/textsearch.h | 2 +-
>  net/core/skbuff.c  | 1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/textsearch.h b/include/linux/textsearch.h
> index 13770cfe33ad..6673e4d4ac2e 100644
> --- a/include/linux/textsearch.h
> +++ b/include/linux/textsearch.h
> @@ -23,7 +23,7 @@ struct ts_config;
>  struct ts_state
>  {
> unsigned intoffset;
> -   charcb[40];
> +   charcb[48];
>  };
>
>  /**
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 545a472273a5..dd10d4c5f4bf 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3633,6 +3633,7 @@ static unsigned int skb_ts_get_next_block(unsigned int 
> offset, const u8 **text,
>   struct ts_config *conf,
>   struct ts_state *state)
>  {
> +   BUILD_BUG_ON(sizeof(struct skb_seq_state) > sizeof(state->cb));
> return skb_seq_read(offset, text, TS_SKB_CB(state));
>  }
>
> --
> 2.29.2
>

Thanks for addressing this.

A similar fix already landed in 5.12-rc3: commit b228c9b05876 ("net:
expand textsearch ts_state to fit skb_seq_state"). That fix landed in
5.12-rc3.


Re: [PATCH net v3 1/2] net: check if protocol extracted by virtio_net_hdr_set_proto is correct

2021-03-09 Thread Willem de Bruijn
On Tue, Mar 9, 2021 at 6:32 AM Balazs Nemeth  wrote:
>
> For gso packets, virtio_net_hdr_set_proto sets the protocol (if it isn't
> set) based on the type in the virtio net hdr, but the skb could contain
> anything since it could come from packet_snd through a raw socket. If
> there is a mismatch between what virtio_net_hdr_set_proto sets and
> the actual protocol, then the skb could be handled incorrectly later
> on.
>
> An example where this poses an issue is with the subsequent call to
> skb_flow_dissect_flow_keys_basic which relies on skb->protocol being set
> correctly. A specially crafted packet could fool
> skb_flow_dissect_flow_keys_basic preventing EINVAL to be returned.
>
> Avoid blindly trusting the information provided by the virtio net header
> by checking that the protocol in the packet actually matches the
> protocol set by virtio_net_hdr_set_proto. Note that since the protocol
> is only checked if skb->dev implements header_ops->parse_protocol,
> packets from devices without the implementation are not checked at this
> stage.
>
> Fixes: 9274124f023b ("net: stricter validation of untrusted gso packets")
> Signed-off-by: Balazs Nemeth 

Acked-by: Willem de Bruijn 

This still relies entirely on data from the untrusted process. But it
adds the constraint that the otherwise untrusted data at least has to
be consistent, closing one loophole.

As responded in v2, we may want to look at the (few) callers and make
sure that they initialize skb->protocol before the call to
virtio_net_hdr_to_skb where possible. That will avoid this entire
branch.


Re: [PATCH net v3 2/2] net: avoid infinite loop in mpls_gso_segment when mpls_hlen == 0

2021-03-09 Thread Willem de Bruijn
On Tue, Mar 9, 2021 at 6:32 AM Balazs Nemeth  wrote:
>
> A packet with skb_inner_network_header(skb) == skb_network_header(skb)
> and ETH_P_MPLS_UC will prevent mpls_gso_segment from pulling any headers
> from the packet. Subsequently, the call to skb_mac_gso_segment will
> again call mpls_gso_segment with the same packet leading to an infinite
> loop. In addition, ensure that the header length is a multiple of four,
> which should hold irrespective of the number of stacked labels.
>
> Signed-off-by: Balazs Nemeth 

Acked-by: Willem de Bruijn 

The compiler will convert that modulo into a cheap & (ETH_HLEN - 1)
test for this constant.


Re: [PATCH v2 1/2] net: check if protocol extracted by virtio_net_hdr_set_proto is correct

2021-03-09 Thread Willem de Bruijn
On Tue, Mar 9, 2021 at 6:26 AM Michael S. Tsirkin  wrote:
>
> On Mon, Mar 08, 2021 at 11:31:25AM +0100, Balazs Nemeth wrote:
> > For gso packets, virtio_net_hdr_set_proto sets the protocol (if it isn't
> > set) based on the type in the virtio net hdr, but the skb could contain
> > anything since it could come from packet_snd through a raw socket. If
> > there is a mismatch between what virtio_net_hdr_set_proto sets and
> > the actual protocol, then the skb could be handled incorrectly later
> > on.
> >
> > An example where this poses an issue is with the subsequent call to
> > skb_flow_dissect_flow_keys_basic which relies on skb->protocol being set
> > correctly. A specially crafted packet could fool
> > skb_flow_dissect_flow_keys_basic preventing EINVAL to be returned.
> >
> > Avoid blindly trusting the information provided by the virtio net header
> > by checking that the protocol in the packet actually matches the
> > protocol set by virtio_net_hdr_set_proto. Note that since the protocol
> > is only checked if skb->dev implements header_ops->parse_protocol,
> > packets from devices without the implementation are not checked at this
> > stage.
> >
> > Fixes: 9274124f023b ("net: stricter validation of untrusted gso packets")
> > Signed-off-by: Balazs Nemeth 
> > ---
> >  include/linux/virtio_net.h | 8 +++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> > index e8a924eeea3d..6c478eee0452 100644
> > --- a/include/linux/virtio_net.h
> > +++ b/include/linux/virtio_net.h
> > @@ -79,8 +79,14 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff 
> > *skb,
> >   if (gso_type && skb->network_header) {
> >   struct flow_keys_basic keys;
> >
> > - if (!skb->protocol)
> > + if (!skb->protocol) {
> > + const struct ethhdr *eth = skb_eth_hdr(skb);
> > + __be16 etype = dev_parse_header_protocol(skb);
> > +
> >   virtio_net_hdr_set_proto(skb, hdr);
> > + if (etype && etype != skb->protocol)
> > + return -EINVAL;
> > + }
>
>
> Well the protocol in the header is an attempt at an optimization to
> remove need to parse the packet ... any data on whether this
> affecs performance?

This adds a branch and reading a cacheline that is inevitably read not
much later. It shouldn't be significant.

And this branch is only taken if skb->protocol is not set. So the cost
can easily be avoided by passing the information.

But you raise a good point, because TUNTAP does set it, but only after
the call to virtio_net_hdr_to_skb.

That should perhaps be inverted (in a separate net-next patch).


Re: [PATCH v2 2/2] net: avoid infinite loop in mpls_gso_segment when mpls_hlen == 0

2021-03-08 Thread Willem de Bruijn
On Mon, Mar 8, 2021 at 11:43 AM David Ahern  wrote:
>
> On 3/8/21 9:26 AM, Balazs Nemeth wrote:
> > On Mon, 2021-03-08 at 09:17 -0700, David Ahern wrote:
> >> On 3/8/21 9:07 AM, Willem de Bruijn wrote:
> >>>> diff --git a/net/mpls/mpls_gso.c b/net/mpls/mpls_gso.c
> >>>> index b1690149b6fa..cc1b6457fc93 100644
> >>>> --- a/net/mpls/mpls_gso.c
> >>>> +++ b/net/mpls/mpls_gso.c
> >>>> @@ -27,7 +27,7 @@ static struct sk_buff *mpls_gso_segment(struct
> >>>> sk_buff *skb,
> >>>>
> >>>> skb_reset_network_header(skb);
> >>>> mpls_hlen = skb_inner_network_header(skb) -
> >>>> skb_network_header(skb);
> >>>> -   if (unlikely(!pskb_may_pull(skb, mpls_hlen)))
> >>>> +   if (unlikely(!mpls_hlen || !pskb_may_pull(skb,
> >>>> mpls_hlen)))
> >>>> goto out;
> >>>
> >>> Good cathc. Besides length zero, this can be more strict: a label
> >>> is
> >>> 4B, so mpls_hlen needs to be >= 4B.
> >>>
> >>> Perhaps even aligned to 4B, too, but not if there may be other
> >>> encap on top.

On second thought, since mpls_gso_segment pulls all these headers, it
is correct to require it to be a multiple of MPLS_HLEN.


Re: [PATCH v2 2/2] net: avoid infinite loop in mpls_gso_segment when mpls_hlen == 0

2021-03-08 Thread Willem de Bruijn
On Mon, Mar 8, 2021 at 5:32 AM Balazs Nemeth  wrote:
>
> A packet with skb_inner_network_header(skb) == skb_network_header(skb)
> and ETH_P_MPLS_UC will prevent mpls_gso_segment from pulling any headers
> from the packet. Subsequently, the call to skb_mac_gso_segment will
> again call mpls_gso_segment with the same packet leading to an infinite
> loop.
>
> Signed-off-by: Balazs Nemeth 
> ---
>  net/mpls/mpls_gso.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/mpls/mpls_gso.c b/net/mpls/mpls_gso.c
> index b1690149b6fa..cc1b6457fc93 100644
> --- a/net/mpls/mpls_gso.c
> +++ b/net/mpls/mpls_gso.c
> @@ -27,7 +27,7 @@ static struct sk_buff *mpls_gso_segment(struct sk_buff *skb,
>
> skb_reset_network_header(skb);
> mpls_hlen = skb_inner_network_header(skb) - skb_network_header(skb);
> -   if (unlikely(!pskb_may_pull(skb, mpls_hlen)))
> +   if (unlikely(!mpls_hlen || !pskb_may_pull(skb, mpls_hlen)))
> goto out;

Good cathc. Besides length zero, this can be more strict: a label is
4B, so mpls_hlen needs to be >= 4B.

Perhaps even aligned to 4B, too, but not if there may be other encap on top.

Unfortunately there is no struct or type definition that we can use a
sizeof instead of open coding the raw constant.


Re: [PATCH v2 1/2] net: check if protocol extracted by virtio_net_hdr_set_proto is correct

2021-03-08 Thread Willem de Bruijn
On Mon, Mar 8, 2021 at 5:32 AM Balazs Nemeth  wrote:
>
> For gso packets, virtio_net_hdr_set_proto sets the protocol (if it isn't
> set) based on the type in the virtio net hdr, but the skb could contain
> anything since it could come from packet_snd through a raw socket. If
> there is a mismatch between what virtio_net_hdr_set_proto sets and
> the actual protocol, then the skb could be handled incorrectly later
> on.
>
> An example where this poses an issue is with the subsequent call to
> skb_flow_dissect_flow_keys_basic which relies on skb->protocol being set
> correctly. A specially crafted packet could fool
> skb_flow_dissect_flow_keys_basic preventing EINVAL to be returned.
>
> Avoid blindly trusting the information provided by the virtio net header
> by checking that the protocol in the packet actually matches the
> protocol set by virtio_net_hdr_set_proto. Note that since the protocol
> is only checked if skb->dev implements header_ops->parse_protocol,
> packets from devices without the implementation are not checked at this
> stage.
>
> Fixes: 9274124f023b ("net: stricter validation of untrusted gso packets")
> Signed-off-by: Balazs Nemeth 

Going forward, please mark your the patch as targeting the net tree
using [PATCH net]

> ---
>  include/linux/virtio_net.h | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> index e8a924eeea3d..6c478eee0452 100644
> --- a/include/linux/virtio_net.h
> +++ b/include/linux/virtio_net.h
> @@ -79,8 +79,14 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff 
> *skb,
> if (gso_type && skb->network_header) {
> struct flow_keys_basic keys;
>
> -   if (!skb->protocol)
> +   if (!skb->protocol) {
> +   const struct ethhdr *eth = skb_eth_hdr(skb);

eth is no longer used.

> +   __be16 etype = dev_parse_header_protocol(skb);

nit: customary to call this protocol. etype, I guess short for
EtherType, makes sense, but is not commonly used in the kernel.

> +
> virtio_net_hdr_set_proto(skb, hdr);
> +   if (etype && etype != skb->protocol)
> +   return -EINVAL;
> +   }
>  retry:
> if (!skb_flow_dissect_flow_keys_basic(NULL, skb, 
> ,
>   NULL, 0, 0, 0,
> --
> 2.29.2
>


Re: [PATCH bpf-next] selftests_bpf: extend test_tc_tunnel test with vxlan

2021-03-05 Thread Willem de Bruijn
On Fri, Mar 5, 2021 at 11:10 AM Daniel Borkmann  wrote:
>
> On 3/5/21 4:08 PM, Willem de Bruijn wrote:
> > On Fri, Mar 5, 2021 at 7:34 AM Xuesen Huang  wrote:
> >>
> >> From: Xuesen Huang 
> >>
> >> Add BPF_F_ADJ_ROOM_ENCAP_L2_ETH flag to the existing tests which
> >> encapsulates the ethernet as the inner l2 header.
> >>
> >> Update a vxlan encapsulation test case.
> >>
> >> Signed-off-by: Xuesen Huang 
> >> Signed-off-by: Li Wang 
> >> Signed-off-by: Willem de Bruijn 
> >
> > Please don't add my signed off by without asking.
>
> Agree, I can remove it if you prefer while applying and only keep the
> ack instead.

That would be great. Thanks, Daniel!


Re: [PATCH bpf-next] selftests_bpf: extend test_tc_tunnel test with vxlan

2021-03-05 Thread Willem de Bruijn
On Fri, Mar 5, 2021 at 7:34 AM Xuesen Huang  wrote:
>
> From: Xuesen Huang 
>
> Add BPF_F_ADJ_ROOM_ENCAP_L2_ETH flag to the existing tests which
> encapsulates the ethernet as the inner l2 header.
>
> Update a vxlan encapsulation test case.
>
> Signed-off-by: Xuesen Huang 
> Signed-off-by: Li Wang 
> Signed-off-by: Willem de Bruijn 


Please don't add my signed off by without asking.

That said,

Acked-by: Willem de Bruijn 


Re: [PATCH] selftests_bpf: extend test_tc_tunnel test with vxlan

2021-03-04 Thread Willem de Bruijn
On Thu, Mar 4, 2021 at 1:42 AM Xuesen Huang  wrote:
>
> From: Xuesen Huang 
>
> Add BPF_F_ADJ_ROOM_ENCAP_L2_ETH flag to the existing tests which
> encapsulates the ethernet as the inner l2 header.
>
> Update a vxlan encapsulation test case.
>
> Signed-off-by: Xuesen Huang 
> Signed-off-by: Li Wang 
> Signed-off-by: Willem de Bruijn 

Please mark patch target: [PATCH bpf-next]

> ---
>  tools/testing/selftests/bpf/progs/test_tc_tunnel.c | 113 
> ++---
>  tools/testing/selftests/bpf/test_tc_tunnel.sh  |  15 ++-
>  2 files changed, 111 insertions(+), 17 deletions(-)


> -static __always_inline int encap_ipv4(struct __sk_buff *skb, __u8 
> encap_proto,
> - __u16 l2_proto)
> +static __always_inline int __encap_ipv4(struct __sk_buff *skb, __u8 
> encap_proto,
> +   __u16 l2_proto, __u16 ext_proto)
>  {
> __u16 udp_dst = UDP_PORT;
> struct iphdr iph_inner;
> struct v4hdr h_outer;
> struct tcphdr tcph;
> int olen, l2_len;
> +   __u8 *l2_hdr = NULL;
> int tcp_off;
> __u64 flags;
>
> @@ -141,7 +157,11 @@ static __always_inline int encap_ipv4(struct __sk_buff 
> *skb, __u8 encap_proto,
> break;
> case ETH_P_TEB:
> l2_len = ETH_HLEN;
> -   udp_dst = ETH_OVER_UDP_PORT;
> +   if (ext_proto & EXTPROTO_VXLAN) {
> +   udp_dst = VXLAN_UDP_PORT;
> +   l2_len += sizeof(struct vxlanhdr);
> +   } else
> +   udp_dst = ETH_OVER_UDP_PORT;
> break;
> }
> flags |= BPF_F_ADJ_ROOM_ENCAP_L2(l2_len);
> @@ -171,14 +191,26 @@ static __always_inline int encap_ipv4(struct __sk_buff 
> *skb, __u8 encap_proto,
> }
>
> /* add L2 encap (if specified) */
> +   l2_hdr = (__u8 *)_outer + olen;
> switch (l2_proto) {
> case ETH_P_MPLS_UC:
> -   *((__u32 *)((__u8 *)_outer + olen)) = mpls_label;
> +   *(__u32 *)l2_hdr = mpls_label;
> break;
> case ETH_P_TEB:
> -   if (bpf_skb_load_bytes(skb, 0, (__u8 *)_outer + olen,
> -  ETH_HLEN))
> +   flags |= BPF_F_ADJ_ROOM_ENCAP_L2_ETH;
> +
> +   if (ext_proto & EXTPROTO_VXLAN) {
> +   struct vxlanhdr *vxlan_hdr = (struct vxlanhdr 
> *)l2_hdr;
> +
> +   vxlan_hdr->vx_flags = VXLAN_FLAGS;
> +   vxlan_hdr->vx_vni = bpf_htonl((VXLAN_VNI & 
> VXLAN_VNI_MASK) << 8);
> +
> +   l2_hdr += sizeof(struct vxlanhdr);

should this be l2_len? (here and ipv6 below)

> +SEC("encap_vxlan_eth")
> +int __encap_vxlan_eth(struct __sk_buff *skb)
> +{
> +   if (skb->protocol == __bpf_constant_htons(ETH_P_IP))
> +   return __encap_ipv4(skb, IPPROTO_UDP,
> +   ETH_P_TEB,
> +   EXTPROTO_VXLAN);

non-standard indentation: align with the opening parenthesis. (here
and ipv6 below)


Re: [PATCH/v5] bpf: add bpf_skb_adjust_room flag BPF_F_ADJ_ROOM_ENCAP_L2_ETH

2021-03-04 Thread Willem de Bruijn
On Thu, Mar 4, 2021 at 1:41 AM Xuesen Huang  wrote:
>
> From: Xuesen Huang 
>
> bpf_skb_adjust_room sets the inner_protocol as skb->protocol for packets
> encapsulation. But that is not appropriate when pushing Ethernet header.
>
> Add an option to further specify encap L2 type and set the inner_protocol
> as ETH_P_TEB.
>
> Suggested-by: Willem de Bruijn 
> Signed-off-by: Xuesen Huang 
> Signed-off-by: Zhiyong Cheng 
> Signed-off-by: Li Wang 

Acked-by: Willem de Bruijn 


Re: [PATCH/v4] bpf: add bpf_skb_adjust_room flag BPF_F_ADJ_ROOM_ENCAP_L2_ETH

2021-03-03 Thread Willem de Bruijn
> > Instead of untyped macros, I'd define encap_ipv4 as a function that
> > calls __encap_ipv4.
> >
> > And no need for encap_ipv4_with_ext_proto equivalent to __encap_ipv4.
> >
> I defined these macros to try to keep the existing  invocation for 
> encap_ipv4/6
> as the same, if we define this as a function all invocation should be 
> modified?

You can leave the existing invocations the same and make the new
callers caller __encap_ipv4 directly, which takes one extra argument?
Adding a __ prefixed variant with extra args is a common pattern.

> >>/* add L2 encap (if specified) */
> >> +   l2_hdr = (__u8 *)_outer + olen;
> >>switch (l2_proto) {
> >>case ETH_P_MPLS_UC:
> >> -   *((__u32 *)((__u8 *)_outer + olen)) = mpls_label;
> >> +   *(__u32 *)l2_hdr = mpls_label;
> >>break;
> >>case ETH_P_TEB:
> >> -   if (bpf_skb_load_bytes(skb, 0, (__u8 *)_outer + olen,
> >> -  ETH_HLEN))
> >
> > This is non-standard indentation? Here and elsewhere.
> I thinks it’s a previous issue.

Ah right. Bad example. How about in __encap_vxlan_eth

+   return encap_ipv4_with_ext_proto(skb, IPPROTO_UDP,
+   ETH_P_TEB, EXTPROTO_VXLAN);

> >> @@ -278,13 +321,24 @@ static __always_inline int encap_ipv6(struct 
> >> __sk_buff *skb, __u8 encap_proto,
> >>}
> >>
> >>/* add L2 encap (if specified) */
> >> +   l2_hdr = (__u8 *)_outer + olen;
> >>switch (l2_proto) {
> >>case ETH_P_MPLS_UC:
> >> -   *((__u32 *)((__u8 *)_outer + olen)) = mpls_label;
> >> +   *(__u32 *)l2_hdr = mpls_label;
> >>break;
> >>case ETH_P_TEB:
> >> -   if (bpf_skb_load_bytes(skb, 0, (__u8 *)_outer + olen,
> >> -  ETH_HLEN))
> >> +   flags |= BPF_F_ADJ_ROOM_ENCAP_L2_ETH;
> >
> > This is a change also for the existing case. Correctly so, I imagine.
> > But the test used to pass with the wrong protocol?
> Yes all tests pass. I’m not sure should we add this flag for the existing 
> tests
> which encap eth as the l2 header or only for the Vxlan test?

It is correct in both cases. If it does not break anything, I would do both.

Thanks,

  Willem


Re: [PATCH/v4] bpf: add bpf_skb_adjust_room flag BPF_F_ADJ_ROOM_ENCAP_L2_ETH

2021-03-03 Thread Willem de Bruijn
On Wed, Mar 3, 2021 at 7:33 AM Xuesen Huang  wrote:
>
> From: Xuesen Huang 
>
> bpf_skb_adjust_room sets the inner_protocol as skb->protocol for packets
> encapsulation. But that is not appropriate when pushing Ethernet header.
>
> Add an option to further specify encap L2 type and set the inner_protocol
> as ETH_P_TEB.
>
> Update test_tc_tunnel to verify adding vxlan encapsulation works with
> this flag.
>
> Suggested-by: Willem de Bruijn 
> Signed-off-by: Xuesen Huang 
> Signed-off-by: Zhiyong Cheng 
> Signed-off-by: Li Wang 

Thanks for adding the test. Perhaps that is better in a separate patch?

Overall looks great to me.

The patch has not (yet?) arrived on patchwork.

>  enum {
> diff --git a/tools/testing/selftests/bpf/progs/test_tc_tunnel.c 
> b/tools/testing/selftests/bpf/progs/test_tc_tunnel.c
> index 37bce7a..6e144db 100644
> --- a/tools/testing/selftests/bpf/progs/test_tc_tunnel.c
> +++ b/tools/testing/selftests/bpf/progs/test_tc_tunnel.c
> @@ -20,6 +20,14 @@
>  #include 
>  #include 
>
> +#define encap_ipv4(...) __encap_ipv4(__VA_ARGS__, 0)
> +
> +#define encap_ipv4_with_ext_proto(...) __encap_ipv4(__VA_ARGS__)
> +
> +#define encap_ipv6(...) __encap_ipv6(__VA_ARGS__, 0)
> +
> +#define encap_ipv6_with_ext_proto(...) __encap_ipv6(__VA_ARGS__)
> +

Instead of untyped macros, I'd define encap_ipv4 as a function that
calls __encap_ipv4.

And no need for encap_ipv4_with_ext_proto equivalent to __encap_ipv4.

>  static const int cfg_port = 8000;
>
>  static const int cfg_udp_src = 2;
> @@ -27,11 +35,24 @@
>  #defineUDP_PORT
>  #defineMPLS_OVER_UDP_PORT  6635
>  #defineETH_OVER_UDP_PORT   
> +#defineVXLAN_UDP_PORT  8472
> +
> +#defineEXTPROTO_VXLAN  0x1
> +
> +#defineVXLAN_N_VID (1u << 24)
> +#defineVXLAN_VNI_MASK  bpf_htonl((VXLAN_N_VID - 1) << 8)
> +#defineVXLAN_FLAGS 0x8
> +#defineVXLAN_VNI   1
>
>  /* MPLS label 1000 with S bit (last label) set and ttl of 255. */
>  static const __u32 mpls_label = __bpf_constant_htonl(1000 << 12 |
>  MPLS_LS_S_MASK | 0xff);
>
> +struct vxlanhdr {
> +   __be32 vx_flags;
> +   __be32 vx_vni;
> +} __attribute__((packed));
> +
>  struct gre_hdr {
> __be16 flags;
> __be16 protocol;
> @@ -45,13 +66,13 @@ struct gre_hdr {
>  struct v4hdr {
> struct iphdr ip;
> union l4hdr l4hdr;
> -   __u8 pad[16];   /* enough space for L2 header */
> +   __u8 pad[24];   /* space for L2 header / vxlan header 
> ... */

could we use something like sizeof(..) instead of a constant?

> @@ -171,14 +197,26 @@ static __always_inline int encap_ipv4(struct __sk_buff 
> *skb, __u8 encap_proto,
> }
>
> /* add L2 encap (if specified) */
> +   l2_hdr = (__u8 *)_outer + olen;
> switch (l2_proto) {
> case ETH_P_MPLS_UC:
> -   *((__u32 *)((__u8 *)_outer + olen)) = mpls_label;
> +   *(__u32 *)l2_hdr = mpls_label;
> break;
> case ETH_P_TEB:
> -   if (bpf_skb_load_bytes(skb, 0, (__u8 *)_outer + olen,
> -  ETH_HLEN))

This is non-standard indentation? Here and elsewhere.

> @@ -249,7 +288,11 @@ static __always_inline int encap_ipv6(struct __sk_buff 
> *skb, __u8 encap_proto,
> break;
> case ETH_P_TEB:
> l2_len = ETH_HLEN;
> -   udp_dst = ETH_OVER_UDP_PORT;
> +   if (ext_proto & EXTPROTO_VXLAN) {
> +   udp_dst = VXLAN_UDP_PORT;
> +   l2_len += sizeof(struct vxlanhdr);
> +   } else
> +   udp_dst = ETH_OVER_UDP_PORT;
> break;
> }
> flags |= BPF_F_ADJ_ROOM_ENCAP_L2(l2_len);
> @@ -267,7 +310,7 @@ static __always_inline int encap_ipv6(struct __sk_buff 
> *skb, __u8 encap_proto,
> h_outer.l4hdr.udp.source = __bpf_constant_htons(cfg_udp_src);
> h_outer.l4hdr.udp.dest = bpf_htons(udp_dst);
> tot_len = bpf_ntohs(iph_inner.payload_len) + 
> sizeof(iph_inner) +
> - sizeof(h_outer.l4hdr.udp);
> + sizeof(h_outer.l4hdr.udp) + l2_len;

Was this a bug previously?

> h_outer.l4hdr.udp.check = 0;
> h_outer.l4hdr.udp.len = bpf_htons(tot_len);
> break;
> @@ -278,13 +321,24 @@ static __always_inline int encap_ipv6(struct __sk_buff 
> *skb, __u8 encap_proto,
> }
>
>   

Re: [PATCH/v3] bpf: add bpf_skb_adjust_room flag BPF_F_ADJ_ROOM_ENCAP_L2_ETH

2021-02-26 Thread Willem de Bruijn
On Fri, Feb 26, 2021 at 3:15 PM Cong Wang  wrote:
>
> On Thu, Feb 25, 2021 at 7:59 PM Xuesen Huang  wrote:
> > v3:
> > - Fix the code format.
> >
> > v2:
> > Suggested-by: Willem de Bruijn 
> > - Add a new flag to specify the type of the inner packet.
>
> These need to be moved after '---', otherwise it would be merged
> into the final git log.
>
> >
> > Suggested-by: Willem de Bruijn 
> > Signed-off-by: Xuesen Huang 
> > Signed-off-by: Zhiyong Cheng 
> > Signed-off-by: Li Wang 
> > ---
> >  include/uapi/linux/bpf.h   |  5 +
> >  net/core/filter.c  | 11 ++-
> >  tools/include/uapi/linux/bpf.h |  5 +
> >  3 files changed, 20 insertions(+), 1 deletion(-)
>
> As a good practice, please add a test case for this in
> tools/testing/selftests/bpf/progs/test_tc_tunnel.c.

That's a great idea. This function covers a lot of cases. Can use the
code coverage against regressions.

With that caveat, looks great to me, thanks.


Re: [PATCH] net: check if protocol extracted by virtio_net_hdr_set_proto is correct

2021-02-23 Thread Willem de Bruijn
On Tue, Feb 23, 2021 at 8:48 AM Balazs Nemeth  wrote:
>
> On Mon, 2021-02-22 at 11:39 +0800, Jason Wang wrote:
> >
> > On 2021/2/19 10:55 下午, Willem de Bruijn wrote:
> > > On Fri, Feb 19, 2021 at 3:53 AM Jason Wang 
> > > wrote:
> > > >
> > > > On 2021/2/18 11:50 下午, Willem de Bruijn wrote:
> > > > > On Thu, Feb 18, 2021 at 10:01 AM Balazs Nemeth <
> > > > > bnem...@redhat.com> wrote:
> > > > > > For gso packets, virtio_net_hdr_set_proto sets the protocol
> > > > > > (if it isn't
> > > > > > set) based on the type in the virtio net hdr, but the skb
> > > > > > could contain
> > > > > > anything since it could come from packet_snd through a raw
> > > > > > socket. If
> > > > > > there is a mismatch between what virtio_net_hdr_set_proto
> > > > > > sets and
> > > > > > the actual protocol, then the skb could be handled
> > > > > > incorrectly later
> > > > > > on by gso.
> > > > > >
> > > > > > The network header of gso packets starts at 14 bytes, but a
> > > > > > specially
> > > > > > crafted packet could fool the call to
> > > > > > skb_flow_dissect_flow_keys_basic
> > > > > > as the network header offset in the skb could be incorrect.
> > > > > > Consequently, EINVAL is not returned.
> > > > > >
> > > > > > There are even packets that can cause an infinite loop. For
> > > > > > example, a
> > > > > > packet with ethernet type ETH_P_MPLS_UC (which is unnoticed
> > > > > > by
> > > > > > virtio_net_hdr_to_skb) that is sent to a geneve interface
> > > > > > will be
> > > > > > handled by geneve_build_skb. In turn, it calls
> > > > > > udp_tunnel_handle_offloads which then calls
> > > > > > skb_reset_inner_headers.
> > > > > > After that, the packet gets passed to mpls_gso_segment. That
> > > > > > function
> > > > > > calculates the mpls header length by taking the difference
> > > > > > between
> > > > > > network_header and inner_network_header. Since the two are
> > > > > > equal
> > > > > > (due to the earlier call to skb_reset_inner_headers), it will
> > > > > > calculate
> > > > > > a header of length 0, and it will not pull any headers. Then,
> > > > > > it will
> > > > > > call skb_mac_gso_segment which will again call
> > > > > > mpls_gso_segment, etc...
> > > > > > This leads to the infinite loop.
> > > >
> > > > I remember kernel will validate dodgy gso packets in gso ops. I
> > > > wonder
> > > > why not do the check there? The reason is that virtio/TUN is not
> > > > the
> > > > only source for those packets.
> > > It is? All other GSO packets are generated by the stack itself,
> > > either
> > > locally or through GRO.
> >
> >
> > Something like what has been done in tcp_tso_segment()?
> >
> >  if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) {
> >  /* Packet is from an untrusted source, reset
> > gso_segs. */
> >
> >  skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(skb->len, mss);
> >
> >  segs = NULL;
> >  goto out;
> >  }
> >
> > My understanding of the header check logic is that it tries to dealy
> > the
> > check as much as possible, so for device that has GRO_ROBUST, there's
> > even no need to do that.
> >
> >
> > >
> > > But indeed some checks are better performed in the GSO layer. Such
> > > as
> > > likely the 0-byte mpls header length.
> > >
> > > If we cannot trust virtio_net_hdr.gso_type passed from userspace,
> > > then
> > > we can also not trust the eth.h_proto coming from the same source.
> >
> >
> > I agree.
> >
> I'll add a check in the GSO layer as well.
> >
> > > But
> > > it makes sense to require them to be consistent. There is a
> > > dev_parse_header_protocol that may return the link layer type in a
> > > more generic fashion than casting to skb_eth_hdr.
> > >
> > > Question remains what to do for the link layer types that do not
> > > implement
> > > header_ops->parse_protocol, and so we cannot validate the packet's
> > > network protocol. Drop will cause false positives, accepts will
> > > leave a
> > > potential path, just closes it for Ethernet.
> > >
> > > This might call for multiple fixes, both on first ingest and inside
> > > the stack?
> >
> Given that this is related to dodgy packets and that we can't trust
> eth.h_proto, wouldn't it make sense to always drop packets (with
> potential false positives), erring on the side of caution, if
> header_ops->parse_protocol isn't implemented for the dev in question?

Unfortunately, that might break applications somewhere out there.


Re: [PATCH] net: check if protocol extracted by virtio_net_hdr_set_proto is correct

2021-02-19 Thread Willem de Bruijn
On Fri, Feb 19, 2021 at 3:53 AM Jason Wang  wrote:
>
>
> On 2021/2/18 11:50 下午, Willem de Bruijn wrote:
> > On Thu, Feb 18, 2021 at 10:01 AM Balazs Nemeth  wrote:
> >> For gso packets, virtio_net_hdr_set_proto sets the protocol (if it isn't
> >> set) based on the type in the virtio net hdr, but the skb could contain
> >> anything since it could come from packet_snd through a raw socket. If
> >> there is a mismatch between what virtio_net_hdr_set_proto sets and
> >> the actual protocol, then the skb could be handled incorrectly later
> >> on by gso.
> >>
> >> The network header of gso packets starts at 14 bytes, but a specially
> >> crafted packet could fool the call to skb_flow_dissect_flow_keys_basic
> >> as the network header offset in the skb could be incorrect.
> >> Consequently, EINVAL is not returned.
> >>
> >> There are even packets that can cause an infinite loop. For example, a
> >> packet with ethernet type ETH_P_MPLS_UC (which is unnoticed by
> >> virtio_net_hdr_to_skb) that is sent to a geneve interface will be
> >> handled by geneve_build_skb. In turn, it calls
> >> udp_tunnel_handle_offloads which then calls skb_reset_inner_headers.
> >> After that, the packet gets passed to mpls_gso_segment. That function
> >> calculates the mpls header length by taking the difference between
> >> network_header and inner_network_header. Since the two are equal
> >> (due to the earlier call to skb_reset_inner_headers), it will calculate
> >> a header of length 0, and it will not pull any headers. Then, it will
> >> call skb_mac_gso_segment which will again call mpls_gso_segment, etc...
> >> This leads to the infinite loop.
>
>
> I remember kernel will validate dodgy gso packets in gso ops. I wonder
> why not do the check there? The reason is that virtio/TUN is not the
> only source for those packets.

It is? All other GSO packets are generated by the stack itself, either
locally or through GRO.

But indeed some checks are better performed in the GSO layer. Such as
likely the 0-byte mpls header length.

If we cannot trust virtio_net_hdr.gso_type passed from userspace, then
we can also not trust the eth.h_proto coming from the same source. But
it makes sense to require them to be consistent. There is a
dev_parse_header_protocol that may return the link layer type in a
more generic fashion than casting to skb_eth_hdr.

Question remains what to do for the link layer types that do not implement
header_ops->parse_protocol, and so we cannot validate the packet's
network protocol. Drop will cause false positives, accepts will leave a
potential path, just closes it for Ethernet.

This might call for multiple fixes, both on first ingest and inside the stack?


Re: [PATCH] net: check if protocol extracted by virtio_net_hdr_set_proto is correct

2021-02-18 Thread Willem de Bruijn
On Thu, Feb 18, 2021 at 10:01 AM Balazs Nemeth  wrote:
>
> For gso packets, virtio_net_hdr_set_proto sets the protocol (if it isn't
> set) based on the type in the virtio net hdr, but the skb could contain
> anything since it could come from packet_snd through a raw socket. If
> there is a mismatch between what virtio_net_hdr_set_proto sets and
> the actual protocol, then the skb could be handled incorrectly later
> on by gso.
>
> The network header of gso packets starts at 14 bytes, but a specially
> crafted packet could fool the call to skb_flow_dissect_flow_keys_basic
> as the network header offset in the skb could be incorrect.
> Consequently, EINVAL is not returned.
>
> There are even packets that can cause an infinite loop. For example, a
> packet with ethernet type ETH_P_MPLS_UC (which is unnoticed by
> virtio_net_hdr_to_skb) that is sent to a geneve interface will be
> handled by geneve_build_skb. In turn, it calls
> udp_tunnel_handle_offloads which then calls skb_reset_inner_headers.
> After that, the packet gets passed to mpls_gso_segment. That function
> calculates the mpls header length by taking the difference between
> network_header and inner_network_header. Since the two are equal
> (due to the earlier call to skb_reset_inner_headers), it will calculate
> a header of length 0, and it will not pull any headers. Then, it will
> call skb_mac_gso_segment which will again call mpls_gso_segment, etc...
> This leads to the infinite loop.
>
> For that reason, address the root cause of the issue: don't blindly
> trust the information provided by the virtio net header. Instead,
> check if the protocol in the packet actually matches the protocol set by
> virtio_net_hdr_set_proto.
>
> Fixes: 9274124f023b ("net: stricter validation of untrusted gso packets")
> Signed-off-by: Balazs Nemeth 
> ---
>  include/linux/virtio_net.h | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
> index e8a924eeea3d..cf2c53563f22 100644
> --- a/include/linux/virtio_net.h
> +++ b/include/linux/virtio_net.h
> @@ -79,8 +79,13 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff 
> *skb,
> if (gso_type && skb->network_header) {
> struct flow_keys_basic keys;
>
> -   if (!skb->protocol)
> +   if (!skb->protocol) {
> +   const struct ethhdr *eth = skb_eth_hdr(skb);
> +

Unfortunately, cannot assume that the device type is ARPHRD_ETHER.

The underlying approach is sound: packets that have a gso type set in
the virtio_net_hdr have to be IP packets.

> virtio_net_hdr_set_proto(skb, hdr);
> +   if (skb->protocol != eth->h_proto)
> +   return -EINVAL;
> +   }
>  retry:
> if (!skb_flow_dissect_flow_keys_basic(NULL, skb, 
> ,
>   NULL, 0, 0, 0,
> --
> 2.29.2
>


Re: possible stack corruption in icmp_send (__stack_chk_fail)

2021-02-17 Thread Willem de Bruijn
On Wed, Feb 17, 2021 at 6:18 PM Jason A. Donenfeld  wrote:
>
> On 2/18/21, Willem de Bruijn  wrote:
> > On Wed, Feb 17, 2021 at 5:56 PM Jason A. Donenfeld  wrote:
> >>
> >> Hi Willem,
> >>
> >> On Wed, Feb 17, 2021 at 11:27 PM Willem de Bruijn
> >>  wrote:
> >> > A vmlinux image might help. I couldn't find one for this kernel.
> >>
> >> https://data.zx2c4.com/icmp_send-crash-e03b4a42-706a-43bf-bc40-1f15966b3216.tar.xz
> >> has .debs with vmlinuz in there, which you can extract to vmlinux, as
> >> well as my own vmlinux elf construction with the symbols added back in
> >> by extracting them from kallsyms. That's the best I've been able to
> >> do, as all of this is coming from somebody random emailing me.
> >>
> >> > But could it be
> >> > that the forwarded packet is not sensible IPv4? The skb->protocol is
> >> > inferred in wg_packet_consume_data_done->ip_tunnel_parse_protocol.
> >>
> >> The wg calls to icmp_ndo_send are gated by checking skb->protocol:
> >>
> >> if (skb->protocol == htons(ETH_P_IP))
> >>icmp_ndo_send(skb, ICMP_DEST_UNREACH, ICMP_HOST_UNREACH,
> >> 0);
> >>else if (skb->protocol == htons(ETH_P_IPV6))
> >>icmpv6_ndo_send(skb, ICMPV6_DEST_UNREACH,
> >> ICMPV6_ADDR_UNREACH, 0);
> >>
> >> On the other hand, that code is hit on an error path when
> >> wg_check_packet_protocol returns false:
> >>
> >> static inline bool wg_check_packet_protocol(struct sk_buff *skb)
> >> {
> >>__be16 real_protocol = ip_tunnel_parse_protocol(skb);
> >>return real_protocol && skb->protocol == real_protocol;
> >> }
> >>
> >> So that means, at least in theory, icmp_ndo_send could be called with
> >> skb->protocol != ip_tunnel_parse_protocol(skb). I guess I can address
> >> that. But... is it actually a problem?
> >
> > For this forwarded packet that arrived on a wireguard tunnel,
> > skb->protocol was originally also set by ip_tunnel_parse_protocol.
> > So likely not.
> >
> > The other issue seems more like a real bug. wg_xmit calling
> > icmp_ndo_send without clearing IPCB first.
> >
>
> Bingo! Nice eye! I confirmed the crash by just memsetting 0x41 to cb
> before the call. Clearly this should be zeroed by icmp_ndo_xmit. Will
> send a patch for icmp_ndo_xmit momentarily and will CC you.

Great, let's hope that's it.

gtp_build_skb_ip4 zeroes before calling. The fix will be most
obviously correct if wg_xmit does the same.

But it is quite likely that the other callers, xfrmi_xmit2 and
sunvnet_start_xmit_common should zero, too. If so, then icmp_ndo_xmit
is the more robust location to do this. Then the Fixes tag will likely
go quite a bit farther back, too.

Whichever variant of the patch you prefer.


Re: possible stack corruption in icmp_send (__stack_chk_fail)

2021-02-17 Thread Willem de Bruijn
On Wed, Feb 17, 2021 at 5:56 PM Jason A. Donenfeld  wrote:
>
> Hi Willem,
>
> On Wed, Feb 17, 2021 at 11:27 PM Willem de Bruijn
>  wrote:
> > A vmlinux image might help. I couldn't find one for this kernel.
>
> https://data.zx2c4.com/icmp_send-crash-e03b4a42-706a-43bf-bc40-1f15966b3216.tar.xz
> has .debs with vmlinuz in there, which you can extract to vmlinux, as
> well as my own vmlinux elf construction with the symbols added back in
> by extracting them from kallsyms. That's the best I've been able to
> do, as all of this is coming from somebody random emailing me.
>
> > But could it be
> > that the forwarded packet is not sensible IPv4? The skb->protocol is
> > inferred in wg_packet_consume_data_done->ip_tunnel_parse_protocol.
>
> The wg calls to icmp_ndo_send are gated by checking skb->protocol:
>
> if (skb->protocol == htons(ETH_P_IP))
>icmp_ndo_send(skb, ICMP_DEST_UNREACH, ICMP_HOST_UNREACH, 0);
>else if (skb->protocol == htons(ETH_P_IPV6))
>icmpv6_ndo_send(skb, ICMPV6_DEST_UNREACH,
> ICMPV6_ADDR_UNREACH, 0);
>
> On the other hand, that code is hit on an error path when
> wg_check_packet_protocol returns false:
>
> static inline bool wg_check_packet_protocol(struct sk_buff *skb)
> {
>__be16 real_protocol = ip_tunnel_parse_protocol(skb);
>return real_protocol && skb->protocol == real_protocol;
> }
>
> So that means, at least in theory, icmp_ndo_send could be called with
> skb->protocol != ip_tunnel_parse_protocol(skb). I guess I can address
> that. But... is it actually a problem?

For this forwarded packet that arrived on a wireguard tunnel,
skb->protocol was originally also set by ip_tunnel_parse_protocol.
So likely not.

The other issue seems more like a real bug. wg_xmit calling
icmp_ndo_send without clearing IPCB first.


Re: possible stack corruption in icmp_send (__stack_chk_fail)

2021-02-17 Thread Willem de Bruijn
On Wed, Feb 17, 2021 at 1:12 PM Jason A. Donenfeld  wrote:
>
> Hi Netdev & Willem,
>
> I've received a report of stack corruption -- via the stack protector
> check -- in icmp_send. I was sent a vmcore, and was able to extract
> the OOPS from there. However, I've been unable to produce the bug and
> I don't see where it'd be in the code. That might point to a more
> sinister problem, or I'm simply just not seeing it. Apparently the
> reporter reproduces it every 40 or so minutes, and has seen it happen
> since at least ~5.10. Willem - I'm emailing you because it seems like
> you were making a lot of changes to the icmp code around then, and
> perhaps you have an intuition. For example, some of the error handling
> code takes a pointer to a stack buffer (_objh and such), and maybe
> that's problematic? I'm not quite sure. The vmcore, along with the
> various kernel binaries I hunted down are here:
> https://data.zx2c4.com/icmp_send-crash-e03b4a42-706a-43bf-bc40-1f15966b3216.tar.xz
> . The extracted dmesg follows below, in case you or anyone has a
> pointer. I've been staring at this for a while and don't see it.
>
> Jason

Sorry, I also don't immediately see a cause.

The _objh is a fairly standard approach to accessing skb data with
skb_header_pointer. More importantly, that codepath is in the icmp
receive path and then guarded by a socket option
(inet_sk(sk)->recverr_rfc4884), so unlikely to be exercised here.

This is an icmp send in response to a forwarded packet (assuming
__qdisc_run dequeued the packet that triggered it). The icmp send code
is quite robust against, e.g., undersized packets. But could it be
that the forwarded packet is not sensible IPv4? The skb->protocol is
inferred in wg_packet_consume_data_done->ip_tunnel_parse_protocol.

As for on-stack variable overflow, __ip_options_echo parses the
(untrusted) input to write into stack allocated icmp_param. But that
is fairly well tested, rarely touched, code by now. Perhaps relevant,
though, the opt passed is in skb->cb[], which at should probably not
be interpreted as inet_skb_parm (IPCB).

   static inline void icmp_send(struct sk_buff *skb_in, int type, int
code, __be32 info)
   {
__icmp_send(skb_in, type, code, info, (skb_in)->opt);
   }


A vmlinux image might help. I couldn't find one for this kernel.

Or if the kernel can be modified and this path is rarely taken,
logging the packet, e.g., with skb_dump.


Re: [PATCH/v2] bpf: add bpf_skb_adjust_room flag BPF_F_ADJ_ROOM_ENCAP_L2_ETH

2021-02-10 Thread Willem de Bruijn
On Wed, Feb 10, 2021 at 1:59 AM huangxuesen  wrote:
>
> From: huangxuesen 
>
> bpf_skb_adjust_room sets the inner_protocol as skb->protocol for packets
> encapsulation. But that is not appropriate when pushing Ethernet header.
>
> Add an option to further specify encap L2 type and set the inner_protocol
> as ETH_P_TEB.
>
> Suggested-by: Willem de Bruijn 
> Signed-off-by: huangxuesen 
> Signed-off-by: chengzhiyong 
> Signed-off-by: wangli 

Thanks, this is exactly what I meant.

Acked-by: Willem de Bruijn 

One small point regarding Signed-off-by: It is customary to capitalize
family and given names.


Re: [PATCH] bpf: in bpf_skb_adjust_room correct inner protocol for vxlan

2021-02-09 Thread Willem de Bruijn
On Tue, Feb 9, 2021 at 5:41 AM 黄学森  wrote:
>
> Appreciate for your reply Willem!
>
> The original intention of this commit is that when we use bpf_skb_adjust_room 
>  to encapsulate
> Vxlan packets, we find some powerful device features disabled.
>
> Setting the inner_protocol directly as skb->protocol is the root cause.
>
> I understand that it’s not easy to handle all tunnel protocol in one bpf 
> helper function. But for my
> immature idea, when pushing Ethernet header, setting the inner_protocol as 
> ETH_P_TEB may
> be better.
>
> Now the flag BPF_F_ADJ_ROOM_ENCAP_L4_UDP includes many udp tunnel types( e.g.
> udp+mpls, geneve, vxlan). Adding an independent flag to represents Vxlan 
> looks a little
> reduplicative. What’s your suggestion?

Agreed. I don't mean to add a vxlan specific flag.

Instead, a way to identify that the encapsulation includes a mac
header. To a certain extent, that already exists as of commit
58dfc900faff ("bpf: add layer 2 encap support to
bpf_skb_adjust_room"). That computes an inner_maclen. It makes sense
that inner_protocol needs to be updated if inner_maclen indicates a
mac header.

I would only not infer it based on some imprecise measure, such as
inner_maclen being 14. But add a new explicit flag
BPF_F_ADJ_ROOM_ENCAP_L2_ETH. Update inner protocol if the flag is
passed and inner_maclen >= ETH_HLEN. Fail the operation if the flag is
passed and inner_maclen is too short.

> Thanks again for your reply!
>
>
>
> > 2021年2月8日 下午9:06,Willem de Bruijn  写道:
> >
> > On Mon, Feb 8, 2021 at 7:16 AM huangxuesen  wrote:
> >>
> >> From: huangxuesen 
> >>
> >> When pushing vxlan tunnel header, set inner protocol as ETH_P_TEB in skb
> >> to avoid HW device disabling udp tunnel segmentation offload, just like
> >> vxlan_build_skb does.
> >>
> >> Drivers for NIC may invoke vxlan_features_check to check the
> >> inner_protocol in skb for vxlan packets to decide whether to disable
> >> NETIF_F_GSO_MASK. Currently it sets inner_protocol as the original
> >> skb->protocol, that will make mlx5_core disable TSO and lead to huge
> >> performance degradation.
> >>
> >> Signed-off-by: huangxuesen 
> >> Signed-off-by: chengzhiyong 
> >> Signed-off-by: wangli 
> >> ---
> >> net/core/filter.c | 7 ++-
> >> 1 file changed, 6 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/net/core/filter.c b/net/core/filter.c
> >> index 255aeee72402..f8d3ba3fe10f 100644
> >> --- a/net/core/filter.c
> >> +++ b/net/core/filter.c
> >> @@ -3466,7 +3466,12 @@ static int bpf_skb_net_grow(struct sk_buff *skb, 
> >> u32 off, u32 len_diff,
> >>skb->inner_mac_header = inner_net - inner_mac_len;
> >>skb->inner_network_header = inner_net;
> >>skb->inner_transport_header = inner_trans;
> >> -   skb_set_inner_protocol(skb, skb->protocol);
> >> +
> >> +   if (flags & BPF_F_ADJ_ROOM_ENCAP_L4_UDP &&
> >> +   inner_mac_len == ETH_HLEN)
> >> +   skb_set_inner_protocol(skb, htons(ETH_P_TEB));
> >
> > This may be used by vxlan, but it does not imply it.
> >
> > Adding ETH_HLEN bytes likely means pushing an Ethernet header, but same 
> > point.
> >
> > Conversely, pushing an Ethernet header is not limited to UDP encap.
> >
> > This probably needs a new explicit BPF_F_ADJ_ROOM_.. flag, rather than
> > trying to infer from imprecise heuristics.
>


Re: [PATCH] bpf: in bpf_skb_adjust_room correct inner protocol for vxlan

2021-02-08 Thread Willem de Bruijn
On Mon, Feb 8, 2021 at 7:16 AM huangxuesen  wrote:
>
> From: huangxuesen 
>
> When pushing vxlan tunnel header, set inner protocol as ETH_P_TEB in skb
> to avoid HW device disabling udp tunnel segmentation offload, just like
> vxlan_build_skb does.
>
> Drivers for NIC may invoke vxlan_features_check to check the
> inner_protocol in skb for vxlan packets to decide whether to disable
> NETIF_F_GSO_MASK. Currently it sets inner_protocol as the original
> skb->protocol, that will make mlx5_core disable TSO and lead to huge
> performance degradation.
>
> Signed-off-by: huangxuesen 
> Signed-off-by: chengzhiyong 
> Signed-off-by: wangli 
> ---
>  net/core/filter.c | 7 ++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 255aeee72402..f8d3ba3fe10f 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3466,7 +3466,12 @@ static int bpf_skb_net_grow(struct sk_buff *skb, u32 
> off, u32 len_diff,
> skb->inner_mac_header = inner_net - inner_mac_len;
> skb->inner_network_header = inner_net;
> skb->inner_transport_header = inner_trans;
> -   skb_set_inner_protocol(skb, skb->protocol);
> +
> +   if (flags & BPF_F_ADJ_ROOM_ENCAP_L4_UDP &&
> +   inner_mac_len == ETH_HLEN)
> +   skb_set_inner_protocol(skb, htons(ETH_P_TEB));

This may be used by vxlan, but it does not imply it.

Adding ETH_HLEN bytes likely means pushing an Ethernet header, but same point.

Conversely, pushing an Ethernet header is not limited to UDP encap.

This probably needs a new explicit BPF_F_ADJ_ROOM_.. flag, rather than
trying to infer from imprecise heuristics.


Re: [PATCH net-next] net/packet: Improve the comment about LL header visibility criteria

2021-02-05 Thread Willem de Bruijn
On Fri, Feb 5, 2021 at 5:42 PM Xie He  wrote:
>
> The "dev_has_header" function, recently added in
> commit d549699048b4 ("net/packet: fix packet receive on L3 devices
> without visible hard header"),
> is more accurate as criteria for determining whether a device exposes
> the LL header to upper layers, because in addition to dev->header_ops,
> it also checks for dev->header_ops->create.
>
> When transmitting an skb on a device, dev_hard_header can be called to
> generate an LL header. dev_hard_header will only generate a header if
> dev->header_ops->create is present.
>
> Signed-off-by: Xie He 

Acked-by: Willem de Bruijn 

Indeed, existence of dev->header_ops->create is the deciding factor. Thanks Xie.


Re: [PATCH net-next v2 0/7] net: ipa: don't disable NAPI in suspend

2021-02-01 Thread Willem de Bruijn
On Mon, Feb 1, 2021 at 12:28 PM Alex Elder  wrote:
>
> This is version 2 of a series that reworks the order in which things
> happen during channel stop and suspend (and start and resume), in
> order to address a hang that has been observed during suspend.
> The introductory message on the first version of the series gave
> some history which is omitted here.
>
> The end result of this series is that we only enable NAPI and the
> I/O completion interrupt on a channel when we start the channel for
> the first time.  And we only disable them when stopping the channel
> "for good."  In other words, NAPI and the completion interrupt
> remain enabled while a channel is stopped for suspend.
>
> One comment on version 1 of the series suggested *not* returning
> early on success in a function, instead having both success and
> error paths return from the same point at the end of the function
> block.  This has been addressed in this version.
>
> In addition, this version consolidates things a little bit, but the
> net result of the series is exactly the same as version 1 (with the
> exception of the return fix mentioned above).
>
> First, patch 6 in the first version was a small step to make patch 7
> easier to understand.  The two have been combined now.
>
> Second, previous version moved (and for suspend/resume, eliminated)
> I/O completion interrupt and NAPI disable/enable control in separate
> steps (patches).  Now both are moved around together in patch 5 and
> 6, which eliminates the need for the final (NAPI-only) patch.
>
> I won't repeat the patch summaries provided in v1:
>   https://lore.kernel.org/netdev/20210129202019.2099259-1-el...@linaro.org/
>
> Many thanks to Willem de Bruijn for his thoughtful input.
>
> -Alex
>
> Alex Elder (7):
>   net: ipa: don't thaw channel if error starting
>   net: ipa: introduce gsi_channel_stop_retry()
>   net: ipa: introduce __gsi_channel_start()
>   net: ipa: kill gsi_channel_freeze() and gsi_channel_thaw()
>   net: ipa: disable interrupt and NAPI after channel stop
>   net: ipa: don't disable interrupt on suspend
>   net: ipa: expand last transaction check
>
>  drivers/net/ipa/gsi.c | 138 ++
>  1 file changed, 85 insertions(+), 53 deletions(-)

Acked-by: Willem de Bruijn 


Re: [PATCH net-next 9/9] net: ipa: don't disable NAPI in suspend

2021-02-01 Thread Willem de Bruijn
On Mon, Feb 1, 2021 at 9:35 AM Alex Elder  wrote:
>
> On 1/31/21 7:36 PM, Willem de Bruijn wrote:
> > On Sun, Jan 31, 2021 at 10:32 AM Alex Elder  wrote:
> >>
> >> On 1/31/21 8:52 AM, Willem de Bruijn wrote:
> >>> On Sat, Jan 30, 2021 at 11:29 PM Alex Elder  wrote:
> >>>>
> >>>> On 1/30/21 9:25 AM, Willem de Bruijn wrote:
> >>>>> On Fri, Jan 29, 2021 at 3:29 PM Alex Elder  wrote:
> >>>>>>
> >>>>>> The channel stop and suspend paths both call __gsi_channel_stop(),
> >>>>>> which quiesces channel activity, disables NAPI, and (on other than
> >>>>>> SDM845) stops the channel.  Similarly, the start and resume paths
> >>>>>> share __gsi_channel_start(), which starts the channel and re-enables
> >>>>>> NAPI again.
> >>>>>>
> >>>>>> Disabling NAPI should be done when stopping a channel, but this
> >>>>>> should *not* be done when suspending.  It's not necessary in the
> >>>>>> suspend path anyway, because the stopped channel (or suspended
> >>>>>> endpoint on SDM845) will not cause interrupts to schedule NAPI,
> >>>>>> and gsi_channel_trans_quiesce() won't return until there are no
> >>>>>> more transactions to process in the NAPI polling loop.
> >>>>>
> >>>>> But why is it incorrect to do so?
> >>>>
> >>>> Maybe it's not; I also thought it was fine before, but...
>
> . . .
>
> >> The "hang" occurs on an RX endpoint, and in particular it
> >> occurs on an endpoint that we *know* will be receiving a
> >> packet as part of the suspend process (when clearing the
> >> hardware pipeline).  I can go into that further but won't'
> >> unless asked.
> >>
> >>>> A stopped channel won't interrupt,
> >>>> so we don't bother disabling the completion interrupt,
> >>>> with no interrupts, NAPI won't be scheduled, so there's
> >>>> no need to disable NAPI either.
> >>>
> >>> That sounds plausible. But it doesn't explain why napi_disable "should
> >>> *not* be done when suspending" as the commit states.
> >>>
> >>> Arguably, leaving that won't have much effect either way, and is in
> >>> line with other drivers.
> >>
> >> Understood and agreed.  In fact, if the hang occurrs in
> >> napi_disable() when waiting for NAPI_STATE_SCHED to clear,
> >> it would occur in napi_synchronize() as well.
> >
> > Agreed.
> >
> > So you have an environment to test a patch in, it might be worthwhile
> > to test essentially the same logic reordering as in this patch set,
> > but while still disabling napi.
>
> What is the purpose of this test?  Just to guarantee
> that the NAPI hang goes away?  Because you agree that
> the napi_schedule() call would *also* hang if that
> problem exists, right?
>
> Anyway, what you're suggesting is to simply test with
> this last patch removed.  I can do that but I really
> don't expect it to change anything.  I will start that
> test later today when I'm turning my attention to
> something else for a while.
>
> > The disappearing race may be due to another change rather than
> > napi_disable vs napi_synchronize. A smaller, more targeted patch could
> > also be a net (instead of net-next) candidate.
>
> I am certain it is.
>
> I can tell you that we have seen a hang (after I think 2500+
> suspend/resume cycles) with the IPA code that is currently
> upstream.
>
> But with this latest series of 9, there is no hang after
> 10,000+ cycles.  That gives me a bisect window, but I really
> don't want to go through a full bisect of even those 9,
> because it's 4 tests, each of which takes days to complete.
>
> Looking at the 9 patches, I think this one is the most
> likely culprit:
>net: ipa: disable IEOB interrupt after channel stop
>
> I think the race involves the I/O completion handler
> interacting with NAPI in an unwanted way, but I have
> not come up with the exact sequence that would lead
> to getting stuck in napi_disable().
>
> Here are some possible events that could occur on an
> RX channel in *some* order, prior to that patch.  And
> in the order I show there's at least a problem of a
> receive not being processed immediately.
>
> . . . (suspend initiated)
>
> replenish_stop()
> quiesce()
> IRQ fires (r

Re: [PATCH net-next 9/9] net: ipa: don't disable NAPI in suspend

2021-01-31 Thread Willem de Bruijn
On Sun, Jan 31, 2021 at 10:32 AM Alex Elder  wrote:
>
> On 1/31/21 8:52 AM, Willem de Bruijn wrote:
> > On Sat, Jan 30, 2021 at 11:29 PM Alex Elder  wrote:
> >>
> >> On 1/30/21 9:25 AM, Willem de Bruijn wrote:
> >>> On Fri, Jan 29, 2021 at 3:29 PM Alex Elder  wrote:
> >>>>
> >>>> The channel stop and suspend paths both call __gsi_channel_stop(),
> >>>> which quiesces channel activity, disables NAPI, and (on other than
> >>>> SDM845) stops the channel.  Similarly, the start and resume paths
> >>>> share __gsi_channel_start(), which starts the channel and re-enables
> >>>> NAPI again.
> >>>>
> >>>> Disabling NAPI should be done when stopping a channel, but this
> >>>> should *not* be done when suspending.  It's not necessary in the
> >>>> suspend path anyway, because the stopped channel (or suspended
> >>>> endpoint on SDM845) will not cause interrupts to schedule NAPI,
> >>>> and gsi_channel_trans_quiesce() won't return until there are no
> >>>> more transactions to process in the NAPI polling loop.
> >>>
> >>> But why is it incorrect to do so?
> >>
> >> Maybe it's not; I also thought it was fine before, but...
> >>
> >> Someone at Qualcomm asked me why I thought NAPI needed
> >> to be disabled on suspend.  My response was basically
> >> that it was a lightweight operation, and it shouldn't
> >> really be a problem to do so.
> >>
> >> Then, when I posted two patches last month, Jakub's
> >> response told me he didn't understand why I was doing
> >> what I was doing, and I stepped back to reconsider
> >> the details of what was happening at suspend time.
> >>
> >> https://lore.kernel.org/netdev/20210107183803.47308...@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/
> >>
> >> Four things were happening to suspend a channel:
> >> quiesce activity; disable interrupt; disable NAPI;
> >> and stop the channel.  It occurred to me that a
> >> stopped channel would not generate interrupts, so if
> >> the channel was stopped earlier there would be no need
> >> to disable the interrupt.  Similarly there would be
> >> (essentially) no need to disable NAPI once a channel
> >> was stopped.
> >>
> >> Underlying all of this is that I started chasing a
> >> hang that was occurring on suspend over a month ago.
> >> It was hard to reproduce (hundreds or thousands of
> >> suspend/resume cycles without hitting it), and one
> >> of the few times I actually hit the problem it was
> >> stuck in napi_disable(), apparently waiting for
> >> NAPI_STATE_SCHED to get cleared by napi_complete().
> >
> > This is important information.
> >
> > What exactly do you mean by hang?
>
> Yes it's important!  Unfortunately I was not able to
> gather details about the problem in all the cases where
> it occurred.  But in at least one case I *did* confirm
> it was in the situation described above.
>
> What I mean by "hang" is that the system simply stopped
> on its way down, and the IPA ->suspend callback never
> completed (stuck in napi_disable).  So I expect that
> the SCHED flag was never going to get cleared (because
> of a race, presumably).
>
> >> My best guess about how this could occur was if there
> >> were a race of some kind between the interrupt handler
> >> (scheduling NAPI) and the poll function (completing
> >> it).  I found a number of problems while looking
> >> at this, and in the past few weeks I've posted some
> >> fixes to improve things.  Still, even with some of
> >> these fixes in place we have seen a hang (but now
> >> even more rarely).
> >>
> >> So this grand rework of suspending/stopping channels
> >> is an attempt to resolve this hang on suspend.
> >
> > Do you have any data that this patchset resolves the issue, or is it
> > too hard to reproduce to say anything?
>
> The data I have is that I have been running for weeks
> with tens of thousands of iterations with this patch
> (and the rest of them) without any hang.  Unfortunately
> that doesn't guarantee anything.  I contemplated trying
> to "catch" the problem and report that it *would have*
> occurred had the fix not been in place, but I haven't
> tried that (in part because it might not be easy).
>
> So...  Too hard to reproduce, but I have evidence that
> my testing so far has never reproduced the hang.
>

Re: [Patch v3 net-next 0/7] ethtool support for fec and link configuration

2021-01-31 Thread Willem de Bruijn
On Sun, Jan 31, 2021 at 8:11 AM Hariprasad Kelam  wrote:
>
> This series of patches add support for forward error correction(fec) and
> physical link configuration. Patches 1&2 adds necessary mbox handlers for fec
> mode configuration request and to fetch stats. Patch 3 registers driver
> callbacks for fec mode configuration and display. Patch 4&5 adds support of 
> mbox
> handlers for configuring link parameters like speed/duplex and autoneg etc.
> Patche 6&7 registers driver callbacks for physical link configuration.
>
> Change-log:
> v2:
> - Fixed review comments
> - Corrected indentation issues
> - Return -ENOMEM incase of mbox allocation failure
> - added validation for input fecparams bitmask values
> - added more comments
>
> V3:
> - Removed inline functions
> - Make use of ethtool helpers APIs to display supported
>   advertised modes
> - corrected indentation issues
> - code changes such that return early in case of failure
>   to aid branch prediction

This addresses my comments to the previous patch series, thanks.

It seems that patchwork only picked up only patch 6/7 unfortunately:
https://patchwork.kernel.org/project/netdevbpf/list/?series=424969

>
> Christina Jacob (6):
>   octeontx2-af: forward error correction configuration
>   octeontx2-pf: ethtool fec mode support
>   octeontx2-af: Physical link configuration support
>   octeontx2-af: advertised link modes support on cgx
>   octeontx2-pf: ethtool physical link status
>   octeontx2-pf: ethtool physical link configuration
>
> Felix Manlunas (1):
>   octeontx2-af: Add new CGX_CMD to get PHY FEC statistics
>
>  drivers/net/ethernet/marvell/octeontx2/af/cgx.c| 258 -
>  drivers/net/ethernet/marvell/octeontx2/af/cgx.h|  10 +
>  .../net/ethernet/marvell/octeontx2/af/cgx_fw_if.h  |  70 +++-
>  drivers/net/ethernet/marvell/octeontx2/af/mbox.h   |  87 -
>  drivers/net/ethernet/marvell/octeontx2/af/rvu.h|   4 +
>  .../net/ethernet/marvell/octeontx2/af/rvu_cgx.c|  80 +
>  .../ethernet/marvell/octeontx2/nic/otx2_common.c   |  20 ++
>  .../ethernet/marvell/octeontx2/nic/otx2_common.h   |   6 +
>  .../ethernet/marvell/octeontx2/nic/otx2_ethtool.c  | 399 
> -
>  .../net/ethernet/marvell/octeontx2/nic/otx2_pf.c   |   3 +
>  10 files changed, 930 insertions(+), 7 deletions(-)
>
> --
> 2.7.4


Re: [net-next 00/14] Add Marvell CN10K support

2021-01-31 Thread Willem de Bruijn
On Sat, Jan 30, 2021 at 12:04 PM Geetha sowjanya  wrote:
>
> The current admin function (AF) driver and the netdev driver supports
> OcteonTx2 silicon variants. The same OcteonTx2's Resource Virtualization Unit 
> (RVU)
> is carried forward to the next-gen silicon ie OcteonTx3, with some changes
> and feature enhancements.
>
> This patch set adds support for OcteonTx3 (CN10K) silicon and gets the drivers
> to the same level as OcteonTx2. No new OcteonTx3 specific features are added.
> Changes cover below HW level differences
> - PCIe BAR address changes wrt shared mailbox memory region
> - Receive buffer freeing to HW
> - Transmit packet's descriptor submission to HW
> - Programmable HW interface identifiers (channels)
> - Increased MTU support
> - A Serdes MAC block (RPM) configuration
>
> Geetha sowjanya (6):
>   octeontx2-af: cn10k: Update NIX/NPA context structure
>   octeontx2-af: cn10k: Update NIX and NPA context in debugfs
>   octeontx2-pf: cn10k: Initialise NIX context
>   octeontx2-pf: cn10k: Map LMTST region
>   octeontx2-pf: cn10k: Use LMTST lines for NPA/NIX operations
>
> Hariprasad Kelam (5):
>   octeontx2-af: cn10k: Add RPM MAC support
>   octeontx2-af: cn10K: Add MTU configuration
>   octeontx2-pf: cn10k: Get max mtu supported from admin function
>   octeontx2-af: cn10k: Add RPM Rx/Tx stats support
>   octeontx2-af: cn10k: MAC internal loopback support
>
> Rakesh Babu (1):
>   octeontx2-af: cn10k: Add RPM LMAC pause frame support
>
> Subbaraya Sundeep (2):
>   octeontx2-af: cn10k: Add mbox support for CN10K platform
>   octeontx2-pf: cn10k: Add mbox support for CN10K
>   octeontx2-af: cn10k: Add support for programmable channels
>
>  drivers/net/ethernet/marvell/octeontx2/af/Makefile |   2 +-
>  drivers/net/ethernet/marvell/octeontx2/af/cgx.c| 315 ---
>  drivers/net/ethernet/marvell/octeontx2/af/cgx.h|  15 +-
>  .../net/ethernet/marvell/octeontx2/af/cgx_fw_if.h  |   1 +
>  drivers/net/ethernet/marvell/octeontx2/af/common.h |   5 +
>  .../ethernet/marvell/octeontx2/af/lmac_common.h| 129 +
>  drivers/net/ethernet/marvell/octeontx2/af/mbox.c   |  59 +-
>  drivers/net/ethernet/marvell/octeontx2/af/mbox.h   |  70 ++-
>  drivers/net/ethernet/marvell/octeontx2/af/rpm.c| 272 ++
>  drivers/net/ethernet/marvell/octeontx2/af/rpm.h|  57 ++
>  drivers/net/ethernet/marvell/octeontx2/af/rvu.c| 157 +-
>  drivers/net/ethernet/marvell/octeontx2/af/rvu.h|  70 +++
>  .../net/ethernet/marvell/octeontx2/af/rvu_cgx.c| 135 -
>  .../net/ethernet/marvell/octeontx2/af/rvu_cn10k.c  | 261 +
>  .../ethernet/marvell/octeontx2/af/rvu_debugfs.c| 339 +++-
>  .../net/ethernet/marvell/octeontx2/af/rvu_nix.c| 112 +++-
>  .../net/ethernet/marvell/octeontx2/af/rvu_npc.c|   4 +-
>  .../net/ethernet/marvell/octeontx2/af/rvu_reg.h|  24 +
>  .../net/ethernet/marvell/octeontx2/af/rvu_struct.h | 604 
> ++---
>  .../net/ethernet/marvell/octeontx2/nic/Makefile|   2 +-
>  drivers/net/ethernet/marvell/octeontx2/nic/cn10k.c | 182 +++
>  drivers/net/ethernet/marvell/octeontx2/nic/cn10k.h |  17 +
>  .../ethernet/marvell/octeontx2/nic/otx2_common.c   | 144 +++--
>  .../ethernet/marvell/octeontx2/nic/otx2_common.h   | 105 +++-
>  .../net/ethernet/marvell/octeontx2/nic/otx2_pf.c   |  67 ++-
>  .../net/ethernet/marvell/octeontx2/nic/otx2_reg.h  |   4 +
>  .../ethernet/marvell/octeontx2/nic/otx2_struct.h   |  10 +-
>  .../net/ethernet/marvell/octeontx2/nic/otx2_txrx.c |  70 ++-
>  .../net/ethernet/marvell/octeontx2/nic/otx2_txrx.h |   8 +-
>  .../net/ethernet/marvell/octeontx2/nic/otx2_vf.c   |  52 +-
>  include/linux/soc/marvell/octeontx2/asm.h  |   8 +
>  31 files changed, 2573 insertions(+), 727 deletions(-)
>  create mode 100644 drivers/net/ethernet/marvell/octeontx2/af/lmac_common.h
>  create mode 100644 drivers/net/ethernet/marvell/octeontx2/af/rpm.c
>  create mode 100644 drivers/net/ethernet/marvell/octeontx2/af/rpm.h
>  create mode 100644 drivers/net/ethernet/marvell/octeontx2/af/rvu_cn10k.c
>  create mode 100644 drivers/net/ethernet/marvell/octeontx2/nic/cn10k.c
>  create mode 100644 drivers/net/ethernet/marvell/octeontx2/nic/cn10k.h
>

FYI, patchwork shows a number of checkpatch and build warnings to fix up

https://patchwork.kernel.org/project/netdevbpf/list/?series=424847


Re: [PATCH net-next 9/9] net: ipa: don't disable NAPI in suspend

2021-01-31 Thread Willem de Bruijn
On Sat, Jan 30, 2021 at 11:29 PM Alex Elder  wrote:
>
> On 1/30/21 9:25 AM, Willem de Bruijn wrote:
> > On Fri, Jan 29, 2021 at 3:29 PM Alex Elder  wrote:
> >>
> >> The channel stop and suspend paths both call __gsi_channel_stop(),
> >> which quiesces channel activity, disables NAPI, and (on other than
> >> SDM845) stops the channel.  Similarly, the start and resume paths
> >> share __gsi_channel_start(), which starts the channel and re-enables
> >> NAPI again.
> >>
> >> Disabling NAPI should be done when stopping a channel, but this
> >> should *not* be done when suspending.  It's not necessary in the
> >> suspend path anyway, because the stopped channel (or suspended
> >> endpoint on SDM845) will not cause interrupts to schedule NAPI,
> >> and gsi_channel_trans_quiesce() won't return until there are no
> >> more transactions to process in the NAPI polling loop.
> >
> > But why is it incorrect to do so?
>
> Maybe it's not; I also thought it was fine before, but...
>
> Someone at Qualcomm asked me why I thought NAPI needed
> to be disabled on suspend.  My response was basically
> that it was a lightweight operation, and it shouldn't
> really be a problem to do so.
>
> Then, when I posted two patches last month, Jakub's
> response told me he didn't understand why I was doing
> what I was doing, and I stepped back to reconsider
> the details of what was happening at suspend time.
>
> https://lore.kernel.org/netdev/20210107183803.47308...@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com/
>
> Four things were happening to suspend a channel:
> quiesce activity; disable interrupt; disable NAPI;
> and stop the channel.  It occurred to me that a
> stopped channel would not generate interrupts, so if
> the channel was stopped earlier there would be no need
> to disable the interrupt.  Similarly there would be
> (essentially) no need to disable NAPI once a channel
> was stopped.
>
> Underlying all of this is that I started chasing a
> hang that was occurring on suspend over a month ago.
> It was hard to reproduce (hundreds or thousands of
> suspend/resume cycles without hitting it), and one
> of the few times I actually hit the problem it was
> stuck in napi_disable(), apparently waiting for
> NAPI_STATE_SCHED to get cleared by napi_complete().

This is important information.

What exactly do you mean by hang?

>
> My best guess about how this could occur was if there
> were a race of some kind between the interrupt handler
> (scheduling NAPI) and the poll function (completing
> it).  I found a number of problems while looking
> at this, and in the past few weeks I've posted some
> fixes to improve things.  Still, even with some of
> these fixes in place we have seen a hang (but now
> even more rarely).
>
> So this grand rework of suspending/stopping channels
> is an attempt to resolve this hang on suspend.

Do you have any data that this patchset resolves the issue, or is it
too hard to reproduce to say anything?

> The channel is now stopped early, and once stopped,
> everything that completed prior to the channel being
> stopped is polled before considering the suspend
> function done.

Does the call to gsi_channel_trans_quiesce before
gsi_channel_stop_retry leave a race where new transactions may occur
until state GSI_CHANNEL_STATE_STOPPED is reached? Asking without truly
knowing the details of this device.

> A stopped channel won't interrupt,
> so we don't bother disabling the completion interrupt,
> with no interrupts, NAPI won't be scheduled, so there's
> no need to disable NAPI either.

That sounds plausible. But it doesn't explain why napi_disable "should
*not* be done when suspending" as the commit states.

Arguably, leaving that won't have much effect either way, and is in
line with other drivers.

Your previous patchset mentions "When stopping a channel, the IPA
driver currently disables NAPI before disabling the interrupt." That
would no longer be the case.

> The net result is simpler, and seems logical, and
> should preclude any possible race between the interrupt
> handler and poll function.  I'm trying to solve the
> hang problem analytically, because it takes *so* long
> to reproduce.
>
> I'm open to other suggestions.
>
> -Alex
>
> >  From a quick look, virtio-net disables on both remove and freeze, for 
> > instance.
> >
> >> Instead, enable NAPI in gsi_channel_start(), when the completion
> >> interrupt is first enabled.  Disable it again in gsi_channel_stop(),
> >> when finally disabling the interrupt.
> >>
> >> Add a call to napi_synchronize() to 

Re: [PATCH net-next 9/9] net: ipa: don't disable NAPI in suspend

2021-01-30 Thread Willem de Bruijn
On Fri, Jan 29, 2021 at 3:29 PM Alex Elder  wrote:
>
> The channel stop and suspend paths both call __gsi_channel_stop(),
> which quiesces channel activity, disables NAPI, and (on other than
> SDM845) stops the channel.  Similarly, the start and resume paths
> share __gsi_channel_start(), which starts the channel and re-enables
> NAPI again.
>
> Disabling NAPI should be done when stopping a channel, but this
> should *not* be done when suspending.  It's not necessary in the
> suspend path anyway, because the stopped channel (or suspended
> endpoint on SDM845) will not cause interrupts to schedule NAPI,
> and gsi_channel_trans_quiesce() won't return until there are no
> more transactions to process in the NAPI polling loop.

But why is it incorrect to do so?

>From a quick look, virtio-net disables on both remove and freeze, for instance.

> Instead, enable NAPI in gsi_channel_start(), when the completion
> interrupt is first enabled.  Disable it again in gsi_channel_stop(),
> when finally disabling the interrupt.
>
> Add a call to napi_synchronize() to __gsi_channel_stop(), to ensure
> NAPI polling is done before moving on.
>
> Signed-off-by: Alex Elder 
> ---
=
> @@ -894,12 +894,16 @@ int gsi_channel_start(struct gsi *gsi, u32 channel_id)
> struct gsi_channel *channel = >channel[channel_id];
> int ret;
>
> -   /* Enable the completion interrupt */
> +   /* Enable NAPI and the completion interrupt */
> +   napi_enable(>napi);
> gsi_irq_ieob_enable_one(gsi, channel->evt_ring_id);
>
> ret = __gsi_channel_start(channel, true);
> -   if (ret)
> -   gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
> +   if (!ret)
> +   return 0;
> +
> +   gsi_irq_ieob_disable_one(gsi, channel->evt_ring_id);
> +   napi_disable(>napi);
>
> return ret;
>  }

subjective, but easier to parse when the normal control flow is linear
and the error path takes a branch (or goto, if reused).


Re: [Patch v2 net-next 2/7] octeontx2-af: Add new CGX_CMD to get PHY FEC statistics

2021-01-30 Thread Willem de Bruijn
On Sat, Jan 30, 2021 at 4:53 AM Hariprasad Kelam  wrote:
>
> Hi Willem,
>
> > -Original Message-
> > From: Willem de Bruijn 
> > Sent: Thursday, January 28, 2021 1:50 AM
> > To: Hariprasad Kelam 
> > Cc: Network Development ; LKML  > ker...@vger.kernel.org>; David Miller ; Jakub
> > Kicinski ; Sunil Kovvuri Goutham
> > ; Linu Cherian ;
> > Geethasowjanya Akula ; Jerin Jacob Kollanukkaran
> > ; Subbaraya Sundeep Bhatta ;
> > Felix Manlunas ; Christina Jacob
> > ; Sunil Kovvuri Goutham
> > 
> > Subject: [EXT] Re: [Patch v2 net-next 2/7] octeontx2-af: Add new CGX_CMD
> > to get PHY FEC statistics
> >
> > On Wed, Jan 27, 2021 at 4:04 AM Hariprasad Kelam 
> > wrote:
> > >
> > > From: Felix Manlunas 
> > >
> > > This patch adds support to fetch fec stats from PHY. The stats are put
> > > in the shared data struct fwdata.  A PHY driver indicates that it has
> > > FEC stats by setting the flag fwdata.phy.misc.has_fec_stats
> > >
> > > Besides CGX_CMD_GET_PHY_FEC_STATS, also add CGX_CMD_PRBS and
> > > CGX_CMD_DISPLAY_EYE to enum cgx_cmd_id so that Linux's enum list is in
> > > sync with firmware's enum list.
> > >
> > > Signed-off-by: Felix Manlunas 
> > > Signed-off-by: Christina Jacob 
> > > Signed-off-by: Sunil Kovvuri Goutham 
> > > Signed-off-by: Hariprasad Kelam 
> >
> >
> > > +struct phy_s {
> > > +   struct {
> > > +   u64 can_change_mod_type : 1;
> > > +   u64 mod_type: 1;
> > > +   u64 has_fec_stats   : 1;
> >
> > this style is not customary
>
> These structures are shared with firmware and stored in a shared memory. Any 
> change in size of structures will break compatibility. To avoid frequent 
> compatible issues with new vs old firmware we have put spaces where ever we 
> see that there could be more fields added in future.
> So changing this to u8 can have an impact in future.

My comment was intended much simpler: don't add whitespace between the
bit-field variable name and its size expression.

  u64 mod_type:1;

not

  u64 mod_type : 1;

At least, I have not seen that style anywhere else in the kernel.


Re: [PATCH net-next v1 2/6] lan743x: support rx multi-buffer packets

2021-01-29 Thread Willem de Bruijn
On Fri, Jan 29, 2021 at 6:03 PM Sven Van Asbroeck  wrote:
>
> Hoi Willem, thanks a lot for reviewing this patch, much appreciated !!
>
> On Fri, Jan 29, 2021 at 5:11 PM Willem de Bruijn
>  wrote:
> >
> > > +static struct sk_buff *
> > > +lan743x_rx_trim_skb(struct sk_buff *skb, int frame_length)
> > > +{
> > > +   if (skb_linearize(skb)) {
> >
> > Is this needed? That will be quite expensive
>
> The skb will only be non-linear when it's created from a multi-buffer frame.
> Multi-buffer frames are only generated right after a mtu change - fewer than
> 32 frames will be non-linear after an mtu increase. So as long as people don't
> change the mtu in a tight loop, skb_linearize is just a single comparison,
> 99.99+% of the time.

Ah. I had missed the temporary state of this until the buffers are
reinitialized. Yes, then there is no reason to worry. Same for the
frag_list vs frags comment I made.

> >
> > Is it possible to avoid the large indentation change, or else do that
> > in a separate patch? It makes it harder to follow the functional
> > change.
>
> It's not immediately obvious, but I have replaced the whole function
> with slightly different logic, and the replacement content has a much
> flatter indentation structure, and should be easier to follow.
>
> Or perhaps I am misinterpreting your question?

Okay. I found it a bit hard to parse how much true code change was
mixed in with just reindenting existing code. If a lot, then no need
to split of the code refactor.

>
> > > +
> > > +   /* add buffers to skb via skb->frag_list */
> > > +   if (is_first) {
> > > +   skb_reserve(skb, RX_HEAD_PADDING);
> > > +   skb_put(skb, buffer_length - RX_HEAD_PADDING);
> > > +   if (rx->skb_head)
> > > +   dev_kfree_skb_irq(rx->skb_head);
> > > +   rx->skb_head = skb;
> > > +   } else if (rx->skb_head) {
> > > +   skb_put(skb, buffer_length);
> > > +   if (skb_shinfo(rx->skb_head)->frag_list)
> > > +   rx->skb_tail->next = skb;
> > > +   else
> > > +   skb_shinfo(rx->skb_head)->frag_list = skb;
> >
> > Instead of chaining skbs into frag_list, you could perhaps delay skb
> > alloc until after reception, allocate buffers stand-alone, and link
> > them into the skb as skb_frags? That might avoid a few skb alloc +
> > frees. Though a bit change, not sure how feasible.
>
> The problem here is this (copypasta from somewhere else in this patch):
>
> /* Only the last buffer in a multi-buffer frame contains the total frame
> * length. All other buffers have a zero frame length. The chip
> * occasionally sends more buffers than strictly required to reach the
> * total frame length.
> * Handle this by adding all buffers to the skb in their entirety.
> * Once the real frame length is known, trim the skb.
> */
>
> In other words, the chip sometimes sends more buffers than strictly needed to
> fit the frame. linearize + trim deals with this thorny issue perfectly.
>
> If the skb weren't linearized, we would run into trouble when trying to trim
> (remove from the end) a chunk bigger than the last skb fragment.
>
> > > +process_extension:
> > > +   if (extension_index >= 0) {
> > > +   u32 ts_sec;
> > > +   u32 ts_nsec;
> > > +
> > > +   ts_sec = le32_to_cpu(desc_ext->data1);
> > > +   ts_nsec = (le32_to_cpu(desc_ext->data2) &
> > > + RX_DESC_DATA2_TS_NS_MASK_);
> > > +   if (rx->skb_head) {
> > > +   hwtstamps = skb_hwtstamps(rx->skb_head);
> > > +   if (hwtstamps)
> >
> > This is always true.
> >
> > You can just call skb_hwtstamps(skb)->hwtstamp = ktime_set(ts_sec, ts_nsec);
>
> Thank you, will do !
>
> >
> > Though I see that this is existing code just moved due to
> > aforementioned indentation change.
>
> True, but I can make the change anyway.


Re: [PATCH net-next v1 2/6] lan743x: support rx multi-buffer packets

2021-01-29 Thread Willem de Bruijn
On Fri, Jan 29, 2021 at 2:56 PM Sven Van Asbroeck  wrote:
>
> From: Sven Van Asbroeck 
>
> Multi-buffer packets enable us to use rx ring buffers smaller than
> the mtu. This will allow us to change the mtu on-the-fly, without
> having to stop the network interface in order to re-size the rx
> ring buffers.
>
> This is a big change touching a key driver function (process_packet),
> so care has been taken to test this extensively:
>
> Tests with debug logging enabled (add #define DEBUG).
>
> 1. Limit rx buffer size to 500, so mtu (1500) takes 3 buffers.
>Ping to chip, verify correct packet size is sent to OS.
>Ping large packets to chip (ping -s 1400), verify correct
>  packet size is sent to OS.
>Ping using packets around the buffer size, verify number of
>  buffers is changing, verify correct packet size is sent
>  to OS:
>  $ ping -s 472
>  $ ping -s 473
>  $ ping -s 992
>  $ ping -s 993
>Verify that each packet is followed by extension processing.
>
> 2. Limit rx buffer size to 500, so mtu (1500) takes 3 buffers.
>Run iperf3 -s on chip, verify that packets come in 3 buffers
>  at a time.
>Verify that packet size is equal to mtu.
>Verify that each packet is followed by extension processing.
>
> 3. Set chip and host mtu to 2000.
>Limit rx buffer size to 500, so mtu (2000) takes 4 buffers.
>Run iperf3 -s on chip, verify that packets come in 4 buffers
>  at a time.
>Verify that packet size is equal to mtu.
>Verify that each packet is followed by extension processing.
>
> Tests with debug logging DISabled (remove #define DEBUG).
>
> 4. Limit rx buffer size to 500, so mtu (1500) takes 3 buffers.
>Run iperf3 -s on chip, note sustained rx speed.
>Set chip and host mtu to 2000, so mtu takes 4 buffers.
>Run iperf3 -s on chip, note sustained rx speed.
>Verify no packets are dropped in both cases.
>
> Tests with DEBUG_KMEMLEAK on:
>  $ mount -t debugfs nodev /sys/kernel/debug/
>  $ echo scan > /sys/kernel/debug/kmemleak
>
> 5. Limit rx buffer size to 500, so mtu (1500) takes 3 buffers.
>Run the following tests concurrently for at least one hour:
>- iperf3 -s on chip
>- ping -> chip
>Monitor reported memory leaks.
>
> 6. Set chip and host mtu to 2000.
>Limit rx buffer size to 500, so mtu (2000) takes 4 buffers.
>Run the following tests concurrently for at least one hour:
>- iperf3 -s on chip
>- ping -> chip
>Monitor reported memory leaks.
>
> 7. Simulate low-memory in lan743x_rx_allocate_skb(): fail every
>  100 allocations.
>Repeat (5) and (6).
>Monitor reported memory leaks.
>
> 8. Simulate  low-memory in lan743x_rx_allocate_skb(): fail 10
>  allocations in a row in every 100.
>Repeat (5) and (6).
>Monitor reported memory leaks.
>
> 9. Simulate  low-memory in lan743x_rx_trim_skb(): fail 1 allocation
>  in every 100.
>Repeat (5) and (6).
>Monitor reported memory leaks.
>
> Signed-off-by: Sven Van Asbroeck 
> ---
>
> Tree: git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git # 
> 46eb3c108fe1
>
> To: Bryan Whitehead 
> To: unglinuxdri...@microchip.com
> To: "David S. Miller" 
> To: Jakub Kicinski 
> Cc: Andrew Lunn 
> Cc: Alexey Denisov 
> Cc: Sergej Bauer 
> Cc: Tim Harvey 
> Cc: Anders Rønningen 
> Cc: net...@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org (open list)
>
>  drivers/net/ethernet/microchip/lan743x_main.c | 321 --
>  drivers/net/ethernet/microchip/lan743x_main.h |   2 +
>  2 files changed, 143 insertions(+), 180 deletions(-)


> +static struct sk_buff *
> +lan743x_rx_trim_skb(struct sk_buff *skb, int frame_length)
> +{
> +   if (skb_linearize(skb)) {

Is this needed? That will be quite expensive

> +   dev_kfree_skb_irq(skb);
> +   return NULL;
> +   }
> +   frame_length = max_t(int, 0, frame_length - RX_HEAD_PADDING - 2);
> +   if (skb->len > frame_length) {
> +   skb->tail -= skb->len - frame_length;
> +   skb->len = frame_length;
> +   }
> +   return skb;
> +}
> +
>  static int lan743x_rx_process_packet(struct lan743x_rx *rx)
>  {
> -   struct skb_shared_hwtstamps *hwtstamps = NULL;
> +   struct lan743x_rx_descriptor *descriptor, *desc_ext;
> int result = RX_PROCESS_RESULT_NOTHING_TO_DO;
> int current_head_index = le32_to_cpu(*rx->head_cpu_ptr);
> struct lan743x_rx_buffer_info *buffer_info;
> -   struct lan743x_rx_descriptor *descriptor;
> +   struct skb_shared_hwtstamps *hwtstamps;
> +   int frame_length, buffer_length;
> +   struct sk_buff *skb;
> int extension_index = -1;
> -   int first_index = -1;
> -   int last_index = -1;
> +   bool is_last, is_first;
>
> if (current_head_index < 0 || current_head_index >= rx->ring_size)
> goto done;
> @@ -2068,170 +2075,126 @@ static int lan743x_rx_process_packet(struct 
> lan743x_rx *rx)
> if 

Re: [PATCH] rtlwifi: rtl8192se: remove redundant initialization of variable rtstatus

2021-01-28 Thread Willem de Bruijn
On Thu, Jan 28, 2021 at 12:15 PM Colin King  wrote:
>
> From: Colin Ian King 
>
> The variable rtstatu is being initialized with a value that is never
> read and it is being updated later with a new value.  The initialization
> is redundant and can be removed.
>
> Addresses-Coverity: ("Unused value")
> Signed-off-by: Colin Ian King 

(for netdrv)

Acked-by: Willem de Bruijn 


Re: [Patch v2 net-next 4/7] octeontx2-af: Physical link configuration support

2021-01-27 Thread Willem de Bruijn
On Wed, Jan 27, 2021 at 4:02 AM Hariprasad Kelam  wrote:
>
> From: Christina Jacob 
>
> CGX LMAC, the physical interface support link configuration parameters
> like speed, auto negotiation, duplex  etc. Firmware saves these into
> memory region shared between firmware and this driver.
>
> This patch adds mailbox handler set_link_mode, fw_data_get to
> configure and read these parameters.
>
> Signed-off-by: Christina Jacob 
> Signed-off-by: Sunil Goutham 
> Signed-off-by: Hariprasad Kelam 

> +int rvu_mbox_handler_cgx_set_link_mode(struct rvu *rvu,
> +  struct cgx_set_link_mode_req *req,
> +  struct cgx_set_link_mode_rsp *rsp)
> +{
> +   int pf = rvu_get_pf(req->hdr.pcifunc);
> +   u8 cgx_idx, lmac;
> +   void *cgxd;
> +
> +   if (!is_cgx_config_permitted(rvu, req->hdr.pcifunc))
> +   return -EPERM;
> +
> +   rvu_get_cgx_lmac_id(rvu->pf2cgxlmac_map[pf], _idx, );
> +   cgxd = rvu_cgx_pdata(cgx_idx, rvu);
> +   rsp->status =  cgx_set_link_mode(cgxd, req->args, cgx_idx, lmac);

nit: two spaces after assignment operator.

on the point of no new inline: do also check the status in patchwork.
that also flags such issues.


Re: [Patch v2 net-next 3/7] octeontx2-pf: ethtool fec mode support

2021-01-27 Thread Willem de Bruijn
On Wed, Jan 27, 2021 at 4:03 AM Hariprasad Kelam  wrote:
>
> From: Christina Jacob 
>
> Add ethtool support to configure fec modes baser/rs and
> support to fecth FEC stats from CGX as well PHY.
>
> Configure fec mode
> - ethtool --set-fec eth0 encoding rs/baser/off/auto
> Query fec mode
> - ethtool --show-fec eth0
>
> Signed-off-by: Christina Jacob 
> Signed-off-by: Sunil Goutham 
> Signed-off-by: Hariprasad Kelam 
> ---
>  .../ethernet/marvell/octeontx2/nic/otx2_common.c   |  23 +++
>  .../ethernet/marvell/octeontx2/nic/otx2_common.h   |   6 +
>  .../ethernet/marvell/octeontx2/nic/otx2_ethtool.c  | 181 
> -
>  .../net/ethernet/marvell/octeontx2/nic/otx2_pf.c   |   3 +
>  4 files changed, 211 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c 
> b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
> index bdfa2e2..f7e5450 100644
> --- a/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
> +++ b/drivers/net/ethernet/marvell/octeontx2/nic/otx2_common.c
> @@ -60,6 +60,22 @@ void otx2_update_lmac_stats(struct otx2_nic *pfvf)
> mutex_unlock(>mbox.lock);
>  }
>
> +void otx2_update_lmac_fec_stats(struct otx2_nic *pfvf)
> +{
> +   struct msg_req *req;
> +
> +   if (!netif_running(pfvf->netdev))
> +   return;
> +   mutex_lock(>mbox.lock);
> +   req = otx2_mbox_alloc_msg_cgx_fec_stats(>mbox);
> +   if (!req) {
> +   mutex_unlock(>mbox.lock);
> +   return;
> +   }
> +   otx2_sync_mbox_msg(>mbox);

Perhaps simpler to have a single exit from the critical section:

  if (req)
otx2_update_lmac_fec_stats

> +   mutex_unlock(>mbox.lock);
> +}

Also, should this function return an error on failure? The caller
returns errors in other cases.


Re: [Patch v2 net-next 2/7] octeontx2-af: Add new CGX_CMD to get PHY FEC statistics

2021-01-27 Thread Willem de Bruijn
On Wed, Jan 27, 2021 at 4:04 AM Hariprasad Kelam  wrote:
>
> From: Felix Manlunas 
>
> This patch adds support to fetch fec stats from PHY. The stats are
> put in the shared data struct fwdata.  A PHY driver indicates
> that it has FEC stats by setting the flag fwdata.phy.misc.has_fec_stats
>
> Besides CGX_CMD_GET_PHY_FEC_STATS, also add CGX_CMD_PRBS and
> CGX_CMD_DISPLAY_EYE to enum cgx_cmd_id so that Linux's enum list is in sync
> with firmware's enum list.
>
> Signed-off-by: Felix Manlunas 
> Signed-off-by: Christina Jacob 
> Signed-off-by: Sunil Kovvuri Goutham 
> Signed-off-by: Hariprasad Kelam 


> +struct phy_s {
> +   struct {
> +   u64 can_change_mod_type : 1;
> +   u64 mod_type: 1;
> +   u64 has_fec_stats   : 1;

this style is not customary

> +   } misc;
> +   struct fec_stats_s {
> +   u32 rsfec_corr_cws;
> +   u32 rsfec_uncorr_cws;
> +   u32 brfec_corr_blks;
> +   u32 brfec_uncorr_blks;
> +   } fec_stats;
> +};
> +
> +struct cgx_lmac_fwdata_s {
> +   u16 rw_valid;
> +   u64 supported_fec;
> +   u64 supported_an;

are these intended to be individual u64's?

> +   u64 supported_link_modes;
> +   /* only applicable if AN is supported */
> +   u64 advertised_fec;
> +   u64 advertised_link_modes;
> +   /* Only applicable if SFP/QSFP slot is present */
> +   struct sfp_eeprom_s sfp_eeprom;
> +   struct phy_s phy;
> +#define LMAC_FWDATA_RESERVED_MEM 1021
> +   u64 reserved[LMAC_FWDATA_RESERVED_MEM];
> +};
> +
> +struct cgx_fw_data {
> +   struct mbox_msghdr hdr;
> +   struct cgx_lmac_fwdata_s fwdata;
> +};
> +
>  /* NPA mbox message formats */
>
>  /* NPA mailbox error codes
> diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h 
> b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
> index b1a6ecf..c824f1e 100644
> --- a/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
> +++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu.h
> @@ -350,6 +350,10 @@ struct rvu_fwdata {
> u64 msixtr_base;
>  #define FWDATA_RESERVED_MEM 1023
> u64 reserved[FWDATA_RESERVED_MEM];
> +   /* Do not add new fields below this line */
> +#define CGX_MAX 5
> +#define CGX_LMACS_MAX   4
> +   struct cgx_lmac_fwdata_s cgx_fw_data[CGX_MAX][CGX_LMACS_MAX];

Probably want to move the comment below the field.
>  };
>
>  struct ptp;
> diff --git a/drivers/net/ethernet/marvell/octeontx2/af/rvu_cgx.c 
> b/drivers/net/ethernet/marvell/octeontx2/af/rvu_cgx.c
> index 74f494b..7fac9ab 100644
> --- a/drivers/net/ethernet/marvell/octeontx2/af/rvu_cgx.c
> +++ b/drivers/net/ethernet/marvell/octeontx2/af/rvu_cgx.c
> @@ -694,6 +694,19 @@ int rvu_mbox_handler_cgx_cfg_pause_frm(struct rvu *rvu,
> return 0;
>  }
>
> +int rvu_mbox_handler_cgx_get_phy_fec_stats(struct rvu *rvu, struct msg_req 
> *req,
> +  struct msg_rsp *rsp)
> +{
> +   int pf = rvu_get_pf(req->hdr.pcifunc);
> +   u8 cgx_id, lmac_id;
> +
> +   if (!is_pf_cgxmapped(rvu, pf))
> +   return -EPERM;
> +
> +   rvu_get_cgx_lmac_id(rvu->pf2cgxlmac_map[pf], _id, _id);
> +   return cgx_get_phy_fec_stats(rvu_cgx_pdata(cgx_id, rvu), lmac_id);
> +}
> +
>  /* Finds cumulative status of NIX rx/tx counters from LF of a PF and those
>   * from its VFs as well. ie. NIX rx/tx counters at the CGX port level
>   */
> @@ -800,3 +813,22 @@ int rvu_mbox_handler_cgx_set_fec_param(struct rvu *rvu,
> rsp->fec = cgx_set_fec(req->fec, cgx_id, lmac_id);
> return 0;
>  }
> +
> +int rvu_mbox_handler_cgx_get_aux_link_info(struct rvu *rvu, struct msg_req 
> *req,
> +  struct cgx_fw_data *rsp)
> +{
> +   int pf = rvu_get_pf(req->hdr.pcifunc);
> +   u8 cgx_id, lmac_id;
> +
> +   if (!rvu->fwdata)
> +   return -ENXIO;
> +
> +   if (!is_pf_cgxmapped(rvu, pf))
> +   return -EPERM;
> +
> +   rvu_get_cgx_lmac_id(rvu->pf2cgxlmac_map[pf], _id, _id);
> +
> +   memcpy(>fwdata, >fwdata->cgx_fw_data[cgx_id][lmac_id],
> +  sizeof(struct cgx_lmac_fwdata_s));
> +   return 0;
> +}
> --
> 2.7.4
>


Re: [Patch v2 net-next 1/7] octeontx2-af: forward error correction configuration

2021-01-27 Thread Willem de Bruijn
On Wed, Jan 27, 2021 at 4:05 AM Hariprasad Kelam  wrote:
>
> From: Christina Jacob 
>
> CGX block supports forward error correction modes baseR
> and RS. This patch adds support to set encoding mode
> and to read corrected/uncorrected block counters
>
> Adds new mailbox handlers set_fec to configure encoding modes
> and fec_stats to read counters and also increase mbox timeout
> to accomdate firmware command response timeout.
>
> Along with new CGX_CMD_SET_FEC command add other commands to
> sync with kernel enum list with firmware.
>
> Signed-off-by: Christina Jacob 
> Signed-off-by: Sunil Goutham 
> Signed-off-by: Hariprasad Kelam 
> ---
>  drivers/net/ethernet/marvell/octeontx2/af/cgx.c| 74 
> ++
>  drivers/net/ethernet/marvell/octeontx2/af/cgx.h|  7 ++
>  .../net/ethernet/marvell/octeontx2/af/cgx_fw_if.h  | 17 -
>  drivers/net/ethernet/marvell/octeontx2/af/mbox.h   | 22 ++-
>  .../net/ethernet/marvell/octeontx2/af/rvu_cgx.c| 33 ++
>  5 files changed, 151 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/marvell/octeontx2/af/cgx.c 
> b/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
> index 84a9123..5489dab 100644
> --- a/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
> +++ b/drivers/net/ethernet/marvell/octeontx2/af/cgx.c
> @@ -340,6 +340,58 @@ int cgx_get_tx_stats(void *cgxd, int lmac_id, int idx, 
> u64 *tx_stat)
> return 0;
>  }
>
> +static int cgx_set_fec_stats_count(struct cgx_link_user_info *linfo)
> +{
> +   if (linfo->fec) {
> +   switch (linfo->lmac_type_id) {
> +   case LMAC_MODE_SGMII:
> +   case LMAC_MODE_XAUI:
> +   case LMAC_MODE_RXAUI:
> +   case LMAC_MODE_QSGMII:
> +   return 0;
> +   case LMAC_MODE_10G_R:
> +   case LMAC_MODE_25G_R:
> +   case LMAC_MODE_100G_R:
> +   case LMAC_MODE_USXGMII:
> +   return 1;
> +   case LMAC_MODE_40G_R:
> +   return 4;
> +   case LMAC_MODE_50G_R:
> +   if (linfo->fec == OTX2_FEC_BASER)
> +   return 2;
> +   else
> +   return 1;
> +   }
> +   }
> +   return 0;

may consider inverting the condition, to remove one level of indentation.

> +int cgx_set_fec(u64 fec, int cgx_id, int lmac_id)
> +{
> +   u64 req = 0, resp;
> +   struct cgx *cgx;
> +   int err = 0;
> +
> +   cgx = cgx_get_pdata(cgx_id);
> +   if (!cgx)
> +   return -ENXIO;
> +
> +   req = FIELD_SET(CMDREG_ID, CGX_CMD_SET_FEC, req);
> +   req = FIELD_SET(CMDSETFEC, fec, req);
> +   err = cgx_fwi_cmd_generic(req, , cgx, lmac_id);
> +   if (!err) {
> +   cgx->lmac_idmap[lmac_id]->link_info.fec =
> +   FIELD_GET(RESP_LINKSTAT_FEC, resp);
> +   return cgx->lmac_idmap[lmac_id]->link_info.fec;
> +   }
> +   return err;

Prefer keeping the success path linear and return early if (err) in
explicit branch. This also aids branch prediction.

> +int rvu_mbox_handler_cgx_fec_stats(struct rvu *rvu,
> +  struct msg_req *req,
> +  struct cgx_fec_stats_rsp *rsp)
> +{
> +   int pf = rvu_get_pf(req->hdr.pcifunc);
> +   u8 cgx_idx, lmac;
> +   int err = 0;
> +   void *cgxd;
> +
> +   if (!is_cgx_config_permitted(rvu, req->hdr.pcifunc))
> +   return -EPERM;
> +   rvu_get_cgx_lmac_id(rvu->pf2cgxlmac_map[pf], _idx, );
> +
> +   cgxd = rvu_cgx_pdata(cgx_idx, rvu);
> +   err = cgx_get_fec_stats(cgxd, lmac, rsp);
> +   return err;

no need for variable err


Re: [PATCH v4 net-next 00/19] net: mvpp2: Add TX Flow Control support

2021-01-27 Thread Willem de Bruijn
On Wed, Jan 27, 2021 at 6:50 AM  wrote:
>
> From: Stefan Chulski 
>
> Armada hardware has a pause generation mechanism in GOP (MAC).
> The GOP generate flow control frames based on an indication programmed in 
> Ports Control 0 Register. There is a bit per port.
> However assertion of the PortX Pause bits in the ports control 0 register 
> only sends a one time pause.
> To complement the function the GOP has a mechanism to periodically send pause 
> control messages based on periodic counters.
> This mechanism ensures that the pause is effective as long as the Appropriate 
> PortX Pause is asserted.
>
> Problem is that Packet Processor that actually can drop packets due to lack 
> of resources not connected to the GOP flow control generation mechanism.
> To solve this issue Armada has firmware running on CM3 CPU dedicated for Flow 
> Control support.
> Firmware monitors Packet Processor resources and asserts XON/XOFF by writing 
> to Ports Control 0 Register.
>
> MSS shared SRAM memory used to communicate between CM3 firmware and PP2 
> driver.
> During init PP2 driver informs firmware about used BM pools, RXQs, congestion 
> and depletion thresholds.
>
> The pause frames are generated whenever congestion or depletion in resources 
> is detected.
> The back pressure is stopped when the resource reaches a sufficient level.
> So the congestion/depletion and sufficient level implement a hysteresis that 
> reduces the XON/XOFF toggle frequency.
>
> Packet Processor v23 hardware introduces support for RX FIFO fill level 
> monitor.
> Patch "add PPv23 version definition" to differ between v23 and v22 hardware.
> Patch "add TX FC firmware check" verifies that CM3 firmware supports Flow 
> Control monitoring.
>
> v3 --> v4
> - Remove RFC tag
>
> v2 --> v3
> - Remove inline functions
> - Add PPv2.3 description into marvell-pp2.txt
> - Improve mvpp2_interrupts_mask/unmask procedure
> - Improve FC enable/disable procedure
> - Add priv->sram_pool check
> - Remove gen_pool_destroy call
> - Reduce Flow Control timer to x100 faster
>
> v1 --> v2
> - Add memory requirements information
> - Add EPROBE_DEFER if of_gen_pool_get return NULL
> - Move Flow control configuration to mvpp2_mac_link_up callback
> - Add firmware version info with Flow control support
>
> Konstantin Porotchkin (1):
>   dts: marvell: add CM3 SRAM memory to cp115 ethernet device tree
>
> Stefan Chulski (18):
>   doc: marvell: add cm3-mem device tree bindings description
>   net: mvpp2: add CM3 SRAM memory map
>   doc: marvell: add PPv2.3 description to marvell-pp2.txt
>   net: mvpp2: add PPv23 version definition
>   net: mvpp2: always compare hw-version vs MVPP21
>   net: mvpp2: increase BM pool size to 2048 buffers
>   net: mvpp2: increase RXQ size to 1024 descriptors
>   net: mvpp2: add FCA periodic timer configurations
>   net: mvpp2: add FCA RXQ non occupied descriptor threshold
>   net: mvpp2: add spinlock for FW FCA configuration path
>   net: mvpp2: enable global flow control
>   net: mvpp2: add RXQ flow control configurations
>   net: mvpp2: add ethtool flow control configuration support
>   net: mvpp2: add BM protection underrun feature support
>   net: mvpp2: add PPv23 RX FIFO flow control
>   net: mvpp2: set 802.3x GoP Flow Control mode
>   net: mvpp2: limit minimum ring size to 1024 descriptors
>   net: mvpp2: add TX FC firmware check
>
>  Documentation/devicetree/bindings/net/marvell-pp2.txt |   4 +-
>  arch/arm64/boot/dts/marvell/armada-cp11x.dtsi |  10 +
>  drivers/net/ethernet/marvell/mvpp2/mvpp2.h| 130 -
>  drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c   | 564 
> ++--
>  4 files changed, 658 insertions(+), 50 deletions(-)

Besides the per patch comments, see also the patchwork state for the
patches. Patch 3 and 12 seem to introduce new build warnings or
errors. And one patch misses the sign-off of the author in the From
line.


Re: [PATCH v4 net-next 11/19] net: mvpp2: add spinlock for FW FCA configuration path

2021-01-27 Thread Willem de Bruijn
On Wed, Jan 27, 2021 at 7:19 AM  wrote:
>
> From: Stefan Chulski 
>
> Spinlock added to MSS shared memory configuration space.
>
> Signed-off-by: Stefan Chulski 
> ---
>  drivers/net/ethernet/marvell/mvpp2/mvpp2.h  | 5 +
>  drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c | 3 +++
>  2 files changed, 8 insertions(+)
>
> diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2.h 
> b/drivers/net/ethernet/marvell/mvpp2/mvpp2.h
> index 9d8993f..f34e260 100644
> --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2.h
> +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2.h
> @@ -1021,6 +1021,11 @@ struct mvpp2 {
>
> /* CM3 SRAM pool */
> struct gen_pool *sram_pool;
> +
> +   bool custom_dma_mask;
> +
> +   /* Spinlocks for CM3 shared memory configuration */
> +   spinlock_t mss_spinlock;

Does this need to be a stand-alone patch? This introduces a spinlock,
but does not use it.

Also, is the introduction of custom_dma_mask in this commit on purpose?


Re: [PATCH v4 net-next 10/19] net: mvpp2: add FCA RXQ non occupied descriptor threshold

2021-01-27 Thread Willem de Bruijn
On Wed, Jan 27, 2021 at 7:26 AM  wrote:
>
> From: Stefan Chulski 
>
> RXQ non occupied descriptor threshold would be used by
> Flow Control Firmware feature to move to the XOFF mode.
> RXQ non occupied threshold would change interrupt cause
> that polled by CM3 Firmware.
> Actual non occupied interrupt masked and won't trigger interrupt.

Does this mean that this change enables a feature, but it is unused
due to a masked interrupt?

>
> Signed-off-by: Stefan Chulski 
> ---
>  drivers/net/ethernet/marvell/mvpp2/mvpp2.h  |  3 ++
>  drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c | 46 +---
>  2 files changed, 42 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2.h 
> b/drivers/net/ethernet/marvell/mvpp2/mvpp2.h
> index 73f087c..9d8993f 100644
> --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2.h
> +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2.h
> @@ -295,6 +295,8 @@
>  #define MVPP2_PON_CAUSE_TXP_OCCUP_DESC_ALL_MASK0x3fc0
>  #define MVPP2_PON_CAUSE_MISC_SUM_MASK  BIT(31)
>  #define MVPP2_ISR_MISC_CAUSE_REG   0x55b0
> +#define MVPP2_ISR_RX_ERR_CAUSE_REG(port)   (0x5520 + 4 * (port))
> +#defineMVPP2_ISR_RX_ERR_CAUSE_NONOCC_MASK  0x00ff

The indentation in this file is inconsistent. Here even between the
two newly introduced lines.

>  /* Buffer Manager registers */
>  #define MVPP2_BM_POOL_BASE_REG(pool)   (0x6000 + ((pool) * 4))
> @@ -764,6 +766,7 @@
>  #define MSS_SRAM_SIZE  0x800
>  #define FC_QUANTA  0x
>  #define FC_CLK_DIVIDER 100
> +#define MSS_THRESHOLD_STOP 768
>
>  /* RX buffer constants */
>  #define MVPP2_SKB_SHINFO_SIZE \
> diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c 
> b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
> index 8f40293a..a4933c4 100644
> --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
> +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
> @@ -1144,14 +1144,19 @@ static inline void 
> mvpp2_qvec_interrupt_disable(struct mvpp2_queue_vector *qvec)
>  static void mvpp2_interrupts_mask(void *arg)
>  {
> struct mvpp2_port *port = arg;
> +   int cpu = smp_processor_id();
> +   u32 thread;
>
> /* If the thread isn't used, don't do anything */
> -   if (smp_processor_id() > port->priv->nthreads)
> +   if (cpu >= port->priv->nthreads)
> return;

Here and below, the change from greater than to greater than is really
a (standalone) fix?

> -   mvpp2_thread_write(port->priv,
> -  mvpp2_cpu_to_thread(port->priv, 
> smp_processor_id()),
> +   thread = mvpp2_cpu_to_thread(port->priv, cpu);
> +
> +   mvpp2_thread_write(port->priv, thread,
>MVPP2_ISR_RX_TX_MASK_REG(port->id), 0);
> +   mvpp2_thread_write(port->priv, thread,
> +  MVPP2_ISR_RX_ERR_CAUSE_REG(port->id), 0);
>  }
>
>  /* Unmask the current thread's Rx/Tx interrupts.
> @@ -1161,20 +1166,25 @@ static void mvpp2_interrupts_mask(void *arg)
>  static void mvpp2_interrupts_unmask(void *arg)
>  {
> struct mvpp2_port *port = arg;
> -   u32 val;
> +   int cpu = smp_processor_id();
> +   u32 val, thread;
>
> /* If the thread isn't used, don't do anything */
> -   if (smp_processor_id() > port->priv->nthreads)
> +   if (cpu >= port->priv->nthreads)
> return;
>
> +   thread = mvpp2_cpu_to_thread(port->priv, cpu);
> +
> val = MVPP2_CAUSE_MISC_SUM_MASK |
> MVPP2_CAUSE_RXQ_OCCUP_DESC_ALL_MASK(port->priv->hw_version);
> if (port->has_tx_irqs)
> val |= MVPP2_CAUSE_TXQ_OCCUP_DESC_ALL_MASK;
>
> -   mvpp2_thread_write(port->priv,
> -  mvpp2_cpu_to_thread(port->priv, 
> smp_processor_id()),
> +   mvpp2_thread_write(port->priv, thread,
>MVPP2_ISR_RX_TX_MASK_REG(port->id), val);
> +   mvpp2_thread_write(port->priv, thread,
> +  MVPP2_ISR_RX_ERR_CAUSE_REG(port->id),
> +  MVPP2_ISR_RX_ERR_CAUSE_NONOCC_MASK);
>  }
>
>  static void
> @@ -1199,6 +1209,9 @@ static void mvpp2_interrupts_unmask(void *arg)
>
> mvpp2_thread_write(port->priv, v->sw_thread_id,
>MVPP2_ISR_RX_TX_MASK_REG(port->id), val);
> +   mvpp2_thread_write(port->priv, v->sw_thread_id,
> +  MVPP2_ISR_RX_ERR_CAUSE_REG(port->id),
> +  MVPP2_ISR_RX_ERR_CAUSE_NONOCC_MASK);
> }
>  }
>
> @@ -2404,6 +2417,22 @@ static void mvpp2_txp_max_tx_size_set(struct 
> mvpp2_port *port)
> }
>  }
>
> +/* Routine set the number of non-occupied descriptors threshold that change
> + * interrupt error cause polled by FW Flow Control
> + */

nit: no need for "Routine". Also, does "change .. cause" mean "that
triggers an interrupt"?


Re: [PATCH] rtlwifi: use tasklet_setup to initialize rx_work_tasklet

2021-01-27 Thread Willem de Bruijn
On Wed, Jan 27, 2021 at 5:23 AM Emil Renner Berthing  wrote:
>
> In commit d3ccc14dfe95 most of the tasklets in this driver was
> updated to the new API. However for the rx_work_tasklet only the
> type of the callback was changed from
>   void _rtl_rx_work(unsigned long data)
> to
>   void _rtl_rx_work(struct tasklet_struct *t).
>
> The initialization of rx_work_tasklet was still open-coded and the
> function pointer just cast into the old type, and hence nothing sets
> rx_work_tasklet.use_callback = true and the callback was still called as
>
>   t->func(t->data);
>
> with uninitialized/zero t->data.
>
> Commit 6b8c7574a5f8 changed the casting of _rtl_rx_work a bit and
> initialized t->data to a pointer to the tasklet cast to an unsigned
> long.
>
> This way calling t->func(t->data) might actually work through all the
> casting, but it still doesn't update the code to use the new tasklet
> API.
>
> Let's use the new tasklet_setup to initialize rx_work_tasklet properly
> and set rx_work_tasklet.use_callback = true so that the callback is
> called as
>
>   t->callback(t);
>
> without all the casting.
>
> Fixes: 6b8c7574a5f8 ("rtlwifi: fix build warning")
> Fixes: d3ccc14dfe95 ("rtlwifi/rtw88: convert tasklets to use new 
> tasklet_setup() API")
> Signed-off-by: Emil Renner Berthing 

Since the current code works, this could target net-next without Fixes tags.

Acked-by: Willem de Bruijn 


Re: [PATCH net-next v2 0/6] net: ipa: hardware pipeline cleanup fixes

2021-01-27 Thread Willem de Bruijn
On Wed, Jan 27, 2021 at 5:04 AM Alex Elder  wrote:
>
> Version 2 of this series fixes a "restricted __le16 degrades to
> integer" warning from sparse in the third patch.  The normal host
> architecture is little-endian, so the problem did not produce
> incorrect behavior, but the code was wrong not to perform the
> endianness conversion.  The updated patch uses le16_get_bits() to
> properly extract the value of the field we're interested in.
>
> Everything else remains the same.  Below is the original description.
>
> -Alex
>
> There is a procedure currently referred to as a "tag process" that
> is performed to clear the IPA hardware pipeline--either at the time
> of a modem crash, or when suspending modem GSI channels.
>
> One thing done in this procedure is issuing a command that sends a
> data packet originating from the AP->command TX endpoint, destined
> for the AP<-LAN RX (default) endpoint.  And although we currently
> wait for the send to complete, we do *not* wait for the packet to be
> received.  But the pipeline can't be assumed clear until we have
> actually received this packet.
>
> This series addresses this by detecting when the pipeline-clearing
> packet has been received, and using a completion to allow a waiter
> to know when that has happened.  This uses the IPA status capability
> (which sends an extra status buffer for certain packets).  It also
> uses the ability to supply a "tag" with a packet, which will be
> delivered with the packet's status buffer.  We tag the data packet
> that's sent to clear the pipeline, and use the receipt of a status
> buffer associated with a tagged packet to determine when that packet
> has arrived.
>
> "Tag status" just desribes one aspect of this procedure, so some
> symbols are renamed to be more like "pipeline clear" so they better
> describe the larger purpose.  Finally, two functions used in this
> code don't use their arguments, so those arguments are removed.
>
> -Alex
>
> Alex Elder (6):
>   net: ipa: rename "tag status" symbols
>   net: ipa: minor update to handling of packet with status
>   net: ipa: drop packet if status has valid tag
>   net: ipa: signal when tag transfer completes
>   net: ipa: don't pass tag value to ipa_cmd_ip_tag_status_add()
>   net: ipa: don't pass size to ipa_cmd_transfer_add()
>
>  drivers/net/ipa/ipa.h  |  2 +
>  drivers/net/ipa/ipa_cmd.c  | 45 +--
>  drivers/net/ipa/ipa_cmd.h      | 24 ++-
>  drivers/net/ipa/ipa_endpoint.c | 79 ++
>  drivers/net/ipa/ipa_main.c |  1 +
>  5 files changed, 109 insertions(+), 42 deletions(-)

For netdrv

Acked-by: Willem de Bruijn 


Re: [PATCH] rtlwifi: halbtc8723b2ant: Remove redundant code

2021-01-27 Thread Willem de Bruijn
On Wed, Jan 27, 2021 at 6:05 AM Abaci Team
 wrote:
>
> Fix the following coccicheck warnings:
>
> ./drivers/net/wireless/realtek/rtlwifi/btcoexist/halbtc8723b2ant.c:
> 1876:11-13: WARNING: possible condition with no effect (if == else).
>
> Reported-by: Abaci Robot 
> Suggested-by: Jiapeng Zhong 
> Signed-off-by: Abaci Team 

Signed-off-by lines need to have a real name. See
Documentation/process/submitting-patches.rst

With that change

Acked-by: Willem de Bruijn 


Re: [PATCH net-next 3/4] bridge: mrp: Extend br_mrp_switchdev to detect better the errors

2021-01-25 Thread Willem de Bruijn
On Sat, Jan 23, 2021 at 11:23 AM Horatiu Vultur
 wrote:
>
> This patch extends the br_mrp_switchdev functions to be able to have a
> better understanding what cause the issue and if the SW needs to be used
> as a backup.
>
> There are the following cases:
> - when the code is compiled without CONFIG_NET_SWITCHDEV. In this case
>   return success so the SW can continue with the protocol. Depending on
>   the function it returns 0 or BR_MRP_SW.
> - when code is compiled with CONFIG_NET_SWITCHDEV and the driver doesn't
>   implement any MRP callbacks, then the HW can't run MRP so it just
>   returns -EOPNOTSUPP. So the SW will stop further to configure the
>   node.
> - when code is compiled with CONFIG_NET_SWITCHDEV and the driver fully
>   supports any MRP functionality then the SW doesn't need to do
>   anything.  The functions will return 0 or BR_MRP_HW.
> - when code is compiled with CONFIG_NET_SWITCHDEV and the HW can't run
>   completely the protocol but it can help the SW to run it.  For
>   example, the HW can't support completely MRM role(can't detect when it
>   stops receiving MRP Test frames) but it can redirect these frames to
>   CPU. In this case it is possible to have a SW fallback. The SW will
>   try initially to call the driver with sw_backup set to false, meaning
>   that the HW can implement completely the role. If the driver returns
>   -EOPNOTSUPP, the SW will try again with sw_backup set to false,
>   meaning that the SW will detect when it stops receiving the frames. In
>   case the driver returns 0 then the SW will continue to configure the
>   node accordingly.
>
> In this way is more clear when the SW needs to stop configuring the
> node, or when the SW is used as a backup or the HW can implement the
> functionality.
>
> Signed-off-by: Horatiu Vultur 


> -int br_mrp_switchdev_set_ring_role(struct net_bridge *br,
> -  struct br_mrp *mrp,
> -  enum br_mrp_ring_role_type role)
> +enum br_mrp_hw_support
> +br_mrp_switchdev_set_ring_role(struct net_bridge *br, struct br_mrp *mrp,
> +  enum br_mrp_ring_role_type role)
>  {
> struct switchdev_obj_ring_role_mrp mrp_role = {
> .obj.orig_dev = br->dev,
> .obj.id = SWITCHDEV_OBJ_ID_RING_ROLE_MRP,
> .ring_role = role,
> .ring_id = mrp->ring_id,
> +   .sw_backup = false,
> };
> int err;
>
> +   /* If switchdev is not enabled then just run in SW */
> +   if (!IS_ENABLED(CONFIG_NET_SWITCHDEV))
> +   return BR_MRP_SW;
> +
> +   /* First try to see if HW can implement comptletly the role in HW */

typo: completely

> if (role == BR_MRP_RING_ROLE_DISABLED)
> err = switchdev_port_obj_del(br->dev, _role.obj);
> else
> err = switchdev_port_obj_add(br->dev, _role.obj, NULL);
>
> -   return err;
> +   /* In case of success then just return and notify the SW that doesn't
> +* need to do anything
> +*/
> +   if (!err)
> +   return BR_MRP_HW;
> +
> +   /* There was some issue then is not possible at all to have this role 
> so
> +* just return failire

typo: failure

> +*/
> +   if (err != -EOPNOTSUPP)
> +   return BR_MRP_NONE;
> +
> +   /* In case the HW can't run complety in HW the protocol, we try again

typo: completely. Please proofread your comments closely. I saw at
least one typo in the commit messages too.

More in general comments that say what the code does can generally be eschewed.

> +* and this time to allow the SW to help, but the HW needs to redirect
> +* the frames to CPU.
> +*/
> +   mrp_role.sw_backup = true;
> +   err = switchdev_port_obj_add(br->dev, _role.obj, NULL);

This calls the same function. I did not see code that changes behavior
based on sw_backup. Will this not give the same result?

Also, this lacks the role test (add or del). Is that because if
falling back onto SW mode during add, this code does not get called at
all on delete?

> +
> +   /* In case of success then notify the SW that it needs to help with 
> the
> +* protocol
> +*/
> +   if (!err)
> +   return BR_MRP_SW;
> +
> +   return BR_MRP_NONE;
>  }
>
> -int br_mrp_switchdev_send_ring_test(struct net_bridge *br,
> -   struct br_mrp *mrp, u32 interval,
> -   u8 max_miss, u32 period,
> -   bool monitor)
> +enum br_mrp_hw_support
> +br_mrp_switchdev_send_ring_test(struct net_bridge *br, struct br_mrp *mrp,
> +   u32 interval, u8 max_miss, u32 period,
> +   bool monitor)
>  {
> struct switchdev_obj_ring_test_mrp test = {
> .obj.orig_dev = br->dev,
> @@ -79,12 +106,29 @@ int 

Re: [PATCH v4 net-next 2/2] udp: allow forwarding of plain (non-fraglisted) UDP GRO packets

2021-01-22 Thread Willem de Bruijn
On Fri, Jan 22, 2021 at 1:20 PM Alexander Lobakin  wrote:
>
> Commit 9fd1ff5d2ac7 ("udp: Support UDP fraglist GRO/GSO.") actually
> not only added a support for fraglisted UDP GRO, but also tweaked
> some logics the way that non-fraglisted UDP GRO started to work for
> forwarding too.
> Commit 2e4ef10f5850 ("net: add GSO UDP L4 and GSO fraglists to the
> list of software-backed types") added GSO UDP L4 to the list of
> software GSO to allow virtual netdevs to forward them as is up to
> the real drivers.
>
> Tests showed that currently forwarding and NATing of plain UDP GRO
> packets are performed fully correctly, regardless if the target
> netdevice has a support for hardware/driver GSO UDP L4 or not.
> Add the last element and allow to form plain UDP GRO packets if
> we are on forwarding path, and the new NETIF_F_GRO_UDP_FWD is
> enabled on a receiving netdevice.
>
> If both NETIF_F_GRO_FRAGLIST and NETIF_F_GRO_UDP_FWD are set,
> fraglisted GRO takes precedence. This keeps the current behaviour
> and is generally more optimal for now, as the number of NICs with
> hardware USO offload is relatively small.
>
> Signed-off-by: Alexander Lobakin 

Acked-by: Willem de Bruijn 


Re: [PATCH net-next 2/2] udp: allow forwarding of plain (non-fraglisted) UDP GRO packets

2021-01-22 Thread Willem de Bruijn
On Fri, Jan 22, 2021 at 6:25 AM Alexander Lobakin  wrote:
>
> From: Willem de Bruijn 
> Date: Thu, 21 Jan 2021 21:47:47 -0500
>
> > On Mon, Jan 18, 2021 at 2:33 PM Alexander Lobakin  wrote:
> > >
> > > Commit 9fd1ff5d2ac7 ("udp: Support UDP fraglist GRO/GSO.") actually
> > > not only added a support for fraglisted UDP GRO, but also tweaked
> > > some logics the way that non-fraglisted UDP GRO started to work for
> > > forwarding too.
> > > Commit 2e4ef10f5850 ("net: add GSO UDP L4 and GSO fraglists to the
> > > list of software-backed types") added GSO UDP L4 to the list of
> > > software GSO to allow virtual netdevs to forward them as is up to
> > > the real drivers.
> > >
> > > Tests showed that currently forwarding and NATing of plain UDP GRO
> > > packets are performed fully correctly, regardless if the target
> > > netdevice has a support for hardware/driver GSO UDP L4 or not.
> > > Plain UDP GRO forwarding even shows better performance than fraglisted
> > > UDP GRO in some cases due to not wasting one skbuff_head per every
> > > segment.
> >
> > That is surprising. The choice for fraglist based forwarding was made
> > on the assumption that it is cheaper if software segmentation is needed.
> >
> > Do you have a more specific definition of the relevant cases?
>
> "Classic" UDP GRO shows better performance when forwarding to a NIC
> that supports GSO UDP L4 (i.e. no software segmentation occurs), like
> the one that I test kernel on.
> I don't have much info about performance without UDP GSO offload
> as I usually test NAT, and fralisted UDP GRO currently fails on
> this [0].
>
> > There currently is no option to enable GRO for forwarding, without
> > fraglist if to a device with h/w udp segmentation offload. This would
> > add that option too.
>
> Yes, that's exactly what I want. I want to maximize UDP
> forwarding/NATing performance when NIC is capable of UDP GSO offload,
> as I said above, non-fraglisted UDP GRO is better for that case.

That makes sense. Better to make explicit that that is the case
targeted here, rather than "some cases".

> > Though under admin control, which may make it a rarely exercised option.
> > Assuming most hosts to have single or homogeneous NICs, the OS should
> > be able to choose the preferred option in most cases (e.g.,: use fraglist
> > unless all devices support h/w gro).
>
> I though about some sort of auto-selection, but at the moment of
> receiving we can't know which interface this skb will be forwarded
> to.
> Also, as Paolo Abeni said in a comment to v2, UDP GRO may cause
> sensible delays, which may be inacceptable in some environments.
> That's why we have to use a sockopt and netdev features to explicitly
> enable UDP GRO.

I'm suspect that such fine-grained toggles end up broadly unused.

Agreed that it is not always possible to predict the destination NIC,
but that is why I suggested a very low bar that I believe captures the
majority of installed systems: where all NICs support the feature.
Anyway, that can always be added later -- as long as having this flag
off is not interpreted as demanding fraglist on forwarding.

> Regarding all this, I introduced NETIF_F_UDP_GRO to have the
> following chose:
>  - both NETIF_F_UDP_GRO and NETIF_F_GRO_FRAGLIST is off - no UDP GRO;
>  - NETIF_F_UDP_GRO is on, NETIF_F_GRO_FRAGLIST is off - classic GRO;
>  - both NETIF_F_UDP_GRO and NETIF_F_GRO_FRAGLIST is on - fraglisted
>UDP GRO.
>
> > > Add the last element and allow to form plain UDP GRO packets if
> > > there is no socket -> we are on forwarding path, and the new
> > > NETIF_F_GRO_UDP is enabled on a receiving netdevice.
> > > Note that fraglisted UDP GRO now also depends on this feature, as
> >
> > That may cause a regression for applications that currently enable
> > that device feature.
>
> Thought about this one too. Not sure if it would be better to leave
> it as it is for now or how it's done in this series. The problem
> that we may have in future is that in some day we may get fraglisted
> TCP GRO, and then NETIF_F_GRO_FRAGLIST will affect both TCP and UDP,
> which is not desirable as for me. So I decided to guard this possible
> case.
>
> > > NETIF_F_GRO_FRAGLIST isn't tied to any particular L4 protocol.

As its name implies. I think it makes more sense to see it as an
explicit request to use fraglist for any protocol that supports it.

> > >
> > > Signed-off-by: Alexander Lobakin 
> > > ---
> > >  net/ipv4/udp_offload.c | 16 +++-
> > >  1 file c

Re: [PATCH net-next 2/2] udp: allow forwarding of plain (non-fraglisted) UDP GRO packets

2021-01-21 Thread Willem de Bruijn
On Mon, Jan 18, 2021 at 2:33 PM Alexander Lobakin  wrote:
>
> Commit 9fd1ff5d2ac7 ("udp: Support UDP fraglist GRO/GSO.") actually
> not only added a support for fraglisted UDP GRO, but also tweaked
> some logics the way that non-fraglisted UDP GRO started to work for
> forwarding too.
> Commit 2e4ef10f5850 ("net: add GSO UDP L4 and GSO fraglists to the
> list of software-backed types") added GSO UDP L4 to the list of
> software GSO to allow virtual netdevs to forward them as is up to
> the real drivers.
>
> Tests showed that currently forwarding and NATing of plain UDP GRO
> packets are performed fully correctly, regardless if the target
> netdevice has a support for hardware/driver GSO UDP L4 or not.
> Plain UDP GRO forwarding even shows better performance than fraglisted
> UDP GRO in some cases due to not wasting one skbuff_head per every
> segment.

That is surprising. The choice for fraglist based forwarding was made
on the assumption that it is cheaper if software segmentation is needed.

Do you have a more specific definition of the relevant cases?

There currently is no option to enable GRO for forwarding, without
fraglist if to a device with h/w udp segmentation offload. This would
add that option too.

Though under admin control, which may make it a rarely exercised option.
Assuming most hosts to have single or homogeneous NICs, the OS should
be able to choose the preferred option in most cases (e.g.,: use fraglist
unless all devices support h/w gro).

> Add the last element and allow to form plain UDP GRO packets if
> there is no socket -> we are on forwarding path, and the new
> NETIF_F_GRO_UDP is enabled on a receiving netdevice.
> Note that fraglisted UDP GRO now also depends on this feature, as

That may cause a regression for applications that currently enable
that device feature.

> NETIF_F_GRO_FRAGLIST isn't tied to any particular L4 protocol.
>
> Signed-off-by: Alexander Lobakin 
> ---
>  net/ipv4/udp_offload.c | 16 +++-
>  1 file changed, 11 insertions(+), 5 deletions(-)
>
> diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
> index ff39e94781bf..781a035de5a9 100644
> --- a/net/ipv4/udp_offload.c
> +++ b/net/ipv4/udp_offload.c
> @@ -454,13 +454,19 @@ struct sk_buff *udp_gro_receive(struct list_head *head, 
> struct sk_buff *skb,
> struct sk_buff *p;
> struct udphdr *uh2;
> unsigned int off = skb_gro_offset(skb);
> -   int flush = 1;
> +   int flist = 0, flush = 1;
> +   bool gro_by_feat = false;

What is this variable shorthand for? By feature? Perhaps
gro_forwarding is more descriptive.

>
> -   NAPI_GRO_CB(skb)->is_flist = 0;
> -   if (skb->dev->features & NETIF_F_GRO_FRAGLIST)
> -   NAPI_GRO_CB(skb)->is_flist = sk ? !udp_sk(sk)->gro_enabled: 1;
> +   if (skb->dev->features & NETIF_F_GRO_UDP) {
> +   if (skb->dev->features & NETIF_F_GRO_FRAGLIST)
> +   flist = !sk || !udp_sk(sk)->gro_enabled;
>
> -   if ((sk && udp_sk(sk)->gro_enabled) || NAPI_GRO_CB(skb)->is_flist) {

I would almost rename NETIF_F_GRO_FRAGLIST to NETIF_F_UDP_GRO_FWD.
Then this could be a !NETIF_F_UDP_GRO_FWD_FRAGLIST toggle on top of
that. If it wasn't for this fraglist option also enabling UDP GRO to
local sockets if set.

That is, if the performance difference is significant enough to
require supporting both types of forwarding, under admin control.

Perhaps the simplest alternative is to add the new feature without
making fraglist dependent on it:

  if ((sk && udp_sk(sk)->gro_enabled) ||
  (skb->dev->features & NETIF_F_GRO_FRAGLIST) ||
  (!sk && skb->dev->features & NETIF_F_GRO_UDP_FWD))






> +   gro_by_feat = !sk || flist;
> +   }
> +
> +   NAPI_GRO_CB(skb)->is_flist = flist;
> +
> +   if (gro_by_feat || (sk && udp_sk(sk)->gro_enabled)) {
> pp = call_gro_receive(udp_gro_receive_segment, head, skb);
> return pp;
> }
> --
> 2.30.0
>
>


Re: [RFC PATCH 0/7] Support for virtio-net hash reporting

2021-01-18 Thread Willem de Bruijn
> > > What it does not give is a type indication, such as
> > > VIRTIO_NET_HASH_TYPE_TCPv6. I don't understand how this would be used.
> > > In datapaths where the NIC has already computed the four-tuple hash
> > > and stored it in skb->hash --the common case for servers--, That type
> > > field is the only reason to have to compute again.
> >  The problem is there's no guarantee that the packet comes from the NIC,
> >  it could be a simple VM2VM or host2VM packet.
> > 
> >  And even if the packet is coming from the NIC that calculates the hash
> >  there's no guarantee that it's the has that guest want (guest may use
> >  different RSS keys).
> > >>> Ah yes, of course.
> > >>>
> > >>> I would still revisit the need to store a detailed hash_type along with
> > >>> the hash, as as far I can tell that conveys no actionable information
> > >>> to the guest.
> > >>
> > >> Yes, need to figure out its usage. According to [1], it only mention
> > >> that storing has type is a charge of driver. Maybe Yuri can answer this.
> > >>
> > > For the case of Windows VM we can't know how exactly the network stack
> > > uses provided hash data (including hash type). But: different releases
> > > of Windows
> > > enable different hash types (for example UDP hash is enabled only on
> > > Server 2016 and up).
> > >
> > > Indeed the Windows requires a little more from the network adapter/driver
> > > than Linux does.
> > >
> > > The addition of RSS support to virtio specification takes in account
> > > the widest set of
> > > requirements (i.e. Windows one), our initial impression is that this
> > > should be enough also for Linux.
> > >
> > > The NDIS specification in part of RSS is _mandatory_ and there are
> > > certification tests
> > > that check that the driver provides the hash data as expected. All the
> > > high-performance
> > > network adapters have such RSS functionality in the hardware.

Thanks for the context.

If Windows requires the driver to pass the hash-type along with the
hash data, then indeed this will be needed.

If it only requires the device to support a subset of of the possible
types, chosen at init, that would be different and it would be cheaper
for the driver to pass this config to the device one time.

> > > With pre-RSS QEMU (i.e. where the virtio-net device does not indicate
> > > the RSS support)
> > > the virtio-net driver for Windows does all the job related to RSS:
> > > - hash calculation
> > > - hash/hash_type delivery
> > > - reporting each packet on the correct CPU according to RSS settings
> > >
> > > With RSS support in QEMU all the packets always come on a proper CPU and
> > > the driver never needs to reschedule them. The driver still need to
> > > calculate the
> > > hash and report it to Windows. In this case we do the same job twice: the 
> > > device
> > > (QEMU or eBPF) does calculate the hash and get proper queue/CPU to deliver
> > > the packet. But the hash is not delivered by the device, so the driver 
> > > needs to
> > > recalculate it and report to the Windows.
> > >
> > > If we add HASH_REPORT support (current set of patches) and the device
> > > indicates this
> > > feature we can avoid hash recalculation in the driver assuming we
> > > receive the correct hash
> > > value and hash type. Otherwise the driver can't know which exactly
> > > hash the device has calculated.
> > >
> > > Please let me know if I did not answer the question.
> >
> >
> > I think I get you. The hash type is also a kind of classification (e.g
> > TCP or UDP). Any possibility that it can be deduced from the driver? (Or
> > it could be too expensive to do that).
> >
> The driver does it today (when the device does not offer any features)
> and of course can continue doing it.
> IMO if the device can't report the data according to the spec it
> should not indicate support for the respective feature (or fallback to
> vhost=off).
> Again, IMO if Linux does not need the exact hash_type we can use (for
> Linux) the way that Willem de Brujin suggested in his patchset:
> - just add VIRTIO_NET_HASH_REPORT_L4 to the spec
> - Linux can use MQ + hash delivery (and use VIRTIO_NET_HASH_REPORT_L4)
> - Linux can use (if makes sense) RSS with VIRTIO_NET_HASH_REPORT_L4 and eBPF
> - Windows gets what it needs + eBPF
> So, everyone has what they need at the respective cost.
>
> Regarding use of skb->cb for hash type:
> Currently, if I'm not mistaken, there are 2 bytes at the end of skb->cb:
> skb->cb is 48 bytes array
> There is skb_gso_cb (14 bytes) at offset SKB_GSO_CB_OFFSET(32)
> Is it possible to use one of these 2 bytes for hash_type?
> If yes, shall we extend the skb_gso_cb and place the 1-bytes hash_type
> in it or just emit compilation error if the skb_gso_cb grows beyond 15
> bytes?

Good catch on segmentation taking place between .ndo_select_queue and
.ndo_start_xmit.

That also means that whatever field in the skb is used, has to be
copied to all segments in 

Re: arch/arm64/include/asm/syscall_wrapper.h:41:25: warning: no previous prototype for '__arm64_compat_sys_epoll_pwait2'

2021-01-14 Thread Willem de Bruijn
On Thu, Jan 14, 2021 at 3:05 AM kernel test robot  wrote:
>
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
> master
> head:   65f0d2414b7079556fbbcc070b3d1c9f9587606d
> commit: b0a0c2615f6f199a656ed8549d7dce625d77aa77 epoll: wire up syscall 
> epoll_pwait2
> date:   4 weeks ago
> config: arm64-randconfig-r005-20210113 (attached as .config)
> compiler: aarch64-linux-gcc (GCC) 9.3.0
> reproduce (this is a W=1 build):
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b0a0c2615f6f199a656ed8549d7dce625d77aa77
> git remote add linus 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> git fetch --no-tags linus master
> git checkout b0a0c2615f6f199a656ed8549d7dce625d77aa77
> # save the attached .config to linux build tree
> COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross 
> ARCH=arm64
>
> If you fix the issue, kindly add following tag as appropriate
> Reported-by: kernel test robot 
>
> All warnings (new ones prefixed by >>):
>
>  | ^~~~
>kernel/sys_ni.c:45:1: note: in expansion of macro 'COND_SYSCALL'
>   45 | COND_SYSCALL(io_getevents_time32);
>  | ^~~~
>arch/arm64/include/asm/syscall_wrapper.h:76:25: warning: no previous 
> prototype for '__arm64_sys_io_getevents' [-Wmissing-prototypes]
>   76 |  asmlinkage long __weak __arm64_sys_##name(const struct pt_regs 
> *regs) \
>  | ^~~~
>kernel/sys_ni.c:46:1: note: in expansion of macro 'COND_SYSCALL'
>   46 | COND_SYSCALL(io_getevents);
>  | ^~~~
>arch/arm64/include/asm/syscall_wrapper.h:76:25: warning: no previous 
> prototype for '__arm64_sys_io_pgetevents_time32' [-Wmissing-prototypes]
>   76 |  asmlinkage long __weak __arm64_sys_##name(const struct pt_regs 
> *regs) \
>  | ^~~~
>kernel/sys_ni.c:47:1: note: in expansion of macro 'COND_SYSCALL'
>   47 | COND_SYSCALL(io_pgetevents_time32);
>  | ^~~~
>arch/arm64/include/asm/syscall_wrapper.h:76:25: warning: no previous 
> prototype for '__arm64_sys_io_pgetevents' [-Wmissing-prototypes]
>   76 |  asmlinkage long __weak __arm64_sys_##name(const struct pt_regs 
> *regs) \
>  | ^~~~
>kernel/sys_ni.c:48:1: note: in expansion of macro 'COND_SYSCALL'
>   48 | COND_SYSCALL(io_pgetevents);
>  | ^~~~
>arch/arm64/include/asm/syscall_wrapper.h:41:25: warning: no previous 
> prototype for '__arm64_compat_sys_io_pgetevents_time32' [-Wmissing-prototypes]
>   41 |  asmlinkage long __weak __arm64_compat_sys_##name(const struct 
> pt_regs *regs) \
>  | ^~~
>kernel/sys_ni.c:49:1: note: in expansion of macro 'COND_SYSCALL_COMPAT'
>   49 | COND_SYSCALL_COMPAT(io_pgetevents_time32);
>  | ^~~
>arch/arm64/include/asm/syscall_wrapper.h:41:25: warning: no previous 
> prototype for '__arm64_compat_sys_io_pgetevents' [-Wmissing-prototypes]
>   41 |  asmlinkage long __weak __arm64_compat_sys_##name(const struct 
> pt_regs *regs) \
>  | ^~~
>kernel/sys_ni.c:50:1: note: in expansion of macro 'COND_SYSCALL_COMPAT'
>   50 | COND_SYSCALL_COMPAT(io_pgetevents);
>  | ^~~
>arch/arm64/include/asm/syscall_wrapper.h:76:25: warning: no previous 
> prototype for '__arm64_sys_io_uring_setup' [-Wmissing-prototypes]
>   76 |  asmlinkage long __weak __arm64_sys_##name(const struct pt_regs 
> *regs) \
>  | ^~~~
>kernel/sys_ni.c:51:1: note: in expansion of macro 'COND_SYSCALL'
>   51 | COND_SYSCALL(io_uring_setup);
>  | ^~~~
>arch/arm64/include/asm/syscall_wrapper.h:76:25: warning: no previous 
> prototype for '__arm64_sys_io_uring_enter' [-Wmissing-prototypes]
>   76 |  asmlinkage long __weak __arm64_sys_##name(const struct pt_regs 
> *regs) \
>  | ^~~~
>kernel/sys_ni.c:52:1: note: in expansion of macro 'COND_SYSCALL'
>   52 | COND_SYSCALL(io_uring_enter);
>  | ^~~~
>arch/arm64/include/asm/syscall_wrapper.h:76:25: warning: no previous 
> prototype for '__arm64_sys_io_uring_register' [-Wmissing-prototypes]
>   76 |  asmlinkage long __weak __arm64_sys_##name(const struct pt_regs 
> *regs) \
>  | ^~~~
>kernel/sys_ni.c:53:1: note: in expansion of macro 'COND_SYSCALL'
>   53 | COND_SYSCALL(io_uring_register);
>  | ^~~~
>arch/arm64/include/asm/syscall_wrapper.h:76:25: warning: no previous 
> prototype for 

Re: [RFC PATCH 0/7] Support for virtio-net hash reporting

2021-01-13 Thread Willem de Bruijn
On Tue, Jan 12, 2021 at 11:11 PM Jason Wang  wrote:
>
>
> On 2021/1/13 上午7:47, Willem de Bruijn wrote:
> > On Tue, Jan 12, 2021 at 3:29 PM Yuri Benditovich
> >  wrote:
> >> On Tue, Jan 12, 2021 at 9:49 PM Yuri Benditovich
> >>  wrote:
> >>> On Tue, Jan 12, 2021 at 9:41 PM Yuri Benditovich
> >>>  wrote:
> >>>> Existing TUN module is able to use provided "steering eBPF" to
> >>>> calculate per-packet hash and derive the destination queue to
> >>>> place the packet to. The eBPF uses mapped configuration data
> >>>> containing a key for hash calculation and indirection table
> >>>> with array of queues' indices.
> >>>>
> >>>> This series of patches adds support for virtio-net hash reporting
> >>>> feature as defined in virtio specification. It extends the TUN module
> >>>> and the "steering eBPF" as follows:
> >>>>
> >>>> Extended steering eBPF calculates the hash value and hash type, keeps
> >>>> hash value in the skb->hash and returns index of destination virtqueue
> >>>> and the type of the hash. TUN module keeps returned hash type in
> >>>> (currently unused) field of the skb.
> >>>> skb->__unused renamed to 'hash_report_type'.
> >>>>
> >>>> When TUN module is called later to allocate and fill the virtio-net
> >>>> header and push it to destination virtqueue it populates the hash
> >>>> and the hash type into virtio-net header.
> >>>>
> >>>> VHOST driver is made aware of respective virtio-net feature that
> >>>> extends the virtio-net header to report the hash value and hash report
> >>>> type.
> >>> Comment from Willem de Bruijn:
> >>>
> >>> Skbuff fields are in short supply. I don't think we need to add one
> >>> just for this narrow path entirely internal to the tun device.
> >>>
> >> We understand that and try to minimize the impact by using an already
> >> existing unused field of skb.
> > Not anymore. It was repurposed as a flags field very recently.
> >
> > This use case is also very narrow in scope. And a very short path from
> > data producer to consumer. So I don't think it needs to claim scarce
> > bits in the skb.
> >
> > tun_ebpf_select_queue stores the field, tun_put_user reads it and
> > converts it to the virtio_net_hdr in the descriptor.
> >
> > tun_ebpf_select_queue is called from .ndo_select_queue.  Storing the
> > field in skb->cb is fragile, as in theory some code could overwrite
> > that between field between ndo_select_queue and
> > ndo_start_xmit/tun_net_xmit, from which point it is fully under tun
> > control again. But in practice, I don't believe anything does.
> >
> > Alternatively an existing skb field that is used only on disjoint
> > datapaths, such as ingress-only, could be viable.
>
>
> A question here. We had metadata support in XDP for cooperation between
> eBPF programs. Do we have something similar in the skb?
>
> E.g in the RSS, if we want to pass some metadata information between
> eBPF program and the logic that generates the vnet header (either hard
> logic in the kernel or another eBPF program). Is there any way that can
> avoid the possible conflicts of qdiscs?

Not that I am aware of. The closest thing is cb[].

It'll have to aliase a field like that, that is known unused for the given path.

One other approach that has been used within linear call stacks is out
of band. Like percpu variables softnet_data.xmit.more and
mirred_rec_level. But that is perhaps a bit overwrought for this use
case.

> >
> >>> Instead, you could just run the flow_dissector in tun_put_user if the
> >>> feature is negotiated. Indeed, the flow dissector seems more apt to me
> >>> than BPF here. Note that the flow dissector internally can be
> >>> overridden by a BPF program if the admin so chooses.
> >>>
> >> When this set of patches is related to hash delivery in the virtio-net
> >> packet in general,
> >> it was prepared in context of RSS feature implementation as defined in
> >> virtio spec [1]
> >> In case of RSS it is not enough to run the flow_dissector in tun_put_user:
> >> in tun_ebpf_select_queue the TUN calls eBPF to calculate the hash,
> >> hash type and queue index
> >> according to the (mapped) parameters (key, hash types, indirection
> >> table) received from the guest.
> > TUNSETSTEERINGEB

Re: [RFC PATCH 0/7] Support for virtio-net hash reporting

2021-01-12 Thread Willem de Bruijn
On Tue, Jan 12, 2021 at 3:29 PM Yuri Benditovich
 wrote:
>
> On Tue, Jan 12, 2021 at 9:49 PM Yuri Benditovich
>  wrote:
> >
> > On Tue, Jan 12, 2021 at 9:41 PM Yuri Benditovich
> >  wrote:
> > >
> > > Existing TUN module is able to use provided "steering eBPF" to
> > > calculate per-packet hash and derive the destination queue to
> > > place the packet to. The eBPF uses mapped configuration data
> > > containing a key for hash calculation and indirection table
> > > with array of queues' indices.
> > >
> > > This series of patches adds support for virtio-net hash reporting
> > > feature as defined in virtio specification. It extends the TUN module
> > > and the "steering eBPF" as follows:
> > >
> > > Extended steering eBPF calculates the hash value and hash type, keeps
> > > hash value in the skb->hash and returns index of destination virtqueue
> > > and the type of the hash. TUN module keeps returned hash type in
> > > (currently unused) field of the skb.
> > > skb->__unused renamed to 'hash_report_type'.
> > >
> > > When TUN module is called later to allocate and fill the virtio-net
> > > header and push it to destination virtqueue it populates the hash
> > > and the hash type into virtio-net header.
> > >
> > > VHOST driver is made aware of respective virtio-net feature that
> > > extends the virtio-net header to report the hash value and hash report
> > > type.
> >
> > Comment from Willem de Bruijn:
> >
> > Skbuff fields are in short supply. I don't think we need to add one
> > just for this narrow path entirely internal to the tun device.
> >
>
> We understand that and try to minimize the impact by using an already
> existing unused field of skb.

Not anymore. It was repurposed as a flags field very recently.

This use case is also very narrow in scope. And a very short path from
data producer to consumer. So I don't think it needs to claim scarce
bits in the skb.

tun_ebpf_select_queue stores the field, tun_put_user reads it and
converts it to the virtio_net_hdr in the descriptor.

tun_ebpf_select_queue is called from .ndo_select_queue.  Storing the
field in skb->cb is fragile, as in theory some code could overwrite
that between field between ndo_select_queue and
ndo_start_xmit/tun_net_xmit, from which point it is fully under tun
control again. But in practice, I don't believe anything does.

Alternatively an existing skb field that is used only on disjoint
datapaths, such as ingress-only, could be viable.

> > Instead, you could just run the flow_dissector in tun_put_user if the
> > feature is negotiated. Indeed, the flow dissector seems more apt to me
> > than BPF here. Note that the flow dissector internally can be
> > overridden by a BPF program if the admin so chooses.
> >
> When this set of patches is related to hash delivery in the virtio-net
> packet in general,
> it was prepared in context of RSS feature implementation as defined in
> virtio spec [1]
> In case of RSS it is not enough to run the flow_dissector in tun_put_user:
> in tun_ebpf_select_queue the TUN calls eBPF to calculate the hash,
> hash type and queue index
> according to the (mapped) parameters (key, hash types, indirection
> table) received from the guest.

TUNSETSTEERINGEBPF was added to support more diverse queue selection
than the default in case of multiqueue tun. Not sure what the exact
use cases are.

But RSS is exactly the purpose of the flow dissector. It is used for
that purpose in the software variant RPS. The flow dissector
implements a superset of the RSS spec, and certainly computes a
four-tuple for TCP/IPv6. In the case of RPS, it is skipped if the NIC
has already computed a 4-tuple hash.

What it does not give is a type indication, such as
VIRTIO_NET_HASH_TYPE_TCPv6. I don't understand how this would be used.
In datapaths where the NIC has already computed the four-tuple hash
and stored it in skb->hash --the common case for servers--, That type
field is the only reason to have to compute again.

> Our intention is to keep the hash and hash type in the skb to populate them
> into a virtio-net header later in tun_put_user.
> Note that in this case the type of calculated hash is selected not
> only from flow dissections
> but also from limitations provided by the guest.
>
> This is already implemented in qemu (for case of vhost=off), see [2]
> (virtio_net_process_rss)
> For case of vhost=on there are WIP for qemu to load eBPF and attach it to TUN.

> Note that exact way of selecting rx virtqueue depends on the guest,
> it could be automatic steering (typical for Linux VM), RSS (typical
> for Window

Re: [PATCH 0/6] fs: deduplicate compat logic

2021-01-11 Thread Willem de Bruijn
On Mon, Jan 11, 2021 at 7:58 PM Al Viro  wrote:
>
> On Mon, Jan 11, 2021 at 07:30:11PM -0500, Willem de Bruijn wrote:
> > From: Willem de Bruijn 
> >
> > Use in_compat_syscall() to differentiate compat handling exactly
> > where needed, including in nested function calls. Then remove
> > duplicated code in callers.
>
> IMO it's a bad idea.  Use of in_compat_syscall() is hard to avoid
> in some cases, but let's not use it without a good reason.  It
> makes the code harder to reason about.

In the specific cases of select, poll and epoll, this removes quite a
bit of duplicate code that may diverge over time. Indeed, for select
already has. Reduction of duplication may also make subsequent changes
more feasible. We discussed avoiding in epoll an unnecessary
ktime_get_ts64 in select_estimate_accuracy, which requires plumbing a
variable through these intermediate helpers.

I also personally find the code simpler to understand without the
various near duplicates. The change exposes their differences
more clearly. select is the best example of this.

The last two patches I added based on earlier comments. Perhaps
the helper in 5 adds more churn than it's worth.


[PATCH 1/6] selftests/filesystems: add initial select and poll selftest

2021-01-11 Thread Willem de Bruijn
From: Willem de Bruijn 

Add initial code coverage for select, pselect, poll and ppoll.

Open a socketpair and wait for a read event.
1. run with data waiting
2. run to timeout, if a (short) timeout is specified.

Also optionally pass sigset to pselect and ppoll, to exercise
all datapaths. Build with -m32, -mx32 and -m64 to cover all the
various compat and 32/64-bit time syscall implementations.

Signed-off-by: Willem de Bruijn 
---
 .../testing/selftests/filesystems/.gitignore  |   1 +
 tools/testing/selftests/filesystems/Makefile  |   2 +-
 .../selftests/filesystems/selectpoll.c| 207 ++
 3 files changed, 209 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/filesystems/selectpoll.c

diff --git a/tools/testing/selftests/filesystems/.gitignore 
b/tools/testing/selftests/filesystems/.gitignore
index f0c0ff20d6cf..d4a2e50475ea 100644
--- a/tools/testing/selftests/filesystems/.gitignore
+++ b/tools/testing/selftests/filesystems/.gitignore
@@ -1,3 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0-only
 dnotify_test
 devpts_pts
+selectpoll
diff --git a/tools/testing/selftests/filesystems/Makefile 
b/tools/testing/selftests/filesystems/Makefile
index 129880fb42d3..8de184865fa4 100644
--- a/tools/testing/selftests/filesystems/Makefile
+++ b/tools/testing/selftests/filesystems/Makefile
@@ -1,7 +1,7 @@
 # SPDX-License-Identifier: GPL-2.0
 
 CFLAGS += -I../../../../usr/include/
-TEST_GEN_PROGS := devpts_pts
+TEST_GEN_PROGS := devpts_pts selectpoll
 TEST_GEN_PROGS_EXTENDED := dnotify_test
 
 include ../lib.mk
diff --git a/tools/testing/selftests/filesystems/selectpoll.c 
b/tools/testing/selftests/filesystems/selectpoll.c
new file mode 100644
index ..315da0786a6c
--- /dev/null
+++ b/tools/testing/selftests/filesystems/selectpoll.c
@@ -0,0 +1,207 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "../kselftest_harness.h"
+
+const unsigned long timeout_us = 5UL * 1000;
+const unsigned long timeout_ns = timeout_us * 1000;
+
+/* (p)select: basic invocation, optionally with data waiting */
+
+FIXTURE(select_basic)
+{
+   fd_set readfds;
+   int sfd[2];
+};
+
+FIXTURE_SETUP(select_basic)
+{
+   ASSERT_EQ(socketpair(AF_UNIX, SOCK_STREAM, 0, self->sfd), 0);
+
+   FD_ZERO(>readfds);
+   FD_SET(self->sfd[0], >readfds);
+   FD_SET(self->sfd[1], >readfds);
+}
+
+FIXTURE_TEARDOWN(select_basic)
+{
+   /* FD_ISSET(self->sfd[0] tested in TEST_F: depends on timeout */
+   ASSERT_EQ(FD_ISSET(self->sfd[1], >readfds), 0);
+
+   EXPECT_EQ(close(self->sfd[0]), 0);
+   EXPECT_EQ(close(self->sfd[1]), 0);
+}
+
+TEST_F(select_basic, select)
+{
+   ASSERT_EQ(write(self->sfd[1], "w", 1), 1);
+   ASSERT_EQ(select(self->sfd[1] + 1, >readfds,
+NULL, NULL, NULL), 1);
+   ASSERT_NE(FD_ISSET(self->sfd[0], >readfds), 0);
+}
+
+TEST_F(select_basic, select_with_timeout)
+{
+   struct timeval tv = { .tv_usec = timeout_us };
+
+   ASSERT_EQ(write(self->sfd[1], "w", 1), 1);
+   ASSERT_EQ(select(self->sfd[1] + 1, >readfds,
+NULL, NULL, ), 1);
+   ASSERT_GE(tv.tv_usec, 1000);
+   ASSERT_NE(FD_ISSET(self->sfd[0], >readfds), 0);
+}
+
+TEST_F(select_basic, select_timeout)
+{
+   struct timeval tv = { .tv_usec = timeout_us };
+
+   ASSERT_EQ(select(self->sfd[1] + 1, >readfds,
+NULL, NULL, ), 0);
+   ASSERT_EQ(FD_ISSET(self->sfd[0], >readfds), 0);
+}
+
+TEST_F(select_basic, pselect)
+{
+   ASSERT_EQ(write(self->sfd[1], "w", 1), 1);
+   ASSERT_EQ(pselect(self->sfd[1] + 1, >readfds,
+ NULL, NULL, NULL, NULL), 1);
+   ASSERT_NE(FD_ISSET(self->sfd[0], >readfds), 0);
+}
+
+TEST_F(select_basic, pselect_with_timeout)
+{
+   struct timespec ts = { .tv_nsec = timeout_ns };
+
+   ASSERT_EQ(write(self->sfd[1], "w", 1), 1);
+   ASSERT_EQ(pselect(self->sfd[1] + 1, >readfds,
+ NULL, NULL, , NULL), 1);
+   ASSERT_GE(ts.tv_nsec, 1000);
+   ASSERT_NE(FD_ISSET(self->sfd[0], >readfds), 0);
+}
+
+TEST_F(select_basic, pselect_timeout)
+{
+   struct timespec ts = { .tv_nsec = timeout_ns };
+
+   ASSERT_EQ(pselect(self->sfd[1] + 1, >readfds,
+ NULL, NULL, , NULL), 0);
+   ASSERT_EQ(FD_ISSET(self->sfd[0], >readfds), 0);
+}
+
+TEST_F(select_basic, pselect_sigset_with_timeout)
+{
+   struct timespec ts = { .tv_nsec = timeout_ns };
+   sigset_t sigmask;
+
+   sigemptyset();
+   sigaddset(, SIGUSR1);
+   sigprocmask(SIG_SETMASK, , NULL);
+   sigemptyset();
+
+   ASSERT_EQ(write(self->sfd[1], "w", 1), 1);
+   ASSERT_EQ(pselect(self->sfd[1] + 1, &

[PATCH 3/6] ppoll: deduplicate compat logic

2021-01-11 Thread Willem de Bruijn
From: Willem de Bruijn 

Apply the same compat deduplication strategy to ppoll that was
previously applied to select and pselect.

Like pselect, ppoll has timespec and sigmask arguments, which have
compat variants. poll has neither, so is not modified.

Convert the ppoll syscall to a do_ppoll() helper that branches on
timespec and sigmask variants internally.

This allows calling the same implementation for all syscall variants:
standard, time32, compat, compat + time32.

Signed-off-by: Willem de Bruijn 
---
 fs/select.c | 91 ++---
 1 file changed, 30 insertions(+), 61 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index dee7dfc5217b..27567795a892 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -1120,28 +1120,48 @@ SYSCALL_DEFINE3(poll, struct pollfd __user *, ufds, 
unsigned int, nfds,
return ret;
 }
 
-SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, unsigned int, nfds,
-   struct __kernel_timespec __user *, tsp, const sigset_t __user 
*, sigmask,
-   size_t, sigsetsize)
+static int do_ppoll(struct pollfd __user *ufds, unsigned int nfds,
+   void __user *tsp, const void __user *sigmask,
+   size_t sigsetsize, enum poll_time_type type)
 {
struct timespec64 ts, end_time, *to = NULL;
int ret;
 
if (tsp) {
-   if (get_timespec64(, tsp))
-   return -EFAULT;
+   switch (type) {
+   case PT_TIMESPEC:
+   if (get_timespec64(, tsp))
+   return -EFAULT;
+   break;
+   case PT_OLD_TIMESPEC:
+   if (get_old_timespec32(, tsp))
+   return -EFAULT;
+   break;
+   default:
+   BUG();
+   }
 
to = _time;
if (poll_select_set_timeout(to, ts.tv_sec, ts.tv_nsec))
return -EINVAL;
}
 
-   ret = set_user_sigmask(sigmask, sigsetsize);
+   if (!in_compat_syscall())
+   ret = set_user_sigmask(sigmask, sigsetsize);
+   else
+   ret = set_compat_user_sigmask(sigmask, sigsetsize);
if (ret)
return ret;
 
ret = do_sys_poll(ufds, nfds, to);
-   return poll_select_finish(_time, tsp, PT_TIMESPEC, ret);
+   return poll_select_finish(_time, tsp, type, ret);
+}
+
+SYSCALL_DEFINE5(ppoll, struct pollfd __user *, ufds, unsigned int, nfds,
+   struct __kernel_timespec __user *, tsp, const sigset_t __user 
*, sigmask,
+   size_t, sigsetsize)
+{
+   return do_ppoll(ufds, nfds, tsp, sigmask, sigsetsize, PT_TIMESPEC);
 }
 
 #if defined(CONFIG_COMPAT_32BIT_TIME) && !defined(CONFIG_64BIT)
@@ -1150,24 +1170,7 @@ SYSCALL_DEFINE5(ppoll_time32, struct pollfd __user *, 
ufds, unsigned int, nfds,
struct old_timespec32 __user *, tsp, const sigset_t __user *, 
sigmask,
size_t, sigsetsize)
 {
-   struct timespec64 ts, end_time, *to = NULL;
-   int ret;
-
-   if (tsp) {
-   if (get_old_timespec32(, tsp))
-   return -EFAULT;
-
-   to = _time;
-   if (poll_select_set_timeout(to, ts.tv_sec, ts.tv_nsec))
-   return -EINVAL;
-   }
-
-   ret = set_user_sigmask(sigmask, sigsetsize);
-   if (ret)
-   return ret;
-
-   ret = do_sys_poll(ufds, nfds, to);
-   return poll_select_finish(_time, tsp, PT_OLD_TIMESPEC, ret);
+   return do_ppoll(ufds, nfds, tsp, sigmask, sigsetsize, PT_OLD_TIMESPEC);
 }
 #endif
 
@@ -1258,24 +1261,7 @@ COMPAT_SYSCALL_DEFINE5(ppoll_time32, struct pollfd 
__user *, ufds,
unsigned int,  nfds, struct old_timespec32 __user *, tsp,
const compat_sigset_t __user *, sigmask, compat_size_t, sigsetsize)
 {
-   struct timespec64 ts, end_time, *to = NULL;
-   int ret;
-
-   if (tsp) {
-   if (get_old_timespec32(, tsp))
-   return -EFAULT;
-
-   to = _time;
-   if (poll_select_set_timeout(to, ts.tv_sec, ts.tv_nsec))
-   return -EINVAL;
-   }
-
-   ret = set_compat_user_sigmask(sigmask, sigsetsize);
-   if (ret)
-   return ret;
-
-   ret = do_sys_poll(ufds, nfds, to);
-   return poll_select_finish(_time, tsp, PT_OLD_TIMESPEC, ret);
+   return do_ppoll(ufds, nfds, tsp, sigmask, sigsetsize, PT_OLD_TIMESPEC);
 }
 #endif
 
@@ -1284,24 +1270,7 @@ COMPAT_SYSCALL_DEFINE5(ppoll_time64, struct pollfd 
__user *, ufds,
unsigned int,  nfds, struct __kernel_timespec __user *, tsp,
const compat_sigset_t __user *, sigmask, compat_size_t, sigsetsize)
 {
-   struct timespec64 ts, end_time, *to = NULL;
-   int ret;
-
-   if (tsp) {
-   if (get_timespec64(, tsp))
- 

[PATCH 2/6] select: deduplicate compat logic

2021-01-11 Thread Willem de Bruijn
From: Willem de Bruijn 

Select and pselect have multiple syscall implementations to handle
compat and 32-bit time variants.

Deduplicate core logic, which can cause divergence over time as
changes may not be applied consistently. See vmalloc support in
select, for one example.

Handle compat differences using in_compat_syscall() where needed.
Specifically, fd_set and sigmask may be compat variants. Handle
the !in_compat_syscall() case first, for branch prediction.

Handle timeval/timespec differences by passing along the type to
where the pointer is used.

Compat variants of select and old_select can now call standard
kern_select, removing all callers to do_compat_select.

Compat variants of pselect6 (time32 and time64) can now call standard
do_pselect, removing all callers to do_compat_pselect.

That removes both callers to compat_core_sys_select. And with that
callers to compat_[gs]et_fd_set.

Also move up zero_fd_set, to avoid one open-coded variant.

Signed-off-by: Willem de Bruijn 
---
 fs/select.c | 254 
 1 file changed, 57 insertions(+), 197 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 37aaa8317f3a..dee7dfc5217b 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -382,32 +382,39 @@ typedef struct {
 #define FDS_LONGS(nr)  (((nr)+FDS_BITPERLONG-1)/FDS_BITPERLONG)
 #define FDS_BYTES(nr)  (FDS_LONGS(nr)*sizeof(long))
 
+static inline
+void zero_fd_set(unsigned long nr, unsigned long *fdset)
+{
+   memset(fdset, 0, FDS_BYTES(nr));
+}
+
 /*
  * Use "unsigned long" accesses to let user-mode fd_set's be long-aligned.
  */
 static inline
 int get_fd_set(unsigned long nr, void __user *ufdset, unsigned long *fdset)
 {
-   nr = FDS_BYTES(nr);
-   if (ufdset)
-   return copy_from_user(fdset, ufdset, nr) ? -EFAULT : 0;
+   if (!ufdset) {
+   zero_fd_set(nr, fdset);
+   return 0;
+   }
 
-   memset(fdset, 0, nr);
-   return 0;
+   if (!in_compat_syscall())
+   return copy_from_user(fdset, ufdset, FDS_BYTES(nr)) ? -EFAULT : 
0;
+   else
+   return compat_get_bitmap(fdset, ufdset, nr);
 }
 
 static inline unsigned long __must_check
 set_fd_set(unsigned long nr, void __user *ufdset, unsigned long *fdset)
 {
-   if (ufdset)
-   return __copy_to_user(ufdset, fdset, FDS_BYTES(nr));
-   return 0;
-}
+   if (!ufdset)
+   return 0;
 
-static inline
-void zero_fd_set(unsigned long nr, unsigned long *fdset)
-{
-   memset(fdset, 0, FDS_BYTES(nr));
+   if (!in_compat_syscall())
+   return __copy_to_user(ufdset, fdset, FDS_BYTES(nr));
+   else
+   return compat_put_bitmap(ufdset, fdset, nr);
 }
 
 #define FDS_IN(fds, n) (fds->in + n)
@@ -698,15 +705,29 @@ int core_sys_select(int n, fd_set __user *inp, fd_set 
__user *outp,
 }
 
 static int kern_select(int n, fd_set __user *inp, fd_set __user *outp,
-  fd_set __user *exp, struct __kernel_old_timeval __user 
*tvp)
+  fd_set __user *exp, void __user *tvp,
+  enum poll_time_type type)
 {
struct timespec64 end_time, *to = NULL;
struct __kernel_old_timeval tv;
+   struct old_timeval32 otv;
int ret;
 
if (tvp) {
-   if (copy_from_user(, tvp, sizeof(tv)))
-   return -EFAULT;
+   switch (type) {
+   case PT_TIMEVAL:
+   if (copy_from_user(, tvp, sizeof(tv)))
+   return -EFAULT;
+   break;
+   case PT_OLD_TIMEVAL:
+   if (copy_from_user(, tvp, sizeof(otv)))
+   return -EFAULT;
+   tv.tv_sec = otv.tv_sec;
+   tv.tv_usec = otv.tv_usec;
+   break;
+   default:
+   BUG();
+   }
 
to = _time;
if (poll_select_set_timeout(to,
@@ -716,18 +737,18 @@ static int kern_select(int n, fd_set __user *inp, fd_set 
__user *outp,
}
 
ret = core_sys_select(n, inp, outp, exp, to);
-   return poll_select_finish(_time, tvp, PT_TIMEVAL, ret);
+   return poll_select_finish(_time, tvp, type, ret);
 }
 
 SYSCALL_DEFINE5(select, int, n, fd_set __user *, inp, fd_set __user *, outp,
fd_set __user *, exp, struct __kernel_old_timeval __user *, tvp)
 {
-   return kern_select(n, inp, outp, exp, tvp);
+   return kern_select(n, inp, outp, exp, tvp, PT_TIMEVAL);
 }
 
 static long do_pselect(int n, fd_set __user *inp, fd_set __user *outp,
   fd_set __user *exp, void __user *tsp,
-  const sigset_t __user *sigmask, size_t sigsetsize,
+  const void __user *sigmask, size_t sigsetsize,
   enum poll_time_type type)
 {
struct timespec64

[PATCH 5/6] compat: add set_maybe_compat_user_sigmask helper

2021-01-11 Thread Willem de Bruijn
From: Willem de Bruijn 

Deduplicate the open coded branch on sigmask compat handling.

Signed-off-by: Willem de Bruijn 
Cc: Jens Axboe 
---
 fs/eventpoll.c |  5 +
 fs/io_uring.c  |  9 +
 fs/select.c| 10 ++
 include/linux/compat.h | 10 ++
 4 files changed, 14 insertions(+), 20 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index c9dcffba2da1..c011327c8402 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2247,10 +2247,7 @@ static int do_epoll_pwait(int epfd, struct epoll_event 
__user *events,
 * If the caller wants a certain signal mask to be set during the wait,
 * we apply it here.
 */
-   if (!in_compat_syscall())
-   error = set_user_sigmask(sigmask, sigsetsize);
-   else
-   error = set_compat_user_sigmask(sigmask, sigsetsize);
+   error = set_maybe_compat_user_sigmask(sigmask, sigsetsize);
if (error)
return error;
 
diff --git a/fs/io_uring.c b/fs/io_uring.c
index fdc923e53873..abc88bc738ce 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -7190,14 +7190,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int 
min_events,
} while (1);
 
if (sig) {
-#ifdef CONFIG_COMPAT
-   if (in_compat_syscall())
-   ret = set_compat_user_sigmask((const compat_sigset_t 
__user *)sig,
- sigsz);
-   else
-#endif
-   ret = set_user_sigmask(sig, sigsz);
-
+   ret = set_maybe_compat_user_sigmask(sig, sigsz);
if (ret)
return ret;
}
diff --git a/fs/select.c b/fs/select.c
index 27567795a892..c013662bbf51 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -773,10 +773,7 @@ static long do_pselect(int n, fd_set __user *inp, fd_set 
__user *outp,
return -EINVAL;
}
 
-   if (!in_compat_syscall())
-   ret = set_user_sigmask(sigmask, sigsetsize);
-   else
-   ret = set_compat_user_sigmask(sigmask, sigsetsize);
+   ret = set_maybe_compat_user_sigmask(sigmask, sigsetsize);
if (ret)
return ret;
 
@@ -1146,10 +1143,7 @@ static int do_ppoll(struct pollfd __user *ufds, unsigned 
int nfds,
return -EINVAL;
}
 
-   if (!in_compat_syscall())
-   ret = set_user_sigmask(sigmask, sigsetsize);
-   else
-   ret = set_compat_user_sigmask(sigmask, sigsetsize);
+   ret = set_maybe_compat_user_sigmask(sigmask, sigsetsize);
if (ret)
return ret;
 
diff --git a/include/linux/compat.h b/include/linux/compat.h
index 6e65be753603..4a9b740496b4 100644
--- a/include/linux/compat.h
+++ b/include/linux/compat.h
@@ -18,6 +18,7 @@
 #include  /* for aio_context_t */
 #include 
 #include 
+#include 
 
 #include 
 
@@ -942,6 +943,15 @@ static inline bool in_compat_syscall(void) { return false; 
}
 
 #endif /* CONFIG_COMPAT */
 
+static inline int set_maybe_compat_user_sigmask(const void __user *umask,
+   size_t sigsetsize)
+{
+   if (!in_compat_syscall())
+   return set_user_sigmask(umask, sigsetsize);
+   else
+   return set_compat_user_sigmask(umask, sigsetsize);
+}
+
 /*
  * Some legacy ABIs like the i386 one use less than natural alignment for 
64-bit
  * types, and will need special compat treatment for that.  Most architectures
-- 
2.30.0.284.gd98b1dd5eaa7-goog



[PATCH 6/6] io_pgetevents: deduplicate compat logic

2021-01-11 Thread Willem de Bruijn
From: Willem de Bruijn 

io_pgetevents has four variants, including compat variants of both
timespec and sigmask.

With set_maybe_compat_user_sigmask helper, the latter can be
deduplicated. Move the shared logic to new do_io_pgetevents,
analogous to do_io_getevents.

Signed-off-by: Willem de Bruijn 
Cc: Benjamin LaHaise 
---
 fs/aio.c | 94 ++--
 1 file changed, 37 insertions(+), 57 deletions(-)

diff --git a/fs/aio.c b/fs/aio.c
index d213be7b8a7e..56460ab47d64 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -2101,6 +2101,31 @@ struct __aio_sigset {
size_t  sigsetsize;
 };
 
+static long do_io_pgetevents(aio_context_t ctx_id,
+   long min_nr,
+   long nr,
+   struct io_event __user *events,
+   struct timespec64 *ts,
+   const void __user *umask,
+   size_t sigsetsize)
+{
+   bool interrupted;
+   int ret;
+
+   ret = set_maybe_compat_user_sigmask(umask, sigsetsize);
+   if (ret)
+   return ret;
+
+   ret = do_io_getevents(ctx_id, min_nr, nr, events, ts);
+
+   interrupted = signal_pending(current);
+   restore_saved_sigmask_unless(interrupted);
+   if (interrupted && !ret)
+   ret = -ERESTARTNOHAND;
+
+   return ret;
+}
+
 SYSCALL_DEFINE6(io_pgetevents,
aio_context_t, ctx_id,
long, min_nr,
@@ -2111,8 +2136,6 @@ SYSCALL_DEFINE6(io_pgetevents,
 {
struct __aio_sigset ksig = { NULL, };
struct timespec64   ts;
-   bool interrupted;
-   int ret;
 
if (timeout && unlikely(get_timespec64(, timeout)))
return -EFAULT;
@@ -2120,18 +2143,9 @@ SYSCALL_DEFINE6(io_pgetevents,
if (usig && copy_from_user(, usig, sizeof(ksig)))
return -EFAULT;
 
-   ret = set_user_sigmask(ksig.sigmask, ksig.sigsetsize);
-   if (ret)
-   return ret;
-
-   ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ?  : NULL);
-
-   interrupted = signal_pending(current);
-   restore_saved_sigmask_unless(interrupted);
-   if (interrupted && !ret)
-   ret = -ERESTARTNOHAND;
-
-   return ret;
+   return do_io_pgetevents(ctx_id, min_nr, nr, events,
+   timeout ?  : NULL,
+   ksig.sigmask, ksig.sigsetsize);
 }
 
 #if defined(CONFIG_COMPAT_32BIT_TIME) && !defined(CONFIG_64BIT)
@@ -2146,8 +2160,6 @@ SYSCALL_DEFINE6(io_pgetevents_time32,
 {
struct __aio_sigset ksig = { NULL, };
struct timespec64   ts;
-   bool interrupted;
-   int ret;
 
if (timeout && unlikely(get_old_timespec32(, timeout)))
return -EFAULT;
@@ -2155,19 +2167,9 @@ SYSCALL_DEFINE6(io_pgetevents_time32,
if (usig && copy_from_user(, usig, sizeof(ksig)))
return -EFAULT;
 
-
-   ret = set_user_sigmask(ksig.sigmask, ksig.sigsetsize);
-   if (ret)
-   return ret;
-
-   ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ?  : NULL);
-
-   interrupted = signal_pending(current);
-   restore_saved_sigmask_unless(interrupted);
-   if (interrupted && !ret)
-   ret = -ERESTARTNOHAND;
-
-   return ret;
+   return do_io_pgetevents(ctx_id, min_nr, nr, events,
+   timeout ?  : NULL,
+   ksig.sigmask, ksig.sigsetsize);
 }
 
 #endif
@@ -2213,8 +2215,6 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents,
 {
struct __compat_aio_sigset ksig = { 0, };
struct timespec64 t;
-   bool interrupted;
-   int ret;
 
if (timeout && get_old_timespec32(, timeout))
return -EFAULT;
@@ -,18 +,9 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents,
if (usig && copy_from_user(, usig, sizeof(ksig)))
return -EFAULT;
 
-   ret = set_compat_user_sigmask(compat_ptr(ksig.sigmask), 
ksig.sigsetsize);
-   if (ret)
-   return ret;
-
-   ret = do_io_getevents(ctx_id, min_nr, nr, events, timeout ?  : NULL);
-
-   interrupted = signal_pending(current);
-   restore_saved_sigmask_unless(interrupted);
-   if (interrupted && !ret)
-   ret = -ERESTARTNOHAND;
-
-   return ret;
+   return do_io_pgetevents(ctx_id, min_nr, nr, events,
+   timeout ?  : NULL,
+   compat_ptr(ksig.sigmask), ksig.sigsetsize);
 }
 
 #endif
@@ -2248,8 +2239,6 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents_time64,
 {
struct __compat_aio_sigset ksig = { 0, };
struct timespec64 t;
-   bool interrupted;
-   int ret;
 
if (timeout && get_timespec64(, timeout))
return -EFAULT;
@@ -2257,17 +2246,8 @@ COMPAT_SYSCALL_DEFINE6(io_pgetevents_time64,
   

[PATCH 0/6] fs: deduplicate compat logic

2021-01-11 Thread Willem de Bruijn
From: Willem de Bruijn 

Use in_compat_syscall() to differentiate compat handling exactly
where needed, including in nested function calls. Then remove
duplicated code in callers.

Changes
  RFC[1]->v1
  - remove kselftest dependency on variant support in teardown
(patch is out for review, not available in linux-next/akpm yet)
  - add patch 5: deduplicate set_user_sigmask compat handling
  - add patch 6: deduplicate io_pgetevents sigmask compat handling

[1] RFC: https://github.com/wdebruij/linux-next-mirror/tree/select-compat-1

Willem de Bruijn (6):
  selftests/filesystems: add initial select and poll selftest
  select: deduplicate compat logic
  ppoll: deduplicate compat logic
  epoll: deduplicate compat logic
  compat: add set_maybe_compat_user_sigmask helper
  io_pgetevents: deduplicate compat logic

 fs/aio.c  |  94 ++---
 fs/eventpoll.c|  35 +-
 fs/io_uring.c |   9 +-
 fs/select.c   | 339 +-
 include/linux/compat.h|  10 +
 .../testing/selftests/filesystems/.gitignore  |   1 +
 tools/testing/selftests/filesystems/Makefile  |   2 +-
 .../selftests/filesystems/selectpoll.c| 207 +++
 8 files changed, 344 insertions(+), 353 deletions(-)
 create mode 100644 tools/testing/selftests/filesystems/selectpoll.c

-- 
2.30.0.284.gd98b1dd5eaa7-goog



[PATCH 4/6] epoll: deduplicate compat logic

2021-01-11 Thread Willem de Bruijn
From: Willem de Bruijn 

Apply the same compat deduplication strategy to epoll that was
previously applied to (p)select and ppoll.

Make do_epoll_wait handle both variants of sigmask. This removes
the need for near duplicate do_compat_epoll_pwait.

Signed-off-by: Willem de Bruijn 
---
 fs/eventpoll.c | 38 +-
 1 file changed, 9 insertions(+), 29 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index a829af074eb5..c9dcffba2da1 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2239,7 +2239,7 @@ SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event 
__user *, events,
  */
 static int do_epoll_pwait(int epfd, struct epoll_event __user *events,
  int maxevents, struct timespec64 *to,
- const sigset_t __user *sigmask, size_t sigsetsize)
+ const void __user *sigmask, size_t sigsetsize)
 {
int error;
 
@@ -2247,7 +2247,10 @@ static int do_epoll_pwait(int epfd, struct epoll_event 
__user *events,
 * If the caller wants a certain signal mask to be set during the wait,
 * we apply it here.
 */
-   error = set_user_sigmask(sigmask, sigsetsize);
+   if (!in_compat_syscall())
+   error = set_user_sigmask(sigmask, sigsetsize);
+   else
+   error = set_compat_user_sigmask(sigmask, sigsetsize);
if (error)
return error;
 
@@ -2288,28 +2291,6 @@ SYSCALL_DEFINE6(epoll_pwait2, int, epfd, struct 
epoll_event __user *, events,
 }
 
 #ifdef CONFIG_COMPAT
-static int do_compat_epoll_pwait(int epfd, struct epoll_event __user *events,
-int maxevents, struct timespec64 *timeout,
-const compat_sigset_t __user *sigmask,
-compat_size_t sigsetsize)
-{
-   long err;
-
-   /*
-* If the caller wants a certain signal mask to be set during the wait,
-* we apply it here.
-*/
-   err = set_compat_user_sigmask(sigmask, sigsetsize);
-   if (err)
-   return err;
-
-   err = do_epoll_wait(epfd, events, maxevents, timeout);
-
-   restore_saved_sigmask_unless(err == -EINTR);
-
-   return err;
-}
-
 COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
   struct epoll_event __user *, events,
   int, maxevents, int, timeout,
@@ -2318,9 +2299,9 @@ COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
 {
struct timespec64 to;
 
-   return do_compat_epoll_pwait(epfd, events, maxevents,
-ep_timeout_to_timespec(, timeout),
-sigmask, sigsetsize);
+   return do_epoll_pwait(epfd, events, maxevents,
+ ep_timeout_to_timespec(, timeout),
+ sigmask, sigsetsize);
 }
 
 COMPAT_SYSCALL_DEFINE6(epoll_pwait2, int, epfd,
@@ -2340,8 +2321,7 @@ COMPAT_SYSCALL_DEFINE6(epoll_pwait2, int, epfd,
return -EINVAL;
}
 
-   return do_compat_epoll_pwait(epfd, events, maxevents, to,
-sigmask, sigsetsize);
+   return do_epoll_pwait(epfd, events, maxevents, to, sigmask, sigsetsize);
 }
 
 #endif
-- 
2.30.0.284.gd98b1dd5eaa7-goog



Re: [PATCH v3 1/2] epoll: add nsec timeout support with epoll_pwait2

2021-01-11 Thread Willem de Bruijn
On Thu, Dec 10, 2020 at 5:59 PM Willem de Bruijn
 wrote:
>
> On Thu, Dec 10, 2020 at 3:34 PM Arnd Bergmann  wrote:
> >
> > On Thu, Dec 10, 2020 at 6:33 PM Willem de Bruijn
> >  wrote:
> > > On Sat, Nov 21, 2020 at 4:27 AM Arnd Bergmann  wrote:
> > > > On Fri, Nov 20, 2020 at 11:28 PM Willem de Bruijn 
> > > >  wrote:
> > > > I would imagine this can be done like the way I proposed
> > > > for get_bitmap() in sys_migrate_pages:
> > > >
> > > > https://lore.kernel.org/lkml/20201102123151.2860165-4-a...@kernel.org/
> > >
> > > Coming back to this. Current patchset includes new select and poll
> > > selftests to verify the changes. I need to send a small kselftest
> > > patch for that first.
> > >
> > > Assuming there's no time pressure, I will finish up and send the main
> > > changes after the merge window, for the next release then.
> > >
> > > Current state against linux-next at
> > > https://github.com/wdebruij/linux-next-mirror/tree/select-compat-1
> >
> > Ok, sounds good to me. I've had a (very brief) look and have one
> > suggestion: instead of open-coding the compat vs native mode
> > in multiple places like
> >
> > if (!in_compat_syscall())
> >  return copy_from_user(fdset, ufdset, FDS_BYTES(nr)) ? -EFAULT : 0;
> > else
> >  return compat_get_bitmap(fdset, ufdset, nr);
> >
> > maybe move this into a separate function and call that where needed.
> >
> > I've done this for the get_bitmap() function in my series at
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git/commit/?h=compat-alloc-user-space-7=b1b23ebb12b635654a2060df49455167a142c5d2
> >
> > The definition is slightly differrent for cpumask, nodemask and fd_set,
> > so we'd need to try out the best way to structure the code to end
> > up with the most readable version, but it should be possible when
> > there are only three callers (and duplicating the function would
> > be the end of the world either)
>
> For fd_set there is only a single caller for each direction. Do you
> prefer helpers even so?
>
> For sigmask, with three callers, something along the lines of this?
>
>   @@ -1138,10 +1135,7 @@ static int do_ppoll(struct pollfd __user
> *ufds, unsigned int nfds,
>   return -EINVAL;
>   }
>
>   -   if (!in_compat_syscall())
>   -   ret = set_user_sigmask(sigmask, sigsetsize);
>   -   else
>   -   ret = set_compat_user_sigmask(sigmask, sigsetsize);
>   +   ret = set_maybe_compat_user_sigmask(sigmask, sigsetsize);
>   if (ret)
>   return ret;
>
>   --- a/include/linux/compat.h
>   +++ b/include/linux/compat.h
>   @@ -942,6 +942,17 @@ static inline bool in_compat_syscall(void) {
> return false; }
>
>   +static inline int set_maybe_compat_user_sigmask(const void __user *sigmask,
>   +   size_t sigsetsize)
>   +{
>   +#if defined CONFIG_COMPAT
>   +   if (unlikely(in_compat_syscall()))
>   +   return set_compat_user_sigmask(sigmask, sigsetsize);
>   +#endif
>   +
>   +   return set_user_sigmask(sigmask, sigsetsize);
>   +}

set_user_sigmask is the only open-coded variant that is used more than once.

Because it is used in both select.c and eventpoll.c, a helper would
have to live in compat.h. This then needs a new dependency on
sched_signal.h.

So given that this is a simple branch, it might just make logic more
complex, instead of less. I can add this change in a separate patch on
top of the original three, to judge whether it is worthwhile.


Re: [BUG] from x86: Support kmap_local() forced debugging

2021-01-07 Thread Willem de Bruijn
On Thu, Jan 7, 2021 at 3:53 PM Steven Rostedt  wrote:
>
> On Thu, 7 Jan 2021 11:47:02 -0800
> Linus Torvalds  wrote:
>
> > On Wed, Jan 6, 2021 at 8:45 PM Willem de Bruijn  wrote:
> > >
> > > But there are three other kmap_atomic callers under net/ that do not
> > > loop at all, so assume non-compound pages. In esp_output_head,
> > > esp6_output_head and skb_seq_read. The first two directly use
> > > skb_page_frag_refill, which can allocate compound (but not
> > > __GFP_HIGHMEM) pages, and the third can be inserted with
> > > netfilter xt_string in the path of tcp transmit skbs, which can also
> > > have compound pages. I think that these could similarly access
> > > data beyond the end of the kmap_atomic mapped page. I'll take
> > > a closer look.
> >
> > Thanks.
> >
> > Note that I have flushed my random one-liner patch from my system, and
> > expect to get a proper fix through the normal networking pulls.
> >
> > And _if_ the networking people feel that my one-liner was the proper
> > fix, you can use it and add my sign-off if you want to, but it really
> > was more of a "this is the quick ugly fix for testing" rather than
> > anything else.

I do think it is the proper fix as is. If no one else has comments, I
can submit it through the net tree.

It won't address the other issues that became apparent only as a
result of this. I'm preparing separate patches for those.

> Please add:
>
>   Link: 
> https://lore.kernel.org/linux-mm/20210106180132.41dc2...@gandalf.local.home/
>   Reported-by: Steven Rostedt (VMware) 
>
> And if you take Linus's patch, please add my:
>
>   Tested-by: Steven Rostedt (VMware) 
>
> and if you come up with another patch, please send it to me for testing.
>
> Thanks!

Will do, thanks.


Re: [PATCH net v2] net: fix use-after-free when UDP GRO with shared fraglist

2021-01-07 Thread Willem de Bruijn
On Thu, Jan 7, 2021 at 8:33 AM Daniel Borkmann  wrote:
>
> On 1/7/21 2:05 PM, Willem de Bruijn wrote:
> > On Thu, Jan 7, 2021 at 7:52 AM Daniel Borkmann  wrote:
> >> On 1/7/21 12:40 PM, Dongseok Yi wrote:
> >>> On 2021-01-07 20:05, Daniel Borkmann wrote:
> >>>> On 1/7/21 1:39 AM, Dongseok Yi wrote:
> >>>>> skbs in fraglist could be shared by a BPF filter loaded at TC. It
> >>>>> triggers skb_ensure_writable -> pskb_expand_head ->
> >>>>> skb_clone_fraglist -> skb_get on each skb in the fraglist.
> >>>>>
> >>>>> While tcpdump, sk_receive_queue of PF_PACKET has the original fraglist.
> >>>>> But the same fraglist is queued to PF_INET (or PF_INET6) as the fraglist
> >>>>> chain made by skb_segment_list.
> >>>>>
> >>>>> If the new skb (not fraglist) is queued to one of the sk_receive_queue,
> >>>>> multiple ptypes can see this. The skb could be released by ptypes and
> >>>>> it causes use-after-free.
> >>>>>
> >>>>> [ 4443.426215] [ cut here ]
> >>>>> [ 4443.426222] refcount_t: underflow; use-after-free.
> >>>>> [ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
> >>>>> refcount_dec_and_test_checked+0xa4/0xc8
> >>>>> [ 4443.426726] pstate: 6045 (nZCv daif +PAN -UAO)
> >>>>> [ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
> >>>>> [ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
> >>>>> [ 4443.426808] Call trace:
> >>>>> [ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
> >>>>> [ 4443.426823]  skb_release_data+0x144/0x264
> >>>>> [ 4443.426828]  kfree_skb+0x58/0xc4
> >>>>> [ 4443.426832]  skb_queue_purge+0x64/0x9c
> >>>>> [ 4443.426844]  packet_set_ring+0x5f0/0x820
> >>>>> [ 4443.426849]  packet_setsockopt+0x5a4/0xcd0
> >>>>> [ 4443.426853]  __sys_setsockopt+0x188/0x278
> >>>>> [ 4443.426858]  __arm64_sys_setsockopt+0x28/0x38
> >>>>> [ 4443.426869]  el0_svc_common+0xf0/0x1d0
> >>>>> [ 4443.426873]  el0_svc_handler+0x74/0x98
> >>>>> [ 4443.426880]  el0_svc+0x8/0xc
> >>>>>
> >>>>> Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.)
> >>>>> Signed-off-by: Dongseok Yi 
> >>>>> Acked-by: Willem de Bruijn 
> >>>>> ---
> >>>>> net/core/skbuff.c | 20 +++-
> >>>>> 1 file changed, 19 insertions(+), 1 deletion(-)
> >>>>>
> >>>>> v2: Expand the commit message to clarify a BPF filter loaded
> >>>>>
> >>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> >>>>> index f62cae3..1dcbda8 100644
> >>>>> --- a/net/core/skbuff.c
> >>>>> +++ b/net/core/skbuff.c
> >>>>> @@ -3655,7 +3655,8 @@ struct sk_buff *skb_segment_list(struct sk_buff 
> >>>>> *skb,
> >>>>>  unsigned int delta_truesize = 0;
> >>>>>  unsigned int delta_len = 0;
> >>>>>  struct sk_buff *tail = NULL;
> >>>>> -   struct sk_buff *nskb;
> >>>>> +   struct sk_buff *nskb, *tmp;
> >>>>> +   int err;
> >>>>>
> >>>>>  skb_push(skb, -skb_network_offset(skb) + offset);
> >>>>>
> >>>>> @@ -3665,11 +3666,28 @@ struct sk_buff *skb_segment_list(struct sk_buff 
> >>>>> *skb,
> >>>>>  nskb = list_skb;
> >>>>>  list_skb = list_skb->next;
> >>>>>
> >>>>> +   err = 0;
> >>>>> +   if (skb_shared(nskb)) {
> >>>>> +   tmp = skb_clone(nskb, GFP_ATOMIC);
> >>>>> +   if (tmp) {
> >>>>> +   kfree_skb(nskb);
> >>>>
> >>>> Should use consume_skb() to not trigger skb:kfree_skb tracepoint when 
> >>>> looking
> >>>> for drops in the stack.
> >>>
> >>> I will use to consume_skb() on the next version.
> >>>
> >>>>> +   nskb = tmp;
> >>>>> +   err = skb_unclone(nskb, GFP_ATOMIC);
&g

Re: [PATCH net v2] net: fix use-after-free when UDP GRO with shared fraglist

2021-01-07 Thread Willem de Bruijn
On Thu, Jan 7, 2021 at 7:52 AM Daniel Borkmann  wrote:
>
> On 1/7/21 12:40 PM, Dongseok Yi wrote:
> > On 2021-01-07 20:05, Daniel Borkmann wrote:
> >> On 1/7/21 1:39 AM, Dongseok Yi wrote:
> >>> skbs in fraglist could be shared by a BPF filter loaded at TC. It
> >>> triggers skb_ensure_writable -> pskb_expand_head ->
> >>> skb_clone_fraglist -> skb_get on each skb in the fraglist.
> >>>
> >>> While tcpdump, sk_receive_queue of PF_PACKET has the original fraglist.
> >>> But the same fraglist is queued to PF_INET (or PF_INET6) as the fraglist
> >>> chain made by skb_segment_list.
> >>>
> >>> If the new skb (not fraglist) is queued to one of the sk_receive_queue,
> >>> multiple ptypes can see this. The skb could be released by ptypes and
> >>> it causes use-after-free.
> >>>
> >>> [ 4443.426215] [ cut here ]
> >>> [ 4443.426222] refcount_t: underflow; use-after-free.
> >>> [ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
> >>> refcount_dec_and_test_checked+0xa4/0xc8
> >>> [ 4443.426726] pstate: 6045 (nZCv daif +PAN -UAO)
> >>> [ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
> >>> [ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
> >>> [ 4443.426808] Call trace:
> >>> [ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
> >>> [ 4443.426823]  skb_release_data+0x144/0x264
> >>> [ 4443.426828]  kfree_skb+0x58/0xc4
> >>> [ 4443.426832]  skb_queue_purge+0x64/0x9c
> >>> [ 4443.426844]  packet_set_ring+0x5f0/0x820
> >>> [ 4443.426849]  packet_setsockopt+0x5a4/0xcd0
> >>> [ 4443.426853]  __sys_setsockopt+0x188/0x278
> >>> [ 4443.426858]  __arm64_sys_setsockopt+0x28/0x38
> >>> [ 4443.426869]  el0_svc_common+0xf0/0x1d0
> >>> [ 4443.426873]  el0_svc_handler+0x74/0x98
> >>> [ 4443.426880]  el0_svc+0x8/0xc
> >>>
> >>> Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.)
> >>> Signed-off-by: Dongseok Yi 
> >>> Acked-by: Willem de Bruijn 
> >>> ---
> >>>net/core/skbuff.c | 20 +++-
> >>>1 file changed, 19 insertions(+), 1 deletion(-)
> >>>
> >>> v2: Expand the commit message to clarify a BPF filter loaded
> >>>
> >>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> >>> index f62cae3..1dcbda8 100644
> >>> --- a/net/core/skbuff.c
> >>> +++ b/net/core/skbuff.c
> >>> @@ -3655,7 +3655,8 @@ struct sk_buff *skb_segment_list(struct sk_buff 
> >>> *skb,
> >>> unsigned int delta_truesize = 0;
> >>> unsigned int delta_len = 0;
> >>> struct sk_buff *tail = NULL;
> >>> -   struct sk_buff *nskb;
> >>> +   struct sk_buff *nskb, *tmp;
> >>> +   int err;
> >>>
> >>> skb_push(skb, -skb_network_offset(skb) + offset);
> >>>
> >>> @@ -3665,11 +3666,28 @@ struct sk_buff *skb_segment_list(struct sk_buff 
> >>> *skb,
> >>> nskb = list_skb;
> >>> list_skb = list_skb->next;
> >>>
> >>> +   err = 0;
> >>> +   if (skb_shared(nskb)) {
> >>> +   tmp = skb_clone(nskb, GFP_ATOMIC);
> >>> +       if (tmp) {
> >>> +   kfree_skb(nskb);
> >>
> >> Should use consume_skb() to not trigger skb:kfree_skb tracepoint when 
> >> looking
> >> for drops in the stack.
> >
> > I will use to consume_skb() on the next version.
> >
> >>> +   nskb = tmp;
> >>> +   err = skb_unclone(nskb, GFP_ATOMIC);
> >>
> >> Could you elaborate why you also need to unclone? This looks odd here. tc 
> >> layer
> >> (independent of BPF) from ingress & egress side generally assumes unshared 
> >> skb,
> >> so above clone + dropping ref of nskb looks okay to make the main skb 
> >> struct private
> >> for mangling attributes (e.g. mark) & should suffice. What is the exact 
> >> purpose of
> >> the additional skb_unclone() in this context?
> >
> > Willem de Bruijn said:
> > udp_rcv_segment later converts the udp-gro-list skb to a list of
> > regular packets to pass these one-by-one to udp_queue_rcv_one_skb.
> > Now all the

Re: [BUG] from x86: Support kmap_local() forced debugging

2021-01-06 Thread Willem de Bruijn
On Wed, Jan 6, 2021 at 9:11 PM Willem de Bruijn  wrote:
>
> On Wed, Jan 6, 2021 at 8:49 PM Jakub Kicinski  wrote:
> >
> > On Wed, 6 Jan 2021 17:03:48 -0800 Linus Torvalds wrote:
> > > I wonder whether there is other code that "knows" about kmap() only
> > > affecting PageHighmem() pages thing that is no longer true.
> > >
> > > Looking at some other code, skb_gro_reset_offset() looks suspiciously
> > > like it also thinks highmem pages are special.
> > >
> > > Adding the networking people involved in this area to the cc too.

But there are three other kmap_atomic callers under net/ that do not
loop at all, so assume non-compound pages. In esp_output_head,
esp6_output_head and skb_seq_read. The first two directly use
skb_page_frag_refill, which can allocate compound (but not
__GFP_HIGHMEM) pages, and the third can be inserted with
netfilter xt_string in the path of tcp transmit skbs, which can also
have compound pages. I think that these could similarly access
data beyond the end of the kmap_atomic mapped page. I'll take
a closer look.


Re: [BUG] from x86: Support kmap_local() forced debugging

2021-01-06 Thread Willem de Bruijn
On Wed, Jan 6, 2021 at 8:49 PM Jakub Kicinski  wrote:
>
> On Wed, 6 Jan 2021 17:03:48 -0800 Linus Torvalds wrote:
> > I wonder whether there is other code that "knows" about kmap() only
> > affecting PageHighmem() pages thing that is no longer true.
> >
> > Looking at some other code, skb_gro_reset_offset() looks suspiciously
> > like it also thinks highmem pages are special.
> >
> > Adding the networking people involved in this area to the cc too.
>
> Thanks for the detailed analysis! skb_gro_reset_offset() checks if
> kernel can read data in the fragments directly as an optimization,
> in case the entire header is in a fragment.
>
> IIUC DEBUG_KMAP_LOCAL_FORCE_MAP only affects the mappings from
> explicit kmap calls, which GRO won't make - it will fall back to
> pulling the header out of the fragment and end up in skb_copy_bits(),
> i.e. the loop you fixed. So GRO should be good. I think..

Agreed. That code in skb_gro_reset_offset skips the GRO frag0
optimization in various cases, including if the first fragment is in
high mem.

That specific check goes back to the introduction of the frag0
optimization in commit 86911732d399 ("gro: Avoid copying headers of
unmerged packets"), at the time in helper skb_gro_header().

Very glad to hear that the fix addresses the crash in
skb_frag_foreach_page. Thanks!


Re: [PATCH net] net: fix use-after-free when UDP GRO with shared fraglist

2021-01-06 Thread Willem de Bruijn
On Tue, Jan 5, 2021 at 10:32 PM Dongseok Yi  wrote:
>
> On 2021-01-06 12:07, Willem de Bruijn wrote:
> >
> > On Tue, Jan 5, 2021 at 8:29 PM Dongseok Yi  wrote:
> > >
> > > On 2021-01-05 06:03, Willem de Bruijn wrote:
> > > >
> > > > On Mon, Jan 4, 2021 at 4:00 AM Dongseok Yi  wrote:
> > > > >
> > > > > skbs in frag_list could be shared by pskb_expand_head() from BPF.
> > > >
> > > > Can you elaborate on the BPF connection?
> > >
> > > With the following registered ptypes,
> > >
> > > /proc/net # cat ptype
> > > Type Device  Function
> > > ALL   tpacket_rcv
> > > 0800  ip_rcv.cfi_jt
> > > 0011  llc_rcv.cfi_jt
> > > 0004  llc_rcv.cfi_jt
> > > 0806  arp_rcv
> > > 86dd  ipv6_rcv.cfi_jt
> > >
> > > BPF checks skb_ensure_writable between tpacket_rcv and ip_rcv
> > > (or ipv6_rcv). And it calls pskb_expand_head.
> > >
> > > [  132.051228] pskb_expand_head+0x360/0x378
> > > [  132.051237] skb_ensure_writable+0xa0/0xc4
> > > [  132.051249] bpf_skb_pull_data+0x28/0x60
> > > [  132.051262] bpf_prog_331d69c77ea5e964_schedcls_ingres+0x5f4/0x1000
> > > [  132.051273] cls_bpf_classify+0x254/0x348
> > > [  132.051284] tcf_classify+0xa4/0x180
> >
> > Ah, you have a BPF program loaded at TC. That was not entirely obvious.
> >
> > This program gets called after packet sockets with ptype_all, before
> > those with a specific protocol.
> >
> > Tcpdump will have inserted a program with ptype_all, which cloned the
> > skb. This triggers skb_ensure_writable -> pskb_expand_head ->
> > skb_clone_fraglist -> skb_get.
> >
> > > [  132.051294] __netif_receive_skb_core+0x590/0xd28
> > > [  132.051303] __netif_receive_skb+0x50/0x17c
> > > [  132.051312] process_backlog+0x15c/0x1b8
> > >
> > > >
> > > > > While tcpdump, sk_receive_queue of PF_PACKET has the original 
> > > > > frag_list.
> > > > > But the same frag_list is queued to PF_INET (or PF_INET6) as the 
> > > > > fraglist
> > > > > chain made by skb_segment_list().
> > > > >
> > > > > If the new skb (not frag_list) is queued to one of the 
> > > > > sk_receive_queue,
> > > > > multiple ptypes can see this. The skb could be released by ptypes and
> > > > > it causes use-after-free.
> > > >
> > > > If I understand correctly, a udp-gro-list skb makes it up the receive
> > > > path with one or more active packet sockets.
> > > >
> > > > The packet socket will call skb_clone after accepting the filter. This
> > > > replaces the head_skb, but shares the skb_shinfo and thus frag_list.
> > > >
> > > > udp_rcv_segment later converts the udp-gro-list skb to a list of
> > > > regular packets to pass these one-by-one to udp_queue_rcv_one_skb.
> > > > Now all the frags are fully fledged packets, with headers pushed
> > > > before the payload. This does not change their refcount anymore than
> > > > the skb_clone in pf_packet did. This should be 1.
> > > >
> > > > Eventually udp_recvmsg will call skb_consume_udp on each packet.
> > > >
> > > > The packet socket eventually also frees its cloned head_skb, which 
> > > > triggers
> > > >
> > > >   kfree_skb_list(shinfo->frag_list)
> > > > kfree_skb
> > > >   skb_unref
> > > > refcount_dec_and_test(>users)
> > >
> > > Every your understanding is right, but
> > >
> > > >
> > > > >
> > > > > [ 4443.426215] [ cut here ]
> > > > > [ 4443.426222] refcount_t: underflow; use-after-free.
> > > > > [ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
> > > > > refcount_dec_and_test_checked+0xa4/0xc8
> > > > > [ 4443.426726] pstate: 6045 (nZCv daif +PAN -UAO)
> > > > > [ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
> > > > > [ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
> > > > > [ 4443.426808] Call trace:
> > > > > [ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
> > > > > [ 4443.426823]  skb_release_data+0x144/0x264
> > > > > [ 4443.426828]  kfree_skb+0x58/0xc4
> >

Re: [PATCH net] net: fix use-after-free when UDP GRO with shared fraglist

2021-01-05 Thread Willem de Bruijn
On Tue, Jan 5, 2021 at 8:29 PM Dongseok Yi  wrote:
>
> On 2021-01-05 06:03, Willem de Bruijn wrote:
> >
> > On Mon, Jan 4, 2021 at 4:00 AM Dongseok Yi  wrote:
> > >
> > > skbs in frag_list could be shared by pskb_expand_head() from BPF.
> >
> > Can you elaborate on the BPF connection?
>
> With the following registered ptypes,
>
> /proc/net # cat ptype
> Type Device  Function
> ALL   tpacket_rcv
> 0800  ip_rcv.cfi_jt
> 0011  llc_rcv.cfi_jt
> 0004  llc_rcv.cfi_jt
> 0806  arp_rcv
> 86dd  ipv6_rcv.cfi_jt
>
> BPF checks skb_ensure_writable between tpacket_rcv and ip_rcv
> (or ipv6_rcv). And it calls pskb_expand_head.
>
> [  132.051228] pskb_expand_head+0x360/0x378
> [  132.051237] skb_ensure_writable+0xa0/0xc4
> [  132.051249] bpf_skb_pull_data+0x28/0x60
> [  132.051262] bpf_prog_331d69c77ea5e964_schedcls_ingres+0x5f4/0x1000
> [  132.051273] cls_bpf_classify+0x254/0x348
> [  132.051284] tcf_classify+0xa4/0x180

Ah, you have a BPF program loaded at TC. That was not entirely obvious.

This program gets called after packet sockets with ptype_all, before
those with a specific protocol.

Tcpdump will have inserted a program with ptype_all, which cloned the
skb. This triggers skb_ensure_writable -> pskb_expand_head ->
skb_clone_fraglist -> skb_get.

> [  132.051294] __netif_receive_skb_core+0x590/0xd28
> [  132.051303] __netif_receive_skb+0x50/0x17c
> [  132.051312] process_backlog+0x15c/0x1b8
>
> >
> > > While tcpdump, sk_receive_queue of PF_PACKET has the original frag_list.
> > > But the same frag_list is queued to PF_INET (or PF_INET6) as the fraglist
> > > chain made by skb_segment_list().
> > >
> > > If the new skb (not frag_list) is queued to one of the sk_receive_queue,
> > > multiple ptypes can see this. The skb could be released by ptypes and
> > > it causes use-after-free.
> >
> > If I understand correctly, a udp-gro-list skb makes it up the receive
> > path with one or more active packet sockets.
> >
> > The packet socket will call skb_clone after accepting the filter. This
> > replaces the head_skb, but shares the skb_shinfo and thus frag_list.
> >
> > udp_rcv_segment later converts the udp-gro-list skb to a list of
> > regular packets to pass these one-by-one to udp_queue_rcv_one_skb.
> > Now all the frags are fully fledged packets, with headers pushed
> > before the payload. This does not change their refcount anymore than
> > the skb_clone in pf_packet did. This should be 1.
> >
> > Eventually udp_recvmsg will call skb_consume_udp on each packet.
> >
> > The packet socket eventually also frees its cloned head_skb, which triggers
> >
> >   kfree_skb_list(shinfo->frag_list)
> > kfree_skb
> >   skb_unref
> > refcount_dec_and_test(>users)
>
> Every your understanding is right, but
>
> >
> > >
> > > [ 4443.426215] [ cut here ]
> > > [ 4443.426222] refcount_t: underflow; use-after-free.
> > > [ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
> > > refcount_dec_and_test_checked+0xa4/0xc8
> > > [ 4443.426726] pstate: 6045 (nZCv daif +PAN -UAO)
> > > [ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
> > > [ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
> > > [ 4443.426808] Call trace:
> > > [ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
> > > [ 4443.426823]  skb_release_data+0x144/0x264
> > > [ 4443.426828]  kfree_skb+0x58/0xc4
> > > [ 4443.426832]  skb_queue_purge+0x64/0x9c
> > > [ 4443.426844]  packet_set_ring+0x5f0/0x820
> > > [ 4443.426849]  packet_setsockopt+0x5a4/0xcd0
> > > [ 4443.426853]  __sys_setsockopt+0x188/0x278
> > > [ 4443.426858]  __arm64_sys_setsockopt+0x28/0x38
> > > [ 4443.426869]  el0_svc_common+0xf0/0x1d0
> > > [ 4443.426873]  el0_svc_handler+0x74/0x98
> > > [ 4443.426880]  el0_svc+0x8/0xc
> > >
> > > Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.)
> > > Signed-off-by: Dongseok Yi 
> > > ---
> > >  net/core/skbuff.c | 20 +++-
> > >  1 file changed, 19 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index f62cae3..1dcbda8 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -3655,7 +3655,8 @@ struct sk_buff *skb_segment_list(struct sk_buff 
> > > *skb,
> > > unsigned int delta_truesize = 0;
>

Re: [PATCH net] net: fix use-after-free when UDP GRO with shared fraglist

2021-01-04 Thread Willem de Bruijn
On Mon, Jan 4, 2021 at 4:00 AM Dongseok Yi  wrote:
>
> skbs in frag_list could be shared by pskb_expand_head() from BPF.

Can you elaborate on the BPF connection?

> While tcpdump, sk_receive_queue of PF_PACKET has the original frag_list.
> But the same frag_list is queued to PF_INET (or PF_INET6) as the fraglist
> chain made by skb_segment_list().
>
> If the new skb (not frag_list) is queued to one of the sk_receive_queue,
> multiple ptypes can see this. The skb could be released by ptypes and
> it causes use-after-free.

If I understand correctly, a udp-gro-list skb makes it up the receive
path with one or more active packet sockets.

The packet socket will call skb_clone after accepting the filter. This
replaces the head_skb, but shares the skb_shinfo and thus frag_list.

udp_rcv_segment later converts the udp-gro-list skb to a list of
regular packets to pass these one-by-one to udp_queue_rcv_one_skb.
Now all the frags are fully fledged packets, with headers pushed
before the payload. This does not change their refcount anymore than
the skb_clone in pf_packet did. This should be 1.

Eventually udp_recvmsg will call skb_consume_udp on each packet.

The packet socket eventually also frees its cloned head_skb, which triggers

  kfree_skb_list(shinfo->frag_list)
kfree_skb
  skb_unref
refcount_dec_and_test(>users)

>
> [ 4443.426215] [ cut here ]
> [ 4443.426222] refcount_t: underflow; use-after-free.
> [ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
> refcount_dec_and_test_checked+0xa4/0xc8
> [ 4443.426726] pstate: 6045 (nZCv daif +PAN -UAO)
> [ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
> [ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
> [ 4443.426808] Call trace:
> [ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
> [ 4443.426823]  skb_release_data+0x144/0x264
> [ 4443.426828]  kfree_skb+0x58/0xc4
> [ 4443.426832]  skb_queue_purge+0x64/0x9c
> [ 4443.426844]  packet_set_ring+0x5f0/0x820
> [ 4443.426849]  packet_setsockopt+0x5a4/0xcd0
> [ 4443.426853]  __sys_setsockopt+0x188/0x278
> [ 4443.426858]  __arm64_sys_setsockopt+0x28/0x38
> [ 4443.426869]  el0_svc_common+0xf0/0x1d0
> [ 4443.426873]  el0_svc_handler+0x74/0x98
> [ 4443.426880]  el0_svc+0x8/0xc
>
> Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.)
> Signed-off-by: Dongseok Yi 
> ---
>  net/core/skbuff.c | 20 +++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index f62cae3..1dcbda8 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3655,7 +3655,8 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
> unsigned int delta_truesize = 0;
> unsigned int delta_len = 0;
> struct sk_buff *tail = NULL;
> -   struct sk_buff *nskb;
> +   struct sk_buff *nskb, *tmp;
> +   int err;
>
> skb_push(skb, -skb_network_offset(skb) + offset);
>
> @@ -3665,11 +3666,28 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
> nskb = list_skb;
> list_skb = list_skb->next;
>
> +   err = 0;
> +   if (skb_shared(nskb)) {

I must be missing something still. This does not square with my
understanding that the two sockets are operating on clones, with each
frag_list skb having skb->users == 1.

Unless the packet socket patch previously also triggered an
skb_unclone/pskb_expand_head, as that call skb_clone_fraglist, which
calls skb_get on each frag_list skb.


> +   tmp = skb_clone(nskb, GFP_ATOMIC);
> +   if (tmp) {
> +   kfree_skb(nskb);
> +   nskb = tmp;
> +   err = skb_unclone(nskb, GFP_ATOMIC);
> +   } else {
> +   err = -ENOMEM;
> +   }
> +   }
> +
> if (!tail)
> skb->next = nskb;
> else
> tail->next = nskb;
>
> +   if (unlikely(err)) {
> +   nskb->next = list_skb;
> +   goto err_linearize;
> +   }
> +
> tail = nskb;
>
> delta_len += nskb->len;
> --
> 2.7.4
>


Re: [PATCH] epoll: fix compat syscall wire up of epoll_pwait2

2020-12-20 Thread Willem de Bruijn
On Sun, Dec 20, 2020 at 6:43 AM Arnd Bergmann  wrote:
>
> On Sun, Dec 20, 2020 at 11:00 AM Heiko Carstens  wrote:
> >
> > Commit b0a0c2615f6f ("epoll: wire up syscall epoll_pwait2") wired up
> > the 64 bit syscall instead of the compat variant in a couple of places.
> >
> > Cc: Willem de Bruijn 
> > Cc: Al Viro 
> > Cc: Arnd Bergmann 
> > Cc: Matthew Wilcox (Oracle) 
> > Cc: Catalin Marinas 
> > Cc: Will Deacon 
> > Cc: Thomas Bogendoerfer 
> > Cc: Vasily Gorbik 
> > Cc: Christian Borntraeger 
> > Cc: "David S. Miller" 
> > Fixes: b0a0c2615f6f ("epoll: wire up syscall epoll_pwait2")
> > Signed-off-by: Heiko Carstens 
> > ---
> >  arch/arm64/include/asm/unistd32.h | 2 +-
> >  arch/mips/kernel/syscalls/syscall_n32.tbl | 2 +-
> >  arch/s390/kernel/syscalls/syscall.tbl | 2 +-
> >  arch/sparc/kernel/syscalls/syscall.tbl| 2 +-
> >  4 files changed, 4 insertions(+), 4 deletions(-)
>
> I double-checked all the entries to make sure you caught all
> the missing ones, looks good.
>
> Acked-by: Arnd Bergmann 

Acked-by: Willem de Bruijn 

Thanks a lot. I also arrived at the same list after comparing to
epoll_pwait and signalfd (for sigset) and ppoll_time64 (for timespec64).

Slightly tangential, it's not immediately clear to me why in
arch/x86/entry/syscalls/syscall_32.tbl epoll_pwait does not need a
compat entry, unlike on other architectures and unlike signalfd.


Re: [epoll] fb72873666: WARNING:at_kernel/tracepoint.c:#tracepoint_probe_register_prio

2020-12-14 Thread Willem de Bruijn
On Mon, Dec 14, 2020 at 9:59 AM kernel test robot  wrote:
>
> Greeting,
>
> FYI, we noticed the following commit (built with gcc-9):
>
> commit: fb728736669f7805bcc0fa1c4d578faf991d62a8 ("epoll: wire up syscall 
> epoll_pwait2")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>
>
> in testcase: trinity
> version: trinity-x86_64-af355e9-1_2019-12-03
> with following parameters:
>
> runtime: 300s
>
> test-description: Trinity is a linux system call fuzz tester.
> test-url: http://codemonkey.org.uk/projects/trinity/
>
>
> on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 8G
>
> caused below changes (please refer to attached dmesg/kmsg for entire 
> log/backtrace):
>
>
> ++++
> || e659ea023d 
> | fb72873666 |
> ++++
> | boot_successes | 11 
> | 0  |
> | boot_failures  | 0  
> | 12 |
> | WARNING:at_kernel/tracepoint.c:#tracepoint_probe_register_prio | 0  
> | 10 |
> | RIP:tracepoint_probe_register_prio | 0  
> | 10 |
> | WARNING:at_kernel/locking/lockdep.c:#__lock_acquire| 0  
> | 2  |
> | RIP:__lock_acquire | 0  
> | 2  |
> | BUG:kernel_NULL_pointer_dereference,address| 0  
> | 2  |
> | Oops:#[##] | 0  
> | 2  |
> | Kernel_panic-not_syncing:Fatal_exception_in_interrupt  | 0  
> | 2  |
> ++++
>
>
> If you fix the issue, kindly add following tag
> Reported-by: kernel test robot 
>
>
> [  147.820910] WARNING: CPU: 0 PID: 4088 at kernel/tracepoint.c:136 
> tracepoint_probe_register_prio+0x451/0x4c0
> [  147.822045] Modules linked in:
> [  147.822415] CPU: 0 PID: 4088 Comm: trinity-main Not tainted 
> 5.10.0-rc7-13210-gfb728736669f #1
> [  147.823462] RIP: 0010:tracepoint_probe_register_prio+0x451/0x4c0
> [  147.824182] Code: ff ff e8 72 e3 0c 00 44 8b 4c 24 08 49 89 c7 e9 fb fd ff 
> ff 41 bf f4 ff ff ff 45 31 e4 48 c7 c5 f4 ff ff ff e9 af fc ff ff 90 <0f> 0b 
> 90 31 c9 31 d2 be 01 00 00 00 48 c7 c7 b8 b0 0c 84 e8 17 c9
> [  147.826276] RSP: 0018:888160cc3d28 EFLAGS: 00010246
> [  147.826938] RAX: 0001 RBX: 888140c2a008 RCX: 
> 
> [  147.827745] RDX:  RSI: 0001 RDI: 
> 840cb0e8
> [  147.828556] RBP: 888140c2a550 R08: 0001 R09: 
> 
> [  147.829382] R10: 888160cc3d28 R11: 0001 R12: 
> 
> [  147.830209] R13: 84f5b6a0 R14: 8120aea9 R15: 
> 
> [  147.831063] FS:  7fc0af66a740() GS:838b2000() 
> knlGS:
> [  147.831998] CS:  0010 DS:  ES:  CR0: 80050033
> [  147.832655] CR2: 027d4058 CR3: 00016040a000 CR4: 
> 06b0
> [  147.833481] Call Trace:
> [  147.833817]  ? perf_event_alloc+0x489/0x10c0
> [  147.834330]  ? perf_trace_init+0x251/0x2a0
> [  147.834860]  ? perf_tp_event_init+0x1b/0x40
> [  147.835343]  ? perf_try_init_event+0x47/0x140
> [  147.835861]  ? perf_event_alloc+0x46e/0x10c0
> [  147.836357]  ? sched_clock_cpu+0xa0/0xc0
> [  147.836828]  ? __do_sys_perf_event_open+0x127/0x1120
> [  147.837394]  ? sched_clock+0x2b/0x40
> [  147.837824]  ? sched_clock_cpu+0xa0/0xc0
> [  147.838274]  ? do_syscall_64+0x53/0x100
> [  147.838726]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  147.839365] irq event stamp: 2133343
> [  147.839782] hardirqs last  enabled at (2133351): [] 
> console_unlock+0x486/0x5a0
> [  147.840804] hardirqs last disabled at (2133362): [] 
> console_unlock+0x3d0/0x5a0
> [  147.841808] softirqs last  enabled at (2133120): [] 
> __do_softirq+0x386/0x47e
> [  147.842823] softirqs last disabled at (2133109): [] 
> asm_call_irq_on_stack+0xf/0x20
> [  147.843952] ---[ end trace a41e09c8b1793541 ]---
>
>
> To reproduce:
>
> # build kernel
> cd linux
> cp config-5.10.0-rc7-13210-gfb728736669f .config
> make HOSTCC=gcc-9 CC=gcc-9 ARCH=x86_64 olddefconfig prepare 
> modules_prepare bzImage
>
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> bin/lkp qemu -k  job-script # job-script is attached in this 
> email
>
>
>
> Thanks,
> Rong Chen

Thanks for the report. I'm running trinity in qemu on this commit, now.

As the failing test is trinity run without any special arguments, as a
general syscall fuzzer, could this be a 

Re: [PATCH 1/3] Add TX sending hardware timestamp.

2020-12-11 Thread Willem de Bruijn
> >>  I did not use "Fair Queue traffic policing".
> >>  As for ETF, it is all about ordering packets from different 
> >>  applications.
> >>  How can we achive it with skiping queuing?
> >>  Could you elaborate on this point?
> >> >>>
> >> >>> The qdisc can only defer pacing to hardware if hardware can ensure the
> >> >>> same invariants on ordering, of course.
> >> >>
> >> >> Yes, this is why we suggest ETF order packets using the hardware 
> >> >> time-stamp.
> >> >> And pass the packet based on system time.
> >> >> So ETF query the system clock only and not the PHC.
> >> >
> >> > On which note: with this patch set all applications have to agree to
> >> > use h/w time base in etf_enqueue_timesortedlist. In practice that
> >> > makes this h/w mode a qdisc used by a single process?
> >>
> >> A single process theoretically does not need ETF, just set the skb-> 
> >> tstamp and use a pass through queue.
> >> However the only way now to set TC_SETUP_QDISC_ETF in the driver is using 
> >> ETF.
> >
> > Yes, and I'd like to eventually get rid of this constraint.
> >
>
> I'm interested in these kind of ideas :-)
>
> What would be your end goal? Something like:
>  - Any application is able to set SO_TXTIME;
>  - We would have a best effort support for scheduling packets based on
>  their transmission time enabled by default;
>  - If the hardware supports, there would be a "offload" flag that could
>  be enabled;
>
> More or less this?

Exactly. Pacing is stateless, so relatively amenable to offload.

For applications that offload pacing to the OS with SO_TXTIME, such as
QUIC, further reduce jitter and timer wake-ups (and thus cycles) by
offloading to hardware.

Not only for SO_TXTIME, also for pacing initiated by the kernel TCP stack.

Initially, in absence of hardware support, at least in virtual environments
offload from guest to host OS.


Re: [PATCH 1/3] Add TX sending hardware timestamp.

2020-12-10 Thread Willem de Bruijn
> > If I understand correctly, you are trying to achieve a single delivery time.
> > The need for two separate timestamps passed along is only because the
> > kernel is unable to do the time base conversion.
>
> Yes, a correct point.
>
> >
> > Else, ETF could program the qdisc watchdog in system time and later,
> > on dequeue, convert skb->tstamp to the h/w time base before
> > passing it to the device.
>
> Or the skb->tstamp is HW time-stamp and the ETF convert it to system clock 
> based.
>
> >
> > It's still not entirely clear to me why the packet has to be held by
> > ETF initially first, if it is held until delivery time by hardware
> > later. But more on that below.
>
> Let plot a simple scenario.
> App A send a packet with time-stamp 100.
> After arrive a second packet from App B with time-stamp 90.
> Without ETF, the second packet will have to wait till the interface hardware 
> send the first packet on 100.
> Making the second packet late by 10 + first packet send time.
> Obviously other "normal" packets are send to the non-ETF queue, though they 
> do not block ETF packets
> The ETF delta is a barrier that the application have to send the packet 
> before to ensure the packet do not tossed.

Got it. The assumption here is that devices are FIFO. That is not
necessarily the case, but I do not know whether it is in practice,
e.g., on the i210.

>
> >
> > So far, the use case sounds a bit narrow and the use of two timestamp
> > fields for a single delivery event a bit of a hack.
>
> The definition of a hack is up to you

Fair enough :) That wasn't very constructive feedback on my part.

> > And one that does impose a cost in the hot path of many workloads
> > by adding a field the ip cookie, cork and writing to (possibly cold)
> > skb_shinfo for every packet.
>
> Most packets do not use skb->tstamp either, probably the cost of testing is 
> higher then just copying.
> But perhaps if we copy 2 time-stamp we can add a condition for both.
> What do you think?

I'd need to take a closer look at the skb_hwtstamps, which unlike
skb->tstamp lie in the skb_shared_data. If that is an otherwise cold
cacheline, then access would be expensive.

The ipcm and cork are admittedly cheap and not worth a branch. But
still it is good to understand that this situation of unsynchronized
clocks is a common operation condition for the foreseeable future, not
an unfortunate constraint of a single piece of hardware.

An extreme option would be moving everything behind a static_branch as
most hot paths will not have the feature enabled. But I'm not
seriously suggesting that for a few assignments.

> The cookie and the cork are just intermediate from application to SKB, I do 
> not think they cost much.
> Both writes of time stamp to the cookie and the cork are conditioned.
>
> >
> > Indeed, we want pacing offload to work for existing applications.
> >
>  As the conversion of the PHC and the system clock is dynamic over time.
>  How do you propse to achive it?
> >>>
> >>> Can you elaborate on this concern?
> >>
> >> Using single time stamp have 3 possible solutions:
> >>
> >> 1. Current solution, synchronize the system clock and the PHC.
> >>  Application uses the system clock.
> >>  The ETF can use the system clock for ordering and pass the packet to 
> >> the driver on time
> >>  The network interface hardware compare the time-stamp to the PHC.
> >>
> >> 2. The application convert the PHC time-stamp to system clock based.
> >>   The ETF works as solution 1
> >>   The network driver convert the system clock time-stamp back to PHC 
> >> time-stamp.
> >>   This solution need a new Net-Link flag and modify the relevant 
> >> network drivers.
> >>   Yet this solution have 2 problems:
> >>   * As applications today are not aware that system clock and PHC are 
> >> not synchronized and
> >>  therefore do not perform any conversion, most of them only use 
> >> the system clock.
> >>   * As the conversion in the network driver happens ~300 - 600 
> >> microseconds after
> >>  the application send the packet.
> >>  And as the PHC and system clock frequencies and offset can change 
> >> during this period.
> >>  The conversion will produce a different PHC time-stamp from the 
> >> application original time-stamp.
> >>  We require a precession of 1 nanoseconds of the PHC time-stamp.
> >>
> >> 3. The application uses PHC time-stamp for skb->tstamp
> >>  The ETF convert the  PHC time-stamp to system clock time-stamp.
> >>  This solution require implementations on supporting reading PHC clocks
> >>  from IRQ/kernel thread context in kernel space.
> >
> > ETF has to release the packet well in advance of the hardware
> > timestamp for the packet to arrive at the device on time. In practice
> > I would expect this delta parameter to be at least at usec timescale.
> > That gives some wiggle room with regard to s/w tstamp, at least.
>
> Yes, 

Re: [PATCH v3 1/2] epoll: add nsec timeout support with epoll_pwait2

2020-12-10 Thread Willem de Bruijn
On Thu, Dec 10, 2020 at 3:34 PM Arnd Bergmann  wrote:
>
> On Thu, Dec 10, 2020 at 6:33 PM Willem de Bruijn
>  wrote:
> > On Sat, Nov 21, 2020 at 4:27 AM Arnd Bergmann  wrote:
> > > On Fri, Nov 20, 2020 at 11:28 PM Willem de Bruijn 
> > >  wrote:
> > > I would imagine this can be done like the way I proposed
> > > for get_bitmap() in sys_migrate_pages:
> > >
> > > https://lore.kernel.org/lkml/20201102123151.2860165-4-a...@kernel.org/
> >
> > Coming back to this. Current patchset includes new select and poll
> > selftests to verify the changes. I need to send a small kselftest
> > patch for that first.
> >
> > Assuming there's no time pressure, I will finish up and send the main
> > changes after the merge window, for the next release then.
> >
> > Current state against linux-next at
> > https://github.com/wdebruij/linux-next-mirror/tree/select-compat-1
>
> Ok, sounds good to me. I've had a (very brief) look and have one
> suggestion: instead of open-coding the compat vs native mode
> in multiple places like
>
> if (!in_compat_syscall())
>  return copy_from_user(fdset, ufdset, FDS_BYTES(nr)) ? -EFAULT : 0;
> else
>  return compat_get_bitmap(fdset, ufdset, nr);
>
> maybe move this into a separate function and call that where needed.
>
> I've done this for the get_bitmap() function in my series at
>
> https://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git/commit/?h=compat-alloc-user-space-7=b1b23ebb12b635654a2060df49455167a142c5d2
>
> The definition is slightly differrent for cpumask, nodemask and fd_set,
> so we'd need to try out the best way to structure the code to end
> up with the most readable version, but it should be possible when
> there are only three callers (and duplicating the function would
> be the end of the world either)

For fd_set there is only a single caller for each direction. Do you
prefer helpers even so?

For sigmask, with three callers, something along the lines of this?

  @@ -1138,10 +1135,7 @@ static int do_ppoll(struct pollfd __user
*ufds, unsigned int nfds,
  return -EINVAL;
  }

  -   if (!in_compat_syscall())
  -   ret = set_user_sigmask(sigmask, sigsetsize);
  -   else
  -   ret = set_compat_user_sigmask(sigmask, sigsetsize);
  +   ret = set_maybe_compat_user_sigmask(sigmask, sigsetsize);
  if (ret)
  return ret;

  --- a/include/linux/compat.h
  +++ b/include/linux/compat.h
  @@ -942,6 +942,17 @@ static inline bool in_compat_syscall(void) {
return false; }

  +static inline int set_maybe_compat_user_sigmask(const void __user *sigmask,
  +   size_t sigsetsize)
  +{
  +#if defined CONFIG_COMPAT
  +   if (unlikely(in_compat_syscall()))
  +   return set_compat_user_sigmask(sigmask, sigsetsize);
  +#endif
  +
  +   return set_user_sigmask(sigmask, sigsetsize);
  +}


Re: [PATCH 1/3] Add TX sending hardware timestamp.

2020-12-10 Thread Willem de Bruijn
On Wed, Dec 9, 2020 at 3:18 PM Geva, Erez  wrote:
>
>
> On 09/12/2020 18:37, Willem de Bruijn wrote:
> > On Wed, Dec 9, 2020 at 10:25 AM Geva, Erez  
> > wrote:
> >>
> >>
> >> On 09/12/2020 15:48, Willem de Bruijn wrote:
> >>> On Wed, Dec 9, 2020 at 9:37 AM Erez Geva  
> >>> wrote:
> >>>>
> >>>> Configure and send TX sending hardware timestamp from
> >>>>user space application to the socket layer,
> >>>>to provide to the TC ETC Qdisc, and pass it to
> >>>>the interface network driver.
> >>>>
> >>>>- New flag for the SO_TXTIME socket option.
> >>>>- New access auxiliary data header to pass the
> >>>>  TX sending hardware timestamp.
> >>>>- Add the hardware timestamp to the socket cookie.
> >>>>- Copy the TX sending hardware timestamp to the socket cookie.
> >>>>
> >>>> Signed-off-by: Erez Geva 
> >>>
> >>> Hardware offload of pacing is definitely useful.
> >>>
> >> Thanks for your comment.
> >> I agree, it is not limited of use.
> >>
> >>> I don't think this needs a new separate h/w variant of SO_TXTIME.
> >>>
> >> I only extend SO_TXTIME.
> >
> > The patchset passes a separate timestamp from skb->tstamp along
> > through the ip cookie, cork (transmit_hw_time) and with the skb in
> > shinfo.
> >
> > I don't see the need for two timestamps, one tied to software and one
> > to hardware. When would we want to pace twice?
>
> As the Net-Link uses system clock and the network interface hardware uses 
> it's own PHC.
> The current ETF depends on synchronizing the system clock and the PHC.

If I understand correctly, you are trying to achieve a single delivery time.
The need for two separate timestamps passed along is only because the
kernel is unable to do the time base conversion.

Else, ETF could program the qdisc watchdog in system time and later,
on dequeue, convert skb->tstamp to the h/w time base before
passing it to the device.

It's still not entirely clear to me why the packet has to be held by
ETF initially first, if it is held until delivery time by hardware
later. But more on that below.

So far, the use case sounds a bit narrow and the use of two timestamp
fields for a single delivery event a bit of a hack.

And one that does impose a cost in the hot path of many workloads
by adding a field the ip cookie, cork and writing to (possibly cold)
skb_shinfo for every packet.

> >>> Indeed, we want pacing offload to work for existing applications.
> >>>
> >> As the conversion of the PHC and the system clock is dynamic over time.
> >> How do you propse to achive it?
> >
> > Can you elaborate on this concern?
>
> Using single time stamp have 3 possible solutions:
>
> 1. Current solution, synchronize the system clock and the PHC.
> Application uses the system clock.
> The ETF can use the system clock for ordering and pass the packet to the 
> driver on time
> The network interface hardware compare the time-stamp to the PHC.
>
> 2. The application convert the PHC time-stamp to system clock based.
>  The ETF works as solution 1
>  The network driver convert the system clock time-stamp back to PHC 
> time-stamp.
>  This solution need a new Net-Link flag and modify the relevant network 
> drivers.
>  Yet this solution have 2 problems:
>  * As applications today are not aware that system clock and PHC are not 
> synchronized and
> therefore do not perform any conversion, most of them only use the 
> system clock.
>  * As the conversion in the network driver happens ~300 - 600 
> microseconds after
> the application send the packet.
> And as the PHC and system clock frequencies and offset can change 
> during this period.
> The conversion will produce a different PHC time-stamp from the 
> application original time-stamp.
> We require a precession of 1 nanoseconds of the PHC time-stamp.
>
> 3. The application uses PHC time-stamp for skb->tstamp
> The ETF convert the  PHC time-stamp to system clock time-stamp.
> This solution require implementations on supporting reading PHC clocks
> from IRQ/kernel thread context in kernel space.

ETF has to release the packet well in advance of the hardware
timestamp for the packet to arrive at the device on time. In practice
I would expect this delta parameter to be at least at usec timescale.
That gives some wiggle room with regard to s/w tstamp, at least.

If changes in clock distance are

Re: [PATCH v3 1/2] epoll: add nsec timeout support with epoll_pwait2

2020-12-10 Thread Willem de Bruijn
On Sat, Nov 21, 2020 at 4:27 AM Arnd Bergmann  wrote:
>
> On Fri, Nov 20, 2020 at 11:28 PM Willem de Bruijn
>  wrote:
> > On Fri, Nov 20, 2020 at 2:23 PM Arnd Bergmann  wrote:
> > > On Fri, Nov 20, 2020 at 5:01 PM Willem de Bruijn 
> > >  wrote:
> >
> > I think it'll be better to split the patchsets:
> >
> > epoll: convert internal api to timespec64
> > epoll: add syscall epoll_pwait2
> > epoll: wire up syscall epoll_pwait2
> > selftests/filesystems: expand epoll with epoll_pwait2
> >
> > and
> >
> > select: compute slack based on relative time
> > epoll: compute slack based on relative time
> >
> > and judge the slack conversion on its own merit.
>
> Yes, makes sense.
>
> > I also would rather not tie this up with the compat deduplication.
> > Happy to take a stab at that though. On that note, when combining
> > functions like
> >
> >   int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
> >fd_set __user *exp, struct timespec64 *end_time,
> >u64 slack)
> >
> > and
> >
> >   static int compat_core_sys_select(int n, compat_ulong_t __user *inp,
> > compat_ulong_t __user *outp, compat_ulong_t __user *exp,
> > struct timespec64 *end_time, u64 slack)
> >
> > by branching on in_compat_syscall() inside get_fd_set/set_fd_set and
> > deprecating their compat_.. counterparts, what would the argument
> > pointers look like? Or is that not the approach you have in mind?
>
> In this case, the top-level entry points becomes unified, and you get
> the prototype from core_sys_select() with the native arguments.
>
> I would imagine this can be done like the way I proposed
> for get_bitmap() in sys_migrate_pages:
>
> https://lore.kernel.org/lkml/20201102123151.2860165-4-a...@kernel.org/

Coming back to this. Current patchset includes new select and poll
selftests to verify the changes. I need to send a small kselftest
patch for that first.

Assuming there's no time pressure, I will finish up and send the main
changes after the merge window, for the next release then.

Current state against linux-next at
https://github.com/wdebruij/linux-next-mirror/tree/select-compat-1


Re: [PATCH 1/3] Add TX sending hardware timestamp.

2020-12-09 Thread Willem de Bruijn
On Wed, Dec 9, 2020 at 10:25 AM Geva, Erez  wrote:
>
>
> On 09/12/2020 15:48, Willem de Bruijn wrote:
> > On Wed, Dec 9, 2020 at 9:37 AM Erez Geva  wrote:
> >>
> >> Configure and send TX sending hardware timestamp from
> >>   user space application to the socket layer,
> >>   to provide to the TC ETC Qdisc, and pass it to
> >>   the interface network driver.
> >>
> >>   - New flag for the SO_TXTIME socket option.
> >>   - New access auxiliary data header to pass the
> >> TX sending hardware timestamp.
> >>   - Add the hardware timestamp to the socket cookie.
> >>   - Copy the TX sending hardware timestamp to the socket cookie.
> >>
> >> Signed-off-by: Erez Geva 
> >
> > Hardware offload of pacing is definitely useful.
> >
> Thanks for your comment.
> I agree, it is not limited of use.
>
> > I don't think this needs a new separate h/w variant of SO_TXTIME.
> >
> I only extend SO_TXTIME.

The patchset passes a separate timestamp from skb->tstamp along
through the ip cookie, cork (transmit_hw_time) and with the skb in
shinfo.

I don't see the need for two timestamps, one tied to software and one
to hardware. When would we want to pace twice?

> > Indeed, we want pacing offload to work for existing applications.
> >
> As the conversion of the PHC and the system clock is dynamic over time.
> How do you propse to achive it?

Can you elaborate on this concern?

The simplest solution for offloading pacing would be to interpret
skb->tstamp either for software pacing, or skip software pacing if the
device advertises a NETIF_F hardware pacing feature.

Clockbase is an issue. The device driver may have to convert to
whatever format the device expects when copying skb->tstamp in the
device tx descriptor.

>
> > It only requires that pacing qdiscs, both sch_etf and sch_fq,
> > optionally skip queuing in their .enqueue callback and instead allow
> > the skb to pass to the device driver as is, with skb->tstamp set. Only
> > to devices that advertise support for h/w pacing offload.
> >
> I did not use "Fair Queue traffic policing".
> As for ETF, it is all about ordering packets from different applications.
> How can we achive it with skiping queuing?
> Could you elaborate on this point?

The qdisc can only defer pacing to hardware if hardware can ensure the
same invariants on ordering, of course.

Btw: this is quite a long list of CC:s


Re: [PATCH 1/3] Add TX sending hardware timestamp.

2020-12-09 Thread Willem de Bruijn
On Wed, Dec 9, 2020 at 9:37 AM Erez Geva  wrote:
>
> Configure and send TX sending hardware timestamp from
>  user space application to the socket layer,
>  to provide to the TC ETC Qdisc, and pass it to
>  the interface network driver.
>
>  - New flag for the SO_TXTIME socket option.
>  - New access auxiliary data header to pass the
>TX sending hardware timestamp.
>  - Add the hardware timestamp to the socket cookie.
>  - Copy the TX sending hardware timestamp to the socket cookie.
>
> Signed-off-by: Erez Geva 

Hardware offload of pacing is definitely useful.

I don't think this needs a new separate h/w variant of SO_TXTIME.

Indeed, we want pacing offload to work for existing applications.

It only requires that pacing qdiscs, both sch_etf and sch_fq,
optionally skip queuing in their .enqueue callback and instead allow
the skb to pass to the device driver as is, with skb->tstamp set. Only
to devices that advertise support for h/w pacing offload.


Re: [PATCH net-next] net: switch to storing KCOV handle directly in sk_buff

2020-11-27 Thread Willem de Bruijn
On Fri, Nov 27, 2020 at 7:26 AM Marco Elver  wrote:
>
> On Thu, 26 Nov 2020 at 17:35, Willem de Bruijn
>  wrote:
> > On Thu, Nov 26, 2020 at 3:19 AM Marco Elver  wrote:
> [...]
> > > Will send v2.
> >
> > Does it make more sense to revert the patch that added the extensions
> > and the follow-on fixes and add a separate new patch instead?
>
> That doesn't work, because then we'll end up with a build-broken
> commit in between the reverts and the new version, because mac80211
> uses skb_get_kcov_handle().
>
> > If adding a new field to the skb, even if only in debug builds,
> > please check with pahole how it affects struct layout if you
> > haven't yet.
>
> Without KCOV:
>
> /* size: 224, cachelines: 4, members: 72 */
> /* sum members: 217, holes: 1, sum holes: 2 */
> /* sum bitfield members: 36 bits, bit holes: 2, sum bit holes: 4 bits 
> */
> /* forced alignments: 2 */
> /* last cacheline: 32 bytes */
>
> With KCOV:
>
> /* size: 232, cachelines: 4, members: 73 */
> /* sum members: 225, holes: 1, sum holes: 2 */
> /* sum bitfield members: 36 bits, bit holes: 2, sum bit holes: 4 bits 
> */
> /* forced alignments: 2 */
> /* last cacheline: 40 bytes */

Thanks. defconfig leaves some symbols disabled, but manually enabling
them just fills a hole, so 232 is indeed the worst case allocation.

I recall a firm edict against growing skb, but I don't know of a
hard limit at exactly 224.

There is a limit at 2048 - sizeof(struct skb_shared_data) == 1728B
when using pages for two ETH_FRAME_LEN (1514) allocations.

This would leave 1728 - 1514 == 214B if also squeezing the skb itself
in with the same allocation.

But I have no idea if this is used anywhere. Certainly have no example
ready. And as you show, the previous default already is at 224.

If no one else knows of a hard limit at 224 or below, I suppose the
next technical limit is just 256 for kmem cache purposes.

My understanding was that skb_extensions was supposed to solve this
problem of extending the skb without growing the main structure. Not
for this patch, but I wonder if we can resolve the issues exposed here
and make usable in more conditions.


Re: [PATCH] media: gp8psk: initialize stats at power control logic

2020-11-27 Thread Willem de Bruijn
On Fri, Nov 27, 2020 at 1:46 AM Mauro Carvalho Chehab
 wrote:
>
> As reported on:
> 
> https://lore.kernel.org/linux-media/20190627222020.45909-1-willemdebruijn.ker...@gmail.com/
>
> if gp8psk_usb_in_op() returns an error, the status var is not
> initialized. Yet, this var is used later on, in order to
> identify:
> - if the device was already started;
> - if firmware has loaded;
> - if the LNBf was powered on.
>
> Using status = 0 seems to ensure that everything will be
> properly powered up.
>
> So, instead of the proposed solution, let's just set
> status = 0.
>
> Reported-by: syzbot 
> Reported-by: Willem de Bruijn 
> Signed-off-by: Mauro Carvalho Chehab 
> ---
>  drivers/media/usb/dvb-usb/gp8psk.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/media/usb/dvb-usb/gp8psk.c 
> b/drivers/media/usb/dvb-usb/gp8psk.c
> index c07f46f5176e..b4f661bb5648 100644
> --- a/drivers/media/usb/dvb-usb/gp8psk.c
> +++ b/drivers/media/usb/dvb-usb/gp8psk.c
> @@ -182,7 +182,7 @@ static int gp8psk_load_bcm4500fw(struct dvb_usb_device *d)
>
>  static int gp8psk_power_ctrl(struct dvb_usb_device *d, int onoff)
>  {
> -   u8 status, buf;
> +   u8 status = 0, buf;
> int gp_product_id = le16_to_cpu(d->udev->descriptor.idProduct);
>
> if (onoff) {
> --
> 2.28.0


Is it okay to ignore the return value of gp8psk_usb_in_op here?


Re: [PATCH net-next] net: switch to storing KCOV handle directly in sk_buff

2020-11-26 Thread Willem de Bruijn
On Thu, Nov 26, 2020 at 3:19 AM Marco Elver  wrote:
>
> On Wed, 25 Nov 2020 at 21:43, Jakub Kicinski  wrote:
> >
> > On Wed, 25 Nov 2020 18:34:36 +0100 Marco Elver wrote:
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index ffe3dcc0ebea..070b1077d976 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -233,6 +233,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
> > > gfp_mask,
> > >   skb->end = skb->tail + size;
> > >   skb->mac_header = (typeof(skb->mac_header))~0U;
> > >   skb->transport_header = (typeof(skb->transport_header))~0U;
> > > + skb_set_kcov_handle(skb, kcov_common_handle());
> > >
> > >   /* make sure we initialize shinfo sequentially */
> > >   shinfo = skb_shinfo(skb);
> > > @@ -249,9 +250,6 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t 
> > > gfp_mask,
> > >
> > >   fclones->skb2.fclone = SKB_FCLONE_CLONE;
> > >   }
> > > -
> > > - skb_set_kcov_handle(skb, kcov_common_handle());
> >
> > Why the move?
>
> v2 of the original series had it above. I frankly don't mind.
>
> 1. Group it with the other fields above?
>
> 2. Leave it at the end here?
>
> > >  out:
> > >   return skb;
> > >  nodata:
> > > @@ -285,8 +283,6 @@ static struct sk_buff *__build_skb_around(struct 
> > > sk_buff *skb,
> > >   memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> > >   atomic_set(>dataref, 1);
> > >
> > > - skb_set_kcov_handle(skb, kcov_common_handle());
> > > -
> > >   return skb;
> > >  }
> >
> > And why are we dropping this?
>
> It wasn't here originally.
>
> > If this was omitted in earlier versions it's just a independent bug,
> > I don't think build_skb() will call __alloc_skb(), so we need a to
> > set the handle here.
>
> Correct, that was an original omission.
>
> Will send v2.

Does it make more sense to revert the patch that added the extensions
and the follow-on fixes and add a separate new patch instead?

If adding a new field to the skb, even if only in debug builds,
please check with pahole how it affects struct layout if you
haven't yet.

The skb_extensions idea was mine. Apologies for steering
this into an apparently unsuccessful direction. Adding new fields
to skb is very rare because possibly problematic wrt allocation.


[PATCH v4 4/4] selftests/filesystems: expand epoll with epoll_pwait2

2020-11-21 Thread Willem de Bruijn
From: Willem de Bruijn 

Code coverage for the epoll_pwait2 syscall.

epoll62: Repeat basic test epoll1, but exercising the new syscall.
epoll63: Pass a timespec and exercise the timeout wakeup path.

Changes
  v4:
  - fix sys_epoll_pwait2 to take __kernel_timespec (Arnd).
  - fix sys_epoll_pwait2 to have sigsetsize arg.

Signed-off-by: Willem de Bruijn 
---
 .../filesystems/epoll/epoll_wakeup_test.c | 72 +++
 1 file changed, 72 insertions(+)

diff --git a/tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c 
b/tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c
index 8f82f99f7748..ad7fabd575f9 100644
--- a/tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c
+++ b/tools/testing/selftests/filesystems/epoll/epoll_wakeup_test.c
@@ -1,6 +1,8 @@
 // SPDX-License-Identifier: GPL-2.0
 
 #define _GNU_SOURCE
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -21,6 +23,19 @@ struct epoll_mtcontext
pthread_t waiter;
 };
 
+#ifndef __NR_epoll_pwait2
+#define __NR_epoll_pwait2 -1
+#endif
+
+static inline int sys_epoll_pwait2(int fd, struct epoll_event *events,
+  int maxevents,
+  const struct __kernel_timespec *timeout,
+  const sigset_t *sigset, size_t sigsetsize)
+{
+   return syscall(__NR_epoll_pwait2, fd, events, maxevents, timeout,
+  sigset, sigsetsize);
+}
+
 static void signal_handler(int signum)
 {
 }
@@ -3377,4 +3392,61 @@ TEST(epoll61)
close(ctx.evfd);
 }
 
+/* Equivalent to basic test epoll1, but exercising epoll_pwait2. */
+TEST(epoll62)
+{
+   int efd;
+   int sfd[2];
+   struct epoll_event e;
+
+   ASSERT_EQ(socketpair(AF_UNIX, SOCK_STREAM, 0, sfd), 0);
+
+   efd = epoll_create(1);
+   ASSERT_GE(efd, 0);
+
+   e.events = EPOLLIN;
+   ASSERT_EQ(epoll_ctl(efd, EPOLL_CTL_ADD, sfd[0], ), 0);
+
+   ASSERT_EQ(write(sfd[1], "w", 1), 1);
+
+   EXPECT_EQ(sys_epoll_pwait2(efd, , 1, NULL, NULL, 0), 1);
+   EXPECT_EQ(sys_epoll_pwait2(efd, , 1, NULL, NULL, 0), 1);
+
+   close(efd);
+   close(sfd[0]);
+   close(sfd[1]);
+}
+
+/* Epoll_pwait2 basic timeout test. */
+TEST(epoll63)
+{
+   const int cfg_delay_ms = 10;
+   unsigned long long tdiff;
+   struct __kernel_timespec ts;
+   int efd;
+   int sfd[2];
+   struct epoll_event e;
+
+   ASSERT_EQ(socketpair(AF_UNIX, SOCK_STREAM, 0, sfd), 0);
+
+   efd = epoll_create(1);
+   ASSERT_GE(efd, 0);
+
+   e.events = EPOLLIN;
+   ASSERT_EQ(epoll_ctl(efd, EPOLL_CTL_ADD, sfd[0], ), 0);
+
+   ts.tv_sec = 0;
+   ts.tv_nsec = cfg_delay_ms * 1000 * 1000;
+
+   tdiff = msecs();
+   EXPECT_EQ(sys_epoll_pwait2(efd, , 1, , NULL, 0), 0);
+   tdiff = msecs() - tdiff;
+
+   EXPECT_GE(tdiff, cfg_delay_ms);
+
+   close(efd);
+   close(sfd[0]);
+   close(sfd[1]);
+}
+
 TEST_HARNESS_MAIN
-- 
2.29.2.454.gaff20da3a2-goog



[PATCH v4 3/4] epoll: wire up syscall epoll_pwait2

2020-11-21 Thread Willem de Bruijn
From: Willem de Bruijn 

Split off from prev patch in the series that implements the syscall.

Signed-off-by: Willem de Bruijn 
---
 arch/alpha/kernel/syscalls/syscall.tbl  | 1 +
 arch/arm/tools/syscall.tbl  | 1 +
 arch/arm64/include/asm/unistd.h | 2 +-
 arch/arm64/include/asm/unistd32.h   | 2 ++
 arch/ia64/kernel/syscalls/syscall.tbl   | 1 +
 arch/m68k/kernel/syscalls/syscall.tbl   | 1 +
 arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   | 1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   | 1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   | 1 +
 arch/parisc/kernel/syscalls/syscall.tbl | 1 +
 arch/powerpc/kernel/syscalls/syscall.tbl| 1 +
 arch/s390/kernel/syscalls/syscall.tbl   | 1 +
 arch/sh/kernel/syscalls/syscall.tbl | 1 +
 arch/sparc/kernel/syscalls/syscall.tbl  | 1 +
 arch/x86/entry/syscalls/syscall_32.tbl  | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl  | 1 +
 arch/xtensa/kernel/syscalls/syscall.tbl | 1 +
 include/linux/compat.h  | 6 ++
 include/linux/syscalls.h| 5 +
 include/uapi/asm-generic/unistd.h   | 4 +++-
 kernel/sys_ni.c | 2 ++
 22 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl 
b/arch/alpha/kernel/syscalls/syscall.tbl
index c5cc5bfa2062..506e59a9ff87 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -481,3 +481,4 @@
 549common  faccessat2  sys_faccessat2
 550common  process_madvise sys_process_madvise
 551common  watch_mount sys_watch_mount
+553common  epoll_pwait2sys_epoll_pwait2
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index 47325b3b661a..dbde88a855b6 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -455,3 +455,4 @@
 439common  faccessat2  sys_faccessat2
 440common  process_madvise sys_process_madvise
 441common  watch_mount sys_watch_mount
+443common  epoll_pwait2sys_epoll_pwait2
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 949788f5ba40..d1f7d35f986e 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -38,7 +38,7 @@
 #define __ARM_NR_compat_set_tls(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls   443
+#define __NR_compat_syscalls   444
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h 
b/arch/arm64/include/asm/unistd32.h
index c71c3fe0b6cd..b84e24a7e2c0 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -893,6 +893,8 @@ __SYSCALL(__NR_process_madvise, sys_process_madvise)
 __SYSCALL(__NR_watch_mount, sys_watch_mount)
 #define __NR_memfd_secret 442
 __SYSCALL(__NR_memfd_secret, sys_memfd_secret)
+#define __NR_epoll_pwait2 443
+__SYSCALL(__NR_epoll_pwait2, sys_epoll_pwait2)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl 
b/arch/ia64/kernel/syscalls/syscall.tbl
index 033244462350..c8809959636f 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -362,3 +362,4 @@
 439common  faccessat2  sys_faccessat2
 440common  process_madvise sys_process_madvise
 441common  watch_mount sys_watch_mount
+443common  epoll_pwait2sys_epoll_pwait2
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl 
b/arch/m68k/kernel/syscalls/syscall.tbl
index efd3ecb3cdfc..dde585616bf8 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -441,3 +441,4 @@
 439common  faccessat2  sys_faccessat2
 440common  process_madvise sys_process_madvise
 441common  watch_mount sys_watch_mount
+443common  epoll_pwait2sys_epoll_pwait2
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl 
b/arch/microblaze/kernel/syscalls/syscall.tbl
index 67ae5a5e4d21..4c09f27fedd0 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -447,3 +447,4 @@
 439common  faccessat2  sys_faccessat2
 440common  process_madvise sys_process_madvise
 441common  watch_mount sys_watch_mount
+443common  epoll_pwait2sys_epoll_pwait2
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl 
b/arch/mips/kernel/syscalls/syscall_n32.tbl
index c59bc6acc47a..00921244242d 100644
--- a/arch

[PATCH v4 1/4] epoll: convert internal api to timespec64

2020-11-21 Thread Willem de Bruijn
From: Willem de Bruijn 

Make epoll more consistent with select/poll: pass along the timeout as
timespec64 pointer.

In anticipation of additional changes affecting all three polling
mechanisms:

- add epoll_pwait2 syscall with timespec semantics,
  and share poll_select_set_timeout implementation.
- compute slack before conversion to absolute time,
  to save one ktime_get_ts64 call.

Signed-off-by: Willem de Bruijn 
---
 fs/eventpoll.c | 57 --
 1 file changed, 37 insertions(+), 20 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 297aeb0ee9d1..7082dfbc3166 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -1714,15 +1714,25 @@ static int ep_send_events(struct eventpoll *ep,
return res;
 }
 
-static inline struct timespec64 ep_set_mstimeout(long ms)
+static struct timespec64 *ep_timeout_to_timespec(struct timespec64 *to, long 
ms)
 {
-   struct timespec64 now, ts = {
-   .tv_sec = ms / MSEC_PER_SEC,
-   .tv_nsec = NSEC_PER_MSEC * (ms % MSEC_PER_SEC),
-   };
+   struct timespec64 now;
+
+   if (ms < 0)
+   return NULL;
+
+   if (!ms) {
+   to->tv_sec = 0;
+   to->tv_nsec = 0;
+   return to;
+   }
+
+   to->tv_sec = ms / MSEC_PER_SEC;
+   to->tv_nsec = NSEC_PER_MSEC * (ms % MSEC_PER_SEC);
 
ktime_get_ts64();
-   return timespec64_add_safe(now, ts);
+   *to = timespec64_add_safe(now, *to);
+   return to;
 }
 
 /**
@@ -1734,8 +1744,8 @@ static inline struct timespec64 ep_set_mstimeout(long ms)
  *  stored.
  * @maxevents: Size (in terms of number of events) of the caller event buffer.
  * @timeout: Maximum timeout for the ready events fetch operation, in
- *   milliseconds. If the @timeout is zero, the function will not 
block,
- *   while if the @timeout is less than zero, the function will block
+ *   timespec. If the timeout is zero, the function will not block,
+ *   while if the @timeout ptr is NULL, the function will block
  *   until at least one event has been retrieved (or an error
  *   occurred).
  *
@@ -1743,7 +1753,7 @@ static inline struct timespec64 ep_set_mstimeout(long ms)
  *  error code, in case of error.
  */
 static int ep_poll(struct eventpoll *ep, struct epoll_event __user *events,
-  int maxevents, long timeout)
+  int maxevents, struct timespec64 *timeout)
 {
int res, eavail, timed_out = 0;
u64 slack = 0;
@@ -1752,13 +1762,11 @@ static int ep_poll(struct eventpoll *ep, struct 
epoll_event __user *events,
 
lockdep_assert_irqs_enabled();
 
-   if (timeout > 0) {
-   struct timespec64 end_time = ep_set_mstimeout(timeout);
-
-   slack = select_estimate_accuracy(_time);
+   if (timeout && (timeout->tv_sec | timeout->tv_nsec)) {
+   slack = select_estimate_accuracy(timeout);
to = 
-   *to = timespec64_to_ktime(end_time);
-   } else if (timeout == 0) {
+   *to = timespec64_to_ktime(*timeout);
+   } else if (timeout) {
/*
 * Avoid the unnecessary trip to the wait queue loop, if the
 * caller specified a non blocking operation.
@@ -2177,7 +2185,7 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
  * part of the user space epoll_wait(2).
  */
 static int do_epoll_wait(int epfd, struct epoll_event __user *events,
-int maxevents, int timeout)
+int maxevents, struct timespec64 *to)
 {
int error;
struct fd f;
@@ -2211,7 +2219,7 @@ static int do_epoll_wait(int epfd, struct epoll_event 
__user *events,
ep = f.file->private_data;
 
/* Time to fish for events ... */
-   error = ep_poll(ep, events, maxevents, timeout);
+   error = ep_poll(ep, events, maxevents, to);
 
 error_fput:
fdput(f);
@@ -2221,7 +2229,10 @@ static int do_epoll_wait(int epfd, struct epoll_event 
__user *events,
 SYSCALL_DEFINE4(epoll_wait, int, epfd, struct epoll_event __user *, events,
int, maxevents, int, timeout)
 {
-   return do_epoll_wait(epfd, events, maxevents, timeout);
+   struct timespec64 to;
+
+   return do_epoll_wait(epfd, events, maxevents,
+ep_timeout_to_timespec(, timeout));
 }
 
 /*
@@ -2232,6 +2243,7 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct 
epoll_event __user *, events,
int, maxevents, int, timeout, const sigset_t __user *, sigmask,
size_t, sigsetsize)
 {
+   struct timespec64 to;
int error;
 
/*
@@ -2242,7 +2254,9 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct 
epoll_event __user *, events,
if (error)
return error;
 
-   error = do_epoll_wait(epfd, events, maxevents, timeout);

[PATCH v4 2/4] epoll: add syscall epoll_pwait2

2020-11-21 Thread Willem de Bruijn
From: Willem de Bruijn 

Add syscall epoll_pwait2, an epoll_wait variant with nsec resolution
that replaces int timeout with struct timespec. It is equivalent
otherwise.

int epoll_pwait2(int fd, struct epoll_event *events,
 int maxevents,
 const struct timespec *timeout,
 const sigset_t *sigset);

The underlying hrtimer is already programmed with nsec resolution.
pselect and ppoll also set nsec resolution timeout with timespec.

The sigset_t in epoll_pwait has a compat variant. epoll_pwait2 needs
the same.

For timespec, only support this new interface on 2038 aware platforms
that define __kernel_timespec_t. So no CONFIG_COMPAT_32BIT_TIME.

Changes
  v4:
  - on top of patch that converts eventpoll.c to pass timespec64
  - split off wiring up the syscall
  - fix alpha syscall number (Arnd)
  v3:
  - rewrite: add epoll_pwait2 syscall instead of epoll_create1 flag
  v2:
  - cast to s64: avoid overflow on 32-bit platforms (Shuo Chen)
  - minor commit message rewording

Signed-off-by: Willem de Bruijn 
---
 fs/eventpoll.c | 87 ++
 1 file changed, 73 insertions(+), 14 deletions(-)

diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 7082dfbc3166..c6d0ab3aaff1 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -2239,11 +2239,10 @@ SYSCALL_DEFINE4(epoll_wait, int, epfd, struct 
epoll_event __user *, events,
  * Implement the event wait interface for the eventpoll file. It is the kernel
  * part of the user space epoll_pwait(2).
  */
-SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
-   int, maxevents, int, timeout, const sigset_t __user *, sigmask,
-   size_t, sigsetsize)
+static int do_epoll_pwait(int epfd, struct epoll_event __user *events,
+ int maxevents, struct timespec64 *to,
+ const sigset_t __user *sigmask, size_t sigsetsize)
 {
-   struct timespec64 to;
int error;
 
/*
@@ -2254,22 +2253,48 @@ SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct 
epoll_event __user *, events,
if (error)
return error;
 
-   error = do_epoll_wait(epfd, events, maxevents,
- ep_timeout_to_timespec(, timeout));
+   error = do_epoll_wait(epfd, events, maxevents, to);
 
restore_saved_sigmask_unless(error == -EINTR);
 
return error;
 }
 
-#ifdef CONFIG_COMPAT
-COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
-   struct epoll_event __user *, events,
-   int, maxevents, int, timeout,
-   const compat_sigset_t __user *, sigmask,
-   compat_size_t, sigsetsize)
+SYSCALL_DEFINE6(epoll_pwait, int, epfd, struct epoll_event __user *, events,
+   int, maxevents, int, timeout, const sigset_t __user *, sigmask,
+   size_t, sigsetsize)
 {
struct timespec64 to;
+
+   return do_epoll_pwait(epfd, events, maxevents,
+ ep_timeout_to_timespec(, timeout),
+ sigmask, sigsetsize);
+}
+
+SYSCALL_DEFINE6(epoll_pwait2, int, epfd, struct epoll_event __user *, events,
+   int, maxevents, const struct __kernel_timespec __user *, 
timeout,
+   const sigset_t __user *, sigmask, size_t, sigsetsize)
+{
+   struct timespec64 ts, *to = NULL;
+
+   if (timeout) {
+   if (get_timespec64(, timeout))
+   return -EFAULT;
+   to = 
+   if (poll_select_set_timeout(to, ts.tv_sec, ts.tv_nsec))
+   return -EINVAL;
+   }
+
+   return do_epoll_pwait(epfd, events, maxevents, to,
+ sigmask, sigsetsize);
+}
+
+#ifdef CONFIG_COMPAT
+static int do_compat_epoll_pwait(int epfd, struct epoll_event __user *events,
+int maxevents, struct timespec64 *timeout,
+const compat_sigset_t __user *sigmask,
+compat_size_t sigsetsize)
+{
long err;
 
/*
@@ -2280,13 +2305,47 @@ COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
if (err)
return err;
 
-   err = do_epoll_wait(epfd, events, maxevents,
-   ep_timeout_to_timespec(, timeout));
+   err = do_epoll_wait(epfd, events, maxevents, timeout);
 
restore_saved_sigmask_unless(err == -EINTR);
 
return err;
 }
+
+COMPAT_SYSCALL_DEFINE6(epoll_pwait, int, epfd,
+  struct epoll_event __user *, events,
+  int, maxevents, int, timeout,
+  const compat_sigset_t __user *, sigmask,
+  compat_size_t, sigsetsize)
+{
+   struct timespec64 to;
+
+   return do_compat_epoll_pwait(epfd, events, maxevents,
+ep_timeout_to_timespec(, timeout

[PATCH v4 0/4] add epoll_pwait2 syscall

2020-11-21 Thread Willem de Bruijn
From: Willem de Bruijn 

Enable nanosecond timeouts for epoll.

Analogous to pselect and ppoll, introduce an epoll_wait syscall
variant that takes a struct timespec instead of int timeout.

See patch 2 for more details.

patch 1: pre patch cleanup: convert internal epoll to timespec64
patch 2: add syscall
patch 3: wire up syscall
patch 4: selftest

Applies cleanly to next-20201120
No update to man-pages since v3, see commit
https://lore.kernel.org/patchwork/patch/1341103/

Willem de Bruijn (4):
  epoll: convert internal api to timespec64
  epoll: add syscall epoll_pwait2
  epoll: wire up syscall epoll_pwait2
  selftests/filesystems: expand epoll with epoll_pwait2

 arch/alpha/kernel/syscalls/syscall.tbl|   1 +
 arch/arm/tools/syscall.tbl|   1 +
 arch/arm64/include/asm/unistd.h   |   2 +-
 arch/arm64/include/asm/unistd32.h |   2 +
 arch/ia64/kernel/syscalls/syscall.tbl |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl |   1 +
 arch/microblaze/kernel/syscalls/syscall.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl |   1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl |   1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl |   1 +
 arch/parisc/kernel/syscalls/syscall.tbl   |   1 +
 arch/powerpc/kernel/syscalls/syscall.tbl  |   1 +
 arch/s390/kernel/syscalls/syscall.tbl |   1 +
 arch/sh/kernel/syscalls/syscall.tbl   |   1 +
 arch/sparc/kernel/syscalls/syscall.tbl|   1 +
 arch/x86/entry/syscalls/syscall_32.tbl|   1 +
 arch/x86/entry/syscalls/syscall_64.tbl|   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl   |   1 +
 fs/eventpoll.c| 130 ++
 include/linux/compat.h|   6 +
 include/linux/syscalls.h  |   5 +
 include/uapi/asm-generic/unistd.h |   4 +-
 kernel/sys_ni.c   |   2 +
 .../filesystems/epoll/epoll_wakeup_test.c |  72 ++
 24 files changed, 210 insertions(+), 29 deletions(-)

-- 
2.29.2.454.gaff20da3a2-goog



  1   2   3   4   5   6   7   >