Re: [PATCH] uprobes: prevent mutex_lock() under rcu_read_lock()

2024-05-21 Thread Breno Leitao
On Mon, May 20, 2024 at 10:30:17PM -0700, Andrii Nakryiko wrote:
> Recent changes made uprobe_cpu_buffer preparation lazy, and moved it
> deeper into __uprobe_trace_func(). This is problematic because
> __uprobe_trace_func() is called inside rcu_read_lock()/rcu_read_unlock()
> block, which then calls prepare_uprobe_buffer() -> uprobe_buffer_get() ->
> mutex_lock(>mutex), leading to a splat about using mutex under
> non-sleepable RCU:
> 
>   BUG: sleeping function called from invalid context at 
> kernel/locking/mutex.c:585
>in_atomic(): 0, irqs_disabled(): 0, non_block: 0, pid: 98231, name: 
> stress-ng-sigq
>preempt_count: 0, expected: 0
>RCU nest depth: 1, expected: 0
>...
>Call Trace:
> 
> dump_stack_lvl+0x3d/0xe0
> __might_resched+0x24c/0x270
> ? prepare_uprobe_buffer+0xd5/0x1d0
> __mutex_lock+0x41/0x820
> ? ___perf_sw_event+0x206/0x290
> ? __perf_event_task_sched_in+0x54/0x660
> ? __perf_event_task_sched_in+0x54/0x660
> prepare_uprobe_buffer+0xd5/0x1d0
> __uprobe_trace_func+0x4a/0x140
> uprobe_dispatcher+0x135/0x280
> ? uprobe_dispatcher+0x94/0x280
> uprobe_notify_resume+0x650/0xec0
> ? atomic_notifier_call_chain+0x21/0x110
> ? atomic_notifier_call_chain+0xf8/0x110
> irqentry_exit_to_user_mode+0xe2/0x1e0
> asm_exc_int3+0x35/0x40
>RIP: 0033:0x7f7e1d4da390
>Code: 33 04 00 0f 1f 80 00 00 00 00 f3 0f 1e fa b9 01 00 00 00 e9 b2 fc ff 
> ff 66 90 f3 0f 1e fa 31 c9 e9 a5 fc ff ff 0f 1f 44 00 00  0f 1e fa b8 27 
> 00 00 00 0f 05 c3 0f 1f 40 00 f3 0f 1e fa b8 6e
>RSP: 002b:7ffd2abc3608 EFLAGS: 0246
>RAX:  RBX: 76d325f1 RCX: 
>RDX: 76d325f1 RSI: 000a RDI: 7ffd2abc3690
>RBP: 000a R08: 00017fb7 R09: 00017fb7
>R10: 00017fb7 R11: 0246 R12: 00017ff2
>R13: 7ffd2abc3610 R14:  R15: 7ffd2abc3780
> 
> 
> Luckily, it's easy to fix by moving prepare_uprobe_buffer() to be called
> slightly earlier: into uprobe_trace_func() and uretprobe_trace_func(), outside
> of RCU locked section. This still keeps this buffer preparation lazy and helps
> avoid the overhead when it's not needed. E.g., if there is only BPF uprobe
> handler installed on a given uprobe, buffer won't be initialized.
> 
> Note, the other user of prepare_uprobe_buffer(), __uprobe_perf_func(), is not
> affected, as it doesn't prepare buffer under RCU read lock.
> 
> Fixes: 1b8f85defbc8 ("uprobes: prepare uprobe args buffer lazily")
> Reported-by: Breno Leitao 
> Signed-off-by: Andrii Nakryiko 

Tested-by: Breno Leitao 



[PATCH net v4] virtio_net: Do not send RSS key if it is not supported

2024-04-03 Thread Breno Leitao
There is a bug when setting the RSS options in virtio_net that can break
the whole machine, getting the kernel into an infinite loop.

Running the following command in any QEMU virtual machine with virtionet
will reproduce this problem:

# ethtool -X eth0  hfunc toeplitz

This is how the problem happens:

1) ethtool_set_rxfh() calls virtnet_set_rxfh()

2) virtnet_set_rxfh() calls virtnet_commit_rss_command()

3) virtnet_commit_rss_command() populates 4 entries for the rss
scatter-gather

4) Since the command above does not have a key, then the last
scatter-gatter entry will be zeroed, since rss_key_size == 0.
sg_buf_size = vi->rss_key_size;

5) This buffer is passed to qemu, but qemu is not happy with a buffer
with zero length, and do the following in virtqueue_map_desc() (QEMU
function):

  if (!sz) {
  virtio_error(vdev, "virtio: zero sized buffers are not allowed");

6) virtio_error() (also QEMU function) set the device as broken

vdev->broken = true;

7) Qemu bails out, and do not repond this crazy kernel.

8) The kernel is waiting for the response to come back (function
virtnet_send_command())

9) The kernel is waiting doing the following :

  while (!virtqueue_get_buf(vi->cvq, ) &&
 !virtqueue_is_broken(vi->cvq))
  cpu_relax();

10) None of the following functions above is true, thus, the kernel
loops here forever. Keeping in mind that virtqueue_is_broken() does
not look at the qemu `vdev->broken`, so, it never realizes that the
vitio is broken at QEMU side.

Fix it by not sending RSS commands if the feature is not available in
the device.

Fixes: c7114b1249fa ("drivers/net/virtio_net: Added basic RSS support.")
Cc: sta...@vger.kernel.org
Cc: qemu-de...@nongnu.org
Signed-off-by: Breno Leitao 
Reviewed-by: Heng Qi 
---
Changelog:

V2:
  * Moved from creating a valid packet, by rejecting the request
completely.
V3:
  * Got some good feedback from and Xuan Zhuo and Heng Qi, and reworked
the rejection path.
V4:
  * Added a comment in an "if" clause, as suggested by Michael S. Tsirkin.

---
 drivers/net/virtio_net.c | 26 ++
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c22d1118a133..115c3c5414f2 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3807,6 +3807,7 @@ static int virtnet_set_rxfh(struct net_device *dev,
struct netlink_ext_ack *extack)
 {
struct virtnet_info *vi = netdev_priv(dev);
+   bool update = false;
int i;
 
if (rxfh->hfunc != ETH_RSS_HASH_NO_CHANGE &&
@@ -3814,13 +3815,28 @@ static int virtnet_set_rxfh(struct net_device *dev,
return -EOPNOTSUPP;
 
if (rxfh->indir) {
+   if (!vi->has_rss)
+   return -EOPNOTSUPP;
+
for (i = 0; i < vi->rss_indir_table_size; ++i)
vi->ctrl->rss.indirection_table[i] = rxfh->indir[i];
+   update = true;
}
-   if (rxfh->key)
+
+   if (rxfh->key) {
+   /* If either _F_HASH_REPORT or _F_RSS are negotiated, the
+* device provides hash calculation capabilities, that is,
+* hash_key is configured.
+*/
+   if (!vi->has_rss && !vi->has_rss_hash_report)
+   return -EOPNOTSUPP;
+
memcpy(vi->ctrl->rss.key, rxfh->key, vi->rss_key_size);
+   update = true;
+   }
 
-   virtnet_commit_rss_command(vi);
+   if (update)
+   virtnet_commit_rss_command(vi);
 
return 0;
 }
@@ -4729,13 +4745,15 @@ static int virtnet_probe(struct virtio_device *vdev)
if (virtio_has_feature(vdev, VIRTIO_NET_F_HASH_REPORT))
vi->has_rss_hash_report = true;
 
-   if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS))
+   if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS)) {
vi->has_rss = true;
 
-   if (vi->has_rss || vi->has_rss_hash_report) {
vi->rss_indir_table_size =
virtio_cread16(vdev, offsetof(struct virtio_net_config,
rss_max_indirection_table_length));
+   }
+
+   if (vi->has_rss || vi->has_rss_hash_report) {
vi->rss_key_size =
virtio_cread8(vdev, offsetof(struct virtio_net_config, 
rss_max_key_size));
 
-- 
2.43.0




Re: [PATCH net v3] virtio_net: Do not send RSS key if it is not supported

2024-04-03 Thread Breno Leitao
On Sun, Mar 31, 2024 at 04:20:30PM -0400, Michael S. Tsirkin wrote:
> On Fri, Mar 29, 2024 at 10:16:41AM -0700, Breno Leitao wrote:
> > @@ -3814,13 +3815,24 @@ static int virtnet_set_rxfh(struct net_device *dev,
> > return -EOPNOTSUPP;
> >  
> > if (rxfh->indir) {
> > +   if (!vi->has_rss)
> > +   return -EOPNOTSUPP;
> > +
> > for (i = 0; i < vi->rss_indir_table_size; ++i)
> > vi->ctrl->rss.indirection_table[i] = rxfh->indir[i];
> > +   update = true;
> > }
> > -   if (rxfh->key)
> > +
> > +   if (rxfh->key) {
> > +   if (!vi->has_rss && !vi->has_rss_hash_report)
> > +   return -EOPNOTSUPP;
> 
> 
> What's the logic here? Is it || or &&? A comment can't hurt.

If txfh carries a key, then the device needs to has either has_rss or
has_rss_hash_report "features".

These are basically virtio features VIRTIO_NET_F_HASH_REPORT and
VIRTIO_NET_F_RSS that are set at virtio_probe.

I will add the comment and respin the series.



[PATCH net v3] virtio_net: Do not send RSS key if it is not supported

2024-03-29 Thread Breno Leitao
There is a bug when setting the RSS options in virtio_net that can break
the whole machine, getting the kernel into an infinite loop.

Running the following command in any QEMU virtual machine with virtionet
will reproduce this problem:

# ethtool -X eth0  hfunc toeplitz

This is how the problem happens:

1) ethtool_set_rxfh() calls virtnet_set_rxfh()

2) virtnet_set_rxfh() calls virtnet_commit_rss_command()

3) virtnet_commit_rss_command() populates 4 entries for the rss
scatter-gather

4) Since the command above does not have a key, then the last
scatter-gatter entry will be zeroed, since rss_key_size == 0.
sg_buf_size = vi->rss_key_size;

5) This buffer is passed to qemu, but qemu is not happy with a buffer
with zero length, and do the following in virtqueue_map_desc() (QEMU
function):

  if (!sz) {
  virtio_error(vdev, "virtio: zero sized buffers are not allowed");

6) virtio_error() (also QEMU function) set the device as broken

vdev->broken = true;

7) Qemu bails out, and do not repond this crazy kernel.

8) The kernel is waiting for the response to come back (function
virtnet_send_command())

9) The kernel is waiting doing the following :

  while (!virtqueue_get_buf(vi->cvq, ) &&
 !virtqueue_is_broken(vi->cvq))
  cpu_relax();

10) None of the following functions above is true, thus, the kernel
loops here forever. Keeping in mind that virtqueue_is_broken() does
not look at the qemu `vdev->broken`, so, it never realizes that the
vitio is broken at QEMU side.

Fix it by not sending RSS commands if the feature is not available in
the device.

Fixes: c7114b1249fa ("drivers/net/virtio_net: Added basic RSS support.")
Cc: sta...@vger.kernel.org
Cc: qemu-de...@nongnu.org
Signed-off-by: Breno Leitao 
---
Changelog:

V2:
  * Moved from creating a valid packet, by rejecting the request
completely
V3:
  * Got some good feedback from and Xuan Zhuo and Heng Qi, and reworked
the rejection path.

---
 drivers/net/virtio_net.c | 22 ++
 1 file changed, 18 insertions(+), 4 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c22d1118a133..c4a21ec51adf 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3807,6 +3807,7 @@ static int virtnet_set_rxfh(struct net_device *dev,
struct netlink_ext_ack *extack)
 {
struct virtnet_info *vi = netdev_priv(dev);
+   bool update = false;
int i;
 
if (rxfh->hfunc != ETH_RSS_HASH_NO_CHANGE &&
@@ -3814,13 +3815,24 @@ static int virtnet_set_rxfh(struct net_device *dev,
return -EOPNOTSUPP;
 
if (rxfh->indir) {
+   if (!vi->has_rss)
+   return -EOPNOTSUPP;
+
for (i = 0; i < vi->rss_indir_table_size; ++i)
vi->ctrl->rss.indirection_table[i] = rxfh->indir[i];
+   update = true;
}
-   if (rxfh->key)
+
+   if (rxfh->key) {
+   if (!vi->has_rss && !vi->has_rss_hash_report)
+   return -EOPNOTSUPP;
+
memcpy(vi->ctrl->rss.key, rxfh->key, vi->rss_key_size);
+   update = true;
+   }
 
-   virtnet_commit_rss_command(vi);
+   if (update)
+   virtnet_commit_rss_command(vi);
 
return 0;
 }
@@ -4729,13 +4741,15 @@ static int virtnet_probe(struct virtio_device *vdev)
if (virtio_has_feature(vdev, VIRTIO_NET_F_HASH_REPORT))
vi->has_rss_hash_report = true;
 
-   if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS))
+   if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS)) {
vi->has_rss = true;
 
-   if (vi->has_rss || vi->has_rss_hash_report) {
vi->rss_indir_table_size =
virtio_cread16(vdev, offsetof(struct virtio_net_config,
rss_max_indirection_table_length));
+   }
+
+   if (vi->has_rss || vi->has_rss_hash_report) {
vi->rss_key_size =
virtio_cread8(vdev, offsetof(struct virtio_net_config, 
rss_max_key_size));
 
-- 
2.43.0




Re: [PATCH net v2 1/2] virtio_net: Do not set rss_indir if RSS is not supported

2024-03-28 Thread Breno Leitao
> > On Wed, Mar 27, 2024 at 09:37:43AM +0800, Xuan Zhuo wrote:
> > > On Tue, 26 Mar 2024 08:19:08 -0700, Breno Leitao  
> > > wrote:
> > > > Do not set virtnet_info->rss_indir_table_size if RSS is not available
> > > > for the device.
> > > >
> > > > Currently, rss_indir_table_size is set if either has_rss or
> > > > has_rss_hash_report is available, but, it should only be set if has_rss
> > > > is set.
> > > >
> > > > On the virtnet_set_rxfh(), return an invalid command if the request has
> > > > indirection table set, but virtnet does not support RSS.
> > > >
> > > > Suggested-by: Heng Qi 
> > > > Signed-off-by: Breno Leitao 
> > > > ---
> > > >  drivers/net/virtio_net.c | 9 +++--
> > > >  1 file changed, 7 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > index c22d1118a133..c640fdf28fc5 100644
> > > > --- a/drivers/net/virtio_net.c
> > > > +++ b/drivers/net/virtio_net.c
> > > > @@ -3813,6 +3813,9 @@ static int virtnet_set_rxfh(struct net_device 
> > > > *dev,
> > > > rxfh->hfunc != ETH_RSS_HASH_TOP)
> > > > return -EOPNOTSUPP;
> > > >
> > > > +   if (rxfh->indir && !vi->has_rss)
> > > > +   return -EINVAL;
> > > > +
> > > > if (rxfh->indir) {
> > >
> > > Put !vi->has_rss here?
> >
> > I am not sure I understand the suggestion. Where do you suggest we have
> > !vi->has_rss?
> >
> > If we got this far, we either have:
> >
> > a) rxfh->indir set and vi->has_rss is also set
> > b) rxfh->indir not set. (vi->has_rss could be set or not).
> 
> 
> This function does two tasks.
> 1. update indir table
> 2. update rss key
> 
> #1 only for has_rss
> #2 for has_rss or has_rss_hash_report
> 
> 
> So I would code:
> 
>   bool update = false
> 
>   if (rxfh->indir) {
>   if (!vi->has_rss)
>   return -EINVAL;
> 
>   for (i = 0; i < vi->rss_indir_table_size; ++i)
>   vi->ctrl->rss.indirection_table[i] = rxfh->indir[i];
> 
>   update = true
>   }
> 
>   if (rxfh->key) {
>   if (!vi->has_rss && !vi->has_rss_hash_report)
>   return -EINVAL;
> 
>   memcpy(vi->ctrl->rss.key, rxfh->key, vi->rss_key_size);
>   update = true
>   }
> 
> 
>   if (update)
>   virtnet_commit_rss_command(vi);
> 
> Thanks.

This is a bit different from the previous proposal, but, I like this one
approach better. It makes the code easier to read.

How does the full patch looks like?


diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c22d1118a133..180f342f1898 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3807,20 +3807,35 @@ static int virtnet_set_rxfh(struct net_device *dev,
struct netlink_ext_ack *extack)
 {
struct virtnet_info *vi = netdev_priv(dev);
+   bool update = false;
int i;
 
+   if (!vi->has_rss && !vi->has_rss_hash_report)
+   return -EOPNOTSUPP;
+
if (rxfh->hfunc != ETH_RSS_HASH_NO_CHANGE &&
rxfh->hfunc != ETH_RSS_HASH_TOP)
return -EOPNOTSUPP;
 
if (rxfh->indir) {
+   if (!vi->has_rss)
+   return -EINVAL;
+
for (i = 0; i < vi->rss_indir_table_size; ++i)
vi->ctrl->rss.indirection_table[i] = rxfh->indir[i];
+   update = true;
}
-   if (rxfh->key)
+
+   if (rxfh->key) {
+   if (!vi->has_rss && !vi->has_rss_hash_report)
+   return -EINVAL;
+
memcpy(vi->ctrl->rss.key, rxfh->key, vi->rss_key_size);
+   update = true;
+   }
 
-   virtnet_commit_rss_command(vi);
+   if (update)
+   virtnet_commit_rss_command(vi);
 
return 0;
 }
@@ -4729,13 +4744,15 @@ static int virtnet_probe(struct virtio_device *vdev)
if (virtio_has_feature(vdev, VIRTIO_NET_F_HASH_REPORT))
vi->has_rss_hash_report = true;
 
-   if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS))
+   if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS)) {
vi->has_rss = true;
 
-   if (vi->has_rss || vi->has_rss_hash_report) {
vi->rss_indir_table_size =
virtio_cread16(vdev, offsetof(struct virtio_net_config,
rss_max_indirection_table_length));
+   }
+
+   if (vi->has_rss || vi->has_rss_hash_report) {
vi->rss_key_size =
virtio_cread8(vdev, offsetof(struct virtio_net_config, 
rss_max_key_size));
 



Re: [PATCH net v2 1/2] virtio_net: Do not set rss_indir if RSS is not supported

2024-03-27 Thread Breno Leitao
Hello Xuan,

On Wed, Mar 27, 2024 at 09:37:43AM +0800, Xuan Zhuo wrote:
> On Tue, 26 Mar 2024 08:19:08 -0700, Breno Leitao  wrote:
> > Do not set virtnet_info->rss_indir_table_size if RSS is not available
> > for the device.
> >
> > Currently, rss_indir_table_size is set if either has_rss or
> > has_rss_hash_report is available, but, it should only be set if has_rss
> > is set.
> >
> > On the virtnet_set_rxfh(), return an invalid command if the request has
> > indirection table set, but virtnet does not support RSS.
> >
> > Suggested-by: Heng Qi 
> > Signed-off-by: Breno Leitao 
> > ---
> >  drivers/net/virtio_net.c | 9 +++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index c22d1118a133..c640fdf28fc5 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -3813,6 +3813,9 @@ static int virtnet_set_rxfh(struct net_device *dev,
> > rxfh->hfunc != ETH_RSS_HASH_TOP)
> > return -EOPNOTSUPP;
> >
> > +   if (rxfh->indir && !vi->has_rss)
> > +   return -EINVAL;
> > +
> > if (rxfh->indir) {
> 
> Put !vi->has_rss here?

I am not sure I understand the suggestion. Where do you suggest we have
!vi->has_rss?

If we got this far, we either have:

a) rxfh->indir set and vi->has_rss is also set
b) rxfh->indir not set. (vi->has_rss could be set or not).

Thanks for the review,
Breno



Re: [PATCH net v2 2/2] virtio_net: Do not send RSS key if it is not supported

2024-03-27 Thread Breno Leitao
On Wed, Mar 27, 2024 at 10:27:58AM +0800, Heng Qi wrote:
> 
> 
> 在 2024/3/26 下午11:19, Breno Leitao 写道:
> > There is a bug when setting the RSS options in virtio_net that can break
> > the whole machine, getting the kernel into an infinite loop.
> > 
> > Running the following command in any QEMU virtual machine with virtionet
> > will reproduce this problem:
> > 
> >  # ethtool -X eth0  hfunc toeplitz
> > 
> > This is how the problem happens:
> > 
> > 1) ethtool_set_rxfh() calls virtnet_set_rxfh()
> > 
> > 2) virtnet_set_rxfh() calls virtnet_commit_rss_command()
> > 
> > 3) virtnet_commit_rss_command() populates 4 entries for the rss
> > scatter-gather
> > 
> > 4) Since the command above does not have a key, then the last
> > scatter-gatter entry will be zeroed, since rss_key_size == 0.
> > sg_buf_size = vi->rss_key_size;
> > 
> > 5) This buffer is passed to qemu, but qemu is not happy with a buffer
> > with zero length, and do the following in virtqueue_map_desc() (QEMU
> > function):
> > 
> >if (!sz) {
> >virtio_error(vdev, "virtio: zero sized buffers are not allowed");
> > 
> > 6) virtio_error() (also QEMU function) set the device as broken
> > 
> >  vdev->broken = true;
> > 
> > 7) Qemu bails out, and do not repond this crazy kernel.
> > 
> > 8) The kernel is waiting for the response to come back (function
> > virtnet_send_command())
> > 
> > 9) The kernel is waiting doing the following :
> > 
> >while (!virtqueue_get_buf(vi->cvq, ) &&
> >  !virtqueue_is_broken(vi->cvq))
> >   cpu_relax();
> > 
> > 10) None of the following functions above is true, thus, the kernel
> > loops here forever. Keeping in mind that virtqueue_is_broken() does
> > not look at the qemu `vdev->broken`, so, it never realizes that the
> > vitio is broken at QEMU side.
> > 
> > Fix it by not sending RSS commands if the feature is not available in
> > the device.
> > 
> > Fixes: c7114b1249fa ("drivers/net/virtio_net: Added basic RSS support.")
> > Cc: sta...@vger.kernel.org
> > Cc: qemu-de...@nongnu.org
> > Signed-off-by: Breno Leitao 
> > ---
> >   drivers/net/virtio_net.c | 3 +++
> >   1 file changed, 3 insertions(+)
> > 
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index c640fdf28fc5..e6b0eaf08ac2 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -3809,6 +3809,9 @@ static int virtnet_set_rxfh(struct net_device *dev,
> > struct virtnet_info *vi = netdev_priv(dev);
> > int i;
> > +   if (!vi->has_rss && !vi->has_rss_hash_report)
> > +   return -EOPNOTSUPP;
> > +
> 
> Why not make the second patch as the first, this seems to work better.

Sure, that works for me. Let me update it in v2.



[PATCH net v2 2/2] virtio_net: Do not send RSS key if it is not supported

2024-03-26 Thread Breno Leitao
There is a bug when setting the RSS options in virtio_net that can break
the whole machine, getting the kernel into an infinite loop.

Running the following command in any QEMU virtual machine with virtionet
will reproduce this problem:

# ethtool -X eth0  hfunc toeplitz

This is how the problem happens:

1) ethtool_set_rxfh() calls virtnet_set_rxfh()

2) virtnet_set_rxfh() calls virtnet_commit_rss_command()

3) virtnet_commit_rss_command() populates 4 entries for the rss
scatter-gather

4) Since the command above does not have a key, then the last
scatter-gatter entry will be zeroed, since rss_key_size == 0.
sg_buf_size = vi->rss_key_size;

5) This buffer is passed to qemu, but qemu is not happy with a buffer
with zero length, and do the following in virtqueue_map_desc() (QEMU
function):

  if (!sz) {
  virtio_error(vdev, "virtio: zero sized buffers are not allowed");

6) virtio_error() (also QEMU function) set the device as broken

vdev->broken = true;

7) Qemu bails out, and do not repond this crazy kernel.

8) The kernel is waiting for the response to come back (function
virtnet_send_command())

9) The kernel is waiting doing the following :

  while (!virtqueue_get_buf(vi->cvq, ) &&
 !virtqueue_is_broken(vi->cvq))
  cpu_relax();

10) None of the following functions above is true, thus, the kernel
loops here forever. Keeping in mind that virtqueue_is_broken() does
not look at the qemu `vdev->broken`, so, it never realizes that the
vitio is broken at QEMU side.

Fix it by not sending RSS commands if the feature is not available in
the device.

Fixes: c7114b1249fa ("drivers/net/virtio_net: Added basic RSS support.")
Cc: sta...@vger.kernel.org
Cc: qemu-de...@nongnu.org
Signed-off-by: Breno Leitao 
---
 drivers/net/virtio_net.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c640fdf28fc5..e6b0eaf08ac2 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3809,6 +3809,9 @@ static int virtnet_set_rxfh(struct net_device *dev,
struct virtnet_info *vi = netdev_priv(dev);
int i;
 
+   if (!vi->has_rss && !vi->has_rss_hash_report)
+   return -EOPNOTSUPP;
+
if (rxfh->hfunc != ETH_RSS_HASH_NO_CHANGE &&
rxfh->hfunc != ETH_RSS_HASH_TOP)
return -EOPNOTSUPP;
-- 
2.43.0




[PATCH net v2 1/2] virtio_net: Do not set rss_indir if RSS is not supported

2024-03-26 Thread Breno Leitao
Do not set virtnet_info->rss_indir_table_size if RSS is not available
for the device.

Currently, rss_indir_table_size is set if either has_rss or
has_rss_hash_report is available, but, it should only be set if has_rss
is set.

On the virtnet_set_rxfh(), return an invalid command if the request has
indirection table set, but virtnet does not support RSS.

Suggested-by: Heng Qi 
Signed-off-by: Breno Leitao 
---
 drivers/net/virtio_net.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c22d1118a133..c640fdf28fc5 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3813,6 +3813,9 @@ static int virtnet_set_rxfh(struct net_device *dev,
rxfh->hfunc != ETH_RSS_HASH_TOP)
return -EOPNOTSUPP;
 
+   if (rxfh->indir && !vi->has_rss)
+   return -EINVAL;
+
if (rxfh->indir) {
for (i = 0; i < vi->rss_indir_table_size; ++i)
vi->ctrl->rss.indirection_table[i] = rxfh->indir[i];
@@ -4729,13 +4732,15 @@ static int virtnet_probe(struct virtio_device *vdev)
if (virtio_has_feature(vdev, VIRTIO_NET_F_HASH_REPORT))
vi->has_rss_hash_report = true;
 
-   if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS))
+   if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS)) {
vi->has_rss = true;
 
-   if (vi->has_rss || vi->has_rss_hash_report) {
vi->rss_indir_table_size =
virtio_cread16(vdev, offsetof(struct virtio_net_config,
rss_max_indirection_table_length));
+   }
+
+   if (vi->has_rss || vi->has_rss_hash_report) {
vi->rss_key_size =
virtio_cread8(vdev, offsetof(struct virtio_net_config, 
rss_max_key_size));
 
-- 
2.43.0




Re: [PATCH] virtio_net: Do not send RSS key if it is not supported

2024-03-25 Thread Breno Leitao
Hello Xuan,

On Mon, Mar 25, 2024 at 01:57:53PM +0800, Xuan Zhuo wrote:
> On Fri, 22 Mar 2024 03:21:21 -0700, Breno Leitao  wrote:
> > Hello Xuan,
> >
> > On Fri, Mar 22, 2024 at 10:00:22AM +0800, Xuan Zhuo wrote:
> > > On Thu, 21 Mar 2024 09:54:30 -0700, Breno Leitao  
> > > wrote:
> >
> > > > 4) Since the command above does not have a key, then the last
> > > >scatter-gatter entry will be zeroed, since rss_key_size == 0.
> > > > sg_buf_size = vi->rss_key_size;
> > >
> > >
> > >
> > >   if (vi->has_rss || vi->has_rss_hash_report) {
> > >   vi->rss_indir_table_size =
> > >   virtio_cread16(vdev, offsetof(struct virtio_net_config,
> > >   rss_max_indirection_table_length));
> > >   vi->rss_key_size =
> > >   virtio_cread8(vdev, offsetof(struct virtio_net_config, 
> > > rss_max_key_size));
> > >
> > >   vi->rss_hash_types_supported =
> > >   virtio_cread32(vdev, offsetof(struct virtio_net_config, 
> > > supported_hash_types));
> > >   vi->rss_hash_types_supported &=
> > >   ~(VIRTIO_NET_RSS_HASH_TYPE_IP_EX |
> > > VIRTIO_NET_RSS_HASH_TYPE_TCP_EX |
> > > VIRTIO_NET_RSS_HASH_TYPE_UDP_EX);
> > >
> > >   dev->hw_features |= NETIF_F_RXHASH;
> > >   }
> > >
> > >
> > > vi->rss_key_size is initiated here, I wonder if there is something wrong?
> >
> > Not really, the code above is never executed (in my machines). This is
> > because `vi->has_rss` and `vi->has_rss_hash_report` are both unset.
> >
> > Looking further, vdev does not have the VIRTIO_NET_F_RSS and
> > VIRTIO_NET_F_HASH_REPORT features.
> >
> > Also, when I run `ethtool -x`, I got:
> >
> > # ethtool  -x eth0
> > RX flow hash indirection table for eth0 with 1 RX ring(s):
> > Operation not supported
> > RSS hash key:
> > Operation not supported
> > RSS hash function:
> > toeplitz: on
> > xor: off
> > crc32: off
> 
> 
> The spec saies:
>   Note that if the device offers VIRTIO_NET_F_HASH_REPORT, even if it
>   supports only one pair of virtqueues, it MUST support at least one of
>   commands of VIRTIO_NET_CTRL_MQ class to configure reported hash
>   parameters:
> 
>   If the device offers VIRTIO_NET_F_RSS, it MUST support
>   VIRTIO_NET_CTRL_MQ_RSS_CONFIG command per 5.1.6.5.7.1.
> 
>   Otherwise the device MUST support VIRTIO_NET_CTRL_MQ_HASH_CONFIG command
>   per 5.1.6.5.6.4.
> 
> 
> So if we have not anyone of `vi->has_rss` and `vi->has_rss_hash_report`,
> we should return from virtnet_set_rxfh directly.

Makes sense. Although it is not clear to me how vi->has_rss_hash_report
is related here, but, I am convinced that we shouldn't do any RSS
operation if the device doesn't have the RSS feature, i.e, vi->has_rss
is false.

That said, I am thinking about something like this. How does it sound?

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 5a7700b103f8..8c1ad7361cf2 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3780,6 +3780,9 @@ static int virtnet_set_rxfh(struct net_device 
*dev,
struct virtnet_info *vi = netdev_priv(dev);
int i;
 
+   if (!vi->has_rss)
+   return -EOPNOTSUPP;
+
if (rxfh->hfunc != ETH_RSS_HASH_NO_CHANGE &&
rxfh->hfunc != ETH_RSS_HASH_TOP)
return -EOPNOTSUPP;

Thanks!



Re: [PATCH] virtio_net: Do not send RSS key if it is not supported

2024-03-22 Thread Breno Leitao
Hello Xuan,

On Fri, Mar 22, 2024 at 10:00:22AM +0800, Xuan Zhuo wrote:
> On Thu, 21 Mar 2024 09:54:30 -0700, Breno Leitao  wrote:

> > 4) Since the command above does not have a key, then the last
> >scatter-gatter entry will be zeroed, since rss_key_size == 0.
> > sg_buf_size = vi->rss_key_size;
> 
> 
> 
>   if (vi->has_rss || vi->has_rss_hash_report) {
>   vi->rss_indir_table_size =
>   virtio_cread16(vdev, offsetof(struct virtio_net_config,
>   rss_max_indirection_table_length));
>   vi->rss_key_size =
>   virtio_cread8(vdev, offsetof(struct virtio_net_config, 
> rss_max_key_size));
> 
>   vi->rss_hash_types_supported =
>   virtio_cread32(vdev, offsetof(struct virtio_net_config, 
> supported_hash_types));
>   vi->rss_hash_types_supported &=
>   ~(VIRTIO_NET_RSS_HASH_TYPE_IP_EX |
> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX |
> VIRTIO_NET_RSS_HASH_TYPE_UDP_EX);
> 
>   dev->hw_features |= NETIF_F_RXHASH;
>   }
> 
> 
> vi->rss_key_size is initiated here, I wonder if there is something wrong?

Not really, the code above is never executed (in my machines). This is
because `vi->has_rss` and `vi->has_rss_hash_report` are both unset.

Looking further, vdev does not have the VIRTIO_NET_F_RSS and
VIRTIO_NET_F_HASH_REPORT features.

Also, when I run `ethtool -x`, I got:

# ethtool  -x eth0
RX flow hash indirection table for eth0 with 1 RX ring(s):
Operation not supported
RSS hash key:
Operation not supported
RSS hash function:
toeplitz: on
xor: off
crc32: off



[PATCH] virtio_net: Do not send RSS key if it is not supported

2024-03-21 Thread Breno Leitao
There is a bug when setting the RSS options in virtio_net that can break
the whole machine, getting the kernel into an infinite loop.

Running the following command in any QEMU virtual machine with virtionet
will reproduce this problem:

# ethtool -X eth0  hfunc toeplitz

This is how the problem happens:

1) ethtool_set_rxfh() calls virtnet_set_rxfh()

2) virtnet_set_rxfh() calls virtnet_commit_rss_command()

3) virtnet_commit_rss_command() populates 4 entries for the rss
   scatter-gather

4) Since the command above does not have a key, then the last
   scatter-gatter entry will be zeroed, since rss_key_size == 0.
sg_buf_size = vi->rss_key_size;

5) This buffer is passed to qemu, but qemu is not happy with a buffer
   with zero length, and do the following in virtqueue_map_desc() (QEMU
   function):

  if (!sz) {
  virtio_error(vdev, "virtio: zero sized buffers are not allowed");

6) virtio_error() (also QEMU function) set the device as broken

vdev->broken = true;

7) Qemu bails out, and do not repond this crazy kernel.

8) The kernel is waiting for the response to come back (function
   virtnet_send_command())

9) The kernel is waiting doing the following :

  while (!virtqueue_get_buf(vi->cvq, ) &&
 !virtqueue_is_broken(vi->cvq))
  cpu_relax();

10) None of the following functions above is true, thus, the kernel
loops here forever. Keeping in mind that virtqueue_is_broken() does
not look at the qemu `vdev->broken`, so, it never realizes that the
vitio is broken at QEMU side.

Fix it by not sending the key scatter-gatter key if it is not set.

Fixes: c7114b1249fa ("drivers/net/virtio_net: Added basic RSS support.")
Signed-off-by: Breno Leitao 
Cc: sta...@vger.kernel.org
Cc: qemu-de...@nongnu.org
---
 drivers/net/virtio_net.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d7ce4a1011ea..5a7700b103f8 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3041,11 +3041,16 @@ static int virtnet_set_ringparam(struct net_device *dev,
 static bool virtnet_commit_rss_command(struct virtnet_info *vi)
 {
struct net_device *dev = vi->dev;
+   int has_key = vi->rss_key_size;
struct scatterlist sgs[4];
unsigned int sg_buf_size;
+   int nents = 3;
+
+   if (has_key)
+   nents += 1;
 
/* prepare sgs */
-   sg_init_table(sgs, 4);
+   sg_init_table(sgs, nents);
 
sg_buf_size = offsetof(struct virtio_net_ctrl_rss, indirection_table);
sg_set_buf([0], >ctrl->rss, sg_buf_size);
@@ -3057,8 +3062,13 @@ static bool virtnet_commit_rss_command(struct 
virtnet_info *vi)
- offsetof(struct virtio_net_ctrl_rss, max_tx_vq);
sg_set_buf([2], >ctrl->rss.max_tx_vq, sg_buf_size);
 
-   sg_buf_size = vi->rss_key_size;
-   sg_set_buf([3], vi->ctrl->rss.key, sg_buf_size);
+   if (has_key) {
+   /* Only populate if key is available, otherwise
+* populating a buffer with zero size breaks virtio
+*/
+   sg_buf_size = vi->rss_key_size;
+   sg_set_buf([3], vi->ctrl->rss.key, sg_buf_size);
+   }
 
if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MQ,
  vi->has_rss ? VIRTIO_NET_CTRL_MQ_RSS_CONFIG
-- 
2.43.0




[PATCH net-next v4] net: dqs: add NIC stall detector based on BQL

2024-03-04 Thread Breno Leitao
From: Jakub Kicinski 

softnet_data->time_squeeze is sometimes used as a proxy for
host overload or indication of scheduling problems. In practice
this statistic is very noisy and has hard to grasp units -
e.g. is 10 squeezes a second to be expected, or high?

Delaying network (NAPI) processing leads to drops on NIC queues
but also RTT bloat, impacting pacing and CA decisions.
Stalls are a little hard to detect on the Rx side, because
there may simply have not been any packets received in given
period of time. Packet timestamps help a little bit, but
again we don't know if packets are stale because we're
not keeping up or because someone (*cough* cgroups)
disabled IRQs for a long time.

We can, however, use Tx as a proxy for Rx stalls. Most drivers
use combined Rx+Tx NAPIs so if Tx gets starved so will Rx.
On the Tx side we know exactly when packets get queued,
and completed, so there is no uncertainty.

This patch adds stall checks to BQL. Why BQL? Because
it's a convenient place to add such checks, already
called by most drivers, and it has copious free space
in its structures (this patch adds no extra cache
references or dirtying to the fast path).

The algorithm takes one parameter - max delay AKA stall
threshold and increments a counter whenever NAPI got delayed
for at least that amount of time. It also records the length
of the longest stall.

To be precise every time NAPI has not polled for at least
stall thrs we check if there were any Tx packets queued
between last NAPI run and now - stall_thrs/2.

Unlike the classic Tx watchdog this mechanism does not
ignore stalls caused by Tx being disabled, or loss of link.
I don't think the check is worth the complexity, and
stall is a stall, whether due to host overload, flow
control, link down... doesn't matter much to the application.

We have been running this detector in production at Meta
for 2 years, with the threshold of 8ms. It's the lowest
value where false positives become rare. There's still
a constant stream of reported stalls (especially without
the ksoftirqd deferral patches reverted), those who like
their stall metrics to be 0 may prefer higher value.

Signed-off-by: Jakub Kicinski 
Signed-off-by: Breno Leitao 
---
v1:
  * https://lore.kernel.org/netdev/202306172057.jx7yhliu-...@intel.com/T/
v2:
  * Fix the documentation file in patch 0001, since patch 0002 will
touch it later.
  * Fix the kernel test robot issues, marking functions as statics.
  * Use #include  instead of .
  * Added some new comments around, mainly around barriers.
  * Format struct `netdev_queue_attribute` assignments to make
checkpatch happy.
  * Updated and fixed the path in sysfs-class-net-queues
documentation.
v3:
  * Sent patch 0002 against net-next.
- The first patch was accepted against 'net'
v4:
 * Added code documentation to clarify the history usage
 * Added better documentation for TX completion stall in
   Documentation/ABI/testing/sysfs-class-net-queues.
 * Changed stall_thrs and stall_max from "unsigned char" to "unsigned
   short"

 .../ABI/testing/sysfs-class-net-queues| 23 ++
 include/linux/dynamic_queue_limits.h  | 45 +++
 include/trace/events/napi.h   | 33 +
 lib/dynamic_queue_limits.c| 74 +++
 net/core/net-sysfs.c  | 62 
 5 files changed, 237 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-class-net-queues 
b/Documentation/ABI/testing/sysfs-class-net-queues
index 5bff64d256c2..84aa25e0d14d 100644
--- a/Documentation/ABI/testing/sysfs-class-net-queues
+++ b/Documentation/ABI/testing/sysfs-class-net-queues
@@ -96,3 +96,26 @@ Description:
Indicates the absolute minimum limit of bytes allowed to be
queued on this network device transmit queue. Default value is
0.
+
+What:  
/sys/class/net//queues/tx-/byte_queue_limits/stall_thrs
+Date:  Jan 2024
+KernelVersion: 6.9
+Contact:   net...@vger.kernel.org
+Description:
+   Tx completion stall detection threshold in ms. Kernel will
+   guarantee to detect all stalls longer than this threshold but
+   may also detect stalls longer than half of the threshold.
+
+What:  
/sys/class/net//queues/tx-/byte_queue_limits/stall_cnt
+Date:  Jan 2024
+KernelVersion: 6.9
+Contact:   net...@vger.kernel.org
+Description:
+   Number of detected Tx completion stalls.
+
+What:  
/sys/class/net//queues/tx-/byte_queue_limits/stall_max
+Date:  Jan 2024
+KernelVersion: 6.9
+Contact:   net...@vger.kernel.org
+Description:
+   Longest detected Tx completion stall. Write 0 to clear.
diff --git a/include/linux/dynamic_queue_limits.h 
b/include/linux/dynamic_queue_limits.h
index 407c2f281b64..5693a4be0d9a 100644
--- a/include/linux/dynamic_queue_limits.h
+++ b/include/linux/dynamic_queue_li

[PATCH net-next 2/2] net/vsockmon: Do not set zeroed statistics

2024-02-23 Thread Breno Leitao
Do not set rtnl_link_stats64 fields to zero, since they are zeroed
before ops->ndo_get_stats64 is called in core dev_get_stats() function.

Signed-off-by: Breno Leitao 
---
 drivers/net/vsockmon.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/net/vsockmon.c b/drivers/net/vsockmon.c
index a0b4dca36baf..a1ba5169ed5d 100644
--- a/drivers/net/vsockmon.c
+++ b/drivers/net/vsockmon.c
@@ -46,9 +46,6 @@ static void
 vsockmon_get_stats64(struct net_device *dev, struct rtnl_link_stats64 *stats)
 {
dev_lstats_read(dev, >rx_packets, >rx_bytes);
-
-   stats->tx_packets = 0;
-   stats->tx_bytes = 0;
 }
 
 static int vsockmon_is_valid_mtu(int new_mtu)
-- 
2.39.3




[PATCH net-next 1/2] net/vsockmon: Leverage core stats allocator

2024-02-23 Thread Breno Leitao
With commit 34d21de99cea9 ("net: Move {l,t,d}stats allocation to core and
convert veth & vrf"), stats allocation could be done on net core
instead of this driver.

With this new approach, the driver doesn't have to bother with error
handling (allocation failure checking, making sure free happens in the
right spot, etc). This is core responsibility now.

Remove the allocation in the vsockmon driver and leverage the network
core allocation instead.

Signed-off-by: Breno Leitao 
---
 drivers/net/vsockmon.c | 16 +---
 1 file changed, 1 insertion(+), 15 deletions(-)

diff --git a/drivers/net/vsockmon.c b/drivers/net/vsockmon.c
index b1bb1b04b664..a0b4dca36baf 100644
--- a/drivers/net/vsockmon.c
+++ b/drivers/net/vsockmon.c
@@ -13,19 +13,6 @@
 #define DEFAULT_MTU (VIRTIO_VSOCK_MAX_PKT_BUF_SIZE + \
 sizeof(struct af_vsockmon_hdr))
 
-static int vsockmon_dev_init(struct net_device *dev)
-{
-   dev->lstats = netdev_alloc_pcpu_stats(struct pcpu_lstats);
-   if (!dev->lstats)
-   return -ENOMEM;
-   return 0;
-}
-
-static void vsockmon_dev_uninit(struct net_device *dev)
-{
-   free_percpu(dev->lstats);
-}
-
 struct vsockmon {
struct vsock_tap vt;
 };
@@ -79,8 +66,6 @@ static int vsockmon_change_mtu(struct net_device *dev, int 
new_mtu)
 }
 
 static const struct net_device_ops vsockmon_ops = {
-   .ndo_init = vsockmon_dev_init,
-   .ndo_uninit = vsockmon_dev_uninit,
.ndo_open = vsockmon_open,
.ndo_stop = vsockmon_close,
.ndo_start_xmit = vsockmon_xmit,
@@ -112,6 +97,7 @@ static void vsockmon_setup(struct net_device *dev)
dev->flags = IFF_NOARP;
 
dev->mtu = DEFAULT_MTU;
+   dev->pcpu_stat_type = NETDEV_PCPU_STAT_LSTATS;
 }
 
 static struct rtnl_link_ops vsockmon_link_ops __read_mostly = {
-- 
2.39.3




Re: [PATCH net-next v3] net: dqs: add NIC stall detector based on BQL

2024-02-14 Thread Breno Leitao
On Wed, Feb 14, 2024 at 05:58:37PM +0100, Eric Dumazet wrote:
> On Wed, Feb 14, 2024 at 5:49 PM Breno Leitao  wrote:
> >
> > On Wed, Feb 14, 2024 at 04:41:36PM +0100, Eric Dumazet wrote:
> > > On Wed, Feb 14, 2024 at 3:45 PM Breno Leitao  wrote:
> > > >
> > > > On Tue, Feb 13, 2024 at 10:04:57AM -0800, Jakub Kicinski wrote:
> > > > > On Tue, 13 Feb 2024 14:57:49 +0100 Eric Dumazet wrote:
> > > > > > Please note that adding other sysfs entries is expensive for 
> > > > > > workloads
> > > > > > creating/deleting netdev and netns often.
> > > > > >
> > > > > > I _think_ we should find a way for not creating
> > > > > > /sys/class/net//queues/tx-{Q}/byte_queue_limits  
> > > > > > directory
> > > > > > and files
> > > > > > for non BQL enabled devices (like loopback !)
> > > > >
> > > > > We should try, see if anyone screams. We could use IFF_NO_QUEUE, and
> > > > > NETIF_F_LLTX as a proxy for "device doesn't have a real queue so BQL
> > > > > would be pointless"? Obviously better to annotate the drivers which
> > > > > do have BQL support, but there's >50 of them on a quick count..
> > > >
> > > > Let me make sure I understand the suggestion above. We want to disable
> > > > BQL completely for devices that has dev->features & NETIF_F_LLTX or
> > > > dev->priv_flags & IFF_NO_QUEUE, right?
> > > >
> > > > Maybe we can add a ->enabled field in struct dql, and set it according
> > > > to the features above. Then we can created the sysfs and process the dql
> > > > operations based on that field. This should avoid some unnecessary calls
> > > > also, if we are not display sysfs.
> > > >
> > > > Here is a very simple PoC to represent what I had in mind. Am I in the
> > > > right direction?
> > >
> > > No, this was really about sysfs entries (aka dql_group)
> > >
> > > Partial patch would be:
> >
> > That is simpler than what I imagined. Thanks!
> >
> 
> >
> > for netdev_uses_bql(), would it be similar to what I proposed in the
> > previous message? Let me copy it here.
> >
> > static bool netdev_uses_bql(struct net_device *dev)
> > {
> >if (dev->features & NETIF_F_LLTX ||
> >dev->priv_flags & IFF_NO_QUEUE)
> >return false;
> >
> >return true;
> > }
> 
> I think this should be fine, yes.

Awesome, thanks.

I am planning to send this in separate from the "net: dqs: add NIC stall
detector based on BQL" patch since there isn't really a dependency here.



Re: [PATCH net-next v3] net: dqs: add NIC stall detector based on BQL

2024-02-14 Thread Breno Leitao
On Wed, Feb 14, 2024 at 04:41:36PM +0100, Eric Dumazet wrote:
> On Wed, Feb 14, 2024 at 3:45 PM Breno Leitao  wrote:
> >
> > On Tue, Feb 13, 2024 at 10:04:57AM -0800, Jakub Kicinski wrote:
> > > On Tue, 13 Feb 2024 14:57:49 +0100 Eric Dumazet wrote:
> > > > Please note that adding other sysfs entries is expensive for workloads
> > > > creating/deleting netdev and netns often.
> > > >
> > > > I _think_ we should find a way for not creating
> > > > /sys/class/net//queues/tx-{Q}/byte_queue_limits  directory
> > > > and files
> > > > for non BQL enabled devices (like loopback !)
> > >
> > > We should try, see if anyone screams. We could use IFF_NO_QUEUE, and
> > > NETIF_F_LLTX as a proxy for "device doesn't have a real queue so BQL
> > > would be pointless"? Obviously better to annotate the drivers which
> > > do have BQL support, but there's >50 of them on a quick count..
> >
> > Let me make sure I understand the suggestion above. We want to disable
> > BQL completely for devices that has dev->features & NETIF_F_LLTX or
> > dev->priv_flags & IFF_NO_QUEUE, right?
> >
> > Maybe we can add a ->enabled field in struct dql, and set it according
> > to the features above. Then we can created the sysfs and process the dql
> > operations based on that field. This should avoid some unnecessary calls
> > also, if we are not display sysfs.
> >
> > Here is a very simple PoC to represent what I had in mind. Am I in the
> > right direction?
> 
> No, this was really about sysfs entries (aka dql_group)
> 
> Partial patch would be:

That is simpler than what I imagined. Thanks!

> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index 
> a09d507c5b03d24a829bf7af0b7cf1e6a0bdb65a..094e3b2d78cca40d810b2fa3bd4393d22b30e6ad
> 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -1709,9 +1709,11 @@ static int netdev_queue_add_kobject(struct
> net_device *dev, int index)
> goto err;
> 
>  #ifdef CONFIG_BQL
> -   error = sysfs_create_group(kobj, _group);
> -   if (error)
> -   goto err;
> +   if (netdev_uses_bql(dev)) {

for netdev_uses_bql(), would it be similar to what I proposed in the
previous message? Let me copy it here.

static bool netdev_uses_bql(struct net_device *dev)
{
   if (dev->features & NETIF_F_LLTX ||
   dev->priv_flags & IFF_NO_QUEUE)
   return false;

   return true;
}

Thanks



Re: [PATCH net-next v3] net: dqs: add NIC stall detector based on BQL

2024-02-14 Thread Breno Leitao
On Tue, Feb 13, 2024 at 10:04:57AM -0800, Jakub Kicinski wrote:
> On Tue, 13 Feb 2024 14:57:49 +0100 Eric Dumazet wrote:
> > Please note that adding other sysfs entries is expensive for workloads
> > creating/deleting netdev and netns often.
> > 
> > I _think_ we should find a way for not creating
> > /sys/class/net//queues/tx-{Q}/byte_queue_limits  directory
> > and files
> > for non BQL enabled devices (like loopback !)
> 
> We should try, see if anyone screams. We could use IFF_NO_QUEUE, and
> NETIF_F_LLTX as a proxy for "device doesn't have a real queue so BQL 
> would be pointless"? Obviously better to annotate the drivers which
> do have BQL support, but there's >50 of them on a quick count..

Let me make sure I understand the suggestion above. We want to disable
BQL completely for devices that has dev->features & NETIF_F_LLTX or
dev->priv_flags & IFF_NO_QUEUE, right?

Maybe we can add a ->enabled field in struct dql, and set it according
to the features above. Then we can created the sysfs and process the dql
operations based on that field. This should avoid some unnecessary calls
also, if we are not display sysfs.

Here is a very simple PoC to represent what I had in mind. Am I in the
right direction?

diff --git a/include/linux/dynamic_queue_limits.h 
b/include/linux/dynamic_queue_limits.h
index 407c2f281b64..a9d17597ea80 100644
--- a/include/linux/dynamic_queue_limits.h
+++ b/include/linux/dynamic_queue_limits.h
@@ -62,6 +62,7 @@ struct dql {
unsigned intmax_limit;  /* Max limit */
unsigned intmin_limit;  /* Minimum limit */
unsigned intslack_hold_time;/* Time to measure slack */
+   boolenabled;/* Queue is active */
 };
 
 /* Set some static maximums */
@@ -101,7 +102,7 @@ void dql_completed(struct dql *dql, unsigned int count);
 void dql_reset(struct dql *dql);
 
 /* Initialize dql state */
-void dql_init(struct dql *dql, unsigned int hold_time);
+void dql_init(struct net_device *dev, struct dql *dql, unsigned int hold_time);
 
 #endif /* _KERNEL_ */
 
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ef7bfbb98497..5c69bbf4267d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3541,6 +3541,9 @@ static inline void netdev_tx_sent_queue(struct 
netdev_queue *dev_queue,
unsigned int bytes)
 {
 #ifdef CONFIG_BQL
+   if (!dev_queue->dql.enabled)
+   return
+
dql_queued(_queue->dql, bytes);
 
if (likely(dql_avail(_queue->dql) >= 0))
@@ -3573,7 +3576,8 @@ static inline bool __netdev_tx_sent_queue(struct 
netdev_queue *dev_queue,
 {
if (xmit_more) {
 #ifdef CONFIG_BQL
-   dql_queued(_queue->dql, bytes);
+   if (dev_queue->dql.enabled)
+   dql_queued(_queue->dql, bytes);
 #endif
return netif_tx_queue_stopped(dev_queue);
}
@@ -3617,7 +3621,7 @@ static inline void netdev_tx_completed_queue(struct 
netdev_queue *dev_queue,
 unsigned int pkts, unsigned int 
bytes)
 {
 #ifdef CONFIG_BQL
-   if (unlikely(!bytes))
+   if (unlikely(!bytes) || !dev_queue->dql.enabled)
return;
 
dql_completed(_queue->dql, bytes);
@@ -3656,6 +3660,9 @@ static inline void netdev_completed_queue(struct 
net_device *dev,
 static inline void netdev_tx_reset_queue(struct netdev_queue *q)
 {
 #ifdef CONFIG_BQL
+   if (!q->dql.enabled)
+   return;
+
clear_bit(__QUEUE_STATE_STACK_XOFF, >state);
dql_reset(>dql);
 #endif
diff --git a/lib/dynamic_queue_limits.c b/lib/dynamic_queue_limits.c
index fde0aa244148..0a0a51f06c3b 100644
--- a/lib/dynamic_queue_limits.c
+++ b/lib/dynamic_queue_limits.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define POSDIFF(A, B) ((int)((A) - (B)) > 0 ? (A) - (B) : 0)
 #define AFTER_EQ(A, B) ((int)((A) - (B)) >= 0)
@@ -128,11 +129,21 @@ void dql_reset(struct dql *dql)
 }
 EXPORT_SYMBOL(dql_reset);
 
-void dql_init(struct dql *dql, unsigned int hold_time)
+static bool netdev_dql_supported(struct net_device *dev)
+{
+   if (dev->features & NETIF_F_LLTX ||
+   dev->priv_flags & IFF_NO_QUEUE)
+   return false;
+
+   return true;
+}
+
+void dql_init(struct net_device *dev, struct dql *dql, unsigned int hold_time)
 {
dql->max_limit = DQL_MAX_LIMIT;
dql->min_limit = 0;
dql->slack_hold_time = hold_time;
dql_reset(dql);
+   dql->enabled = netdev_dql_supported(dev);
 }
 EXPORT_SYMBOL(dql_init);
diff --git a/net/core/dev.c b/net/core/dev.c
index 9bb792cecc16..76aa70ee2c87 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -10052,7 +10052,7 @@ static void netdev_init_one_queue(struct net_device 
*dev,
netdev_queue_numa_node_write(queue, NUMA_NO_NODE);
queue->dev = dev;
 #ifdef CONFIG_BQL
-   dql_init(>dql, HZ);
+   

Re: [PATCH net-next v3] net: dqs: add NIC stall detector based on BQL

2024-02-13 Thread Breno Leitao
On Tue, Feb 06, 2024 at 12:40:13PM +0100, Paolo Abeni wrote:
> On Fri, 2024-02-02 at 08:53 -0800, Breno Leitao wrote:
> > From: Jakub Kicinski 
> > 
> > softnet_data->time_squeeze is sometimes used as a proxy for
> > host overload or indication of scheduling problems. In practice
> > this statistic is very noisy and has hard to grasp units -
> > e.g. is 10 squeezes a second to be expected, or high?
> > 
> > Delaying network (NAPI) processing leads to drops on NIC queues
> > but also RTT bloat, impacting pacing and CA decisions.
> > Stalls are a little hard to detect on the Rx side, because
> > there may simply have not been any packets received in given
> > period of time. Packet timestamps help a little bit, but
> > again we don't know if packets are stale because we're
> > not keeping up or because someone (*cough* cgroups)
> > disabled IRQs for a long time.
> > 
> > We can, however, use Tx as a proxy for Rx stalls. Most drivers
> > use combined Rx+Tx NAPIs so if Tx gets starved so will Rx.
> > On the Tx side we know exactly when packets get queued,
> > and completed, so there is no uncertainty.
> > 
> > This patch adds stall checks to BQL. Why BQL? Because
> > it's a convenient place to add such checks, already
> > called by most drivers, and it has copious free space
> > in its structures (this patch adds no extra cache
> > references or dirtying to the fast path).
> > 
> > The algorithm takes one parameter - max delay AKA stall
> > threshold and increments a counter whenever NAPI got delayed
> > for at least that amount of time. It also records the length
> > of the longest stall.
> > 
> > To be precise every time NAPI has not polled for at least
> > stall thrs we check if there were any Tx packets queued
> > between last NAPI run and now - stall_thrs/2.
> > 
> > Unlike the classic Tx watchdog this mechanism does not
> > ignore stalls caused by Tx being disabled, or loss of link.
> > I don't think the check is worth the complexity, and
> > stall is a stall, whether due to host overload, flow
> > control, link down... doesn't matter much to the application.
> > 
> > We have been running this detector in production at Meta
> > for 2 years, with the threshold of 8ms. It's the lowest
> > value where false positives become rare. There's still
> > a constant stream of reported stalls (especially without
> > the ksoftirqd deferral patches reverted), those who like
> > their stall metrics to be 0 may prefer higher value.
> > 
> > Signed-off-by: Jakub Kicinski 
> > Signed-off-by: Breno Leitao 
> > ---
> > Changelog:
> > 
> > v1:
> >   * https://lore.kernel.org/netdev/202306172057.jx7yhliu-...@intel.com/T/
> > 
> > v2:
> >   * Fix the documentation file in patch 0001, since patch 0002 will
> > touch it later.
> >   * Fix the kernel test robot issues, marking functions as statics.
> >   * Use #include  instead of .
> >   * Added some new comments around, mainly around barriers.
> >   * Format struct `netdev_queue_attribute` assignments to make
> > checkpatch happy.
> >   * Updated and fixed the path in sysfs-class-net-queues
> > documentation.
> > 
> > v3:
> >   * Sending patch 0002 against net-next.
> > - The first patch was accepted against 'net'
> > 
> > ---
> >  .../ABI/testing/sysfs-class-net-queues| 23 +++
> >  include/linux/dynamic_queue_limits.h  | 35 +++
> >  include/trace/events/napi.h   | 33 ++
> >  lib/dynamic_queue_limits.c| 58 +
> >  net/core/net-sysfs.c  | 62 +++
> >  5 files changed, 211 insertions(+)
> > 
> > diff --git a/Documentation/ABI/testing/sysfs-class-net-queues 
> > b/Documentation/ABI/testing/sysfs-class-net-queues
> > index 5bff64d256c2..45bab9b6e3ae 100644
> > --- a/Documentation/ABI/testing/sysfs-class-net-queues
> > +++ b/Documentation/ABI/testing/sysfs-class-net-queues
> > @@ -96,3 +96,26 @@ Description:
> > Indicates the absolute minimum limit of bytes allowed to be
> > queued on this network device transmit queue. Default value is
> > 0.
> > +
> > +What:  
> > /sys/class/net//queues/tx-/byte_queue_limits/stall_thrs
> > +Date:  Jan 2024
> > +KernelVersion: 6.9
> > +Contact:   net...@vger.kernel.org
> > +Description:
> > +   Tx completion stall detection threshold. 
> 
> Possibly wo

[PATCH net-next v3] net: dqs: add NIC stall detector based on BQL

2024-02-02 Thread Breno Leitao
From: Jakub Kicinski 

softnet_data->time_squeeze is sometimes used as a proxy for
host overload or indication of scheduling problems. In practice
this statistic is very noisy and has hard to grasp units -
e.g. is 10 squeezes a second to be expected, or high?

Delaying network (NAPI) processing leads to drops on NIC queues
but also RTT bloat, impacting pacing and CA decisions.
Stalls are a little hard to detect on the Rx side, because
there may simply have not been any packets received in given
period of time. Packet timestamps help a little bit, but
again we don't know if packets are stale because we're
not keeping up or because someone (*cough* cgroups)
disabled IRQs for a long time.

We can, however, use Tx as a proxy for Rx stalls. Most drivers
use combined Rx+Tx NAPIs so if Tx gets starved so will Rx.
On the Tx side we know exactly when packets get queued,
and completed, so there is no uncertainty.

This patch adds stall checks to BQL. Why BQL? Because
it's a convenient place to add such checks, already
called by most drivers, and it has copious free space
in its structures (this patch adds no extra cache
references or dirtying to the fast path).

The algorithm takes one parameter - max delay AKA stall
threshold and increments a counter whenever NAPI got delayed
for at least that amount of time. It also records the length
of the longest stall.

To be precise every time NAPI has not polled for at least
stall thrs we check if there were any Tx packets queued
between last NAPI run and now - stall_thrs/2.

Unlike the classic Tx watchdog this mechanism does not
ignore stalls caused by Tx being disabled, or loss of link.
I don't think the check is worth the complexity, and
stall is a stall, whether due to host overload, flow
control, link down... doesn't matter much to the application.

We have been running this detector in production at Meta
for 2 years, with the threshold of 8ms. It's the lowest
value where false positives become rare. There's still
a constant stream of reported stalls (especially without
the ksoftirqd deferral patches reverted), those who like
their stall metrics to be 0 may prefer higher value.

Signed-off-by: Jakub Kicinski 
Signed-off-by: Breno Leitao 
---
Changelog:

v1:
  * https://lore.kernel.org/netdev/202306172057.jx7yhliu-...@intel.com/T/

v2:
  * Fix the documentation file in patch 0001, since patch 0002 will
touch it later.
  * Fix the kernel test robot issues, marking functions as statics.
  * Use #include  instead of .
  * Added some new comments around, mainly around barriers.
  * Format struct `netdev_queue_attribute` assignments to make
checkpatch happy.
  * Updated and fixed the path in sysfs-class-net-queues
documentation.

v3:
  * Sending patch 0002 against net-next.
- The first patch was accepted against 'net'

---
 .../ABI/testing/sysfs-class-net-queues| 23 +++
 include/linux/dynamic_queue_limits.h  | 35 +++
 include/trace/events/napi.h   | 33 ++
 lib/dynamic_queue_limits.c| 58 +
 net/core/net-sysfs.c  | 62 +++
 5 files changed, 211 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-class-net-queues 
b/Documentation/ABI/testing/sysfs-class-net-queues
index 5bff64d256c2..45bab9b6e3ae 100644
--- a/Documentation/ABI/testing/sysfs-class-net-queues
+++ b/Documentation/ABI/testing/sysfs-class-net-queues
@@ -96,3 +96,26 @@ Description:
Indicates the absolute minimum limit of bytes allowed to be
queued on this network device transmit queue. Default value is
0.
+
+What:  
/sys/class/net//queues/tx-/byte_queue_limits/stall_thrs
+Date:  Jan 2024
+KernelVersion: 6.9
+Contact:   net...@vger.kernel.org
+Description:
+   Tx completion stall detection threshold. Kernel will guarantee
+   to detect all stalls longer than this threshold but may also
+   detect stalls longer than half of the threshold.
+
+What:  
/sys/class/net//queues/tx-/byte_queue_limits/stall_cnt
+Date:  Jan 2024
+KernelVersion: 6.9
+Contact:   net...@vger.kernel.org
+Description:
+   Number of detected Tx completion stalls.
+
+What:  
/sys/class/net//queues/tx-/byte_queue_limits/stall_max
+Date:  Jan 2024
+KernelVersion: 6.9
+Contact:   net...@vger.kernel.org
+Description:
+   Longest detected Tx completion stall. Write 0 to clear.
diff --git a/include/linux/dynamic_queue_limits.h 
b/include/linux/dynamic_queue_limits.h
index 407c2f281b64..288e98fe85f0 100644
--- a/include/linux/dynamic_queue_limits.h
+++ b/include/linux/dynamic_queue_limits.h
@@ -38,14 +38,21 @@
 
 #ifdef __KERNEL__
 
+#include 
 #include 
 
+#define DQL_HIST_LEN   4
+#define DQL_HIST_ENT(dql, idx) ((dql)->history[(idx) % DQL_HIST_LEN])
+
 struct dql {
/* Fields accessed in enqueue path (dql_

[PATCH net v2 2/2] net: dqs: add NIC stall detector based on BQL

2024-01-31 Thread Breno Leitao
From: Jakub Kicinski 

softnet_data->time_squeeze is sometimes used as a proxy for
host overload or indication of scheduling problems. In practice
this statistic is very noisy and has hard to grasp units -
e.g. is 10 squeezes a second to be expected, or high?

Delaying network (NAPI) processing leads to drops on NIC queues
but also RTT bloat, impacting pacing and CA decisions.
Stalls are a little hard to detect on the Rx side, because
there may simply have not been any packets received in given
period of time. Packet timestamps help a little bit, but
again we don't know if packets are stale because we're
not keeping up or because someone (*cough* cgroups)
disabled IRQs for a long time.

We can, however, use Tx as a proxy for Rx stalls. Most drivers
use combined Rx+Tx NAPIs so if Tx gets starved so will Rx.
On the Tx side we know exactly when packets get queued,
and completed, so there is no uncertainty.

This patch adds stall checks to BQL. Why BQL? Because
it's a convenient place to add such checks, already
called by most drivers, and it has copious free space
in its structures (this patch adds no extra cache
references or dirtying to the fast path).

The algorithm takes one parameter - max delay AKA stall
threshold and increments a counter whenever NAPI got delayed
for at least that amount of time. It also records the length
of the longest stall.

To be precise every time NAPI has not polled for at least
stall thrs we check if there were any Tx packets queued
between last NAPI run and now - stall_thrs/2.

Unlike the classic Tx watchdog this mechanism does not
ignore stalls caused by Tx being disabled, or loss of link.
I don't think the check is worth the complexity, and
stall is a stall, whether due to host overload, flow
control, link down... doesn't matter much to the application.

We have been running this detector in production at Meta
for 2 years, with the threshold of 8ms. It's the lowest
value where false positives become rare. There's still
a constant stream of reported stalls (especially without
the ksoftirqd deferral patches reverted), those who like
their stall metrics to be 0 may prefer higher value.

Signed-off-by: Jakub Kicinski 
Signed-off-by: Breno Leitao 
---
 .../ABI/testing/sysfs-class-net-queues| 23 +++
 include/linux/dynamic_queue_limits.h  | 35 +++
 include/trace/events/napi.h   | 33 ++
 lib/dynamic_queue_limits.c| 58 +
 net/core/net-sysfs.c  | 62 +++
 5 files changed, 211 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-class-net-queues 
b/Documentation/ABI/testing/sysfs-class-net-queues
index 5bff64d256c2..45bab9b6e3ae 100644
--- a/Documentation/ABI/testing/sysfs-class-net-queues
+++ b/Documentation/ABI/testing/sysfs-class-net-queues
@@ -96,3 +96,26 @@ Description:
Indicates the absolute minimum limit of bytes allowed to be
queued on this network device transmit queue. Default value is
0.
+
+What:  
/sys/class/net//queues/tx-/byte_queue_limits/stall_thrs
+Date:  Jan 2024
+KernelVersion: 6.9
+Contact:   net...@vger.kernel.org
+Description:
+   Tx completion stall detection threshold. Kernel will guarantee
+   to detect all stalls longer than this threshold but may also
+   detect stalls longer than half of the threshold.
+
+What:  
/sys/class/net//queues/tx-/byte_queue_limits/stall_cnt
+Date:  Jan 2024
+KernelVersion: 6.9
+Contact:   net...@vger.kernel.org
+Description:
+   Number of detected Tx completion stalls.
+
+What:  
/sys/class/net//queues/tx-/byte_queue_limits/stall_max
+Date:  Jan 2024
+KernelVersion: 6.9
+Contact:   net...@vger.kernel.org
+Description:
+   Longest detected Tx completion stall. Write 0 to clear.
diff --git a/include/linux/dynamic_queue_limits.h 
b/include/linux/dynamic_queue_limits.h
index 407c2f281b64..288e98fe85f0 100644
--- a/include/linux/dynamic_queue_limits.h
+++ b/include/linux/dynamic_queue_limits.h
@@ -38,14 +38,21 @@
 
 #ifdef __KERNEL__
 
+#include 
 #include 
 
+#define DQL_HIST_LEN   4
+#define DQL_HIST_ENT(dql, idx) ((dql)->history[(idx) % DQL_HIST_LEN])
+
 struct dql {
/* Fields accessed in enqueue path (dql_queued) */
unsigned intnum_queued; /* Total ever queued */
unsigned intadj_limit;  /* limit + num_completed */
unsigned intlast_obj_cnt;   /* Count at last queuing */
 
+   unsigned long   history_head;
+   unsigned long   history[DQL_HIST_LEN];
+
/* Fields accessed only by completion path (dql_completed) */
 
unsigned intlimit cacheline_aligned_in_smp; /* Current limit */
@@ -62,6 +69,11 @@ struct dql {
unsigned intmax_limit;  /* Max limit */
unsigned intmin

[PATCH v6 06/13] x86/bugs: Rename SLS to CONFIG_MITIGATION_SLS

2023-11-21 Thread Breno Leitao
CPU mitigations config entries are inconsistent, and names are hard to
related. There are concrete benefits for both users and developers of
having all the mitigation config options living in the same config
namespace.

The mitigation options should have consistency and start with
MITIGATION.

Rename the Kconfig entry from SLS to MITIGATION_SLS.

Suggested-by: Josh Poimboeuf 
Signed-off-by: Breno Leitao 
---
 arch/x86/Kconfig   | 2 +-
 arch/x86/Makefile  | 2 +-
 arch/x86/include/asm/linkage.h | 4 ++--
 arch/x86/kernel/alternative.c  | 4 ++--
 arch/x86/kernel/ftrace.c   | 3 ++-
 arch/x86/net/bpf_jit_comp.c| 4 ++--
 scripts/Makefile.lib   | 2 +-
 7 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 862be9b3b216..fa246de60cdb 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2580,7 +2580,7 @@ config CPU_SRSO
help
  Enable the SRSO mitigation needed on AMD Zen1-4 machines.
 
-config SLS
+config MITIGATION_SLS
bool "Mitigate Straight-Line-Speculation"
depends on CC_HAS_SLS && X86_64
select OBJTOOL if HAVE_OBJTOOL
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index b8d23ed059fb..5ce8c30e7701 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -205,7 +205,7 @@ ifdef CONFIG_MITIGATION_RETPOLINE
   endif
 endif
 
-ifdef CONFIG_SLS
+ifdef CONFIG_MITIGATION_SLS
   KBUILD_CFLAGS += -mharden-sls=all
 endif
 
diff --git a/arch/x86/include/asm/linkage.h b/arch/x86/include/asm/linkage.h
index c5165204c66f..09e2d026df33 100644
--- a/arch/x86/include/asm/linkage.h
+++ b/arch/x86/include/asm/linkage.h
@@ -43,7 +43,7 @@
 #if defined(CONFIG_RETHUNK) && !defined(__DISABLE_EXPORTS) && 
!defined(BUILD_VDSO)
 #define RETjmp __x86_return_thunk
 #else /* CONFIG_MITIGATION_RETPOLINE */
-#ifdef CONFIG_SLS
+#ifdef CONFIG_MITIGATION_SLS
 #define RETret; int3
 #else
 #define RETret
@@ -55,7 +55,7 @@
 #if defined(CONFIG_RETHUNK) && !defined(__DISABLE_EXPORTS) && 
!defined(BUILD_VDSO)
 #define ASM_RET"jmp __x86_return_thunk\n\t"
 #else /* CONFIG_MITIGATION_RETPOLINE */
-#ifdef CONFIG_SLS
+#ifdef CONFIG_MITIGATION_SLS
 #define ASM_RET"ret; int3\n\t"
 #else
 #define ASM_RET"ret\n\t"
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 5ec887d065ce..b01d49862497 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -637,8 +637,8 @@ static int patch_retpoline(void *addr, struct insn *insn, 
u8 *bytes)
/*
 * The compiler is supposed to EMIT an INT3 after every unconditional
 * JMP instruction due to AMD BTC. However, if the compiler is too old
-* or SLS isn't enabled, we still need an INT3 after indirect JMPs
-* even on Intel.
+* or MITIGATION_SLS isn't enabled, we still need an INT3 after
+* indirect JMPs even on Intel.
 */
if (op == JMP32_INSN_OPCODE && i < insn->length)
bytes[i++] = INT3_INSN_OPCODE;
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 93bc52d4a472..70139d9d2e01 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -307,7 +307,8 @@ union ftrace_op_code_union {
} __attribute__((packed));
 };
 
-#define RET_SIZE   (IS_ENABLED(CONFIG_MITIGATION_RETPOLINE) ? 5 : 1 + 
IS_ENABLED(CONFIG_SLS))
+#define RET_SIZE \
+   (IS_ENABLED(CONFIG_MITIGATION_RETPOLINE) ? 5 : 1 + 
IS_ENABLED(CONFIG_MITIGATION_SLS))
 
 static unsigned long
 create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index ef732f323926..96a63c4386a9 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -469,7 +469,7 @@ static void emit_indirect_jump(u8 **pprog, int reg, u8 *ip)
emit_jump(, &__x86_indirect_thunk_array[reg], ip);
} else {
EMIT2(0xFF, 0xE0 + reg);/* jmp *%\reg */
-   if (IS_ENABLED(CONFIG_MITIGATION_RETPOLINE) || 
IS_ENABLED(CONFIG_SLS))
+   if (IS_ENABLED(CONFIG_MITIGATION_RETPOLINE) || 
IS_ENABLED(CONFIG_MITIGATION_SLS))
EMIT1(0xCC);/* int3 */
}
 
@@ -484,7 +484,7 @@ static void emit_return(u8 **pprog, u8 *ip)
emit_jump(, x86_return_thunk, ip);
} else {
EMIT1(0xC3);/* ret */
-   if (IS_ENABLED(CONFIG_SLS))
+   if (IS_ENABLED(CONFIG_MITIGATION_SLS))
EMIT1(0xCC);/* int3 */
}
 
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index d6e157938b5f..0d5461276179 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -264,7 +264,7 @@ endif
 objtool-args-$(CONFIG_UNWINDER_ORC)+= --orc
 ob

[PATCH v5 06/12] x86/bugs: Rename SLS to CONFIG_MITIGATION_SLS

2023-10-19 Thread Breno Leitao
CPU mitigations config entries are inconsistent, and names are hard to
related. There are concrete benefits for both users and developers of
having all the mitigation config options living in the same config
namespace.

The mitigation options should have consistency and start with
MITIGATION.

Rename the Kconfig entry from SLS to MITIGATION_SLS.

Suggested-by: Josh Poimboeuf 
Signed-off-by: Breno Leitao 
---
 arch/x86/Kconfig   | 2 +-
 arch/x86/Makefile  | 2 +-
 arch/x86/include/asm/linkage.h | 4 ++--
 arch/x86/kernel/alternative.c  | 4 ++--
 arch/x86/kernel/ftrace.c   | 3 ++-
 arch/x86/net/bpf_jit_comp.c| 4 ++--
 scripts/Makefile.lib   | 2 +-
 7 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f3593461ce35..9dd2fb555973 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2556,7 +2556,7 @@ config CPU_SRSO
help
  Enable the SRSO mitigation needed on AMD Zen1-4 machines.
 
-config SLS
+config MITIGATION_SLS
bool "Mitigate Straight-Line-Speculation"
depends on CC_HAS_SLS && X86_64
select OBJTOOL if HAVE_OBJTOOL
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 3053b60f017b..1ac5d6002f5f 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -204,7 +204,7 @@ ifdef CONFIG_MITIGATION_RETPOLINE
   endif
 endif
 
-ifdef CONFIG_SLS
+ifdef CONFIG_MITIGATION_SLS
   KBUILD_CFLAGS += -mharden-sls=all
 endif
 
diff --git a/arch/x86/include/asm/linkage.h b/arch/x86/include/asm/linkage.h
index c5165204c66f..09e2d026df33 100644
--- a/arch/x86/include/asm/linkage.h
+++ b/arch/x86/include/asm/linkage.h
@@ -43,7 +43,7 @@
 #if defined(CONFIG_RETHUNK) && !defined(__DISABLE_EXPORTS) && 
!defined(BUILD_VDSO)
 #define RETjmp __x86_return_thunk
 #else /* CONFIG_MITIGATION_RETPOLINE */
-#ifdef CONFIG_SLS
+#ifdef CONFIG_MITIGATION_SLS
 #define RETret; int3
 #else
 #define RETret
@@ -55,7 +55,7 @@
 #if defined(CONFIG_RETHUNK) && !defined(__DISABLE_EXPORTS) && 
!defined(BUILD_VDSO)
 #define ASM_RET"jmp __x86_return_thunk\n\t"
 #else /* CONFIG_MITIGATION_RETPOLINE */
-#ifdef CONFIG_SLS
+#ifdef CONFIG_MITIGATION_SLS
 #define ASM_RET"ret; int3\n\t"
 #else
 #define ASM_RET"ret\n\t"
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 8932f524c935..ea9652eb455b 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -624,8 +624,8 @@ static int patch_retpoline(void *addr, struct insn *insn, 
u8 *bytes)
/*
 * The compiler is supposed to EMIT an INT3 after every unconditional
 * JMP instruction due to AMD BTC. However, if the compiler is too old
-* or SLS isn't enabled, we still need an INT3 after indirect JMPs
-* even on Intel.
+* or MITIGATION_SLS isn't enabled, we still need an INT3 after
+* indirect JMPs even on Intel.
 */
if (op == JMP32_INSN_OPCODE && i < insn->length)
bytes[i++] = INT3_INSN_OPCODE;
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 0f26758c7a93..b000158b781a 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -297,7 +297,8 @@ union ftrace_op_code_union {
} __attribute__((packed));
 };
 
-#define RET_SIZE   (IS_ENABLED(CONFIG_MITIGATION_RETPOLINE) ? 5 : 1 + 
IS_ENABLED(CONFIG_SLS))
+#define RET_SIZE \
+   (IS_ENABLED(CONFIG_MITIGATION_RETPOLINE) ? 5 : 1 + 
IS_ENABLED(CONFIG_MITIGATION_SLS))
 
 static unsigned long
 create_trampoline(struct ftrace_ops *ops, unsigned int *tramp_size)
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index ef732f323926..96a63c4386a9 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -469,7 +469,7 @@ static void emit_indirect_jump(u8 **pprog, int reg, u8 *ip)
emit_jump(, &__x86_indirect_thunk_array[reg], ip);
} else {
EMIT2(0xFF, 0xE0 + reg);/* jmp *%\reg */
-   if (IS_ENABLED(CONFIG_MITIGATION_RETPOLINE) || 
IS_ENABLED(CONFIG_SLS))
+   if (IS_ENABLED(CONFIG_MITIGATION_RETPOLINE) || 
IS_ENABLED(CONFIG_MITIGATION_SLS))
EMIT1(0xCC);/* int3 */
}
 
@@ -484,7 +484,7 @@ static void emit_return(u8 **pprog, u8 *ip)
emit_jump(, x86_return_thunk, ip);
} else {
EMIT1(0xC3);/* ret */
-   if (IS_ENABLED(CONFIG_SLS))
+   if (IS_ENABLED(CONFIG_MITIGATION_SLS))
EMIT1(0xCC);/* int3 */
}
 
diff --git a/scripts/Makefile.lib b/scripts/Makefile.lib
index d6e157938b5f..0d5461276179 100644
--- a/scripts/Makefile.lib
+++ b/scripts/Makefile.lib
@@ -264,7 +264,7 @@ endif
 objtool-args-$(CONFIG_UNWINDER_ORC)+= --orc
 ob

[PATCH] scripts: Add ppc64le support for checkstack.pl

2017-11-28 Thread Breno Leitao
64-bit ELF v2 ABI specification for POWER describes, on section "General
Stack Frame Requirements", that the stack should use the following
instructions when compiled with backchain:

  mflr r0
  std  r0, 16(r1)
  stdu r1, -XX(r1)

Where XX is the frame size for that function, and this is the value
checkstack.pl will find the stack size for each function.

This patch also simplifies the entire Powerpc section, since just two
type of instructions are used, 'stdu' for 64 bits and 'stwu' for 32 bits
platform.

Signed-off-by: Breno Leitao <lei...@debian.org>
---
 scripts/checkstack.pl | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/scripts/checkstack.pl b/scripts/checkstack.pl
index 7f4c41717e26..8ed217ddf2c9 100755
--- a/scripts/checkstack.pl
+++ b/scripts/checkstack.pl
@@ -14,6 +14,7 @@
 #  M68k port by Geert Uytterhoeven and Andreas Schwab
 #  AArch64, PARISC ports by Kyle McMartin
 #  sparc port by Martin Habets <errandir_n...@mph.eclipse.co.uk>
+#  ppc64le port by Breno Leitao <lei...@debian.org>
 #
 #  Usage:
 #  objdump -d vmlinux | scripts/checkstack.pl [arch]
@@ -81,13 +82,9 @@ my (@stack, $re, $dre, $x, $xs, $funcre);
$re = qr/.*l\.addi.*r1,r1,-(([0-9]{2}|[3-9])[0-9]{2})/o;
} elsif ($arch eq 'parisc' || $arch eq 'parisc64') {
$re = qr/.*ldo ($x{1,8})\(sp\),sp/o;
-   } elsif ($arch eq 'ppc') {
-   #c00029f4:   94 21 ff 30 stwur1,-208(r1)
-   $re = qr/.*stwu.*r1,-($x{1,8})\(r1\)/o;
-   } elsif ($arch eq 'ppc64') {
-   #XXX
-   $re = qr/.*stdu.*r1,-($x{1,8})\(r1\)/o;
-   } elsif ($arch eq 'powerpc') {
+   } elsif ($arch eq 'powerpc' || $arch =~ /^ppc(64)?(le)?$/ ) {
+   # powerpc: 94 21 ff 30 stwur1,-208(r1)
+   # ppc64(le)  : 81 ff 21 f8 stdur1,-128(r1)
$re = qr/.*st[dw]u.*r1,-($x{1,8})\(r1\)/o;
} elsif ($arch =~ /^s390x?$/) {
#   11160:   a7 fb ff 60 aghi   %r15,-160
-- 
2.15.0



[PATCH] scripts: Add ppc64le support for checkstack.pl

2017-11-28 Thread Breno Leitao
64-bit ELF v2 ABI specification for POWER describes, on section "General
Stack Frame Requirements", that the stack should use the following
instructions when compiled with backchain:

  mflr r0
  std  r0, 16(r1)
  stdu r1, -XX(r1)

Where XX is the frame size for that function, and this is the value
checkstack.pl will find the stack size for each function.

This patch also simplifies the entire Powerpc section, since just two
type of instructions are used, 'stdu' for 64 bits and 'stwu' for 32 bits
platform.

Signed-off-by: Breno Leitao 
---
 scripts/checkstack.pl | 11 ---
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/scripts/checkstack.pl b/scripts/checkstack.pl
index 7f4c41717e26..8ed217ddf2c9 100755
--- a/scripts/checkstack.pl
+++ b/scripts/checkstack.pl
@@ -14,6 +14,7 @@
 #  M68k port by Geert Uytterhoeven and Andreas Schwab
 #  AArch64, PARISC ports by Kyle McMartin
 #  sparc port by Martin Habets 
+#  ppc64le port by Breno Leitao 
 #
 #  Usage:
 #  objdump -d vmlinux | scripts/checkstack.pl [arch]
@@ -81,13 +82,9 @@ my (@stack, $re, $dre, $x, $xs, $funcre);
$re = qr/.*l\.addi.*r1,r1,-(([0-9]{2}|[3-9])[0-9]{2})/o;
} elsif ($arch eq 'parisc' || $arch eq 'parisc64') {
$re = qr/.*ldo ($x{1,8})\(sp\),sp/o;
-   } elsif ($arch eq 'ppc') {
-   #c00029f4:   94 21 ff 30 stwur1,-208(r1)
-   $re = qr/.*stwu.*r1,-($x{1,8})\(r1\)/o;
-   } elsif ($arch eq 'ppc64') {
-   #XXX
-   $re = qr/.*stdu.*r1,-($x{1,8})\(r1\)/o;
-   } elsif ($arch eq 'powerpc') {
+   } elsif ($arch eq 'powerpc' || $arch =~ /^ppc(64)?(le)?$/ ) {
+   # powerpc: 94 21 ff 30 stwur1,-208(r1)
+   # ppc64(le)  : 81 ff 21 f8 stdur1,-128(r1)
$re = qr/.*st[dw]u.*r1,-($x{1,8})\(r1\)/o;
} elsif ($arch =~ /^s390x?$/) {
#   11160:   a7 fb ff 60 aghi   %r15,-160
-- 
2.15.0



Re: [PATCH v9 44/51] selftest/vm: powerpc implementation for generic abstraction

2017-11-10 Thread Breno Leitao
Hi Ram,

On Thu, Nov 09, 2017 at 03:37:46PM -0800, Ram Pai wrote:
> On Thu, Nov 09, 2017 at 04:47:15PM -0200, Breno Leitao wrote:
> > On Mon, Nov 06, 2017 at 12:57:36AM -0800, Ram Pai wrote:
> > > @@ -206,12 +209,14 @@ void signal_handler(int signum, siginfo_t *si, void 
> > > *vucontext)
> > >  
> > >   trapno = uctxt->uc_mcontext.gregs[REG_TRAPNO];
> > >   ip = uctxt->uc_mcontext.gregs[REG_IP_IDX];
> > > - fpregset = uctxt->uc_mcontext.fpregs;
> > > - fpregs = (void *)fpregset;
> > 
> > Since you removed all references for fpregset now, you probably want to
> > remove the declaration of the variable above.
> 
> fpregs is still needed.

Right, fpregs is still needed, but not fpregset. Every reference for this
variable was removed with your patch.

Grepping this variable identifier on a tree with your patches, I see:

 $ grep fpregset protection_keys.c 
 fpregset_t fpregset;


Re: [PATCH v9 44/51] selftest/vm: powerpc implementation for generic abstraction

2017-11-10 Thread Breno Leitao
Hi Ram,

On Thu, Nov 09, 2017 at 03:37:46PM -0800, Ram Pai wrote:
> On Thu, Nov 09, 2017 at 04:47:15PM -0200, Breno Leitao wrote:
> > On Mon, Nov 06, 2017 at 12:57:36AM -0800, Ram Pai wrote:
> > > @@ -206,12 +209,14 @@ void signal_handler(int signum, siginfo_t *si, void 
> > > *vucontext)
> > >  
> > >   trapno = uctxt->uc_mcontext.gregs[REG_TRAPNO];
> > >   ip = uctxt->uc_mcontext.gregs[REG_IP_IDX];
> > > - fpregset = uctxt->uc_mcontext.fpregs;
> > > - fpregs = (void *)fpregset;
> > 
> > Since you removed all references for fpregset now, you probably want to
> > remove the declaration of the variable above.
> 
> fpregs is still needed.

Right, fpregs is still needed, but not fpregset. Every reference for this
variable was removed with your patch.

Grepping this variable identifier on a tree with your patches, I see:

 $ grep fpregset protection_keys.c 
 fpregset_t fpregset;


Re: [PATCH v9 44/51] selftest/vm: powerpc implementation for generic abstraction

2017-11-09 Thread Breno Leitao
Hi Ram,

On Mon, Nov 06, 2017 at 12:57:36AM -0800, Ram Pai wrote:
> @@ -206,12 +209,14 @@ void signal_handler(int signum, siginfo_t *si, void 
> *vucontext)
>  
>   trapno = uctxt->uc_mcontext.gregs[REG_TRAPNO];
>   ip = uctxt->uc_mcontext.gregs[REG_IP_IDX];
> - fpregset = uctxt->uc_mcontext.fpregs;
> - fpregs = (void *)fpregset;

Since you removed all references for fpregset now, you probably want to
remove the declaration of the variable above.

> @@ -219,20 +224,21 @@ void signal_handler(int signum, siginfo_t *si, void 
> *vucontext)
>* state.  We just assume that it is here.
>*/
>   fpregs += 0x70;
> -#endif
> - pkey_reg_offset = pkey_reg_xstate_offset();

With this code, you removed all the reference for variable
pkey_reg_offset, thus, its declaration could be removed also.

> - *(u64 *)pkey_reg_ptr = 0x;
> + dprintf1("si_pkey from siginfo: %lx\n", si_pkey);
> +#if defined(__i386__) || defined(__x86_64__) /* arch */
> + dprintf1("signal pkey_reg from xsave: %016lx\n", *pkey_reg_ptr);
> + *(u64 *)pkey_reg_ptr &= reset_bits(si_pkey, PKEY_DISABLE_ACCESS);
> +#elif __powerpc64__

Since the variable pkey_reg_ptr is only used for Intel code (inside
#ifdefs), you probably want to #ifdef the variable declaration also,
avoid triggering "unused variable" warning on non-Intel machines.


Re: [PATCH v9 44/51] selftest/vm: powerpc implementation for generic abstraction

2017-11-09 Thread Breno Leitao
Hi Ram,

On Mon, Nov 06, 2017 at 12:57:36AM -0800, Ram Pai wrote:
> @@ -206,12 +209,14 @@ void signal_handler(int signum, siginfo_t *si, void 
> *vucontext)
>  
>   trapno = uctxt->uc_mcontext.gregs[REG_TRAPNO];
>   ip = uctxt->uc_mcontext.gregs[REG_IP_IDX];
> - fpregset = uctxt->uc_mcontext.fpregs;
> - fpregs = (void *)fpregset;

Since you removed all references for fpregset now, you probably want to
remove the declaration of the variable above.

> @@ -219,20 +224,21 @@ void signal_handler(int signum, siginfo_t *si, void 
> *vucontext)
>* state.  We just assume that it is here.
>*/
>   fpregs += 0x70;
> -#endif
> - pkey_reg_offset = pkey_reg_xstate_offset();

With this code, you removed all the reference for variable
pkey_reg_offset, thus, its declaration could be removed also.

> - *(u64 *)pkey_reg_ptr = 0x;
> + dprintf1("si_pkey from siginfo: %lx\n", si_pkey);
> +#if defined(__i386__) || defined(__x86_64__) /* arch */
> + dprintf1("signal pkey_reg from xsave: %016lx\n", *pkey_reg_ptr);
> + *(u64 *)pkey_reg_ptr &= reset_bits(si_pkey, PKEY_DISABLE_ACCESS);
> +#elif __powerpc64__

Since the variable pkey_reg_ptr is only used for Intel code (inside
#ifdefs), you probably want to #ifdef the variable declaration also,
avoid triggering "unused variable" warning on non-Intel machines.


Re: Re: collect2: error: ld returned 1 exit status

2015-11-24 Thread Breno Leitao
Hi Fengguang,

On 11/23/2015 10:45 AM, Fengguang Wu wrote:
> I'm using Debian's cross gcc 5.2.1:
> 
> gcc-powerpc64le-linux-gnu5.2.1-14+really5.2.1-13
> 
> It may take some time for debian to catchup up with the newer version.
I already opened a Debian bug to track the backport of this fix into
Debian's GCC 5.2.1.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=806160

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Re: collect2: error: ld returned 1 exit status

2015-11-24 Thread Breno Leitao
Hi Fengguang,

On 11/23/2015 10:45 AM, Fengguang Wu wrote:
> I'm using Debian's cross gcc 5.2.1:
> 
> gcc-powerpc64le-linux-gnu5.2.1-14+really5.2.1-13
> 
> It may take some time for debian to catchup up with the newer version.
I already opened a Debian bug to track the backport of this fix into
Debian's GCC 5.2.1.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=806160

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/