Re: [PATCH 1/1] virtio_net: Add timeout handler to avoid kernel hang

2024-01-25 Thread Zhu Yanjun



在 2024/1/26 11:11, Zhu Yanjun 写道:



在 2024/1/22 15:02, Xuan Zhuo 写道:

On Mon, 22 Jan 2024 14:58:09 +0800, Jason Wang  wrote:

On Mon, Jan 22, 2024 at 2:55 PM Jason Wang  wrote:

On Mon, Jan 22, 2024 at 2:20 PM Xuan Zhuo  wrote:

On Mon, 22 Jan 2024 12:16:27 +0800, Jason Wang  wrote:

On Mon, Jan 22, 2024 at 12:00 PM Xuan Zhuo  wrote:

On Mon, 22 Jan 2024 11:14:30 +0800, Jason Wang  wrote:

On Mon, Jan 22, 2024 at 10:12 AM Zhu Yanjun  wrote:

在 2024/1/20 1:29, Andrew Lunn 写道:

while (!virtqueue_get_buf(vi->cvq, &tmp) &&
-   !virtqueue_is_broken(vi->cvq))
+   !virtqueue_is_broken(vi->cvq)) {
+if (timeout)
+timeout--;

This is not really a timeout, just a loop counter. 200 iterations could
be a very short time on reasonable H/W. I guess this avoid the soft
lockup, but possibly (likely?) breaks the functionality when we need to
loop for some non negligible time.

I fear we need a more complex solution, as mentioned by Micheal in the
thread you quoted.

Got it. I also look forward to the more complex solution to this problem.

Can we add a device capability (new feature bit) such as ctrq_wait_timeout
to get a reasonable timeout?

The usual solution to this is include/linux/iopoll.h. If you can sleep
read_poll_timeout() otherwise read_poll_timeout_atomic().

I read carefully the functions read_poll_timeout() and
read_poll_timeout_atomic(). The timeout is set by the caller of the 2
functions.

FYI, in order to avoid a swtich of atomic or not, we need convert rx
mode setting to workqueue first:

https://www.mail-archive.com/virtualization@lists.linux-foundation.org/msg60298.html


As such, can we add a module parameter to customize this timeout value
by the user?

Who is the "user" here, or how can the "user" know the value?


Or this timeout value is stored in device register, virtio_net driver
will read this timeout value at initialization?

See another thread. The design needs to be general, or you can post a RFC.

In another thought, we've already had a tx watchdog, maybe we can have
something similar to cvq and use timeout + reset in that case.

But we may block by the reset ^_^ if the device is broken?

I mean vq reset here.

I see.

I mean when the deivce is broken, the vq reset also many be blocked.

 void vp_modern_set_queue_reset(struct virtio_pci_modern_device *mdev, 
u16 index)
 {
 struct virtio_pci_modern_common_cfg __iomem *cfg;

 cfg = (struct virtio_pci_modern_common_cfg __iomem 
*)mdev->common;

 vp_iowrite16(index, &cfg->cfg.queue_select);
 vp_iowrite16(1, &cfg->queue_reset);

 while (vp_ioread16(&cfg->queue_reset))
 msleep(1);

 while (vp_ioread16(&cfg->cfg.queue_enable))
 msleep(1);
 }
 EXPORT_SYMBOL_GPL(vp_modern_set_queue_reset);

In this function, for the broken device, we can not expect something.

Yes, it's best effort, there's no guarantee then. But it doesn't harm to try.

Thanks


It looks like we have multiple goals here

1) avoid lockups, using workqueue + cond_resched() seems to be
sufficient, it has issue but nothing new
2) recover from the unresponsive device, the issue for timeout is that
it needs to deal with false positives

I agree.

But I want to add a new goal, cvq async. In the netdim, we will
send many requests via the cvq, so the cvq async will be nice.

Then you need an interrupt for cvq.

FYI, I've posted a series that use interrupt for cvq in the past:

https://lore.kernel.org/lkml/6026e801-6fda-fee9-a69b-d06a80368...@redhat.com/t/

I know this. But the interrupt maybe not a good solution without new space.


Haven't found time in working on this anymore, maybe we can start from
this or not.

I said async, but my aim is to put many requests to the cvq before getting the
response.

Heng Qi posted 
thishttps://lore.kernel.org/all/1705410693-118895-4-git-send-email-hen...@linux.alibaba.com/


Sorry. This mail is rejected by netdev maillist. So I have to resend it.


Thanks a lot. I read Heng Qi's commits carefully. This patch series are 
similiar with the NIC feature xmit_more.


But if cvq command is urgent, can we let this urgent cvq command be 
passed ASAP?


I mean, can we set a flag similar to xmit_more? if cvq command is not 
urgent, it can be queued. If it is urgent,


this cvq command is passed ASAP.

Zhu Yanjun


Zhu Yanjun


Thanks.



Thanks


Thanks.



Thanks


Thanks.



Thans


Zhu Yanjun


   Andrew




Re: [PATCH 1/1] virtio_net: Add timeout handler to avoid kernel hang

2024-01-21 Thread Zhu Yanjun



在 2024/1/20 1:29, Andrew Lunn 写道:

   while (!virtqueue_get_buf(vi->cvq, &tmp) &&
-   !virtqueue_is_broken(vi->cvq))
+   !virtqueue_is_broken(vi->cvq)) {
+    if (timeout)
+    timeout--;

This is not really a timeout, just a loop counter. 200 iterations could
be a very short time on reasonable H/W. I guess this avoid the soft
lockup, but possibly (likely?) breaks the functionality when we need to
loop for some non negligible time.

I fear we need a more complex solution, as mentioned by Micheal in the
thread you quoted.

Got it. I also look forward to the more complex solution to this problem.

Can we add a device capability (new feature bit) such as ctrq_wait_timeout
to get a reasonable timeout?

The usual solution to this is include/linux/iopoll.h. If you can sleep
read_poll_timeout() otherwise read_poll_timeout_atomic().


I read carefully the functions read_poll_timeout() and 
read_poll_timeout_atomic(). The timeout is set by the caller of the 2 
functions.


As such, can we add a module parameter to customize this timeout value 
by the user?


Or this timeout value is stored in device register, virtio_net driver 
will read this timeout value at initialization?


Zhu Yanjun



Andrew




Re: [PATCH 1/1] virtio_net: Add timeout handler to avoid kernel hang

2024-01-19 Thread Zhu Yanjun



在 2024/1/20 1:29, Andrew Lunn 写道:

   while (!virtqueue_get_buf(vi->cvq, &tmp) &&
-   !virtqueue_is_broken(vi->cvq))
+   !virtqueue_is_broken(vi->cvq)) {
+    if (timeout)
+    timeout--;

This is not really a timeout, just a loop counter. 200 iterations could
be a very short time on reasonable H/W. I guess this avoid the soft
lockup, but possibly (likely?) breaks the functionality when we need to
loop for some non negligible time.

I fear we need a more complex solution, as mentioned by Micheal in the
thread you quoted.

Got it. I also look forward to the more complex solution to this problem.

Can we add a device capability (new feature bit) such as ctrq_wait_timeout
to get a reasonable timeout?

The usual solution to this is include/linux/iopoll.h. If you can sleep
read_poll_timeout() otherwise read_poll_timeout_atomic().


Thanks. The 2 functions read_poll_timeout() and 
read_poll_timeout_atomic() are interesting.


Zhu Yanjun



Andrew




Re: [PATCH 1/1] virtio_net: Add timeout handler to avoid kernel hang

2024-01-18 Thread Zhu Yanjun



在 2024/1/16 20:04, Paolo Abeni 写道:

On Mon, 2024-01-15 at 09:29 +0800, Zhu Yanjun wrote:

From: Zhu Yanjun 

Some devices emulate the virtio_net hardwares. When virtio_net
driver sends commands to the emulated hardware, normally the
hardware needs time to response. Sometimes the time is very
long. Thus, the following will appear. Then the whole system
will hang.
The similar problems also occur in Intel NICs and Mellanox NICs.
As such, the similar solution is borrowed from them. A timeout
value is added and the timeout value as large as possible is set
to ensure that the driver gets the maximum possible response from
the hardware.

"
[  213.795860] watchdog: BUG: soft lockup - CPU#108 stuck for 26s! 
[(udev-worker):3157]
[  213.796114] Modules linked in: virtio_net(+) net_failover failover qrtr 
rfkill sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency 
intel_uncore_frequency_common intel_ifs i10nm_edac nfit libnvdimm 
x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt rapl intel_pmc_bxt 
dax_hmem iTCO_vendor_support vfat cxl_acpi intel_cstate pmt_telemetry pmt_class 
intel_sdsi joydev intel_uncore cxl_core fat pcspkr mei_me isst_if_mbox_pci 
isst_if_mmio idxd i2c_i801 isst_if_common mei intel_vsec idxd_bus i2c_smbus 
i2c_ismt ipmi_ssif acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad 
acpi_power_meter pfr_telemetry pfr_update fuse loop zram xfs crct10dif_pclmul 
crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel 
sha512_ssse3 bnxt_en sha256_ssse3 sha1_ssse3 nvme ast nvme_core i2c_algo_bit 
wmi pinctrl_emmitsburg scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_multipath
[  213.796194] irq event stamp: 67740
[  213.796195] hardirqs last  enabled at (67739): [] 
asm_sysvec_apic_timer_interrupt+0x1a/0x20
[  213.796203] hardirqs last disabled at (67740): [] 
sysvec_apic_timer_interrupt+0xe/0x90
[  213.796208] softirqs last  enabled at (67686): [] 
__irq_exit_rcu+0xbe/0xe0
[  213.796214] softirqs last disabled at (67681): [] 
__irq_exit_rcu+0xbe/0xe0
[  213.796217] CPU: 108 PID: 3157 Comm: (udev-worker) Kdump: loaded Not tainted 
6.7.0+ #9
[  213.796220] Hardware name: Intel Corporation M50FCP2SBSTD/M50FCP2SBSTD, BIOS 
SE5C741.86B.01.01.0001.2211140926 11/14/2022
[  213.796221] RIP: 0010:virtqueue_get_buf_ctx_split+0x8d/0x110
[  213.796228] Code: 89 df e8 26 fe ff ff 0f b7 43 50 83 c0 01 66 89 43 50 f6 43 78 
01 75 12 80 7b 42 00 48 8b 4b 68 8b 53 58 74 0f 66 87 44 51 04 <48> 89 e8 5b 5d 
c3 cc cc cc cc 66 89 44 51 04 0f ae f0 48 89 e8 5b
[  213.796230] RSP: 0018:ff4bbb362306f9b0 EFLAGS: 0246
[  213.796233] RAX:  RBX: ff2f15095896f000 RCX: 0001
[  213.796235] RDX:  RSI: ff4bbb362306f9cc RDI: ff2f15095896f000
[  213.796236] RBP:  R08:  R09: 
[  213.796237] R10: 0003 R11: ff2f15095893cc40 R12: 0002
[  213.796239] R13: 0004 R14:  R15: ff2f1509534f3000
[  213.796240] FS:  7f775847d0c0() GS:ff2f1528bac0() 
knlGS:
[  213.796242] CS:  0010 DS:  ES:  CR0: 80050033
[  213.796243] CR2: 557f987b6e70 CR3: 002098602006 CR4: 00f71ef0
[  213.796245] DR0:  DR1:  DR2: 
[  213.796246] DR3:  DR6: fffe07f0 DR7: 0400
[  213.796247] PKRU: 5554
[  213.796249] Call Trace:
[  213.796250]  
[  213.796252]  ? watchdog_timer_fn+0x1c0/0x220
[  213.796258]  ? __pfx_watchdog_timer_fn+0x10/0x10
[  213.796261]  ? __hrtimer_run_queues+0x1af/0x380
[  213.796269]  ? hrtimer_interrupt+0xf8/0x230
[  213.796274]  ? __sysvec_apic_timer_interrupt+0x64/0x1a0
[  213.796279]  ? sysvec_apic_timer_interrupt+0x6d/0x90
[  213.796282]  
[  213.796284]  
[  213.796285]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[  213.796293]  ? virtqueue_get_buf_ctx_split+0x8d/0x110
[  213.796297]  virtnet_send_command+0x18a/0x1f0 [virtio_net]
[  213.796310]  _virtnet_set_queues+0xc6/0x120 [virtio_net]
[  213.796319]  virtnet_probe+0xa06/0xd50 [virtio_net]
[  213.796328]  virtio_dev_probe+0x195/0x230
[  213.796333]  really_probe+0x19f/0x400
[  213.796338]  ? __pfx___driver_attach+0x10/0x10
[  213.796340]  __driver_probe_device+0x78/0x160
[  213.796343]  driver_probe_device+0x1f/0x90
[  213.796346]  __driver_attach+0xd6/0x1d0
[  213.796349]  bus_for_each_dev+0x8c/0xe0
[  213.796355]  bus_add_driver+0x119/0x220
[  213.796359]  driver_register+0x59/0x100
[  213.796362]  ? __pfx_virtio_net_driver_init+0x10/0x10 [virtio_net]
[  213.796369]  virtio_net_driver_init+0x8e/0xff0 [virtio_net]
[  213.796375]  do_one_initcall+0x6f/0x380
[  213.796384]  do_init_module+0x60/0x240
[  213.796388]  init_module_from_file+0x86/0xc0
[  213.796396]  idempotent_init_module+0x129/0x2c0
[  213.796406]  __x64_sys_finit_module+0x5e/0xb0
[  213.796409]  do_syscall_64+0x60/0xe0
[  213.796415]  ? do_syscall_64+0x6f/0xe0
[ 

Re: [PATCH 1/1] virtio_net: Add timeout handler to avoid kernel hang

2024-01-15 Thread Zhu Yanjun



在 2024/1/15 10:20, Jason Wang 写道:

On Mon, Jan 15, 2024 at 9:35 AM Zhu Yanjun  wrote:

From: Zhu Yanjun 

Some devices emulate the virtio_net hardwares. When virtio_net
driver sends commands to the emulated hardware, normally the
hardware needs time to response. Sometimes the time is very
long. Thus, the following will appear. Then the whole system
will hang.
The similar problems also occur in Intel NICs and Mellanox NICs.
As such, the similar solution is borrowed from them. A timeout
value is added and the timeout value as large as possible is set
to ensure that the driver gets the maximum possible response from
the hardware.

"
[  213.795860] watchdog: BUG: soft lockup - CPU#108 stuck for 26s! 
[(udev-worker):3157]
[  213.796114] Modules linked in: virtio_net(+) net_failover failover qrtr 
rfkill sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency 
intel_uncore_frequency_common intel_ifs i10nm_edac nfit libnvdimm 
x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt rapl intel_pmc_bxt 
dax_hmem iTCO_vendor_support vfat cxl_acpi intel_cstate pmt_telemetry pmt_class 
intel_sdsi joydev intel_uncore cxl_core fat pcspkr mei_me isst_if_mbox_pci 
isst_if_mmio idxd i2c_i801 isst_if_common mei intel_vsec idxd_bus i2c_smbus 
i2c_ismt ipmi_ssif acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad 
acpi_power_meter pfr_telemetry pfr_update fuse loop zram xfs crct10dif_pclmul 
crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel 
sha512_ssse3 bnxt_en sha256_ssse3 sha1_ssse3 nvme ast nvme_core i2c_algo_bit 
wmi pinctrl_emmitsburg scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_multipath
[  213.796194] irq event stamp: 67740
[  213.796195] hardirqs last  enabled at (67739): [] 
asm_sysvec_apic_timer_interrupt+0x1a/0x20
[  213.796203] hardirqs last disabled at (67740): [] 
sysvec_apic_timer_interrupt+0xe/0x90
[  213.796208] softirqs last  enabled at (67686): [] 
__irq_exit_rcu+0xbe/0xe0
[  213.796214] softirqs last disabled at (67681): [] 
__irq_exit_rcu+0xbe/0xe0
[  213.796217] CPU: 108 PID: 3157 Comm: (udev-worker) Kdump: loaded Not tainted 
6.7.0+ #9
[  213.796220] Hardware name: Intel Corporation M50FCP2SBSTD/M50FCP2SBSTD, BIOS 
SE5C741.86B.01.01.0001.2211140926 11/14/2022
[  213.796221] RIP: 0010:virtqueue_get_buf_ctx_split+0x8d/0x110
[  213.796228] Code: 89 df e8 26 fe ff ff 0f b7 43 50 83 c0 01 66 89 43 50 f6 43 78 
01 75 12 80 7b 42 00 48 8b 4b 68 8b 53 58 74 0f 66 87 44 51 04 <48> 89 e8 5b 5d 
c3 cc cc cc cc 66 89 44 51 04 0f ae f0 48 89 e8 5b
[  213.796230] RSP: 0018:ff4bbb362306f9b0 EFLAGS: 0246
[  213.796233] RAX:  RBX: ff2f15095896f000 RCX: 0001
[  213.796235] RDX:  RSI: ff4bbb362306f9cc RDI: ff2f15095896f000
[  213.796236] RBP:  R08:  R09: 
[  213.796237] R10: 0003 R11: ff2f15095893cc40 R12: 0002
[  213.796239] R13: 0004 R14:  R15: ff2f1509534f3000
[  213.796240] FS:  7f775847d0c0() GS:ff2f1528bac0() 
knlGS:
[  213.796242] CS:  0010 DS:  ES:  CR0: 80050033
[  213.796243] CR2: 557f987b6e70 CR3: 002098602006 CR4: 00f71ef0
[  213.796245] DR0:  DR1:  DR2: 
[  213.796246] DR3:  DR6: fffe07f0 DR7: 0400
[  213.796247] PKRU: 5554
[  213.796249] Call Trace:
[  213.796250]  
[  213.796252]  ? watchdog_timer_fn+0x1c0/0x220
[  213.796258]  ? __pfx_watchdog_timer_fn+0x10/0x10
[  213.796261]  ? __hrtimer_run_queues+0x1af/0x380
[  213.796269]  ? hrtimer_interrupt+0xf8/0x230
[  213.796274]  ? __sysvec_apic_timer_interrupt+0x64/0x1a0
[  213.796279]  ? sysvec_apic_timer_interrupt+0x6d/0x90
[  213.796282]  
[  213.796284]  
[  213.796285]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[  213.796293]  ? virtqueue_get_buf_ctx_split+0x8d/0x110
[  213.796297]  virtnet_send_command+0x18a/0x1f0 [virtio_net]
[  213.796310]  _virtnet_set_queues+0xc6/0x120 [virtio_net]
[  213.796319]  virtnet_probe+0xa06/0xd50 [virtio_net]
[  213.796328]  virtio_dev_probe+0x195/0x230
[  213.796333]  really_probe+0x19f/0x400
[  213.796338]  ? __pfx___driver_attach+0x10/0x10
[  213.796340]  __driver_probe_device+0x78/0x160
[  213.796343]  driver_probe_device+0x1f/0x90
[  213.796346]  __driver_attach+0xd6/0x1d0
[  213.796349]  bus_for_each_dev+0x8c/0xe0
[  213.796355]  bus_add_driver+0x119/0x220
[  213.796359]  driver_register+0x59/0x100
[  213.796362]  ? __pfx_virtio_net_driver_init+0x10/0x10 [virtio_net]
[  213.796369]  virtio_net_driver_init+0x8e/0xff0 [virtio_net]
[  213.796375]  do_one_initcall+0x6f/0x380
[  213.796384]  do_init_module+0x60/0x240
[  213.796388]  init_module_from_file+0x86/0xc0
[  213.796396]  idempotent_init_module+0x129/0x2c0
[  213.796406]  __x64_sys_finit_module+0x5e/0xb0
[  213.796409]  do_syscall_64+0x60/0xe0
[  213.796415]  ? do_syscall_64+0x6f/0xe0
[  213.796418]  ? lockdep_hardirqs

[PATCH 1/1] virtio_net: Add timeout handler to avoid kernel hang

2024-01-14 Thread Zhu Yanjun
From: Zhu Yanjun 

Some devices emulate the virtio_net hardwares. When virtio_net
driver sends commands to the emulated hardware, normally the
hardware needs time to response. Sometimes the time is very
long. Thus, the following will appear. Then the whole system
will hang.
The similar problems also occur in Intel NICs and Mellanox NICs.
As such, the similar solution is borrowed from them. A timeout
value is added and the timeout value as large as possible is set
to ensure that the driver gets the maximum possible response from
the hardware.

"
[  213.795860] watchdog: BUG: soft lockup - CPU#108 stuck for 26s! 
[(udev-worker):3157]
[  213.796114] Modules linked in: virtio_net(+) net_failover failover qrtr 
rfkill sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency 
intel_uncore_frequency_common intel_ifs i10nm_edac nfit libnvdimm 
x86_pkg_temp_thermal intel_powerclamp coretemp iTCO_wdt rapl intel_pmc_bxt 
dax_hmem iTCO_vendor_support vfat cxl_acpi intel_cstate pmt_telemetry pmt_class 
intel_sdsi joydev intel_uncore cxl_core fat pcspkr mei_me isst_if_mbox_pci 
isst_if_mmio idxd i2c_i801 isst_if_common mei intel_vsec idxd_bus i2c_smbus 
i2c_ismt ipmi_ssif acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad 
acpi_power_meter pfr_telemetry pfr_update fuse loop zram xfs crct10dif_pclmul 
crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel 
sha512_ssse3 bnxt_en sha256_ssse3 sha1_ssse3 nvme ast nvme_core i2c_algo_bit 
wmi pinctrl_emmitsburg scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_multipath
[  213.796194] irq event stamp: 67740
[  213.796195] hardirqs last  enabled at (67739): [] 
asm_sysvec_apic_timer_interrupt+0x1a/0x20
[  213.796203] hardirqs last disabled at (67740): [] 
sysvec_apic_timer_interrupt+0xe/0x90
[  213.796208] softirqs last  enabled at (67686): [] 
__irq_exit_rcu+0xbe/0xe0
[  213.796214] softirqs last disabled at (67681): [] 
__irq_exit_rcu+0xbe/0xe0
[  213.796217] CPU: 108 PID: 3157 Comm: (udev-worker) Kdump: loaded Not tainted 
6.7.0+ #9
[  213.796220] Hardware name: Intel Corporation M50FCP2SBSTD/M50FCP2SBSTD, BIOS 
SE5C741.86B.01.01.0001.2211140926 11/14/2022
[  213.796221] RIP: 0010:virtqueue_get_buf_ctx_split+0x8d/0x110
[  213.796228] Code: 89 df e8 26 fe ff ff 0f b7 43 50 83 c0 01 66 89 43 50 f6 
43 78 01 75 12 80 7b 42 00 48 8b 4b 68 8b 53 58 74 0f 66 87 44 51 04 <48> 89 e8 
5b 5d c3 cc cc cc cc 66 89 44 51 04 0f ae f0 48 89 e8 5b
[  213.796230] RSP: 0018:ff4bbb362306f9b0 EFLAGS: 0246
[  213.796233] RAX:  RBX: ff2f15095896f000 RCX: 0001
[  213.796235] RDX:  RSI: ff4bbb362306f9cc RDI: ff2f15095896f000
[  213.796236] RBP:  R08:  R09: 
[  213.796237] R10: 0003 R11: ff2f15095893cc40 R12: 0002
[  213.796239] R13: 0004 R14:  R15: ff2f1509534f3000
[  213.796240] FS:  7f775847d0c0() GS:ff2f1528bac0() 
knlGS:
[  213.796242] CS:  0010 DS:  ES:  CR0: 80050033
[  213.796243] CR2: 557f987b6e70 CR3: 002098602006 CR4: 00f71ef0
[  213.796245] DR0:  DR1:  DR2: 
[  213.796246] DR3:  DR6: fffe07f0 DR7: 0400
[  213.796247] PKRU: 5554
[  213.796249] Call Trace:
[  213.796250]  
[  213.796252]  ? watchdog_timer_fn+0x1c0/0x220
[  213.796258]  ? __pfx_watchdog_timer_fn+0x10/0x10
[  213.796261]  ? __hrtimer_run_queues+0x1af/0x380
[  213.796269]  ? hrtimer_interrupt+0xf8/0x230
[  213.796274]  ? __sysvec_apic_timer_interrupt+0x64/0x1a0
[  213.796279]  ? sysvec_apic_timer_interrupt+0x6d/0x90
[  213.796282]  
[  213.796284]  
[  213.796285]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[  213.796293]  ? virtqueue_get_buf_ctx_split+0x8d/0x110
[  213.796297]  virtnet_send_command+0x18a/0x1f0 [virtio_net]
[  213.796310]  _virtnet_set_queues+0xc6/0x120 [virtio_net]
[  213.796319]  virtnet_probe+0xa06/0xd50 [virtio_net]
[  213.796328]  virtio_dev_probe+0x195/0x230
[  213.796333]  really_probe+0x19f/0x400
[  213.796338]  ? __pfx___driver_attach+0x10/0x10
[  213.796340]  __driver_probe_device+0x78/0x160
[  213.796343]  driver_probe_device+0x1f/0x90
[  213.796346]  __driver_attach+0xd6/0x1d0
[  213.796349]  bus_for_each_dev+0x8c/0xe0
[  213.796355]  bus_add_driver+0x119/0x220
[  213.796359]  driver_register+0x59/0x100
[  213.796362]  ? __pfx_virtio_net_driver_init+0x10/0x10 [virtio_net]
[  213.796369]  virtio_net_driver_init+0x8e/0xff0 [virtio_net]
[  213.796375]  do_one_initcall+0x6f/0x380
[  213.796384]  do_init_module+0x60/0x240
[  213.796388]  init_module_from_file+0x86/0xc0
[  213.796396]  idempotent_init_module+0x129/0x2c0
[  213.796406]  __x64_sys_finit_module+0x5e/0xb0
[  213.796409]  do_syscall_64+0x60/0xe0
[  213.796415]  ? do_syscall_64+0x6f/0xe0
[  213.796418]  ? lockdep_hardirqs_on_prepare+0xe4/0x1a0
[  213.796424]  ? do_syscall_64+0x6f/0xe0
[  213.796427]  ? do_sysc

[PATCH v3 1/1] virtio_net: Fix "‘%d’ directive writing between 1 and 11 bytes into a region of size 10" warnings

2024-01-03 Thread Zhu Yanjun
From: Zhu Yanjun 

Fix the warnings when building virtio_net driver.

"
drivers/net/virtio_net.c: In function ‘init_vqs’:
drivers/net/virtio_net.c:4551:48: warning: ‘%d’ directive writing between 1 and 
11 bytes into a region of size 10 [-Wformat-overflow=]
 4551 | sprintf(vi->rq[i].name, "input.%d", i);
  |^~
In function ‘virtnet_find_vqs’,
inlined from ‘init_vqs’ at drivers/net/virtio_net.c:4645:8:
drivers/net/virtio_net.c:4551:41: note: directive argument in the range 
[-2147483643, 65534]
 4551 | sprintf(vi->rq[i].name, "input.%d", i);
  | ^~
drivers/net/virtio_net.c:4551:17: note: ‘sprintf’ output between 8 and 18 bytes 
into a destination of size 16
 4551 | sprintf(vi->rq[i].name, "input.%d", i);
  | ^~
drivers/net/virtio_net.c: In function ‘init_vqs’:
drivers/net/virtio_net.c:4552:49: warning: ‘%d’ directive writing between 1 and 
11 bytes into a region of size 9 [-Wformat-overflow=]
 4552 | sprintf(vi->sq[i].name, "output.%d", i);
  | ^~
In function ‘virtnet_find_vqs’,
inlined from ‘init_vqs’ at drivers/net/virtio_net.c:4645:8:
drivers/net/virtio_net.c:4552:41: note: directive argument in the range 
[-2147483643, 65534]
 4552 | sprintf(vi->sq[i].name, "output.%d", i);
  | ^~~
drivers/net/virtio_net.c:4552:17: note: ‘sprintf’ output between 9 and 19 bytes 
into a destination of size 16
 4552 | sprintf(vi->sq[i].name, "output.%d", i);

"

Reviewed-by: Xuan Zhuo 
Signed-off-by: Zhu Yanjun 
---
v2 -> v3: Follow Jakub Kicinski's advice, repost it
v1 -> v2: Add commit logs. Format string is changed.
---
 drivers/net/virtio_net.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d16f592c2061..89a15cc81396 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -4096,10 +4096,11 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
 {
vq_callback_t **callbacks;
struct virtqueue **vqs;
-   int ret = -ENOMEM;
-   int i, total_vqs;
const char **names;
+   int ret = -ENOMEM;
+   int total_vqs;
bool *ctx;
+   u16 i;
 
/* We expect 1 RX virtqueue followed by 1 TX virtqueue, followed by
 * possible N-1 RX/TX queue pairs used in multiqueue mode, followed by
@@ -4136,8 +4137,8 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
for (i = 0; i < vi->max_queue_pairs; i++) {
callbacks[rxq2vq(i)] = skb_recv_done;
callbacks[txq2vq(i)] = skb_xmit_done;
-   sprintf(vi->rq[i].name, "input.%d", i);
-   sprintf(vi->sq[i].name, "output.%d", i);
+   sprintf(vi->rq[i].name, "input.%u", i);
+   sprintf(vi->sq[i].name, "output.%u", i);
names[rxq2vq(i)] = vi->rq[i].name;
names[txq2vq(i)] = vi->sq[i].name;
if (ctx)
-- 
2.27.0




Re: [PATCH v2 1/1] virtio_net: Fix "‘%d’ directive writing between 1 and 11 bytes into a region of size 10" warnings

2024-01-03 Thread Zhu Yanjun



在 2024/1/4 8:55, Jakub Kicinski 写道:

On Wed, 27 Dec 2023 22:26:37 +0800 Zhu Yanjun wrote:

From: Zhu Yanjun 

Fix the warnings when building virtio_net driver.

This got marked as Not Applicable in patchwork, not sure why.
Could you repost?


Got it. I will resend this commit very soon.

Best Regards,

Zhu Yanjun




Re: [PATCH net-next v1 3/6] virtio_net: support device stats

2024-01-02 Thread Zhu Yanjun



在 2024/1/2 3:56, kernel test robot 写道:

Hi Xuan,

kernel test robot noticed the following build warnings:

[auto build test WARNING on mst-vhost/linux-next]
[also build test WARNING on linus/master v6.7-rc8]
[cannot apply to net-next/main horms-ipvs/master next-20231222]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:
https://github.com/intel-lab-lkp/linux/commits/Xuan-Zhuo/virtio_net-introduce-device-stats-feature-and-structures/20231226-153227
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git linux-next
patch link:
https://lore.kernel.org/r/20231226073103.116153-4-xuanzhuo%40linux.alibaba.com
patch subject: [PATCH net-next v1 3/6] virtio_net: support device stats
config: x86_64-randconfig-121-20240101 
(https://download.01.org/0day-ci/archive/20240102/202401020308.rvztx1oi-...@intel.com/config)
compiler: gcc-7 (Ubuntu 7.5.0-6ubuntu2) 7.5.0
reproduce (this is a W=1 build): 
(https://download.01.org/0day-ci/archive/20240102/202401020308.rvztx1oi-...@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202401020308.rvztx1oi-...@intel.com/

sparse warnings: (new ones prefixed by >>)

drivers/net/virtio_net.c:3432:52: sparse: sparse: incorrect type in argument 2 
(different base types) @@ expected restricted __virtio16 [usertype] val @@  
   got restricted __le16 [usertype] vq_index @@

drivers/net/virtio_net.c:3432:52: sparse: expected restricted 
__virtio16 [usertype] val
drivers/net/virtio_net.c:3432:52: sparse: got restricted __le16 
[usertype] vq_index
drivers/net/virtio_net.c:3457:83: sparse: sparse: incorrect type in 
argument 2 (different base types) @@ expected restricted __virtio64 
[usertype] val @@ got unsigned long long [usertype] @@
drivers/net/virtio_net.c:3457:83: sparse: expected restricted 
__virtio64 [usertype] val
drivers/net/virtio_net.c:3457:83: sparse: got unsigned long long 
[usertype]

drivers/net/virtio_net.c:3429:81: sparse: sparse: incorrect type in argument 2 
(different base types) @@ expected restricted __virtio16 [usertype] val @@  
   got restricted __le16 [usertype] size @@

drivers/net/virtio_net.c:3429:81: sparse: expected restricted 
__virtio16 [usertype] val
drivers/net/virtio_net.c:3429:81: sparse: got restricted __le16 
[usertype] size

drivers/net/virtio_net.c:3519:82: sparse: sparse: incorrect type in argument 2 
(different base types) @@ expected restricted __virtio64 [usertype] val @@  
   got restricted __le64 [assigned] [usertype] v @@

drivers/net/virtio_net.c:3519:82: sparse: expected restricted 
__virtio64 [usertype] val
drivers/net/virtio_net.c:3519:82: sparse: got restricted __le64 
[assigned] [usertype] v


I can reproduce these warnings in the local host.

It seems that the followings can fix these warnings.

"

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 1f4d9605552f..62e40234e29c 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3426,10 +3426,10 @@ static int virtnet_get_hw_stats(struct 
virtnet_info *vi,

    num_tx = VIRTNET_SQ_STATS_LEN + ctx->num_tx;
    num_cq = ctx->num_tx;

-   for (p = reply; p - reply < res_size; p += 
virtio16_to_cpu(vi->vdev, hdr->size)) {
+   for (p = reply; p - reply < res_size; p += 
virtio16_to_cpu(vi->vdev, (__virtio16 __force)hdr->size)) {

    hdr = p;

-   qid = virtio16_to_cpu(vi->vdev, hdr->vq_index);
+   qid = virtio16_to_cpu(vi->vdev, (__virtio16 
__force)(hdr->vq_index));


    if (qid == vi->max_queue_pairs * 2) {
    offset = 0;
@@ -3454,7 +3454,7 @@ static int virtnet_get_hw_stats(struct 
virtnet_info *vi,


    for (j = 0; j < m->num; ++j) {
    v = p + m->desc[j].offset;
-   ctx->data[offset + j] = 
virtio64_to_cpu(vi->vdev, *v);
+   ctx->data[offset + j] = 
virtio64_to_cpu(vi->vdev, (__virtio64 __force)*v);

    }

    break;
@@ -3516,7 +3516,7 @@ static int virtnet_get_sset_count(struct 
net_device *dev, int sset)

    __le64 v;

    v = 
vi->ctrl->stats_cap.supported_stats_types[0];
-   vi->device_stats_cap = 
virtio64_to_cpu(vi->vdev, v);
+   vi->device_stats_cap = 
virtio64_to_cpu(vi->vdev, (__virtio64 __force)v);

        }
    }

"

Re: [PATCH v2 1/1] virtio_net: Fix "‘%d’ directive writing between 1 and 11 bytes into a region of size 10" warnings

2023-12-27 Thread Zhu Yanjun



在 2023/12/27 22:26, Zhu Yanjun 写道:

From: Zhu Yanjun 

Fix the warnings when building virtio_net driver.

"
drivers/net/virtio_net.c: In function ‘init_vqs’:
drivers/net/virtio_net.c:4551:48: warning: ‘%d’ directive writing between 1 and 
11 bytes into a region of size 10 [-Wformat-overflow=]
  4551 | sprintf(vi->rq[i].name, "input.%d", i);
   |^~
In function ‘virtnet_find_vqs’,
 inlined from ‘init_vqs’ at drivers/net/virtio_net.c:4645:8:
drivers/net/virtio_net.c:4551:41: note: directive argument in the range 
[-2147483643, 65534]
  4551 | sprintf(vi->rq[i].name, "input.%d", i);
   | ^~
drivers/net/virtio_net.c:4551:17: note: ‘sprintf’ output between 8 and 18 bytes 
into a destination of size 16
  4551 | sprintf(vi->rq[i].name, "input.%d", i);
   | ^~
drivers/net/virtio_net.c: In function ‘init_vqs’:
drivers/net/virtio_net.c:4552:49: warning: ‘%d’ directive writing between 1 and 
11 bytes into a region of size 9 [-Wformat-overflow=]
  4552 | sprintf(vi->sq[i].name, "output.%d", i);
   | ^~
In function ‘virtnet_find_vqs’,
 inlined from ‘init_vqs’ at drivers/net/virtio_net.c:4645:8:
drivers/net/virtio_net.c:4552:41: note: directive argument in the range 
[-2147483643, 65534]
  4552 | sprintf(vi->sq[i].name, "output.%d", i);
   | ^~~
drivers/net/virtio_net.c:4552:17: note: ‘sprintf’ output between 9 and 19 bytes 
into a destination of size 16
  4552 | sprintf(vi->sq[i].name, "output.%d", i);

"


Hi, all

V1->V2: Add commit logs. Format string is changed.

Best Regards,

Zhu Yanjun


Reviewed-by: Xuan Zhuo 
Signed-off-by: Zhu Yanjun 
---
  drivers/net/virtio_net.c | 9 +
  1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d16f592c2061..89a15cc81396 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -4096,10 +4096,11 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
  {
vq_callback_t **callbacks;
struct virtqueue **vqs;
-   int ret = -ENOMEM;
-   int i, total_vqs;
const char **names;
+   int ret = -ENOMEM;
+   int total_vqs;
bool *ctx;
+   u16 i;
  
  	/* We expect 1 RX virtqueue followed by 1 TX virtqueue, followed by

 * possible N-1 RX/TX queue pairs used in multiqueue mode, followed by
@@ -4136,8 +4137,8 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
for (i = 0; i < vi->max_queue_pairs; i++) {
callbacks[rxq2vq(i)] = skb_recv_done;
callbacks[txq2vq(i)] = skb_xmit_done;
-   sprintf(vi->rq[i].name, "input.%d", i);
-   sprintf(vi->sq[i].name, "output.%d", i);
+   sprintf(vi->rq[i].name, "input.%u", i);
+   sprintf(vi->sq[i].name, "output.%u", i);
names[rxq2vq(i)] = vi->rq[i].name;
names[txq2vq(i)] = vi->sq[i].name;
if (ctx)




[PATCH v2 1/1] virtio_net: Fix "‘%d’ directive writing between 1 and 11 bytes into a region of size 10" warnings

2023-12-27 Thread Zhu Yanjun
From: Zhu Yanjun 

Fix the warnings when building virtio_net driver.

"
drivers/net/virtio_net.c: In function ‘init_vqs’:
drivers/net/virtio_net.c:4551:48: warning: ‘%d’ directive writing between 1 and 
11 bytes into a region of size 10 [-Wformat-overflow=]
 4551 | sprintf(vi->rq[i].name, "input.%d", i);
  |^~
In function ‘virtnet_find_vqs’,
inlined from ‘init_vqs’ at drivers/net/virtio_net.c:4645:8:
drivers/net/virtio_net.c:4551:41: note: directive argument in the range 
[-2147483643, 65534]
 4551 | sprintf(vi->rq[i].name, "input.%d", i);
  | ^~
drivers/net/virtio_net.c:4551:17: note: ‘sprintf’ output between 8 and 18 bytes 
into a destination of size 16
 4551 | sprintf(vi->rq[i].name, "input.%d", i);
  | ^~
drivers/net/virtio_net.c: In function ‘init_vqs’:
drivers/net/virtio_net.c:4552:49: warning: ‘%d’ directive writing between 1 and 
11 bytes into a region of size 9 [-Wformat-overflow=]
 4552 | sprintf(vi->sq[i].name, "output.%d", i);
  | ^~
In function ‘virtnet_find_vqs’,
inlined from ‘init_vqs’ at drivers/net/virtio_net.c:4645:8:
drivers/net/virtio_net.c:4552:41: note: directive argument in the range 
[-2147483643, 65534]
 4552 | sprintf(vi->sq[i].name, "output.%d", i);
  | ^~~
drivers/net/virtio_net.c:4552:17: note: ‘sprintf’ output between 9 and 19 bytes 
into a destination of size 16
 4552 | sprintf(vi->sq[i].name, "output.%d", i);

"

Reviewed-by: Xuan Zhuo 
Signed-off-by: Zhu Yanjun 
---
 drivers/net/virtio_net.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d16f592c2061..89a15cc81396 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -4096,10 +4096,11 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
 {
vq_callback_t **callbacks;
struct virtqueue **vqs;
-   int ret = -ENOMEM;
-   int i, total_vqs;
const char **names;
+   int ret = -ENOMEM;
+   int total_vqs;
bool *ctx;
+   u16 i;
 
/* We expect 1 RX virtqueue followed by 1 TX virtqueue, followed by
 * possible N-1 RX/TX queue pairs used in multiqueue mode, followed by
@@ -4136,8 +4137,8 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
for (i = 0; i < vi->max_queue_pairs; i++) {
callbacks[rxq2vq(i)] = skb_recv_done;
callbacks[txq2vq(i)] = skb_xmit_done;
-   sprintf(vi->rq[i].name, "input.%d", i);
-   sprintf(vi->sq[i].name, "output.%d", i);
+   sprintf(vi->rq[i].name, "input.%u", i);
+   sprintf(vi->sq[i].name, "output.%u", i);
names[rxq2vq(i)] = vi->rq[i].name;
names[txq2vq(i)] = vi->sq[i].name;
if (ctx)
-- 
2.27.0




Re: [PATCH 1/1] virtio_net: Fix "‘%d’ directive writing between 1 and 11 bytes into a region of size 10" warnings

2023-12-26 Thread Zhu Yanjun

The warnings are as below:

"

drivers/net/virtio_net.c: In function ‘init_vqs’:
drivers/net/virtio_net.c:4551:48: warning: ‘%d’ directive writing 
between 1 and 11 bytes into a region of size 10 [-Wformat-overflow=]

 4551 | sprintf(vi->rq[i].name, "input.%d", i);
  |    ^~
In function ‘virtnet_find_vqs’,
    inlined from ‘init_vqs’ at drivers/net/virtio_net.c:4645:8:
drivers/net/virtio_net.c:4551:41: note: directive argument in the range 
[-2147483643, 65534]

 4551 | sprintf(vi->rq[i].name, "input.%d", i);
  | ^~
drivers/net/virtio_net.c:4551:17: note: ‘sprintf’ output between 8 and 
18 bytes into a destination of size 16

 4551 | sprintf(vi->rq[i].name, "input.%d", i);
  | ^~
drivers/net/virtio_net.c: In function ‘init_vqs’:
drivers/net/virtio_net.c:4552:49: warning: ‘%d’ directive writing 
between 1 and 11 bytes into a region of size 9 [-Wformat-overflow=]

 4552 | sprintf(vi->sq[i].name, "output.%d", i);
  | ^~
In function ‘virtnet_find_vqs’,
    inlined from ‘init_vqs’ at drivers/net/virtio_net.c:4645:8:
drivers/net/virtio_net.c:4552:41: note: directive argument in the range 
[-2147483643, 65534]

 4552 | sprintf(vi->sq[i].name, "output.%d", i);
  | ^~~
drivers/net/virtio_net.c:4552:17: note: ‘sprintf’ output between 9 and 
19 bytes into a destination of size 16

 4552 | sprintf(vi->sq[i].name, "output.%d", i);

"

Please review.

Best Regards,

Zhu Yanjun

在 2023/12/26 19:45, Zhu Yanjun 写道:

From: Zhu Yanjun 

Fix a warning when building virtio_net driver.

Signed-off-by: Zhu Yanjun 
---
  drivers/net/virtio_net.c | 5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 49625638ad43..cf57eddf768a 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -4508,10 +4508,11 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
  {
vq_callback_t **callbacks;
struct virtqueue **vqs;
-   int ret = -ENOMEM;
-   int i, total_vqs;
const char **names;
+   int ret = -ENOMEM;
+   int total_vqs;
bool *ctx;
+   u16 i;
  
  	/* We expect 1 RX virtqueue followed by 1 TX virtqueue, followed by

 * possible N-1 RX/TX queue pairs used in multiqueue mode, followed by




[PATCH 1/1] virtio_net: Fix "‘%d’ directive writing between 1 and 11 bytes into a region of size 10" warnings

2023-12-26 Thread Zhu Yanjun
From: Zhu Yanjun 

Fix a warning when building virtio_net driver.

Signed-off-by: Zhu Yanjun 
---
 drivers/net/virtio_net.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 49625638ad43..cf57eddf768a 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -4508,10 +4508,11 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
 {
vq_callback_t **callbacks;
struct virtqueue **vqs;
-   int ret = -ENOMEM;
-   int i, total_vqs;
const char **names;
+   int ret = -ENOMEM;
+   int total_vqs;
bool *ctx;
+   u16 i;
 
/* We expect 1 RX virtqueue followed by 1 TX virtqueue, followed by
 * possible N-1 RX/TX queue pairs used in multiqueue mode, followed by
-- 
2.41.0




Re: [PATCH net-next v1 1/6] virtio_net: introduce device stats feature and structures

2023-12-26 Thread Zhu Yanjun



在 2023/12/26 15:30, Xuan Zhuo 写道:

The virtio-net device stats spec:

https://github.com/oasis-tcs/virtio-spec/commit/42f389989823039724f95bbbd243291ab0064f82

This commit introduces the relative feature and structures.

Signed-off-by: Xuan Zhuo 
---
  include/uapi/linux/virtio_net.h | 137 
  1 file changed, 137 insertions(+)

diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index cc65ef0f3c3e..8fca4d1b7635 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -56,6 +56,7 @@
  #define VIRTIO_NET_F_MQ   22  /* Device supports Receive Flow
 * Steering */
  #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */
+#define VIRTIO_NET_F_DEVICE_STATS 50   /* Device can provide device-level 
statistics. */
  #define VIRTIO_NET_F_VQ_NOTF_COAL 52  /* Device supports virtqueue 
notification coalescing */
  #define VIRTIO_NET_F_NOTF_COAL53  /* Device supports 
notifications coalescing */
  #define VIRTIO_NET_F_GUEST_USO4   54  /* Guest can handle USOv4 in. */
@@ -406,4 +407,140 @@ struct  virtio_net_ctrl_coal_vq {
struct virtio_net_ctrl_coal coal;
  };
  
+/*

+ * Device Statistics
+ */
+#define VIRTIO_NET_CTRL_STATS 8
+#define VIRTIO_NET_CTRL_STATS_QUERY   0
+#define VIRTIO_NET_CTRL_STATS_GET 1
+
+struct virtio_net_stats_capabilities {
+
+#define VIRTIO_NET_STATS_TYPE_CVQ   (1ULL << 32)
+
+#define VIRTIO_NET_STATS_TYPE_RX_BASIC  (1ULL << 0)
+#define VIRTIO_NET_STATS_TYPE_RX_CSUM   (1ULL << 1)
+#define VIRTIO_NET_STATS_TYPE_RX_GSO(1ULL << 2)
+#define VIRTIO_NET_STATS_TYPE_RX_SPEED  (1ULL << 3)
+
+#define VIRTIO_NET_STATS_TYPE_TX_BASIC  (1ULL << 16)
+#define VIRTIO_NET_STATS_TYPE_TX_CSUM   (1ULL << 17)
+#define VIRTIO_NET_STATS_TYPE_TX_GSO(1ULL << 18)
+#define VIRTIO_NET_STATS_TYPE_TX_SPEED  (1ULL << 19)
+
+   __le64 supported_stats_types[1];
+};
+
+struct virtio_net_ctrl_queue_stats {
+   struct {
+   __le16 vq_index;
+   __le16 reserved[3];
+   __le64 types_bitmap[1];
+   } stats[1];
+};
+
+struct virtio_net_stats_reply_hdr {
+#define VIRTIO_NET_STATS_TYPE_REPLY_CVQ   32
+
+#define VIRTIO_NET_STATS_TYPE_REPLY_RX_BASIC  0
+#define VIRTIO_NET_STATS_TYPE_REPLY_RX_CSUM   1
+#define VIRTIO_NET_STATS_TYPE_REPLY_RX_GSO2
+#define VIRTIO_NET_STATS_TYPE_REPLY_RX_SPEED  3
+
+#define VIRTIO_NET_STATS_TYPE_REPLY_TX_BASIC  16
+#define VIRTIO_NET_STATS_TYPE_REPLY_TX_CSUM   17
+#define VIRTIO_NET_STATS_TYPE_REPLY_TX_GSO18
+#define VIRTIO_NET_STATS_TYPE_REPLY_TX_SPEED  19
+   __u8 type;
+   __u8 reserved;


Thanks a lot. I have made tests. The mentioned errors are fixed in this 
patch series.


Zhu Yanjun


+   __le16 vq_index;
+   __le16 reserved1;
+   __le16 size;
+};
+
+struct virtio_net_stats_cvq {
+   struct virtio_net_stats_reply_hdr hdr;
+
+   __le64 command_num;
+   __le64 ok_num;
+};
+
+struct virtio_net_stats_rx_basic {
+   struct virtio_net_stats_reply_hdr hdr;
+
+   __le64 rx_notifications;
+
+   __le64 rx_packets;
+   __le64 rx_bytes;
+
+   __le64 rx_interrupts;
+
+   __le64 rx_drops;
+   __le64 rx_drop_overruns;
+};
+
+struct virtio_net_stats_tx_basic {
+   struct virtio_net_stats_reply_hdr hdr;
+
+   __le64 tx_notifications;
+
+   __le64 tx_packets;
+   __le64 tx_bytes;
+
+   __le64 tx_interrupts;
+
+   __le64 tx_drops;
+   __le64 tx_drop_malformed;
+};
+
+struct virtio_net_stats_rx_csum {
+   struct virtio_net_stats_reply_hdr hdr;
+
+   __le64 rx_csum_valid;
+   __le64 rx_needs_csum;
+   __le64 rx_csum_none;
+   __le64 rx_csum_bad;
+};
+
+struct virtio_net_stats_tx_csum {
+   struct virtio_net_stats_reply_hdr hdr;
+
+   __le64 tx_csum_none;
+   __le64 tx_needs_csum;
+};
+
+struct virtio_net_stats_rx_gso {
+   struct virtio_net_stats_reply_hdr hdr;
+
+   __le64 rx_gso_packets;
+   __le64 rx_gso_bytes;
+   __le64 rx_gso_packets_coalesced;
+   __le64 rx_gso_bytes_coalesced;
+};
+
+struct virtio_net_stats_tx_gso {
+   struct virtio_net_stats_reply_hdr hdr;
+
+   __le64 tx_gso_packets;
+   __le64 tx_gso_bytes;
+   __le64 tx_gso_segments;
+   __le64 tx_gso_segments_bytes;
+   __le64 tx_gso_packets_noseg;
+   __le64 tx_gso_bytes_noseg;
+};
+
+struct virtio_net_stats_rx_speed {
+   struct virtio_net_stats_reply_hdr hdr;
+
+   __le64 rx_packets_allowance_exceeded;
+   __le64 rx_bytes_allowance_exceeded;
+};
+
+struct virtio_net_stats_tx_speed {
+   struct virtio_net_stats_reply_hdr hdr;
+
+   __le64 tx_packets_allowance_exceeded;
+   __le64 tx_bytes_allowance_exceeded;
+};
+
  #endif /* _UAPI_LINUX_VIRTIO_NET_H */




Re: [PATCH net-next 1/6] virtio_net: introduce device stats feature and structures

2023-12-25 Thread Zhu Yanjun

在 2023/12/22 11:30, Xuan Zhuo 写道:

The virtio-net device stats spec:

https://github.com/oasis-tcs/virtio-spec/commit/42f389989823039724f95bbbd243291ab0064f82

This commit introduces the relative feature and structures.


Hi, Xuan

After applying this patch series, withe ethtool version 6.5,
I got the following NIC statistics. But I do not find the statistics 
mentioned in this patch series.

Do I miss something?

"
NIC statistics:
 rx_packets: 3434812669
 rx_bytes: 5168475253690
 rx_drops: 0
 rx_xdp_packets: 0
 rx_xdp_tx: 0
 rx_xdp_redirects: 0
 rx_xdp_drops: 0
 rx_kicks: 57179891
 tx_packets: 187694230
 tx_bytes: 12423799040
 tx_xdp_tx: 0
 tx_xdp_tx_drops: 0
 tx_kicks: 187694230
 tx_timeouts: 0
 rx_queue_0_packets: 866027381
 rx_queue_0_bytes: 1302726908150
 rx_queue_0_drops: 0
 rx_queue_0_xdp_packets: 0
 rx_queue_0_xdp_tx: 0
 rx_queue_0_xdp_redirects: 0
 rx_queue_0_xdp_drops: 0
 rx_queue_0_kicks: 14567691
 rx_queue_1_packets: 856758801
 rx_queue_1_bytes: 1289899049042
 rx_queue_1_drops: 0
 rx_queue_1_xdp_packets: 0
 rx_queue_1_xdp_tx: 0
 rx_queue_1_xdp_redirects: 0
 rx_queue_1_xdp_drops: 0
 rx_queue_1_kicks: 14265201
 rx_queue_2_packets: 839291053
 rx_queue_2_bytes: 1261620863886
 rx_queue_2_drops: 0
 rx_queue_2_xdp_packets: 0
 rx_queue_2_xdp_tx: 0
 rx_queue_2_xdp_redirects: 0
 rx_queue_2_xdp_drops: 0
 rx_queue_2_kicks: 13857653
 rx_queue_3_packets: 872735434
 rx_queue_3_bytes: 1314228432612
 rx_queue_3_drops: 0
 rx_queue_3_xdp_packets: 0
 rx_queue_3_xdp_tx: 0
 rx_queue_3_xdp_redirects: 0
 rx_queue_3_xdp_drops: 0
 rx_queue_3_kicks: 14489346
 tx_queue_0_packets: 75723
 tx_queue_0_bytes: 4999030
 tx_queue_0_xdp_tx: 0
 tx_queue_0_xdp_tx_drops: 0
 tx_queue_0_kicks: 75723
 tx_queue_0_timeouts: 0
 tx_queue_1_packets: 62262921
 tx_queue_1_bytes: 4134803914
 tx_queue_1_xdp_tx: 0
 tx_queue_1_xdp_tx_drops: 0
 tx_queue_1_kicks: 62262921
 tx_queue_1_timeouts: 0
 tx_queue_2_packets: 83
 tx_queue_2_bytes: 5478
 tx_queue_2_xdp_tx: 0
 tx_queue_2_xdp_tx_drops: 0
 tx_queue_2_kicks: 83
 tx_queue_2_timeouts: 0
 tx_queue_3_packets: 125355503
 tx_queue_3_bytes: 8283990618
 tx_queue_3_xdp_tx: 0
 tx_queue_3_xdp_tx_drops: 0
 tx_queue_3_kicks: 125355503
 tx_queue_3_timeouts: 0
"



Signed-off-by: Xuan Zhuo 
---
  include/uapi/linux/virtio_net.h | 137 
  1 file changed, 137 insertions(+)

diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index cc65ef0f3c3e..129e0871d28f 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -56,6 +56,7 @@
  #define VIRTIO_NET_F_MQ   22  /* Device supports Receive Flow
 * Steering */
  #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */
+#define VIRTIO_NET_F_DEVICE_STATS 50   /* Device can provide device-level 
statistics. */
  #define VIRTIO_NET_F_VQ_NOTF_COAL 52  /* Device supports virtqueue 
notification coalescing */
  #define VIRTIO_NET_F_NOTF_COAL53  /* Device supports 
notifications coalescing */
  #define VIRTIO_NET_F_GUEST_USO4   54  /* Guest can handle USOv4 in. */
@@ -406,4 +407,140 @@ struct  virtio_net_ctrl_coal_vq {
struct virtio_net_ctrl_coal coal;
  };
  
+/*

+ * Device Statistics
+ */
+#define VIRTIO_NET_CTRL_STATS 8
+#define VIRTIO_NET_CTRL_STATS_QUERY   0
+#define VIRTIO_NET_CTRL_STATS_GET 1
+
+struct virtio_net_stats_capabilities {
+
+#define VIRTIO_NET_STATS_TYPE_CVQ   (1L << 32)
+
+#define VIRTIO_NET_STATS_TYPE_RX_BASIC  (1 << 0)
+#define VIRTIO_NET_STATS_TYPE_RX_CSUM   (1 << 1)
+#define VIRTIO_NET_STATS_TYPE_RX_GSO(1 << 2)
+#define VIRTIO_NET_STATS_TYPE_RX_SPEED  (1 << 3)
+
+#define VIRTIO_NET_STATS_TYPE_TX_BASIC  (1 << 16)
+#define VIRTIO_NET_STATS_TYPE_TX_CSUM   (1 << 17)
+#define VIRTIO_NET_STATS_TYPE_TX_GSO(1 << 18)
+#define VIRTIO_NET_STATS_TYPE_TX_SPEED  (1 << 19)
+
+   __le64 supported_stats_types[1];
+};
+
+struct virtio_net_ctrl_queue_stats {
+   struct {
+   __le16 vq_index;
+   __le16 reserved[3];
+   __le64 types_bitmap[1];
+   } stats[1];
+};
+
+struct virtio_net_stats_reply_hdr {
+#define VIRTIO_NET_STATS_TYPE_REPLY_CVQ   32
+
+#define VIRTIO_NET_STATS_TYPE_REPLY_RX_BASIC  0
+#define VIRTIO_NET_STATS_TYPE_REPLY_RX_CSUM   1
+#define VIRTIO_NET_STATS_TYPE_REPLY_RX_GSO2
+#define VIRTIO_NET_STATS_TYPE_REPLY_RX_SPEED  3
+
+#define VIRTIO_NET_STATS_TYPE_REPLY_TX_BASIC  16
+#define VIRTIO_NET_STATS_TYPE_REPLY_TX_CSUM   17
+#define VIRTIO_NET_STATS_TYPE_REPLY_TX_GSO18
+#define VIRTIO_NET_STATS_TYPE_REPLY_TX_SPEED  19
+   u8 type;
+   u8 reserved;
+   __le16 vq_index;
+   __le16 reserved1;
+   __le16 size;
+};
+
+struc

Re: [PATCH net-next 3/6] virtio_net: support device stats

2023-12-24 Thread Zhu Yanjun

在 2023/12/24 1:23, kernel test robot 写道:

Hi Xuan,

kernel test robot noticed the following build warnings:

[auto build test WARNING on mst-vhost/linux-next]
[also build test WARNING on linus/master v6.7-rc6]
[cannot apply to net-next/main horms-ipvs/master next-20231222]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:
https://github.com/intel-lab-lkp/linux/commits/Xuan-Zhuo/virtio_net-introduce-device-stats-feature-and-structures/20231222-175505
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git linux-next
patch link:
https://lore.kernel.org/r/20231222033021.20649-4-xuanzhuo%40linux.alibaba.com
patch subject: [PATCH net-next 3/6] virtio_net: support device stats
config: arc-haps_hs_defconfig 
(https://download.01.org/0day-ci/archive/20231224/202312240155.ow7kvqzo-...@intel.com/config)
compiler: arc-elf-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): 
(https://download.01.org/0day-ci/archive/20231224/202312240155.ow7kvqzo-...@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202312240155.ow7kvqzo-...@intel.com/

All warnings (new ones prefixed by >>):

In file included from include/linux/virtio_net.h:8,
 from drivers/net/virtio_net.c:12:

include/uapi/linux/virtio_net.h:419:45: warning: left shift count >= width of 
type [-Wshift-count-overflow]

  419 | #define VIRTIO_NET_STATS_TYPE_CVQ   (1L << 32)
  | ^~
drivers/net/virtio_net.c:215:17: note: in expansion of macro 
'VIRTIO_NET_STATS_TYPE_CVQ'
  215 | VIRTIO_NET_STATS_TYPE_##TYPE,   \
  | ^~
drivers/net/virtio_net.c:224:9: note: in expansion of macro 
'VIRTNET_DEVICE_STATS_MAP_ITEM'
  224 | VIRTNET_DEVICE_STATS_MAP_ITEM(CVQ, cvq, CQ),
  | ^


vim +419 include/uapi/linux/virtio_net.h

ba106d1c676c80 Xuan Zhuo 2023-12-22  418
ba106d1c676c80 Xuan Zhuo 2023-12-22 @419  #define VIRTIO_NET_STATS_TYPE_CVQ   (1L 
<< 32)


#define VIRTIO_NET_STATS_TYPE_CVQ   (1ULL << 32)
The above can fix this problem.
Not sure whether this is appropriate for the whole patches.

Zhu Yanjun


ba106d1c676c80 Xuan Zhuo 2023-12-22  420






Re: [PATCH net-next 1/6] virtio_net: introduce device stats feature and structures

2023-12-24 Thread Zhu Yanjun

在 2023/12/24 1:55, kernel test robot 写道:

Hi Xuan,

kernel test robot noticed the following build errors:

[auto build test ERROR on mst-vhost/linux-next]
[also build test ERROR on linus/master v6.7-rc6 next-20231222]
[cannot apply to net-next/main horms-ipvs/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:
https://github.com/intel-lab-lkp/linux/commits/Xuan-Zhuo/virtio_net-introduce-device-stats-feature-and-structures/20231222-175505
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git linux-next
patch link:
https://lore.kernel.org/r/20231222033021.20649-2-xuanzhuo%40linux.alibaba.com
patch subject: [PATCH net-next 1/6] virtio_net: introduce device stats feature 
and structures
config: x86_64-buildonly-randconfig-002-20231223 
(https://download.01.org/0day-ci/archive/20231224/202312240125.00z3nxgy-...@intel.com/config)
compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git 
ae42196bc493ffe877a7e3dff8be32035dea4d07)
reproduce (this is a W=1 build): 
(https://download.01.org/0day-ci/archive/20231224/202312240125.00z3nxgy-...@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202312240125.00z3nxgy-...@intel.com/

All errors (new ones prefixed by >>):

In file included from :1:

./usr/include/linux/virtio_net.h:454:2: error: unknown type name 'u8'

u8 type;
^
./usr/include/linux/virtio_net.h:455:2: error: unknown type name 'u8'
u8 reserved;

I can reproduce this problem.
Replacing u8 as __u8 can fix this problem.
Not sure whether __u8 is correct to the whole patches.

Zhu Yanjun


^
2 errors generated.






Re: [PATCH net-next] virtio-net: switch napi_tx without downing nic

2023-12-21 Thread Zhu Yanjun



在 2023/12/21 13:20, Heng Qi 写道:



在 2023/12/21 上午11:02, Zhu Yanjun 写道:

在 2023/12/20 16:07, Heng Qi 写道:

virtio-net has two ways to switch napi_tx: one is through the
module parameter, and the other is through coalescing parameter
settings (provided that the nic status is down).

Sometimes we face performance regression caused by napi_tx,
then we need to switch napi_tx when debugging. However, the
existing methods are a bit troublesome, such as needing to
reload the driver or turn off the network card. So try to make
this update.


What scenario can trigger this? We want to make tests on our device.


Hi Zhu Yanjun, you can use the following cmds:

ethtool -C tx-frames 0, to disable napi_tx
ethtool -C tx-frames 1, to enable napi_tx



Thanks a lot. Just now I made tests on our device. I confirmed that 
virtion_net driver can work well after running "ethtool -C NIC tx-frames 
0 && sleep 3 && ethtool -C NIC tx-frames 1".


You can add "Reviewed-and-tested-by: Zhu Yanjun "

Thanks,

Zhu Yanjun



Thanks.



Zhu Yanjun



Signed-off-by: Heng Qi 
Reviewed-by: Xuan Zhuo 
---
  drivers/net/virtio_net.c | 81 
++--

  1 file changed, 37 insertions(+), 44 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 10614e9f7cad..12f8e1f9971c 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3559,16 +3559,37 @@ static int 
virtnet_coal_params_supported(struct ethtool_coalesce *ec)

  return 0;
  }
  -static int virtnet_should_update_vq_weight(int dev_flags, int 
weight,

-   int vq_weight, bool *should_update)
+static void virtnet_switch_napi_tx(struct virtnet_info *vi, u32 
qstart,

+   u32 qend, u32 tx_frames)
  {
-    if (weight ^ vq_weight) {
-    if (dev_flags & IFF_UP)
-    return -EBUSY;
-    *should_update = true;
-    }
+    struct net_device *dev = vi->dev;
+    int new_weight, cur_weight;
+    struct netdev_queue *txq;
+    struct send_queue *sq;
  -    return 0;
+    new_weight = tx_frames ? NAPI_POLL_WEIGHT : 0;
+    for (; qstart < qend; qstart++) {
+    sq = &vi->sq[qstart];
+    cur_weight = sq->napi.weight;
+    if (!(new_weight ^ cur_weight))
+    continue;
+
+    if (!(dev->flags & IFF_UP)) {
+    sq->napi.weight = new_weight;
+    continue;
+    }
+
+    if (cur_weight)
+    virtnet_napi_tx_disable(&sq->napi);
+
+    txq = netdev_get_tx_queue(dev, qstart);
+    __netif_tx_lock_bh(txq);
+    sq->napi.weight = new_weight;
+    __netif_tx_unlock_bh(txq);
+
+    if (!cur_weight)
+    virtnet_napi_tx_enable(vi, sq->vq, &sq->napi);
+    }
  }
    static int virtnet_set_coalesce(struct net_device *dev,
@@ -3577,25 +3598,11 @@ static int virtnet_set_coalesce(struct 
net_device *dev,

  struct netlink_ext_ack *extack)
  {
  struct virtnet_info *vi = netdev_priv(dev);
-    int ret, queue_number, napi_weight;
-    bool update_napi = false;
-
-    /* Can't change NAPI weight if the link is up */
-    napi_weight = ec->tx_max_coalesced_frames ? NAPI_POLL_WEIGHT : 0;
-    for (queue_number = 0; queue_number < vi->max_queue_pairs; 
queue_number++) {

-    ret = virtnet_should_update_vq_weight(dev->flags, napi_weight,
- vi->sq[queue_number].napi.weight,
-  &update_napi);
-    if (ret)
-    return ret;
-
-    if (update_napi) {
-    /* All queues that belong to [queue_number, 
vi->max_queue_pairs] will be
- * updated for the sake of simplicity, which might not 
be necessary

- */
-    break;
-    }
-    }
+    int ret;
+
+    /* Param tx_frames can be used to switch napi_tx */
+    virtnet_switch_napi_tx(vi, 0, vi->max_queue_pairs,
+   ec->tx_max_coalesced_frames);
    if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_NOTF_COAL))
  ret = virtnet_send_notf_coal_cmds(vi, ec);
@@ -3605,11 +3612,6 @@ static int virtnet_set_coalesce(struct 
net_device *dev,

  if (ret)
  return ret;
  -    if (update_napi) {
-    for (; queue_number < vi->max_queue_pairs; queue_number++)
-    vi->sq[queue_number].napi.weight = napi_weight;
-    }
-
  return ret;
  }
  @@ -3641,19 +3643,13 @@ static int 
virtnet_set_per_queue_coalesce(struct net_device *dev,

    struct ethtool_coalesce *ec)
  {
  struct virtnet_info *vi = netdev_priv(dev);
-    int ret, napi_weight;
-    bool update_napi = false;
+    int ret;
    if (queue >= vi->max_queue_pairs)
  return -EINVAL;
  -    /* Can't change NAPI weight if the link is up */
-    napi_weight = ec->tx_max_coalesced_frames ? NAPI_POLL_WEIGHT : 0;
-    ret = virtnet_should_update_vq_weight(dev->flags, napi_weight,
-

Re: [PATCH net-next] virtio-net: switch napi_tx without downing nic

2023-12-20 Thread Zhu Yanjun

在 2023/12/20 16:07, Heng Qi 写道:

virtio-net has two ways to switch napi_tx: one is through the
module parameter, and the other is through coalescing parameter
settings (provided that the nic status is down).

Sometimes we face performance regression caused by napi_tx,
then we need to switch napi_tx when debugging. However, the
existing methods are a bit troublesome, such as needing to
reload the driver or turn off the network card. So try to make
this update.


What scenario can trigger this? We want to make tests on our device.

Zhu Yanjun



Signed-off-by: Heng Qi 
Reviewed-by: Xuan Zhuo 
---
  drivers/net/virtio_net.c | 81 ++--
  1 file changed, 37 insertions(+), 44 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 10614e9f7cad..12f8e1f9971c 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3559,16 +3559,37 @@ static int virtnet_coal_params_supported(struct 
ethtool_coalesce *ec)
return 0;
  }
  
-static int virtnet_should_update_vq_weight(int dev_flags, int weight,

-  int vq_weight, bool *should_update)
+static void virtnet_switch_napi_tx(struct virtnet_info *vi, u32 qstart,
+  u32 qend, u32 tx_frames)
  {
-   if (weight ^ vq_weight) {
-   if (dev_flags & IFF_UP)
-   return -EBUSY;
-   *should_update = true;
-   }
+   struct net_device *dev = vi->dev;
+   int new_weight, cur_weight;
+   struct netdev_queue *txq;
+   struct send_queue *sq;
  
-	return 0;

+   new_weight = tx_frames ? NAPI_POLL_WEIGHT : 0;
+   for (; qstart < qend; qstart++) {
+   sq = &vi->sq[qstart];
+   cur_weight = sq->napi.weight;
+   if (!(new_weight ^ cur_weight))
+   continue;
+
+   if (!(dev->flags & IFF_UP)) {
+   sq->napi.weight = new_weight;
+   continue;
+   }
+
+   if (cur_weight)
+   virtnet_napi_tx_disable(&sq->napi);
+
+   txq = netdev_get_tx_queue(dev, qstart);
+   __netif_tx_lock_bh(txq);
+   sq->napi.weight = new_weight;
+   __netif_tx_unlock_bh(txq);
+
+   if (!cur_weight)
+   virtnet_napi_tx_enable(vi, sq->vq, &sq->napi);
+   }
  }
  
  static int virtnet_set_coalesce(struct net_device *dev,

@@ -3577,25 +3598,11 @@ static int virtnet_set_coalesce(struct net_device *dev,
struct netlink_ext_ack *extack)
  {
struct virtnet_info *vi = netdev_priv(dev);
-   int ret, queue_number, napi_weight;
-   bool update_napi = false;
-
-   /* Can't change NAPI weight if the link is up */
-   napi_weight = ec->tx_max_coalesced_frames ? NAPI_POLL_WEIGHT : 0;
-   for (queue_number = 0; queue_number < vi->max_queue_pairs; 
queue_number++) {
-   ret = virtnet_should_update_vq_weight(dev->flags, napi_weight,
- 
vi->sq[queue_number].napi.weight,
- &update_napi);
-   if (ret)
-   return ret;
-
-   if (update_napi) {
-   /* All queues that belong to [queue_number, 
vi->max_queue_pairs] will be
-* updated for the sake of simplicity, which might not 
be necessary
-*/
-   break;
-   }
-   }
+   int ret;
+
+   /* Param tx_frames can be used to switch napi_tx */
+   virtnet_switch_napi_tx(vi, 0, vi->max_queue_pairs,
+  ec->tx_max_coalesced_frames);
  
  	if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_NOTF_COAL))

ret = virtnet_send_notf_coal_cmds(vi, ec);
@@ -3605,11 +3612,6 @@ static int virtnet_set_coalesce(struct net_device *dev,
if (ret)
return ret;
  
-	if (update_napi) {

-   for (; queue_number < vi->max_queue_pairs; queue_number++)
-   vi->sq[queue_number].napi.weight = napi_weight;
-   }
-
return ret;
  }
  
@@ -3641,19 +3643,13 @@ static int virtnet_set_per_queue_coalesce(struct net_device *dev,

  struct ethtool_coalesce *ec)
  {
struct virtnet_info *vi = netdev_priv(dev);
-   int ret, napi_weight;
-   bool update_napi = false;
+   int ret;
  
  	if (queue >= vi->max_queue_pairs)

return -EINVAL;
  
-	/* Can't change NAPI weight if the link is up */

-   napi_weight = ec->tx_max_coalesced_frames ? NAPI_POLL_WEIGHT : 0;
-   ret = virtnet_should_update_vq_weight(dev->flags, napi_weight,
-

Re: [PATCH net-next v2] net/mlx5e: Fix uninitialised struct field moder.comps

2021-04-20 Thread Zhu Yanjun
On Tue, Apr 20, 2021 at 5:21 PM Leon Romanovsky  wrote:
>
> On Tue, Apr 20, 2021 at 03:09:03PM +0800, Zhu Yanjun wrote:
> > On Tue, Apr 20, 2021 at 3:01 PM wangyunjian  wrote:
> > >
> > > From: Yunjian Wang 
> > >
> > > The 'comps' struct field in 'moder' is not being initialized in
> > > mlx5e_get_def_rx_moderation() and mlx5e_get_def_tx_moderation().
> > > So initialize 'moder' to zero to avoid the issue.
>
> Please state that it is false alarm and this patch doesn't fix anything
> except broken static analyzer tool.
>
> > >
> > > Addresses-Coverity: ("Uninitialized scalar variable")
> > > Signed-off-by: Yunjian Wang 
> > > ---
> > > v2: update mlx5e_get_def_tx_moderation() also needs fixing
> > > ---
> > >  drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 4 ++--
> > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
> > > b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > > index 5db63b9f3b70..17a817b7e539 100644
> > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > > @@ -4868,7 +4868,7 @@ static bool slow_pci_heuristic(struct mlx5_core_dev 
> > > *mdev)
> > >
> > >  static struct dim_cq_moder mlx5e_get_def_tx_moderation(u8 cq_period_mode)
> > >  {
> > > -   struct dim_cq_moder moder;
> >
> > > +   struct dim_cq_moder moder = {};
> >
> > If I remember correctly, some gcc compiler will report errors about this 
> > "{}".
>
> Kernel doesn't support such compilers.

Are you sure? Why are you so confirmative?

Zhu Yanjun

>
> Thanks


Re: [PATCH net-next v2] net/mlx5e: Fix uninitialised struct field moder.comps

2021-04-20 Thread Zhu Yanjun
On Tue, Apr 20, 2021 at 3:01 PM wangyunjian  wrote:
>
> From: Yunjian Wang 
>
> The 'comps' struct field in 'moder' is not being initialized in
> mlx5e_get_def_rx_moderation() and mlx5e_get_def_tx_moderation().
> So initialize 'moder' to zero to avoid the issue.
>
> Addresses-Coverity: ("Uninitialized scalar variable")
> Signed-off-by: Yunjian Wang 
> ---
> v2: update mlx5e_get_def_tx_moderation() also needs fixing
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
> b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index 5db63b9f3b70..17a817b7e539 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -4868,7 +4868,7 @@ static bool slow_pci_heuristic(struct mlx5_core_dev 
> *mdev)
>
>  static struct dim_cq_moder mlx5e_get_def_tx_moderation(u8 cq_period_mode)
>  {
> -   struct dim_cq_moder moder;

> +   struct dim_cq_moder moder = {};

If I remember correctly, some gcc compiler will report errors about this "{}".

Zhu Yanjun

>
> moder.cq_period_mode = cq_period_mode;
> moder.pkts = MLX5E_PARAMS_DEFAULT_TX_CQ_MODERATION_PKTS;
> @@ -4881,7 +4881,7 @@ static struct dim_cq_moder 
> mlx5e_get_def_tx_moderation(u8 cq_period_mode)
>
>  static struct dim_cq_moder mlx5e_get_def_rx_moderation(u8 cq_period_mode)
>  {
> -   struct dim_cq_moder moder;
> +   struct dim_cq_moder moder = {};
>
> moder.cq_period_mode = cq_period_mode;
> moder.pkts = MLX5E_PARAMS_DEFAULT_RX_CQ_MODERATION_PKTS;
> --
> 2.23.0
>


Re: [PATCH net-next v2] net: psample: Introduce stubs to remove NIC driver dependency

2021-01-27 Thread Zhu Yanjun
On Wed, Jan 27, 2021 at 5:06 PM Chris Mi  wrote:
>
> In order to send sampled packets to userspace, NIC driver calls
> psample api directly. But it creates a hard dependency on module
> psample. Introduce psample_ops to remove the hard dependency.
> It is initialized when psample module is loaded and set to NULL
> when the module is unloaded.
>
> Reported-by: kernel test robot 
> Signed-off-by: Chris Mi 
> Reviewed-by: Jiri Pirko 
> ---
> v1->v2:
>  - fix sparse errors
>
>  include/net/psample.h| 27 +++
>  net/psample/psample.c| 13 -
>  net/sched/Makefile   |  2 +-
>  net/sched/psample_stub.c |  7 +++
>  4 files changed, 47 insertions(+), 2 deletions(-)
>  create mode 100644 net/sched/psample_stub.c
>
> diff --git a/include/net/psample.h b/include/net/psample.h
> index 68ae16bb0a4a..e6a73128de59 100644
> --- a/include/net/psample.h
> +++ b/include/net/psample.h
> @@ -4,6 +4,7 @@
>
>  #include 
>  #include 
> +#include 
>
>  struct psample_group {
> struct list_head list;
> @@ -14,6 +15,15 @@ struct psample_group {
> struct rcu_head rcu;
>  };
>
> +struct psample_ops {
> +   void (*sample_packet)(struct psample_group *group, struct sk_buff 
> *skb,
> + u32 trunc_size, int in_ifindex, int out_ifindex,
> + u32 sample_rate);
> +
> +};
> +
> +extern const struct psample_ops __rcu *psample_ops __read_mostly;
> +
>  struct psample_group *psample_group_get(struct net *net, u32 group_num);
>  void psample_group_take(struct psample_group *group);
>  void psample_group_put(struct psample_group *group);
> @@ -35,4 +45,21 @@ static inline void psample_sample_packet(struct 
> psample_group *group,
>
>  #endif
>
> +static inline void

inline is not needed here. The compiler should judge it.

Zhu Yanjun

> +psample_nic_sample_packet(struct psample_group *group,
> + struct sk_buff *skb, u32 trunc_size,
> + int in_ifindex, int out_ifindex,
> + u32 sample_rate)
> +{
> +   const struct psample_ops *ops;
> +
> +   rcu_read_lock();
> +   ops = rcu_dereference(psample_ops);
> +   if (ops)
> +   psample_ops->sample_packet(group, skb, trunc_size,
> +  in_ifindex, out_ifindex,
> +  sample_rate);
> +   rcu_read_unlock();
> +}
> +
>  #endif /* __NET_PSAMPLE_H */
> diff --git a/net/psample/psample.c b/net/psample/psample.c
> index 33e238c965bd..2a9fbfe09395 100644
> --- a/net/psample/psample.c
> +++ b/net/psample/psample.c
> @@ -8,6 +8,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -35,6 +36,10 @@ static const struct genl_multicast_group 
> psample_nl_mcgrps[] = {
>
>  static struct genl_family psample_nl_family __ro_after_init;
>
> +static const struct psample_ops psample_sample_ops = {
> +   .sample_packet  = psample_sample_packet,
> +};
> +
>  static int psample_group_nl_fill(struct sk_buff *msg,
>  struct psample_group *group,
>  enum psample_command cmd, u32 portid, u32 
> seq,
> @@ -456,11 +461,17 @@ EXPORT_SYMBOL_GPL(psample_sample_packet);
>
>  static int __init psample_module_init(void)
>  {
> -   return genl_register_family(&psample_nl_family);
> +   int ret;
> +
> +   ret = genl_register_family(&psample_nl_family);
> +   if (!ret)
> +   RCU_INIT_POINTER(psample_ops, &psample_sample_ops);
> +   return ret;
>  }
>
>  static void __exit psample_module_exit(void)
>  {
> +   RCU_INIT_POINTER(psample_ops, NULL);
> genl_unregister_family(&psample_nl_family);
>  }
>
> diff --git a/net/sched/Makefile b/net/sched/Makefile
> index dd14ef413fda..0d92bb98bb26 100644
> --- a/net/sched/Makefile
> +++ b/net/sched/Makefile
> @@ -3,7 +3,7 @@
>  # Makefile for the Linux Traffic Control Unit.
>  #
>
> -obj-y  := sch_generic.o sch_mq.o
> +obj-y  := sch_generic.o sch_mq.o psample_stub.o
>
>  obj-$(CONFIG_INET) += sch_frag.o
>  obj-$(CONFIG_NET_SCHED)+= sch_api.o sch_blackhole.o
> diff --git a/net/sched/psample_stub.c b/net/sched/psample_stub.c
> new file mode 100644
> index ..0615a7b64000
> --- /dev/null
> +++ b/net/sched/psample_stub.c
> @@ -0,0 +1,7 @@
> +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB
> +/* Copyright (c) 2021 Mellanox Technologies. */
> +
> +#include 
> +
> +const struct psample_ops __rcu *psample_ops __read_mostly;
> +EXPORT_SYMBOL_GPL(psample_ops);
> --
> 2.26.2
>


[PATCH v3 1/1] xdp: avoid calling kfree twice

2020-12-09 Thread Zhu Yanjun
In the function xdp_umem_pin_pages, if npgs != umem->npgs and
npgs >= 0, the function xdp_umem_unpin_pages is called. In this
function, kfree is called to handle umem->pgs, and then in the
function xdp_umem_pin_pages, kfree is called again to handle
umem->pgs. Eventually, to umem->pgs, kfree is called twice.

Since umem->pgs is set to NULL after the first kfree, the second
kfree would not trigger call trace.

Fixes: c0c77d8fb787 ("xsk: add user memory registration support sockopt")
CC: Ye Dong 
Acked-by: Björn Töpel 
Signed-off-by: Zhu Yanjun 
---
 net/xdp/xdp_umem.c | 17 +
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 56a28a686988..01b31c56cead 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -97,7 +97,6 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem, unsigned 
long address)
 {
unsigned int gup_flags = FOLL_WRITE;
long npgs;
-   int err;
 
umem->pgs = kcalloc(umem->npgs, sizeof(*umem->pgs),
GFP_KERNEL | __GFP_NOWARN);
@@ -112,20 +111,14 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem, 
unsigned long address)
if (npgs != umem->npgs) {
if (npgs >= 0) {
umem->npgs = npgs;
-   err = -ENOMEM;
-   goto out_pin;
+   xdp_umem_unpin_pages(umem);
+   return -ENOMEM;
}
-   err = npgs;
-   goto out_pgs;
+   kfree(umem->pgs);
+   umem->pgs = NULL;
+   return (int)npgs;
}
return 0;
-
-out_pin:
-   xdp_umem_unpin_pages(umem);
-out_pgs:
-   kfree(umem->pgs);
-   umem->pgs = NULL;
-   return err;
 }
 
 static int xdp_umem_account_pages(struct xdp_umem *umem)
-- 
2.18.4



Re: [PATCH v2 1/1] xdp: avoid calling kfree twice

2020-12-09 Thread Zhu Yanjun
On Wed, Dec 9, 2020 at 6:44 PM Toke Høiland-Jørgensen  wrote:
>
> Zhu Yanjun  writes:
>
> > On Wed, Dec 9, 2020 at 1:12 AM Daniel Borkmann  wrote:
> >>
> >> On 12/9/20 6:03 AM, Zhu Yanjun wrote:
> >> > In the function xdp_umem_pin_pages, if npgs != umem->npgs and
> >> > npgs >= 0, the function xdp_umem_unpin_pages is called. In this
> >> > function, kfree is called to handle umem->pgs, and then in the
> >> > function xdp_umem_pin_pages, kfree is called again to handle
> >> > umem->pgs. Eventually, umem->pgs is freed twice.
> >> >
> >> > Acked-by: Björn Töpel 
> >> > Signed-off-by: Zhu Yanjun 
> >>
> >> Please also fix up the commit log according to Bjorn's prior feedback [0].
> >> If it's just a cleanup, it should state so, the commit message right now
> >> makes it sound like an actual double free bug.
> >
> > The umem->pgs is actually freed twice. Since umem->pgs is set to NULL
> > after the first kfree,
> > the second kfree would not trigger call trace.
> > IMO, the commit log is very clear about this.
>
> Yes, it is very clear; and also wrong. As someone already pointed out,
> passing a NULL pointer to kfree() doesn't actually lead to a double
> free:

In your commit, does "double free" mean the call trace bug? If so, I
will correct it in my commit log.
In my commit log, I just mean that kfree is called twice. And the
second kfree is meaningless. And since NULL
is passed to it, this will not trigger "double free" CallTrace.

I will send the latest version soon.

Zhu Yanjun
>
> https://elixir.bootlin.com/linux/latest/source/mm/slub.c#L4106
>
> -Toke
>


Re: [PATCH v2 1/1] xdp: avoid calling kfree twice

2020-12-08 Thread Zhu Yanjun
On Wed, Dec 9, 2020 at 1:12 AM Daniel Borkmann  wrote:
>
> On 12/9/20 6:03 AM, Zhu Yanjun wrote:
> > In the function xdp_umem_pin_pages, if npgs != umem->npgs and
> > npgs >= 0, the function xdp_umem_unpin_pages is called. In this
> > function, kfree is called to handle umem->pgs, and then in the
> > function xdp_umem_pin_pages, kfree is called again to handle
> > umem->pgs. Eventually, umem->pgs is freed twice.
> >
> > Acked-by: Björn Töpel 
> > Signed-off-by: Zhu Yanjun 
>
> Please also fix up the commit log according to Bjorn's prior feedback [0].
> If it's just a cleanup, it should state so, the commit message right now
> makes it sound like an actual double free bug.

The umem->pgs is actually freed twice. Since umem->pgs is set to NULL
after the first kfree,
the second kfree would not trigger call trace.
IMO, the commit log is very clear about this.

Zhu Yanjun
>
>[0] 
> https://lore.kernel.org/netdev/0fef898d-cf5e-ef1b-6c35-c98669e9e...@intel.com/


[PATCH v2 1/1] xdp: avoid calling kfree twice

2020-12-07 Thread Zhu Yanjun
In the function xdp_umem_pin_pages, if npgs != umem->npgs and
npgs >= 0, the function xdp_umem_unpin_pages is called. In this
function, kfree is called to handle umem->pgs, and then in the
function xdp_umem_pin_pages, kfree is called again to handle
umem->pgs. Eventually, umem->pgs is freed twice.

Acked-by: Björn Töpel 
Signed-off-by: Zhu Yanjun 
---
 net/xdp/xdp_umem.c | 17 +
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 56a28a686988..01b31c56cead 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -97,7 +97,6 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem, unsigned 
long address)
 {
unsigned int gup_flags = FOLL_WRITE;
long npgs;
-   int err;
 
umem->pgs = kcalloc(umem->npgs, sizeof(*umem->pgs),
GFP_KERNEL | __GFP_NOWARN);
@@ -112,20 +111,14 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem, 
unsigned long address)
if (npgs != umem->npgs) {
if (npgs >= 0) {
umem->npgs = npgs;
-   err = -ENOMEM;
-   goto out_pin;
+   xdp_umem_unpin_pages(umem);
+   return -ENOMEM;
}
-   err = npgs;
-   goto out_pgs;
+   kfree(umem->pgs);
+   umem->pgs = NULL;
+   return (int)npgs;
}
return 0;
-
-out_pin:
-   xdp_umem_unpin_pages(umem);
-out_pgs:
-   kfree(umem->pgs);
-   umem->pgs = NULL;
-   return err;
 }
 
 static int xdp_umem_account_pages(struct xdp_umem *umem)
-- 
2.18.4



[PATCH 1/1] xdp: avoid calling kfree twice

2020-12-07 Thread Zhu Yanjun
From: Zhu Yanjun 

In the function xdp_umem_pin_pages, if npgs != umem->npgs and
npgs >= 0, the function xdp_umem_unpin_pages is called. In this
function, kfree is called to handle umem->pgs, and then in the
function xdp_umem_pin_pages, kfree is called again to handle
umem->pgs. Eventually, umem->pgs is freed twice.

Signed-off-by: Zhu Yanjun 
---
 net/xdp/xdp_umem.c | 17 +
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 56a28a686988..ff5173f72920 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -97,7 +97,6 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem, unsigned 
long address)
 {
unsigned int gup_flags = FOLL_WRITE;
long npgs;
-   int err;
 
umem->pgs = kcalloc(umem->npgs, sizeof(*umem->pgs),
GFP_KERNEL | __GFP_NOWARN);
@@ -112,20 +111,14 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem, 
unsigned long address)
if (npgs != umem->npgs) {
if (npgs >= 0) {
umem->npgs = npgs;
-   err = -ENOMEM;
-   goto out_pin;
+   xdp_umem_unpin_pages(umem);
+   return -ENOMEM;
}
-   err = npgs;
-   goto out_pgs;
+   kfree(umem->pgs);
+   umem->pgs = NULL;
+   return npgs;
}
return 0;
-
-out_pin:
-   xdp_umem_unpin_pages(umem);
-out_pgs:
-   kfree(umem->pgs);
-   umem->pgs = NULL;
-   return err;
 }
 
 static int xdp_umem_account_pages(struct xdp_umem *umem)
-- 
2.18.4



[PATCH v5 1/1] xdp: remove the functions xsk_map_inc and xsk_map_put

2020-11-26 Thread Zhu Yanjun
From: Zhu Yanjun 

The functions xsk_map_put and xsk_map_inc are simple wrappers.
As such, replacing these functions with the functions bpf_map_inc
and bpf_map_put and removing some test codes.

Fixes: d20a1676df7e ("xsk: Move xskmap.c to net/xdp/")
Signed-off-by: Zhu Yanjun 
---
 net/xdp/xsk.c|  4 ++--
 net/xdp/xsk.h|  2 --
 net/xdp/xskmap.c | 20 ++--
 3 files changed, 4 insertions(+), 22 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index cfbec3989a76..4f0250f5d676 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -548,7 +548,7 @@ static struct xsk_map *xsk_get_map_list_entry(struct 
xdp_sock *xs,
node = list_first_entry_or_null(&xs->map_list, struct xsk_map_node,
node);
if (node) {
-   WARN_ON(xsk_map_inc(node->map));
+   bpf_map_inc(&node->map->map);
map = node->map;
*map_entry = node->map_entry;
}
@@ -578,7 +578,7 @@ static void xsk_delete_from_maps(struct xdp_sock *xs)
 
while ((map = xsk_get_map_list_entry(xs, &map_entry))) {
xsk_map_try_sock_delete(map, xs, map_entry);
-   xsk_map_put(map);
+   bpf_map_put(&map->map);
}
 }
 
diff --git a/net/xdp/xsk.h b/net/xdp/xsk.h
index b9e896cee5bb..edcf249ad1f1 100644
--- a/net/xdp/xsk.h
+++ b/net/xdp/xsk.h
@@ -41,8 +41,6 @@ static inline struct xdp_sock *xdp_sk(struct sock *sk)
 
 void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
 struct xdp_sock **map_entry);
-int xsk_map_inc(struct xsk_map *map);
-void xsk_map_put(struct xsk_map *map);
 void xsk_clear_pool_at_qid(struct net_device *dev, u16 queue_id);
 int xsk_reg_pool_at_qid(struct net_device *dev, struct xsk_buff_pool *pool,
u16 queue_id);
diff --git a/net/xdp/xskmap.c b/net/xdp/xskmap.c
index 49da2b8ace8b..66231ba6c348 100644
--- a/net/xdp/xskmap.c
+++ b/net/xdp/xskmap.c
@@ -11,32 +11,16 @@
 
 #include "xsk.h"
 
-int xsk_map_inc(struct xsk_map *map)
-{
-   bpf_map_inc(&map->map);
-   return 0;
-}
-
-void xsk_map_put(struct xsk_map *map)
-{
-   bpf_map_put(&map->map);
-}
-
 static struct xsk_map_node *xsk_map_node_alloc(struct xsk_map *map,
   struct xdp_sock **map_entry)
 {
struct xsk_map_node *node;
-   int err;
 
node = kzalloc(sizeof(*node), GFP_ATOMIC | __GFP_NOWARN);
if (!node)
return ERR_PTR(-ENOMEM);
 
-   err = xsk_map_inc(map);
-   if (err) {
-   kfree(node);
-   return ERR_PTR(err);
-   }
+   bpf_map_inc(&map->map);
 
node->map = map;
node->map_entry = map_entry;
@@ -45,7 +29,7 @@ static struct xsk_map_node *xsk_map_node_alloc(struct xsk_map 
*map,
 
 static void xsk_map_node_free(struct xsk_map_node *node)
 {
-   xsk_map_put(node->map);
+   bpf_map_put(&node->map->map);
kfree(node);
 }
 
-- 
2.25.1



[PATCH v4 1/1] xdp: remove the function xsk_map_inc

2020-11-26 Thread Zhu Yanjun
From: Zhu Yanjun 

The functions xsk_map_put and xsk_map_inc are simple wrappers.
As such, replacing these functions with the functions bpf_map_inc
and bpf_map_put and removing some test codes.

Fixes: d20a1676df7e ("xsk: Move xskmap.c to net/xdp/")
Signed-off-by: Zhu Yanjun 
---
 net/xdp/xsk.c|  4 ++--
 net/xdp/xsk.h|  2 --
 net/xdp/xskmap.c | 20 ++--
 3 files changed, 4 insertions(+), 22 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index cfbec3989a76..4f0250f5d676 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -548,7 +548,7 @@ static struct xsk_map *xsk_get_map_list_entry(struct 
xdp_sock *xs,
node = list_first_entry_or_null(&xs->map_list, struct xsk_map_node,
node);
if (node) {
-   WARN_ON(xsk_map_inc(node->map));
+   bpf_map_inc(&node->map->map);
map = node->map;
*map_entry = node->map_entry;
}
@@ -578,7 +578,7 @@ static void xsk_delete_from_maps(struct xdp_sock *xs)
 
while ((map = xsk_get_map_list_entry(xs, &map_entry))) {
xsk_map_try_sock_delete(map, xs, map_entry);
-   xsk_map_put(map);
+   bpf_map_put(&map->map);
}
 }
 
diff --git a/net/xdp/xsk.h b/net/xdp/xsk.h
index b9e896cee5bb..edcf249ad1f1 100644
--- a/net/xdp/xsk.h
+++ b/net/xdp/xsk.h
@@ -41,8 +41,6 @@ static inline struct xdp_sock *xdp_sk(struct sock *sk)
 
 void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
 struct xdp_sock **map_entry);
-int xsk_map_inc(struct xsk_map *map);
-void xsk_map_put(struct xsk_map *map);
 void xsk_clear_pool_at_qid(struct net_device *dev, u16 queue_id);
 int xsk_reg_pool_at_qid(struct net_device *dev, struct xsk_buff_pool *pool,
u16 queue_id);
diff --git a/net/xdp/xskmap.c b/net/xdp/xskmap.c
index 49da2b8ace8b..66231ba6c348 100644
--- a/net/xdp/xskmap.c
+++ b/net/xdp/xskmap.c
@@ -11,32 +11,16 @@
 
 #include "xsk.h"
 
-int xsk_map_inc(struct xsk_map *map)
-{
-   bpf_map_inc(&map->map);
-   return 0;
-}
-
-void xsk_map_put(struct xsk_map *map)
-{
-   bpf_map_put(&map->map);
-}
-
 static struct xsk_map_node *xsk_map_node_alloc(struct xsk_map *map,
   struct xdp_sock **map_entry)
 {
struct xsk_map_node *node;
-   int err;
 
node = kzalloc(sizeof(*node), GFP_ATOMIC | __GFP_NOWARN);
if (!node)
return ERR_PTR(-ENOMEM);
 
-   err = xsk_map_inc(map);
-   if (err) {
-   kfree(node);
-   return ERR_PTR(err);
-   }
+   bpf_map_inc(&map->map);
 
node->map = map;
node->map_entry = map_entry;
@@ -45,7 +29,7 @@ static struct xsk_map_node *xsk_map_node_alloc(struct xsk_map 
*map,
 
 static void xsk_map_node_free(struct xsk_map_node *node)
 {
-   xsk_map_put(node->map);
+   bpf_map_put(&node->map->map);
kfree(node);
 }
 
-- 
2.25.1



Re: [PATCH v3 1/1] xdp: remove the function xsk_map_inc

2020-11-26 Thread Zhu Yanjun
On Wed, Nov 25, 2020 at 4:33 PM Magnus Karlsson
 wrote:
>
> On Wed, Nov 25, 2020 at 1:02 AM Daniel Borkmann  wrote:
> >
> > On 11/23/20 4:05 PM, Zhu Yanjun wrote:
> > > From: Zhu Yanjun 
> > >
> > > The function xsk_map_inc is a simple wrapper of bpf_map_inc and
> > > always returns zero. As such, replacing this function with bpf_map_inc
> > > and removing the test code.
> > >
> > > Signed-off-by: Zhu Yanjun 
> > > ---
> > >   net/xdp/xsk.c|  2 +-
> > >   net/xdp/xsk.h|  1 -
> > >   net/xdp/xskmap.c | 13 +
> > >   3 files changed, 2 insertions(+), 14 deletions(-)
> > >
> > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > index cfbec3989a76..a3c1f07d77d8 100644
> > > --- a/net/xdp/xsk.c
> > > +++ b/net/xdp/xsk.c
> > > @@ -548,7 +548,7 @@ static struct xsk_map *xsk_get_map_list_entry(struct 
> > > xdp_sock *xs,
> > >   node = list_first_entry_or_null(&xs->map_list, struct xsk_map_node,
> > >   node);
> > >   if (node) {
> > > - WARN_ON(xsk_map_inc(node->map));
> > > + bpf_map_inc(&node->map->map);
> > >   map = node->map;
> > >   *map_entry = node->map_entry;
> > >   }
> > > diff --git a/net/xdp/xsk.h b/net/xdp/xsk.h
> > > index b9e896cee5bb..0aad25c0e223 100644
> > > --- a/net/xdp/xsk.h
> > > +++ b/net/xdp/xsk.h
> > > @@ -41,7 +41,6 @@ static inline struct xdp_sock *xdp_sk(struct sock *sk)
> > >
> > >   void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
> > >struct xdp_sock **map_entry);
> > > -int xsk_map_inc(struct xsk_map *map);
> > >   void xsk_map_put(struct xsk_map *map);
> > >   void xsk_clear_pool_at_qid(struct net_device *dev, u16 queue_id);
> > >   int xsk_reg_pool_at_qid(struct net_device *dev, struct xsk_buff_pool 
> > > *pool,
> > > diff --git a/net/xdp/xskmap.c b/net/xdp/xskmap.c
> > > index 49da2b8ace8b..6b7e9a72b101 100644
> > > --- a/net/xdp/xskmap.c
> > > +++ b/net/xdp/xskmap.c
> > > @@ -11,12 +11,6 @@
> > >
> > >   #include "xsk.h"
> > >
> > > -int xsk_map_inc(struct xsk_map *map)
> > > -{
> > > - bpf_map_inc(&map->map);
> > > - return 0;
> > > -}
> > > -
> > >   void xsk_map_put(struct xsk_map *map)
> > >   {
> >
> > So, the xsk_map_put() is defined as:
> >
> >void xsk_map_put(struct xsk_map *map)
> >{
> >  bpf_map_put(&map->map);
> >}
> >
> > What is the reason to get rid of xsk_map_inc() but not xsk_map_put() 
> > wrapper?
> > Can't we just remove both while we're at it?
>
> Yes, why not. Makes sense.
>
> Yanjun, could you please send a new version that removes this too?

OK. I will.

Zhu Yanjun

>
> Thank you both!
>
> > Thanks,
> > Daniel


[PATCH v3 1/1] xdp: remove the function xsk_map_inc

2020-11-23 Thread Zhu Yanjun
From: Zhu Yanjun 

The function xsk_map_inc is a simple wrapper of bpf_map_inc and
always returns zero. As such, replacing this function with bpf_map_inc
and removing the test code.

Signed-off-by: Zhu Yanjun 
---
 net/xdp/xsk.c|  2 +-
 net/xdp/xsk.h|  1 -
 net/xdp/xskmap.c | 13 +
 3 files changed, 2 insertions(+), 14 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index cfbec3989a76..a3c1f07d77d8 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -548,7 +548,7 @@ static struct xsk_map *xsk_get_map_list_entry(struct 
xdp_sock *xs,
node = list_first_entry_or_null(&xs->map_list, struct xsk_map_node,
node);
if (node) {
-   WARN_ON(xsk_map_inc(node->map));
+   bpf_map_inc(&node->map->map);
map = node->map;
*map_entry = node->map_entry;
}
diff --git a/net/xdp/xsk.h b/net/xdp/xsk.h
index b9e896cee5bb..0aad25c0e223 100644
--- a/net/xdp/xsk.h
+++ b/net/xdp/xsk.h
@@ -41,7 +41,6 @@ static inline struct xdp_sock *xdp_sk(struct sock *sk)
 
 void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
 struct xdp_sock **map_entry);
-int xsk_map_inc(struct xsk_map *map);
 void xsk_map_put(struct xsk_map *map);
 void xsk_clear_pool_at_qid(struct net_device *dev, u16 queue_id);
 int xsk_reg_pool_at_qid(struct net_device *dev, struct xsk_buff_pool *pool,
diff --git a/net/xdp/xskmap.c b/net/xdp/xskmap.c
index 49da2b8ace8b..6b7e9a72b101 100644
--- a/net/xdp/xskmap.c
+++ b/net/xdp/xskmap.c
@@ -11,12 +11,6 @@
 
 #include "xsk.h"
 
-int xsk_map_inc(struct xsk_map *map)
-{
-   bpf_map_inc(&map->map);
-   return 0;
-}
-
 void xsk_map_put(struct xsk_map *map)
 {
bpf_map_put(&map->map);
@@ -26,17 +20,12 @@ static struct xsk_map_node *xsk_map_node_alloc(struct 
xsk_map *map,
   struct xdp_sock **map_entry)
 {
struct xsk_map_node *node;
-   int err;
 
node = kzalloc(sizeof(*node), GFP_ATOMIC | __GFP_NOWARN);
if (!node)
return ERR_PTR(-ENOMEM);
 
-   err = xsk_map_inc(map);
-   if (err) {
-   kfree(node);
-   return ERR_PTR(err);
-   }
+   bpf_map_inc(&map->map);
 
node->map = map;
node->map_entry = map_entry;
-- 
2.25.1



Re: [PATCH v3 1/1] xdp: remove the function xsk_map_inc

2020-11-23 Thread Zhu Yanjun
On Mon, Nov 23, 2020 at 10:27 PM  wrote:
>
> From: Zhu Yanjun 
>
> The function xsk_map_inc is a simple wrapper of bpf_map_inc and
> always returns zero. As such, replacing this function with bpf_map_inc
> and removing the test code.
>
> Signed-off-by: Zhu Yanjun 
> ---
>  net/xdp/xsk.c|  2 +-
>  net/xdp/xsk.h|  1 -
>  net/xdp/xskmap.c | 13 +
>  3 files changed, 2 insertions(+), 14 deletions(-)
>
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index cfbec3989a76..a3c1f07d77d8 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -548,7 +548,7 @@ static struct xsk_map *xsk_get_map_list_entry(struct 
> xdp_sock *xs,
> node = list_first_entry_or_null(&xs->map_list, struct xsk_map_node,
> node);
> if (node) {
> -   WARN_ON(xsk_map_inc(node->map));
> +       bpf_map_inc(&node->map->map);

Thanks. This is the latest version.

Zhu Yanjun

> map = node->map;
> *map_entry = node->map_entry;
> }
> diff --git a/net/xdp/xsk.h b/net/xdp/xsk.h
> index b9e896cee5bb..0aad25c0e223 100644
> --- a/net/xdp/xsk.h
> +++ b/net/xdp/xsk.h
> @@ -41,7 +41,6 @@ static inline struct xdp_sock *xdp_sk(struct sock *sk)
>
>  void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
>  struct xdp_sock **map_entry);
> -int xsk_map_inc(struct xsk_map *map);
>  void xsk_map_put(struct xsk_map *map);
>  void xsk_clear_pool_at_qid(struct net_device *dev, u16 queue_id);
>  int xsk_reg_pool_at_qid(struct net_device *dev, struct xsk_buff_pool *pool,
> diff --git a/net/xdp/xskmap.c b/net/xdp/xskmap.c
> index 49da2b8ace8b..6b7e9a72b101 100644
> --- a/net/xdp/xskmap.c
> +++ b/net/xdp/xskmap.c
> @@ -11,12 +11,6 @@
>
>  #include "xsk.h"
>
> -int xsk_map_inc(struct xsk_map *map)
> -{
> -   bpf_map_inc(&map->map);
> -   return 0;
> -}
> -
>  void xsk_map_put(struct xsk_map *map)
>  {
> bpf_map_put(&map->map);
> @@ -26,17 +20,12 @@ static struct xsk_map_node *xsk_map_node_alloc(struct 
> xsk_map *map,
>struct xdp_sock **map_entry)
>  {
> struct xsk_map_node *node;
> -   int err;
>
> node = kzalloc(sizeof(*node), GFP_ATOMIC | __GFP_NOWARN);
> if (!node)
> return ERR_PTR(-ENOMEM);
>
> -   err = xsk_map_inc(map);
> -   if (err) {
> -   kfree(node);
> -   return ERR_PTR(err);
> -   }
> +   bpf_map_inc(&map->map);
>
> node->map = map;
> node->map_entry = map_entry;
> --
> 2.25.1
>


Re: [PATCHv2 1/1] xdp: remove the function xsk_map_inc

2020-11-23 Thread Zhu Yanjun
On Mon, Nov 23, 2020 at 8:19 PM Magnus Karlsson
 wrote:
>
> On Mon, Nov 23, 2020 at 1:11 PM Zhu Yanjun  wrote:
> >
> > On Mon, Nov 23, 2020 at 8:05 PM  wrote:
> > >
> > > From: Zhu Yanjun 
> > >
> > > The function xsk_map_inc is a simple wrapper of bpf_map_inc and
> > > always returns zero. As such, replacing this function with bpf_map_inc
> > > and removing the test code.
> > >
> > > Signed-off-by: Zhu Yanjun 
> >
> >
> > > ---
> > >  net/xdp/xsk.c|  1 -
> > >  net/xdp/xsk.h|  1 -
> > >  net/xdp/xskmap.c | 13 +
> > >  3 files changed, 1 insertion(+), 14 deletions(-)
> > >
> > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > index cfbec3989a76..c1b8a888591c 100644
> > > --- a/net/xdp/xsk.c
> > > +++ b/net/xdp/xsk.c
> > > @@ -548,7 +548,6 @@ static struct xsk_map *xsk_get_map_list_entry(struct 
> > > xdp_sock *xs,
> > > node = list_first_entry_or_null(&xs->map_list, struct 
> > > xsk_map_node,
> > > node);
> > > if (node) {
> > > -   WARN_ON(xsk_map_inc(node->map));
>
> This should be bpf_map_inc(&node->map->map); Think you forgot to
> convert this one.

In include/linux/bpf.h:
"
...
1213 void bpf_map_inc(struct bpf_map *map);
...
"

Zhu Yanjun
>
> > > map = node->map;
> > > *map_entry = node->map_entry;
> > > }
> > > diff --git a/net/xdp/xsk.h b/net/xdp/xsk.h
> > > index b9e896cee5bb..0aad25c0e223 100644
> > > --- a/net/xdp/xsk.h
> > > +++ b/net/xdp/xsk.h
> > > @@ -41,7 +41,6 @@ static inline struct xdp_sock *xdp_sk(struct sock *sk)
> > >
> > >  void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
> > >  struct xdp_sock **map_entry);
> > > -int xsk_map_inc(struct xsk_map *map);
> > >  void xsk_map_put(struct xsk_map *map);
> > >  void xsk_clear_pool_at_qid(struct net_device *dev, u16 queue_id);
> > >  int xsk_reg_pool_at_qid(struct net_device *dev, struct xsk_buff_pool 
> > > *pool,
> > > diff --git a/net/xdp/xskmap.c b/net/xdp/xskmap.c
> > > index 49da2b8ace8b..6b7e9a72b101 100644
> > > --- a/net/xdp/xskmap.c
> > > +++ b/net/xdp/xskmap.c
> > > @@ -11,12 +11,6 @@
> > >
> > >  #include "xsk.h"
> > >
> > > -int xsk_map_inc(struct xsk_map *map)
> > > -{
> > > -   bpf_map_inc(&map->map);
> > > -   return 0;
> > > -}
> >
> > Hi, Magnus
> >
> > The function xsk_map_inc is replaced with bpf_map_inc.
> >
> > Zhu Yanjun
> >
> > > -
> > >  void xsk_map_put(struct xsk_map *map)
> > >  {
> > > bpf_map_put(&map->map);
> > > @@ -26,17 +20,12 @@ static struct xsk_map_node *xsk_map_node_alloc(struct 
> > > xsk_map *map,
> > >struct xdp_sock 
> > > **map_entry)
> > >  {
> > > struct xsk_map_node *node;
> > > -   int err;
> > >
> > > node = kzalloc(sizeof(*node), GFP_ATOMIC | __GFP_NOWARN);
> > > if (!node)
> > > return ERR_PTR(-ENOMEM);
> > >
> > > -   err = xsk_map_inc(map);
> > > -   if (err) {
> > > -   kfree(node);
> > > -   return ERR_PTR(err);
> > > -   }
> > > +   bpf_map_inc(&map->map);
> > >
> > > node->map = map;
> > > node->map_entry = map_entry;
> > > --
> > > 2.25.1
> > >


Re: [PATCHv2 1/1] xdp: remove the function xsk_map_inc

2020-11-23 Thread Zhu Yanjun
On Mon, Nov 23, 2020 at 8:05 PM  wrote:
>
> From: Zhu Yanjun 
>
> The function xsk_map_inc is a simple wrapper of bpf_map_inc and
> always returns zero. As such, replacing this function with bpf_map_inc
> and removing the test code.
>
> Signed-off-by: Zhu Yanjun 


> ---
>  net/xdp/xsk.c|  1 -
>  net/xdp/xsk.h|  1 -
>  net/xdp/xskmap.c | 13 +
>  3 files changed, 1 insertion(+), 14 deletions(-)
>
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index cfbec3989a76..c1b8a888591c 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -548,7 +548,6 @@ static struct xsk_map *xsk_get_map_list_entry(struct 
> xdp_sock *xs,
> node = list_first_entry_or_null(&xs->map_list, struct xsk_map_node,
> node);
> if (node) {
> -   WARN_ON(xsk_map_inc(node->map));
> map = node->map;
> *map_entry = node->map_entry;
> }
> diff --git a/net/xdp/xsk.h b/net/xdp/xsk.h
> index b9e896cee5bb..0aad25c0e223 100644
> --- a/net/xdp/xsk.h
> +++ b/net/xdp/xsk.h
> @@ -41,7 +41,6 @@ static inline struct xdp_sock *xdp_sk(struct sock *sk)
>
>  void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
>  struct xdp_sock **map_entry);
> -int xsk_map_inc(struct xsk_map *map);
>  void xsk_map_put(struct xsk_map *map);
>  void xsk_clear_pool_at_qid(struct net_device *dev, u16 queue_id);
>  int xsk_reg_pool_at_qid(struct net_device *dev, struct xsk_buff_pool *pool,
> diff --git a/net/xdp/xskmap.c b/net/xdp/xskmap.c
> index 49da2b8ace8b..6b7e9a72b101 100644
> --- a/net/xdp/xskmap.c
> +++ b/net/xdp/xskmap.c
> @@ -11,12 +11,6 @@
>
>  #include "xsk.h"
>
> -int xsk_map_inc(struct xsk_map *map)
> -{
> -   bpf_map_inc(&map->map);
> -   return 0;
> -}

Hi, Magnus

The function xsk_map_inc is replaced with bpf_map_inc.

Zhu Yanjun

> -
>  void xsk_map_put(struct xsk_map *map)
>  {
> bpf_map_put(&map->map);
> @@ -26,17 +20,12 @@ static struct xsk_map_node *xsk_map_node_alloc(struct 
> xsk_map *map,
>struct xdp_sock **map_entry)
>  {
> struct xsk_map_node *node;
> -   int err;
>
> node = kzalloc(sizeof(*node), GFP_ATOMIC | __GFP_NOWARN);
> if (!node)
> return ERR_PTR(-ENOMEM);
>
> -   err = xsk_map_inc(map);
> -   if (err) {
> -   kfree(node);
> -   return ERR_PTR(err);
> -   }
> +   bpf_map_inc(&map->map);
>
> node->map = map;
> node->map_entry = map_entry;
> --
> 2.25.1
>


[PATCH 1/1] xdp: compact the function xsk_map_inc

2020-11-22 Thread Zhu Yanjun
From: Zhu Yanjun 

The function xsk_map_inc always returns zero. As such, changing the
return type to void and removing the test code.

Signed-off-by: Zhu Yanjun 
Signed-off-by: Zhu Yanjun 
---
 net/xdp/xsk.c|1 -
 net/xdp/xsk.h|2 +-
 net/xdp/xskmap.c |   10 ++
 3 files changed, 3 insertions(+), 10 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index cfbec39..c1b8a88 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -548,7 +548,6 @@ static void xsk_unbind_dev(struct xdp_sock *xs)
node = list_first_entry_or_null(&xs->map_list, struct xsk_map_node,
node);
if (node) {
-   WARN_ON(xsk_map_inc(node->map));
map = node->map;
*map_entry = node->map_entry;
}
diff --git a/net/xdp/xsk.h b/net/xdp/xsk.h
index b9e896c..766b9e2 100644
--- a/net/xdp/xsk.h
+++ b/net/xdp/xsk.h
@@ -41,7 +41,7 @@ struct xsk_map_node {
 
 void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,
 struct xdp_sock **map_entry);
-int xsk_map_inc(struct xsk_map *map);
+void xsk_map_inc(struct xsk_map *map);
 void xsk_map_put(struct xsk_map *map);
 void xsk_clear_pool_at_qid(struct net_device *dev, u16 queue_id);
 int xsk_reg_pool_at_qid(struct net_device *dev, struct xsk_buff_pool *pool,
diff --git a/net/xdp/xskmap.c b/net/xdp/xskmap.c
index 49da2b8..c7dd94a 100644
--- a/net/xdp/xskmap.c
+++ b/net/xdp/xskmap.c
@@ -11,10 +11,9 @@
 
 #include "xsk.h"
 
-int xsk_map_inc(struct xsk_map *map)
+void xsk_map_inc(struct xsk_map *map)
 {
bpf_map_inc(&map->map);
-   return 0;
 }
 
 void xsk_map_put(struct xsk_map *map)
@@ -26,17 +25,12 @@ void xsk_map_put(struct xsk_map *map)
   struct xdp_sock **map_entry)
 {
struct xsk_map_node *node;
-   int err;
 
node = kzalloc(sizeof(*node), GFP_ATOMIC | __GFP_NOWARN);
if (!node)
return ERR_PTR(-ENOMEM);
 
-   err = xsk_map_inc(map);
-   if (err) {
-   kfree(node);
-   return ERR_PTR(err);
-   }
+   xsk_map_inc(map);
 
node->map = map;
node->map_entry = map_entry;
-- 
1.7.1



Re: [PATCH 1/1] RDMA/rxe: Fetch skb packets from ethernet layer

2020-11-11 Thread Zhu Yanjun
On Tue, Nov 10, 2020 at 9:58 AM Zhu Yanjun  wrote:
>
> On Tue, Nov 10, 2020 at 2:25 AM Jakub Kicinski  wrote:
> >
> > On Sun, 8 Nov 2020 13:27:32 +0800 Zhu Yanjun wrote:
> > > On Sun, Nov 8, 2020 at 1:24 PM Zhu Yanjun  wrote:
> > > > On Thu, 5 Nov 2020 19:12:01 +0800 Zhu Yanjun wrote:
> > > >
> > > > In the original design, in rx, skb packet would pass ethernet
> > > > layer and IP layer, eventually reach udp tunnel.
> > > >
> > > > Now rxe fetches the skb packets from the ethernet layer directly.
> > > > So this bypasses the IP and UDP layer. As such, the skb packets
> > > > are sent to the upper protocals directly from the ethernet layer.
> > > >
> > > > This increases bandwidth and decreases latency.
> > > >
> > > > Signed-off-by: Zhu Yanjun 
> > > >
> > > >
> > > > Nope, no stealing UDP packets with some random rx handlers.
> > >
> > > Why? Is there any risks?
> >
> > Are there risks in layering violations? Yes.
> >
> > For example - you do absolutely no protocol parsing,
>
> Protocol parsing is in rxe driver.
>
> > checksum validation, only support IPv4, etc.
>
> Since only ipv4 is supported in rxe, if ipv6 is supported in rxe, I
> will add ipv6.
>
> >
> > Besides it also makes the code far less maintainable, rx_handler is a
>
> This rx_handler is also used in openvswitch and bridge.

in Vacation. I will reply as soon as I come back.

Zhu Yanjun

>
> Zhu Yanjun
>
> > singleton, etc. etc.
> >
> > > > The tunnel socket is a correct approach.


Re: [PATCH 1/1] RDMA/rxe: Fetch skb packets from ethernet layer

2020-11-09 Thread Zhu Yanjun
On Tue, Nov 10, 2020 at 2:25 AM Jakub Kicinski  wrote:
>
> On Sun, 8 Nov 2020 13:27:32 +0800 Zhu Yanjun wrote:
> > On Sun, Nov 8, 2020 at 1:24 PM Zhu Yanjun  wrote:
> > > On Thu, 5 Nov 2020 19:12:01 +0800 Zhu Yanjun wrote:
> > >
> > > In the original design, in rx, skb packet would pass ethernet
> > > layer and IP layer, eventually reach udp tunnel.
> > >
> > > Now rxe fetches the skb packets from the ethernet layer directly.
> > > So this bypasses the IP and UDP layer. As such, the skb packets
> > > are sent to the upper protocals directly from the ethernet layer.
> > >
> > > This increases bandwidth and decreases latency.
> > >
> > > Signed-off-by: Zhu Yanjun 
> > >
> > >
> > > Nope, no stealing UDP packets with some random rx handlers.
> >
> > Why? Is there any risks?
>
> Are there risks in layering violations? Yes.
>
> For example - you do absolutely no protocol parsing,

Protocol parsing is in rxe driver.

> checksum validation, only support IPv4, etc.

Since only ipv4 is supported in rxe, if ipv6 is supported in rxe, I
will add ipv6.

>
> Besides it also makes the code far less maintainable, rx_handler is a

This rx_handler is also used in openvswitch and bridge.

Zhu Yanjun

> singleton, etc. etc.
>
> > > The tunnel socket is a correct approach.


Re: [PATCH 1/1] RDMA/rxe: Fetch skb packets from ethernet layer

2020-11-07 Thread Zhu Yanjun
On Sun, Nov 8, 2020 at 1:24 PM Zhu Yanjun  wrote:
>
>
>
>
>  Forwarded Message 
> Subject: Re: [PATCH 1/1] RDMA/rxe: Fetch skb packets from ethernet layer
> Date: Sat, 7 Nov 2020 12:26:17 -0800
> From: Jakub Kicinski 
> To: Zhu Yanjun 
> CC: dledf...@redhat.com, j...@ziepe.ca, linux-r...@vger.kernel.org, 
> netdev@vger.kernel.org
>
>
> On Thu, 5 Nov 2020 19:12:01 +0800 Zhu Yanjun wrote:
>
> In the original design, in rx, skb packet would pass ethernet
> layer and IP layer, eventually reach udp tunnel.
>
> Now rxe fetches the skb packets from the ethernet layer directly.
> So this bypasses the IP and UDP layer. As such, the skb packets
> are sent to the upper protocals directly from the ethernet layer.
>
> This increases bandwidth and decreases latency.
>
> Signed-off-by: Zhu Yanjun 
>
>
> Nope, no stealing UDP packets with some random rx handlers.

Why? Is there any risks?

Zhu Yanjun
>
> The tunnel socket is a correct approach.


[PATCH 1/1] net/mlx5e: remove unnecessary memset

2020-11-06 Thread Zhu Yanjun
Since kvzalloc will initialize the allocated memory, it is not
necessary to initialize it once again.

Fixes: 11b717d61526 ("net/mlx5: E-Switch, Get reg_c0 value on CQE")
Signed-off-by: Zhu Yanjun 
---
 .../ethernet/mellanox/mlx5/core/eswitch_offloads.c |1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 1bcf260..35c5629 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -1528,7 +1528,6 @@ static int esw_create_restore_table(struct mlx5_eswitch 
*esw)
goto out_free;
}
 
-   memset(flow_group_in, 0, inlen);
match_criteria = MLX5_ADDR_OF(create_flow_group_in, flow_group_in,
  match_criteria);
misc = MLX5_ADDR_OF(fte_match_param, match_criteria,
-- 
1.7.1



[PATCH 1/1] RDMA/rxe: Fetch skb packets from ethernet layer

2020-11-05 Thread Zhu Yanjun
In the original design, in rx, skb packet would pass ethernet
layer and IP layer, eventually reach udp tunnel.

Now rxe fetches the skb packets from the ethernet layer directly.
So this bypasses the IP and UDP layer. As such, the skb packets
are sent to the upper protocals directly from the ethernet layer.

This increases bandwidth and decreases latency.

Signed-off-by: Zhu Yanjun 
---
 drivers/infiniband/sw/rxe/rxe_net.c |   45 ++-
 1 files changed, 44 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/sw/rxe/rxe_net.c 
b/drivers/infiniband/sw/rxe/rxe_net.c
index 2e490e5..8ea68b6 100644
--- a/drivers/infiniband/sw/rxe/rxe_net.c
+++ b/drivers/infiniband/sw/rxe/rxe_net.c
@@ -18,6 +18,7 @@
 #include "rxe_loc.h"
 
 static struct rxe_recv_sockets recv_sockets;
+static struct net_device *g_ndev;
 
 struct device *rxe_dma_device(struct rxe_dev *rxe)
 {
@@ -113,7 +114,7 @@ static int rxe_udp_encap_recv(struct sock *sk, struct 
sk_buff *skb)
}
 
tnl_cfg.encap_type = 1;
-   tnl_cfg.encap_rcv = rxe_udp_encap_recv;
+   tnl_cfg.encap_rcv = NULL;
 
/* Setup UDP tunnel */
setup_udp_tunnel_sock(net, sock, &tnl_cfg);
@@ -357,6 +358,38 @@ struct sk_buff *rxe_init_packet(struct rxe_dev *rxe, 
struct rxe_av *av,
return rxe->ndev->name;
 }
 
+static rx_handler_result_t rxe_handle_frame(struct sk_buff **pskb)
+{
+   struct sk_buff *skb = *pskb;
+   struct iphdr *iph;
+   struct udphdr *udph;
+
+   if (unlikely(skb->pkt_type == PACKET_LOOPBACK))
+   return RX_HANDLER_PASS;
+
+   if (!is_valid_ether_addr(eth_hdr(skb)->h_source)) {
+   kfree(skb);
+   return RX_HANDLER_CONSUMED;
+   }
+
+   if (eth_hdr(skb)->h_proto != cpu_to_be16(ETH_P_IP))
+   return RX_HANDLER_PASS;
+
+   iph = ip_hdr(skb);
+
+   if (iph->protocol != IPPROTO_UDP)
+   return RX_HANDLER_PASS;
+
+   udph = udp_hdr(skb);
+
+   if (udph->dest != cpu_to_be16(ROCE_V2_UDP_DPORT))
+   return RX_HANDLER_PASS;
+
+   rxe_udp_encap_recv(NULL, skb);
+
+   return RX_HANDLER_CONSUMED;
+}
+
 int rxe_net_add(const char *ibdev_name, struct net_device *ndev)
 {
int err;
@@ -367,6 +400,7 @@ int rxe_net_add(const char *ibdev_name, struct net_device 
*ndev)
return -ENOMEM;
 
rxe->ndev = ndev;
+   g_ndev = ndev;
 
err = rxe_add(rxe, ndev->mtu, ibdev_name);
if (err) {
@@ -374,6 +408,12 @@ int rxe_net_add(const char *ibdev_name, struct net_device 
*ndev)
return err;
}
 
+   rtnl_lock();
+   err = netdev_rx_handler_register(ndev, rxe_handle_frame, rxe);
+   rtnl_unlock();
+   if (err)
+   return err;
+
return 0;
 }
 
@@ -498,6 +538,9 @@ static int rxe_net_ipv6_init(void)
 
 void rxe_net_exit(void)
 {
+   rtnl_lock();
+   netdev_rx_handler_unregister(g_ndev);
+   rtnl_unlock();
rxe_release_udp_tunnel(recv_sockets.sk6);
rxe_release_udp_tunnel(recv_sockets.sk4);
unregister_netdevice_notifier(&rxe_net_notifier);
-- 
1.7.1



Re: [PATCH 1/1] MAINTAINERS: SOFT-ROCE: Change Zhu Yanjun's email address

2020-08-16 Thread Zhu Yanjun
On Sun, Aug 16, 2020 at 3:45 PM Leon Romanovsky  wrote:
>
> On Sun, Aug 16, 2020 at 01:25:50PM +0800, Zhu Yanjun wrote:
> > I prefer to use this email address for kernel related work.
> >
> > Signed-off-by: Zhu Yanjun 
> > ---
> >  MAINTAINERS |2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
>
> It was already handled.
> https://lore.kernel.org/lkml/20200810091100.243932-1-l...@kernel.org/

Cool!

>
> Thanks


[PATCH 1/1] MAINTAINERS: SOFT-ROCE: Change Zhu Yanjun's email address

2020-08-15 Thread Zhu Yanjun
I prefer to use this email address for kernel related work.

Signed-off-by: Zhu Yanjun 
---
 MAINTAINERS |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index e02479a..065225f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -15833,7 +15833,7 @@ F:  drivers/infiniband/sw/siw/
 F: include/uapi/rdma/siw-abi.h
 
 SOFT-ROCE DRIVER (rxe)
-M: Zhu Yanjun 
+M: Zhu Yanjun 
 L: linux-r...@vger.kernel.org
 S: Supported
 F: drivers/infiniband/sw/rxe/
-- 
1.7.1



Re: Bonding driver unexpected behaviour

2020-07-16 Thread Zhu Yanjun
On Thu, Jul 16, 2020 at 6:20 PM Riccardo Paolo Bestetti  wrote:
>
> Hello Zhu Yanjun,
>
> On Thursday, July 16, 2020 11:45 CEST, Zhu Yanjun  
> wrote:
>
> > On Thu, Jul 16, 2020 at 4:08 PM Riccardo Paolo Bestetti  
> > wrote:
> > >
> > >
> > >
> > > On Thursday, July 16, 2020 09:45 CEST, Zhu Yanjun  
> > > wrote:
> > > > You can use team to make tests.
> > > I'm not sure I understand what you mean. Could you point me to relevant 
> > > documentation?
> >
> > https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/networking_guide/sec-comparison_of_network_teaming_to_bonding
> >
> > Use team instead of bonding to make tests.
> That seems like a Red Hat-specific feature. Unfortunately, I do not know Red 
> Hat.

Just a test.
Team driver does not belong to Red Hat.
I am also not Redhat employee.

You can make tests with team driver to find the root cause, then fix it.

IMHO, you can build bonding driver and gretap driver, make tests with
them, then find out where the packets are dropped, finally find out
the root cause.
This is a direct method.

It is up to you about how to find out the root cause.

Zhu Yanjun
> Nor I would have the possibility of using Red Hat in production even if I 
> could get teaming to work instead of bonding.
>
> Riccardo P. Bestetti
>


Re: Bonding driver unexpected behaviour

2020-07-16 Thread Zhu Yanjun
On Thu, Jul 16, 2020 at 4:08 PM Riccardo Paolo Bestetti  wrote:
>
> Hello Zhu Yanjun,
>
> On Thursday, July 16, 2020 09:45 CEST, Zhu Yanjun  
> wrote:
> > You can use team to make tests.
> I'm not sure I understand what you mean. Could you point me to relevant 
> documentation?

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/networking_guide/sec-comparison_of_network_teaming_to_bonding

Use team instead of bonding to make tests.

>
> Riccardo P. Bestetti
>


Re: Bonding driver unexpected behaviour

2020-07-16 Thread Zhu Yanjun
On Thu, Jul 16, 2020 at 3:08 PM Riccardo Paolo Bestetti  wrote:
>
> Hello Zhu Yanjun,
>
> On Thursday, July 16, 2020 05:41 CEST, Zhu Yanjun  
> wrote:
>
> >
> > Please check this
> > https://developers.redhat.com/blog/2019/05/17/an-introduction-to-linux-virtual-interfaces-tunnels/#gre
> >
> > Perhaps gretap only forwards ip (with L2 header) packets.
>
> That does not seem to be the case.
> E.g.
> root@fo-exit:/home/user# tcpdump -i intra16
> tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
> listening on intra16, link-type EN10MB (Ethernet), capture size 262144 bytes
> 09:05:12.619206 IP 10.88.16.100 > 10.88.16.200: GREv0, length 46: ARP, 
> Request who-has 10.42.42.200 tell 10.42.42.100, length 28
> 09:05:12.619278 IP 10.88.16.200 > 10.88.16.100: GREv0, length 46: ARP, Reply 
> 10.42.42.200 is-at da:9d:34:64:cb:8d (oui Unknown), length 28
> 09:05:14.054026 IP 10.88.16.200 > 10.88.16.100: GREv0, length 46: ARP, 
> Request who-has 10.42.42.100 tell 10.42.42.200, length 28
> 09:05:14.107143 IP 10.88.16.100 > 10.88.16.200: GREv0, length 46: ARP, Reply 
> 10.42.42.100 is-at d6:49:e5:19:52:16 (oui Unknown), length 28

Interesting problem. You can use team to make tests.

Zhu Yanjun

> ^C
>
> >
> > Possibly "arp -s" could help to workaround this.
>
> Riccardo P. Bestetti
>


Re: Bonding driver unexpected behaviour

2020-07-15 Thread Zhu Yanjun
On Wed, Jul 15, 2020 at 8:49 PM p...@bestov.io  wrote:
>
> I'm attempting to set up the bonding driver on two gretap interfaces, 
> gretap15 and gretap16
> but I'm observing unexpected (to me) behaviour.
> The underlying interfaces for those two are respectively intra15 (ipv4: 
> 10.88.15.100/24) and
> intra16 (ipv4: 10.88.16.100/24). These two are e1000 virtual network cards, 
> connected through
> virtual cables. As such, I would exclude any hardware issues. As a peer, I 
> have another Linux
> system configured similarly (ipv4s: 10.88.15.200 on intra15, 10.88.16.200 on 
> intra16).
>
> The gretap tunnels work as expected. They have the following ipv4 addresses:
>   host   peer
> gretap15  10.188.15.100  10.188.15.200
> gretap16  10.188.16.100  10.188.16.200
>
> When not enslaved by the bond interface, I'm able to exchange packets in the 
> tunnel using the
> internal ip addresses.
>
> I then set up the bonding driver as follows:
> # ip link add bond-15-16 type bond
> # ip link set bond-15-16 type bond mode active-backup
> # ip link set gretap15 down
> # ip link set gretap16 down
> # ip link set gretap15 master bond-15-16
> # ip link set gretap16 master bond-15-16
> # ip link set bond-15-16 mtu 1462
> # ip addr add 10.42.42.100/24 dev bond-15-16
> # ip link set bond-15-16 type bond arp_interval 100 arp_ip_target 10.42.42.200
> # ip link set bond-15-16 up
>
> I do the same on the peer system, inverting the interface and ARP target IP 
> addresses.
>
> At this point, IP communication using the addresses on the bond interfaces 
> works as expected.
> E.g.
> # ping 10.24.24.200
> gets responses from the other peer.
> Using tcpdump on the other peer shows the GRE packets coming into intra15, 
> and identical ICMP
> packets coming through gretap15 and bond-15-16.
>
> If I then disconnect the (virtual) network cable of intra15, the bonding 
> driver switches to
> intra16, as the GRE tunnel can no longer pass packets. However, despite 
> having primary_reselect=0,
> when I reconnect the network cable of intra15, the driver doesn't switch back 
> to gretap15. In fact,
> it doesn't even attempt sending any probes through it.
>
> Fiddling with the cables (e.g. reconnecting intra15 and then disconnecting 
> intra16) and/or bringing
> the bond interface down and up usually results in the driver ping-ponging a 
> bit between gretap15
> and gretap16, before usually settling on gretap16 (but never on gretap15, it 
> seems). Or,
> sometimes, it results in the driver marking both slaves down and not doing 
> anything ever again
> until manual intervention (e.g. manually selecting a new active_slave, or 
> down -> up).
>
> Trying to ping the gretap15 address of the peer (10.188.15.200) from the host 
> while gretap16 is the
> active slave results in ARP traffic being temporarily exchanged on gretap15. 
> I'm not sure whether
> it originates from the bonding driver, as it seems like the generated 
> requests are the cartesian
> product of all address couples on the network segments of gretap15 and 
> bond-15-16 (e.g. who-has
> 10.188.15.100 tell 10.188.15.100, who-has 10.188.15.100 tell 10.188.15.200, 
> ..., who-hash
> 10.42.42.200 tell 10.42.42.200).

Please check this
https://developers.redhat.com/blog/2019/05/17/an-introduction-to-linux-virtual-interfaces-tunnels/#gre

Perhaps gretap only forwards ip (with L2 header) packets.

Possibly "arp -s" could help to workaround this.

Zhu Yanjun
>
> uname -a:
> Linux fo-gw 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07) 
> x86_64 GNU/Linux
> (same on peer system)
>
> Am I misunderstanding how the driver works? Have I made any mistakes in the 
> configuration?
>
> Best regards,
> Riccardo P. Bestetti
>


Re: [PATCH] net: forcedeth: add xmit_more support

2019-10-22 Thread Zhu Yanjun



On 2019/10/22 23:40, Jakub Kicinski wrote:

On Tue, 22 Oct 2019 13:32:35 +0800, Zhu Yanjun wrote:

On 2019/10/21 23:33, Jakub Kicinski wrote:

On Mon, 21 Oct 2019 17:56:06 +0800, Zhu Yanjun wrote:

On 2019/10/19 6:48, Jakub Kicinski wrote:

On Fri, 18 Oct 2019 06:01:25 -0400, Zhu Yanjun wrote:

This change adds support for xmit_more based on the igb commit 6f19e12f6230
("igb: flush when in xmit_more mode and under descriptor pressure") and
commit 6b16f9ee89b8 ("net: move skb->xmit_more hint to softnet data") that
were made to igb to support this feature. The function netif_xmit_stopped
is called to check if transmit queue on device is currently unable to send
to determine if we must write the tail because we can add no further
buffers.
When normal packets and/or xmit_more packets fill up tx_desc, it is
necessary to trigger NIC tx reg.

Looks broken. You gotta make sure you check the kick on _every_ return
path. There are 4 return statements in each function, you only touched
2.

In nv_start_xmit,

[...]

The above are dma_mapping_error. It seems that triggering NIC HW xmit is
not needed.

So when "tx_desc full" error, HW NIC xmit is triggerred. When
dma_mapping_error,

NIC HW xmit is not triggerred.

That is why only 2 "return" are touched.

Imagine you have the following sequence of frames:

skbA  | xmit_more() == true
skbB  | xmit_more() == true
skbC  | xmit_more() == true
skbD  | xmit_more() == false

A, B, and C got queued successfully but the driver didn't kick the
queue because of xmit_more(). Now D gets dropped due to a DMA error.
Queue never gets kicked.

DMA error is a complicated problem. We will delve into this problem later.

  From the above commit log, this commit is based on the igb commit
6f19e12f6230
("igb: flush when in xmit_more mode and under descriptor pressure") and
commit 6b16f9ee89b8 ("net: move skb->xmit_more hint to softnet data").

It seems that the 2 commits did not consider the DMA errors that you
mentioned.

Then igb is buggy, too.


Then if igb problem is fixed, I will follow.;-)

Zhu Yanjun





[PATCHv2 1/1] net: forcedeth: add xmit_more support

2019-10-22 Thread Zhu Yanjun
This change adds support for xmit_more based on the igb commit 6f19e12f6230
("igb: flush when in xmit_more mode and under descriptor pressure") and
commit 6b16f9ee89b8 ("net: move skb->xmit_more hint to softnet data") that
were made to igb to support this feature. The function netif_xmit_stopped
is called to check if transmit queue on device is currently unable to send
to determine if we must write the tail because we can add no further
buffers.
When normal packets and/or xmit_more packets fill up tx_desc, it is
necessary to trigger NIC tx reg.

Tested:
  - pktgen (xmit_more packets) SMP x86_64 ->
Test command:
./pktgen_sample03_burst_single_flow.sh ... -b 8 -n 100
Test results:
Params:
...
burst: 8
...
Result: OK: 12194004(c12188996+d5007) usec, 101 (1500byte,0frags)
82007pps 984Mb/sec (984084000bps) errors: 0

  - iperf (normal packets) SMP x86_64 ->
Test command:
Server: iperf -s
Client: iperf -c serverip
Result:
TCP window size: 85.0 KByte (default)

[ ID] Interval   Transfer Bandwidth
[  3]  0.0-10.0 sec  1.10 GBytes   942 Mbits/sec

CC: Joe Jin 
CC: JUNXIAO_BI 
Reported-and-tested-by: Nan san 
Signed-off-by: Zhu Yanjun 
---
V1->V2: use the lower case label.
---
 drivers/net/ethernet/nvidia/forcedeth.c | 37 +++--
 1 file changed, 31 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c 
b/drivers/net/ethernet/nvidia/forcedeth.c
index 05d2b47..e2bb0cd 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -2225,6 +2225,7 @@ static netdev_tx_t nv_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
struct nv_skb_map *prev_tx_ctx;
struct nv_skb_map *tmp_tx_ctx = NULL, *start_tx_ctx = NULL;
unsigned long flags;
+   netdev_tx_t ret = NETDEV_TX_OK;
 
/* add fragments to entries count */
for (i = 0; i < fragments; i++) {
@@ -2240,7 +2241,12 @@ static netdev_tx_t nv_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
netif_stop_queue(dev);
np->tx_stop = 1;
spin_unlock_irqrestore(&np->lock, flags);
-   return NETDEV_TX_BUSY;
+
+   /* When normal packets and/or xmit_more packets fill up
+* tx_desc, it is necessary to trigger NIC tx reg.
+*/
+   ret = NETDEV_TX_BUSY;
+   goto txkick;
}
spin_unlock_irqrestore(&np->lock, flags);
 
@@ -2357,8 +2363,14 @@ static netdev_tx_t nv_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
 
spin_unlock_irqrestore(&np->lock, flags);
 
-   writel(NVREG_TXRXCTL_KICK|np->txrxctl_bits, get_hwbase(dev) + 
NvRegTxRxControl);
-   return NETDEV_TX_OK;
+txkick:
+   if (netif_queue_stopped(dev) || !netdev_xmit_more()) {
+   u32 txrxctl_kick = NVREG_TXRXCTL_KICK | np->txrxctl_bits;
+
+   writel(txrxctl_kick, get_hwbase(dev) + NvRegTxRxControl);
+   }
+
+   return ret;
 }
 
 static netdev_tx_t nv_start_xmit_optimized(struct sk_buff *skb,
@@ -2381,6 +2393,7 @@ static netdev_tx_t nv_start_xmit_optimized(struct sk_buff 
*skb,
struct nv_skb_map *start_tx_ctx = NULL;
struct nv_skb_map *tmp_tx_ctx = NULL;
unsigned long flags;
+   netdev_tx_t ret = NETDEV_TX_OK;
 
/* add fragments to entries count */
for (i = 0; i < fragments; i++) {
@@ -2396,7 +2409,13 @@ static netdev_tx_t nv_start_xmit_optimized(struct 
sk_buff *skb,
netif_stop_queue(dev);
np->tx_stop = 1;
spin_unlock_irqrestore(&np->lock, flags);
-   return NETDEV_TX_BUSY;
+
+   /* When normal packets and/or xmit_more packets fill up
+* tx_desc, it is necessary to trigger NIC tx reg.
+*/
+   ret = NETDEV_TX_BUSY;
+
+   goto txkick;
}
spin_unlock_irqrestore(&np->lock, flags);
 
@@ -2542,8 +2561,14 @@ static netdev_tx_t nv_start_xmit_optimized(struct 
sk_buff *skb,
 
spin_unlock_irqrestore(&np->lock, flags);
 
-   writel(NVREG_TXRXCTL_KICK|np->txrxctl_bits, get_hwbase(dev) + 
NvRegTxRxControl);
-   return NETDEV_TX_OK;
+txkick:
+   if (netif_queue_stopped(dev) || !netdev_xmit_more()) {
+   u32 txrxctl_kick = NVREG_TXRXCTL_KICK | np->txrxctl_bits;
+
+   writel(txrxctl_kick, get_hwbase(dev) + NvRegTxRxControl);
+   }
+
+   return ret;
 }
 
 static inline void nv_tx_flip_ownership(struct net_device *dev)
-- 
2.7.4



Re: [PATCH] net: forcedeth: add xmit_more support

2019-10-21 Thread Zhu Yanjun



On 2019/10/21 23:33, Jakub Kicinski wrote:

On Mon, 21 Oct 2019 17:56:06 +0800, Zhu Yanjun wrote:

On 2019/10/19 6:48, Jakub Kicinski wrote:

On Fri, 18 Oct 2019 06:01:25 -0400, Zhu Yanjun wrote:

This change adds support for xmit_more based on the igb commit 6f19e12f6230
("igb: flush when in xmit_more mode and under descriptor pressure") and
commit 6b16f9ee89b8 ("net: move skb->xmit_more hint to softnet data") that
were made to igb to support this feature. The function netif_xmit_stopped
is called to check if transmit queue on device is currently unable to send
to determine if we must write the tail because we can add no further
buffers.
When normal packets and/or xmit_more packets fill up tx_desc, it is
necessary to trigger NIC tx reg.

Looks broken. You gotta make sure you check the kick on _every_ return
path. There are 4 return statements in each function, you only touched
2.

In nv_start_xmit,

[...]

The above are dma_mapping_error. It seems that triggering NIC HW xmit is
not needed.

So when "tx_desc full" error, HW NIC xmit is triggerred. When
dma_mapping_error,

NIC HW xmit is not triggerred.

That is why only 2 "return" are touched.

Imagine you have the following sequence of frames:

skbA  | xmit_more() == true
skbB  | xmit_more() == true
skbC  | xmit_more() == true
skbD  | xmit_more() == false

A, B, and C got queued successfully but the driver didn't kick the
queue because of xmit_more(). Now D gets dropped due to a DMA error.
Queue never gets kicked.


DMA error is a complicated problem. We will delve into this problem later.

From the above commit log, this commit is based on the igb commit 
6f19e12f6230

("igb: flush when in xmit_more mode and under descriptor pressure") and
commit 6b16f9ee89b8 ("net: move skb->xmit_more hint to softnet data").

It seems that the 2 commits did not consider the DMA errors that you 
mentioned.





Also the labels should be lower case.

This patch passes checkpatch.pl. It seems that "not lower case" is not a
problem?

If you think it is a problem, please show me where it is defined.

Look at this driver and at other kernel code. Labels are lower case,
upper case is for constants and macros.


It sounds reasonable. I will send V2 to fix this problem. Thanks.

Zhu Yanjun





Re: [PATCH] net: forcedeth: add xmit_more support

2019-10-21 Thread Zhu Yanjun



On 2019/10/19 6:48, Jakub Kicinski wrote:

On Fri, 18 Oct 2019 06:01:25 -0400, Zhu Yanjun wrote:

This change adds support for xmit_more based on the igb commit 6f19e12f6230
("igb: flush when in xmit_more mode and under descriptor pressure") and
commit 6b16f9ee89b8 ("net: move skb->xmit_more hint to softnet data") that
were made to igb to support this feature. The function netif_xmit_stopped
is called to check if transmit queue on device is currently unable to send
to determine if we must write the tail because we can add no further
buffers.
When normal packets and/or xmit_more packets fill up tx_desc, it is
necessary to trigger NIC tx reg.

Looks broken. You gotta make sure you check the kick on _every_ return
path. There are 4 return statements in each function, you only touched
2.


In nv_start_xmit,

2240 if (unlikely(empty_slots <= entries)) {
2241 netif_stop_queue(dev);
2242 np->tx_stop = 1;
2243 spin_unlock_irqrestore(&np->lock, flags);
2244
2245 /* When normal packets and/or xmit_more packets fill up
2246  * tx_desc, it is necessary to trigger NIC tx reg.
2247  */
2248 ret = NETDEV_TX_BUSY;
2249 goto TXKICK;
2250 }
The above indicates tx_desc is full, it is necessary to trigger NIC HW xmit.

2261 if (unlikely(dma_mapping_error(&np->pci_dev->dev,
2262 np->put_tx_ctx->dma))) {
2263 /* on DMA mapping error - drop the packet */
2264 dev_kfree_skb_any(skb);
2265 u64_stats_update_begin(&np->swstats_tx_syncp);
2266 nv_txrx_stats_inc(stat_tx_dropped);
2267 u64_stats_update_end(&np->swstats_tx_syncp);
2268 return NETDEV_TX_OK;
2269 }

and

2300 if 
(unlikely(dma_mapping_error(&np->pci_dev->dev,

2301 np->put_tx_ctx->dma))) {
2302
2303 /* Unwind the mapped fragments */
2304 do {
2305 nv_unmap_txskb(np, 
start_tx_ctx);
2306 if (unlikely(tmp_tx_ctx++ 
== np->last_tx_ctx))
2307 tmp_tx_ctx = 
np->tx_skb;

2308 } while (tmp_tx_ctx != np->put_tx_ctx);
2309 dev_kfree_skb_any(skb);
2310 np->put_tx_ctx = start_tx_ctx;
2311 u64_stats_update_begin(&np->swstats_tx_syncp);
2312 nv_txrx_stats_inc(stat_tx_dropped);
2313 u64_stats_update_end(&np->swstats_tx_syncp);
2314 return NETDEV_TX_OK;
2315 }

The above are dma_mapping_error. It seems that triggering NIC HW xmit is 
not needed.


So when "tx_desc full" error, HW NIC xmit is triggerred. When 
dma_mapping_error,


NIC HW xmit is not triggerred.

That is why only 2 "return" are touched.



Also the labels should be lower case.


This patch passes checkpatch.pl. It seems that "not lower case" is not a 
problem?


If you think it is a problem, please show me where it is defined.

Zhu Yanjun





[PATCH] net: forcedeth: add xmit_more support

2019-10-18 Thread Zhu Yanjun
This change adds support for xmit_more based on the igb commit 6f19e12f6230
("igb: flush when in xmit_more mode and under descriptor pressure") and
commit 6b16f9ee89b8 ("net: move skb->xmit_more hint to softnet data") that
were made to igb to support this feature. The function netif_xmit_stopped
is called to check if transmit queue on device is currently unable to send
to determine if we must write the tail because we can add no further
buffers.
When normal packets and/or xmit_more packets fill up tx_desc, it is
necessary to trigger NIC tx reg.

Tested:
  - pktgen (xmit_more packets) SMP x86_64 ->
Test command:
./pktgen_sample03_burst_single_flow.sh ... -b 8 -n 100
Test results:
Params:
...
burst: 8
...
Result: OK: 12194004(c12188996+d5007) usec, 101 (1500byte,0frags)
82007pps 984Mb/sec (984084000bps) errors: 0

  - iperf (normal packets) SMP x86_64 ->
Test command:
Server: iperf -s
Client: iperf -c serverip
Result:
TCP window size: 85.0 KByte (default)

[ ID] Interval   Transfer Bandwidth
[  3]  0.0-10.0 sec  1.10 GBytes   942 Mbits/sec

CC: Joe Jin 
CC: JUNXIAO_BI 
Reported-and-tested-by: Nan san 
Signed-off-by: Zhu Yanjun 
---
 drivers/net/ethernet/nvidia/forcedeth.c | 37 +++--
 1 file changed, 31 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c 
b/drivers/net/ethernet/nvidia/forcedeth.c
index 05d2b47..7417bac 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -2225,6 +2225,7 @@ static netdev_tx_t nv_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
struct nv_skb_map *prev_tx_ctx;
struct nv_skb_map *tmp_tx_ctx = NULL, *start_tx_ctx = NULL;
unsigned long flags;
+   netdev_tx_t ret = NETDEV_TX_OK;
 
/* add fragments to entries count */
for (i = 0; i < fragments; i++) {
@@ -2240,7 +2241,12 @@ static netdev_tx_t nv_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
netif_stop_queue(dev);
np->tx_stop = 1;
spin_unlock_irqrestore(&np->lock, flags);
-   return NETDEV_TX_BUSY;
+
+   /* When normal packets and/or xmit_more packets fill up
+* tx_desc, it is necessary to trigger NIC tx reg.
+*/
+   ret = NETDEV_TX_BUSY;
+   goto TXKICK;
}
spin_unlock_irqrestore(&np->lock, flags);
 
@@ -2357,8 +2363,14 @@ static netdev_tx_t nv_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
 
spin_unlock_irqrestore(&np->lock, flags);
 
-   writel(NVREG_TXRXCTL_KICK|np->txrxctl_bits, get_hwbase(dev) + 
NvRegTxRxControl);
-   return NETDEV_TX_OK;
+TXKICK:
+   if (netif_queue_stopped(dev) || !netdev_xmit_more()) {
+   u32 txrxctl_kick = NVREG_TXRXCTL_KICK | np->txrxctl_bits;
+
+   writel(txrxctl_kick, get_hwbase(dev) + NvRegTxRxControl);
+   }
+
+   return ret;
 }
 
 static netdev_tx_t nv_start_xmit_optimized(struct sk_buff *skb,
@@ -2381,6 +2393,7 @@ static netdev_tx_t nv_start_xmit_optimized(struct sk_buff 
*skb,
struct nv_skb_map *start_tx_ctx = NULL;
struct nv_skb_map *tmp_tx_ctx = NULL;
unsigned long flags;
+   netdev_tx_t ret = NETDEV_TX_OK;
 
/* add fragments to entries count */
for (i = 0; i < fragments; i++) {
@@ -2396,7 +2409,13 @@ static netdev_tx_t nv_start_xmit_optimized(struct 
sk_buff *skb,
netif_stop_queue(dev);
np->tx_stop = 1;
spin_unlock_irqrestore(&np->lock, flags);
-   return NETDEV_TX_BUSY;
+
+   /* When normal packets and/or xmit_more packets fill up
+* tx_desc, it is necessary to trigger NIC tx reg.
+*/
+   ret = NETDEV_TX_BUSY;
+
+   goto TXKICK;
}
spin_unlock_irqrestore(&np->lock, flags);
 
@@ -2542,8 +2561,14 @@ static netdev_tx_t nv_start_xmit_optimized(struct 
sk_buff *skb,
 
spin_unlock_irqrestore(&np->lock, flags);
 
-   writel(NVREG_TXRXCTL_KICK|np->txrxctl_bits, get_hwbase(dev) + 
NvRegTxRxControl);
-   return NETDEV_TX_OK;
+TXKICK:
+   if (netif_queue_stopped(dev) || !netdev_xmit_more()) {
+   u32 txrxctl_kick = NVREG_TXRXCTL_KICK | np->txrxctl_bits;
+
+   writel(txrxctl_kick, get_hwbase(dev) + NvRegTxRxControl);
+   }
+
+   return ret;
 }
 
 static inline void nv_tx_flip_ownership(struct net_device *dev)
-- 
2.7.4



[PATCHv3 0/1] Fix deadlock problem and make performance better

2019-09-05 Thread Zhu Yanjun
When running with about 1Gbit/ses for very long time, running ifconfig
and netstat causes dead lock. These symptoms are similar to the
commit 5f6b4e14cada ("net: dsa: User per-cpu 64-bit statistics"). After
replacing network devices statistics with per-cpu 64-bit statistics,
the dead locks disappear even after very long time running with 1Gbit/sec.

V2->V3:
Based on David's advice, "Never use the inline keyword in foo.c files,
let the compiler decide.".

The inline keyword is removed from the functions nv_get_stats and
rx_missing_handler.

V1->V2:
Based on Eric's advice, "If the loops are ever restarted, the
storage->fields will have been modified multiple times.".

A similar change in the commit 5f6b4e14cada ("net: dsa: User per-cpu
64-bit statistics") is borrowed to fix the above problem.

Zhu Yanjun (1):
  forcedeth: use per cpu to collect xmit/recv statistics

 drivers/net/ethernet/nvidia/forcedeth.c | 143 ++--
 1 file changed, 99 insertions(+), 44 deletions(-)

-- 
2.7.4



[PATCHv3 1/1] forcedeth: use per cpu to collect xmit/recv statistics

2019-09-05 Thread Zhu Yanjun
When testing with a background iperf pushing 1Gbit/sec traffic and running
both ifconfig and netstat to collect statistics, some deadlocks occurred.

Ifconfig and netstat will call nv_get_stats64 to get software xmit/recv
statistics. In the commit f5d827aece36 ("forcedeth: implement
ndo_get_stats64() API"), the normal tx/rx variables is to collect tx/rx
statistics. The fix is to replace normal tx/rx variables with per
cpu 64-bit variable to collect xmit/recv statistics. The per cpu variable
will avoid deadlocks and provide fast efficient statistics updates.

In nv_probe, the per cpu variable is initialized. In nv_remove, this
per cpu variable is freed.

In xmit/recv process, this per cpu variable will be updated.

In nv_get_stats64, this per cpu variable on each cpu is added up. Then
the driver can get xmit/recv packets statistics.

A test runs for several days with this commit, the deadlocks disappear
and the performance is better.

Tested:
   - iperf SMP x86_64 ->
   Client connecting to 1.1.1.108, TCP port 5001
   TCP window size: 85.0 KByte (default)
   
   [  3] local 1.1.1.105 port 3 connected with 1.1.1.108 port 5001
   [ ID] Interval   Transfer Bandwidth
   [  3]  0.0-10.0 sec  1.10 GBytes   943 Mbits/sec

   ifconfig results:

   enp0s9 Link encap:Ethernet  HWaddr 00:21:28:6f:de:0f
  inet addr:1.1.1.105  Bcast:0.0.0.0  Mask:255.255.255.0
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:5774764531 errors:0 dropped:0 overruns:0 frame:0
  TX packets:633534193 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:7646159340904 (7.6 TB) TX bytes:11425340407722 (11.4 TB)

   netstat results:

   Kernel Interface table
   Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
   ...
   enp0s9 1500 0  5774764531 00 0  633534193  0  0  0 BMRU
   ...

Fixes: f5d827aece36 ("forcedeth: implement ndo_get_stats64() API")
CC: Joe Jin 
CC: JUNXIAO_BI 
Reported-and-tested-by: Nan san 
Signed-off-by: Zhu Yanjun 
---
V2->V3: Following David's advice, fix the problem "Never use the inline
 keyword in foo.c files, let the compiler decide."
V1->V2: Following Eric's advice fix the problem "If the loops are ever
 restarted, the storage->fields will have been modified multiple
 times."
---
 drivers/net/ethernet/nvidia/forcedeth.c | 143 ++--
 1 file changed, 99 insertions(+), 44 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c 
b/drivers/net/ethernet/nvidia/forcedeth.c
index b327b29..a6b4bfa 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -713,6 +713,21 @@ struct nv_skb_map {
struct nv_skb_map *next_tx_ctx;
 };
 
+struct nv_txrx_stats {
+   u64 stat_rx_packets;
+   u64 stat_rx_bytes; /* not always available in HW */
+   u64 stat_rx_missed_errors;
+   u64 stat_rx_dropped;
+   u64 stat_tx_packets; /* not always available in HW */
+   u64 stat_tx_bytes;
+   u64 stat_tx_dropped;
+};
+
+#define nv_txrx_stats_inc(member) \
+   __this_cpu_inc(np->txrx_stats->member)
+#define nv_txrx_stats_add(member, count) \
+   __this_cpu_add(np->txrx_stats->member, (count))
+
 /*
  * SMP locking:
  * All hardware access under netdev_priv(dev)->lock, except the performance
@@ -797,10 +812,7 @@ struct fe_priv {
 
/* RX software stats */
struct u64_stats_sync swstats_rx_syncp;
-   u64 stat_rx_packets;
-   u64 stat_rx_bytes; /* not always available in HW */
-   u64 stat_rx_missed_errors;
-   u64 stat_rx_dropped;
+   struct nv_txrx_stats __percpu *txrx_stats;
 
/* media detection workaround.
 * Locking: Within irq hander or disable_irq+spin_lock(&np->lock);
@@ -826,9 +838,6 @@ struct fe_priv {
 
/* TX software stats */
struct u64_stats_sync swstats_tx_syncp;
-   u64 stat_tx_packets; /* not always available in HW */
-   u64 stat_tx_bytes;
-   u64 stat_tx_dropped;
 
/* msi/msi-x fields */
u32 msi_flags;
@@ -1721,6 +1730,39 @@ static void nv_update_stats(struct net_device *dev)
}
 }
 
+static void nv_get_stats(int cpu, struct fe_priv *np,
+struct rtnl_link_stats64 *storage)
+{
+   struct nv_txrx_stats *src = per_cpu_ptr(np->txrx_stats, cpu);
+   unsigned int syncp_start;
+   u64 rx_packets, rx_bytes, rx_dropped, rx_missed_errors;
+   u64 tx_packets, tx_bytes, tx_dropped;
+
+   do {
+   syncp_start = u64_stats_fetch_begin_irq(&np->swstats_rx_syncp);
+   rx_packets   = src->stat_rx_packets;
+   rx_bytes = src->stat_rx_bytes;
+   rx_dropped   = src->stat_rx_dropped;
+ 

Re: [PATCHv2 1/1] forcedeth: use per cpu to collect xmit/recv statistics

2019-09-04 Thread Zhu Yanjun



On 2019/9/5 6:22, David Miller wrote:

From: Zhu Yanjun 
Date: Sun,  1 Sep 2019 03:26:13 -0400


+static inline void nv_get_stats(int cpu, struct fe_priv *np,
+   struct rtnl_link_stats64 *storage)

  ...

+static inline void rx_missing_handler(u32 flags, struct fe_priv *np)
+{

Never use the inline keyword in foo.c files, let the compiler decide.


Thanks a lot for your advice. I will pay attention to the usage of 
inline in the


source code.

If you agree, I will send V3 about this soon.

Zhu Yanjun





Re: [PATCHv2 1/1] net: rds: add service level support in rds-info

2019-09-03 Thread Zhu Yanjun



On 2019/9/3 9:58, Gustavo A. R. Silva wrote:

Hi,

On 8/23/19 8:04 PM, Zhu Yanjun wrote:

[..]


diff --git a/net/rds/ib.c b/net/rds/ib.c
index ec05d91..45acab2 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -291,7 +291,7 @@ static int rds_ib_conn_info_visitor(struct rds_connection 
*conn,
void *buffer)
  {
struct rds_info_rdma_connection *iinfo = buffer;
-   struct rds_ib_connection *ic;
+   struct rds_ib_connection *ic = conn->c_transport_data;
  
  	/* We will only ever look at IB transports */

if (conn->c_trans != &rds_ib_transport)
@@ -301,15 +301,16 @@ static int rds_ib_conn_info_visitor(struct rds_connection 
*conn,
  
  	iinfo->src_addr = conn->c_laddr.s6_addr32[3];

iinfo->dst_addr = conn->c_faddr.s6_addr32[3];
-   iinfo->tos = conn->c_tos;
+   if (ic) {

Is this null-check actually necessary? (see related comments below...)


+   iinfo->tos = conn->c_tos;
+   iinfo->sl = ic->i_sl;
+   }
  
  	memset(&iinfo->src_gid, 0, sizeof(iinfo->src_gid));

memset(&iinfo->dst_gid, 0, sizeof(iinfo->dst_gid));
if (rds_conn_state(conn) == RDS_CONN_UP) {
struct rds_ib_device *rds_ibdev;
  
-		ic = conn->c_transport_data;

-
rdma_read_gids(ic->i_cm_id, (union ib_gid *)&iinfo->src_gid,

Notice that *ic* is dereferenced here without null-checking it. More
comments below...


   (union ib_gid *)&iinfo->dst_gid);
  
@@ -329,7 +330,7 @@ static int rds6_ib_conn_info_visitor(struct rds_connection *conn,

 void *buffer)
  {
struct rds6_info_rdma_connection *iinfo6 = buffer;
-   struct rds_ib_connection *ic;
+   struct rds_ib_connection *ic = conn->c_transport_data;
  
  	/* We will only ever look at IB transports */

if (conn->c_trans != &rds_ib_transport)
@@ -337,6 +338,10 @@ static int rds6_ib_conn_info_visitor(struct rds_connection 
*conn,
  
  	iinfo6->src_addr = conn->c_laddr;

iinfo6->dst_addr = conn->c_faddr;
+   if (ic) {
+   iinfo6->tos = conn->c_tos;
+   iinfo6->sl = ic->i_sl;
+   }
  
  	memset(&iinfo6->src_gid, 0, sizeof(iinfo6->src_gid));

memset(&iinfo6->dst_gid, 0, sizeof(iinfo6->dst_gid));
@@ -344,7 +349,6 @@ static int rds6_ib_conn_info_visitor(struct rds_connection 
*conn,
if (rds_conn_state(conn) == RDS_CONN_UP) {
struct rds_ib_device *rds_ibdev;
  
-		ic = conn->c_transport_data;

rdma_read_gids(ic->i_cm_id, (union ib_gid *)&iinfo6->src_gid,

Again, *ic* is being dereferenced here without a previous null-check.


Please  check when this "rds_conn_state(conn) = RDS_CONN_UP".

Thanks a lot.

Zhu Yanjun




   (union ib_gid *)&iinfo6->dst_gid);
rds_ibdev = ic->rds_ibdev;


--
Gustavo



[PATCHv2 1/1] forcedeth: use per cpu to collect xmit/recv statistics

2019-09-01 Thread Zhu Yanjun
When testing with a background iperf pushing 1Gbit/sec traffic and running
both ifconfig and netstat to collect statistics, some deadlocks occurred.

Ifconfig and netstat will call nv_get_stats64 to get software xmit/recv
statistics. In the commit f5d827aece36 ("forcedeth: implement
ndo_get_stats64() API"), the normal tx/rx variables is to collect tx/rx
statistics. The fix is to replace normal tx/rx variables with per
cpu 64-bit variable to collect xmit/recv statistics. The per cpu variable
will avoid deadlocks and provide fast efficient statistics updates.

In nv_probe, the per cpu variable is initialized. In nv_remove, this
per cpu variable is freed.

In xmit/recv process, this per cpu variable will be updated.

In nv_get_stats64, this per cpu variable on each cpu is added up. Then
the driver can get xmit/recv packets statistics.

A test runs for several days with this commit, the deadlocks disappear
and the performance is better.

Tested:
   - iperf SMP x86_64 ->
   Client connecting to 1.1.1.108, TCP port 5001
   TCP window size: 85.0 KByte (default)
   
   [  3] local 1.1.1.105 port 3 connected with 1.1.1.108 port 5001
   [ ID] Interval   Transfer Bandwidth
   [  3]  0.0-10.0 sec  1.10 GBytes   943 Mbits/sec

   ifconfig results:

   enp0s9 Link encap:Ethernet  HWaddr 00:21:28:6f:de:0f
  inet addr:1.1.1.105  Bcast:0.0.0.0  Mask:255.255.255.0
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:5774764531 errors:0 dropped:0 overruns:0 frame:0
  TX packets:633534193 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:7646159340904 (7.6 TB) TX bytes:11425340407722 (11.4 TB)

   netstat results:

   Kernel Interface table
   Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
   ...
   enp0s9 1500 0  5774764531 00 0  633534193  0  0  0 BMRU
   ...

Fixes: f5d827aece36 ("forcedeth: implement ndo_get_stats64() API")
CC: Joe Jin 
CC: JUNXIAO_BI 
Reported-and-tested-by: Nan san 
Signed-off-by: Zhu Yanjun 
---
V1->V2: Following Eric's advice fix the problem "If the loops are ever
 restarted, the storage->fields will have been modified multiple
 times."
---
 drivers/net/ethernet/nvidia/forcedeth.c | 143 ++--
 1 file changed, 99 insertions(+), 44 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c 
b/drivers/net/ethernet/nvidia/forcedeth.c
index b327b29..07dd017 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -713,6 +713,21 @@ struct nv_skb_map {
struct nv_skb_map *next_tx_ctx;
 };
 
+struct nv_txrx_stats {
+   u64 stat_rx_packets;
+   u64 stat_rx_bytes; /* not always available in HW */
+   u64 stat_rx_missed_errors;
+   u64 stat_rx_dropped;
+   u64 stat_tx_packets; /* not always available in HW */
+   u64 stat_tx_bytes;
+   u64 stat_tx_dropped;
+};
+
+#define nv_txrx_stats_inc(member) \
+   __this_cpu_inc(np->txrx_stats->member)
+#define nv_txrx_stats_add(member, count) \
+   __this_cpu_add(np->txrx_stats->member, (count))
+
 /*
  * SMP locking:
  * All hardware access under netdev_priv(dev)->lock, except the performance
@@ -797,10 +812,7 @@ struct fe_priv {
 
/* RX software stats */
struct u64_stats_sync swstats_rx_syncp;
-   u64 stat_rx_packets;
-   u64 stat_rx_bytes; /* not always available in HW */
-   u64 stat_rx_missed_errors;
-   u64 stat_rx_dropped;
+   struct nv_txrx_stats __percpu *txrx_stats;
 
/* media detection workaround.
 * Locking: Within irq hander or disable_irq+spin_lock(&np->lock);
@@ -826,9 +838,6 @@ struct fe_priv {
 
/* TX software stats */
struct u64_stats_sync swstats_tx_syncp;
-   u64 stat_tx_packets; /* not always available in HW */
-   u64 stat_tx_bytes;
-   u64 stat_tx_dropped;
 
/* msi/msi-x fields */
u32 msi_flags;
@@ -1721,6 +1730,39 @@ static void nv_update_stats(struct net_device *dev)
}
 }
 
+static inline void nv_get_stats(int cpu, struct fe_priv *np,
+   struct rtnl_link_stats64 *storage)
+{
+   struct nv_txrx_stats *src = per_cpu_ptr(np->txrx_stats, cpu);
+   unsigned int syncp_start;
+   u64 rx_packets, rx_bytes, rx_dropped, rx_missed_errors;
+   u64 tx_packets, tx_bytes, tx_dropped;
+
+   do {
+   syncp_start = u64_stats_fetch_begin_irq(&np->swstats_rx_syncp);
+   rx_packets   = src->stat_rx_packets;
+   rx_bytes = src->stat_rx_bytes;
+   rx_dropped   = src->stat_rx_dropped;
+   rx_missed_errors = src->stat_rx_missed_errors;
+   } while (u64_stats_fetch_retry_irq(&np->swstats_rx_syn

[PATCHv2 0/1] Fix deadlock problem and make performance better

2019-09-01 Thread Zhu Yanjun
When running with about 1Gbit/ses for very long time, running ifconfig
and netstat causes dead lock. These symptoms are similar to the
commit 5f6b4e14cada ("net: dsa: User per-cpu 64-bit statistics"). After
replacing network devices statistics with per-cpu 64-bit statistics,
the dead locks disappear even after very long time running with 1Gbit/sec.

Based on Eric's advice, "If the loops are ever restarted, the
storage->fields will have been modified multiple times.".

A similar change in the commit 5f6b4e14cada ("net: dsa: User per-cpu
64-bit statistics") is borrowed to fix the above problem.

Zhu Yanjun (1):
  forcedeth: use per cpu to collect xmit/recv statistics

 drivers/net/ethernet/nvidia/forcedeth.c | 143 ++--
 1 file changed, 99 insertions(+), 44 deletions(-)

-- 
2.7.4



Re: [PATCH 1/1] forcedeth: use per cpu to collect xmit/recv statistics

2019-08-30 Thread Zhu Yanjun



On 2019/8/30 17:32, Eric Dumazet wrote:


On 8/30/19 10:35 AM, Zhu Yanjun wrote:

When testing with a background iperf pushing 1Gbit/sec traffic and running
both ifconfig and netstat to collect statistics, some deadlocks occurred.


This is quite a heavy patch trying to fix a bug...

This is to use per-cpu variable. Perhaps the changes are big.


I suspect the root cause has nothing to do with stat
collection since on 64bit arches there is no additional synchronization.
This bug is similar to the one that the commit 5f6b4e14cada ("net: dsa: 
User per-cpu 64-bit statistics") tries to fix.

So a similar patch is to fix this similar bug in forcedeth.

(u64_stats_update_begin(), u64_stats_update_end() are nops)

Sure. Exactly.



+static inline void nv_get_stats(int cpu, struct fe_priv *np,
+   struct rtnl_link_stats64 *storage)
+{
+   struct nv_txrx_stats *src = per_cpu_ptr(np->txrx_stats, cpu);
+   unsigned int syncp_start;
+
+   do {
+   syncp_start = u64_stats_fetch_begin_irq(&np->swstats_rx_syncp);
+   storage->rx_packets   += src->stat_rx_packets;
+   storage->rx_bytes += src->stat_rx_bytes;
+   storage->rx_dropped   += src->stat_rx_dropped;
+   storage->rx_missed_errors += src->stat_rx_missed_errors;
+   } while (u64_stats_fetch_retry_irq(&np->swstats_rx_syncp, syncp_start));
+
+   do {
+   syncp_start = u64_stats_fetch_begin_irq(&np->swstats_tx_syncp);
+   storage->tx_packets += src->stat_tx_packets;
+   storage->tx_bytes   += src->stat_tx_bytes;
+   storage->tx_dropped += src->stat_tx_dropped;
+   } while (u64_stats_fetch_retry_irq(&np->swstats_tx_syncp, syncp_start));
+}
+


This is buggy :
If the loops are ever restarted, the storage->fields will have
been modified multiple times.

Sure. Sorry. My bad.
A similar changes in the commit 5f6b4e14cada ("net: dsa: User per-cpu 
64-bit statistics").

I will use this similar changes.
I will send V2 soon.

Thanks a lot for your comments.

Zhu Yanjun






[PATCH 0/1] Fix deadlock problem and make performance better

2019-08-30 Thread Zhu Yanjun
When running with about 1Gbit/ses for very long time, running ifconfig
and netstat causes dead lock. These symptoms are similar to the
commit 5f6b4e14cada ("net: dsa: User per-cpu 64-bit statistics"). After
replacing network devices statistics with per-cpu 64-bit statistics,
the dead locks disappear even after very long time running with 1Gbit/sec.

Zhu Yanjun (1):
  forcedeth: use per cpu to collect xmit/recv statistics

 drivers/net/ethernet/nvidia/forcedeth.c | 132 +---
 1 file changed, 88 insertions(+), 44 deletions(-)

-- 
2.7.4



[PATCH 1/1] forcedeth: use per cpu to collect xmit/recv statistics

2019-08-30 Thread Zhu Yanjun
When testing with a background iperf pushing 1Gbit/sec traffic and running
both ifconfig and netstat to collect statistics, some deadlocks occurred.

Ifconfig and netstat will call nv_get_stats64 to get software xmit/recv
statistics. In the commit f5d827aece36 ("forcedeth: implement
ndo_get_stats64() API"), the normal tx/rx variables is to collect tx/rx
statistics. The fix is to replace normal tx/rx variables with per
cpu 64-bit variable to collect xmit/recv statistics. The per cpu variable
will avoid deadlocks and provide fast efficient statistics updates.

In nv_probe, the per cpu variable is initialized. In nv_remove, this
per cpu variable is freed.

In xmit/recv process, this per cpu variable will be updated.

In nv_get_stats64, this per cpu variable on each cpu is added up. Then
the driver can get xmit/recv packets statistics.

A test runs for several days with this commit, the deadlocks disappear
and the performance is better.

Tested:
- iperf SMP x86_64 ->
Client connecting to 1.1.1.108, TCP port 5001
TCP window size: 85.0 KByte (default)

[  3] local 1.1.1.105 port 3 connected with 1.1.1.108 port 5001
[ ID] Interval   Transfer Bandwidth
[  3]  0.0-10.0 sec  1.10 GBytes   943 Mbits/sec

ifconfig results:

enp0s9Link encap:Ethernet  HWaddr 00:21:28:6f:de:0f
  inet addr:1.1.1.105  Bcast:0.0.0.0  Mask:255.255.255.0
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:5774764531 errors:0 dropped:0 overruns:0 frame:0
  TX packets:633534193 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:7646159340904 (7.6 TB) TX bytes:11425340407722 (11.4 
TB)

netstat results:

Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
...
enp0s9 1500 0  5774764531 00 0  633534193  0  0  0 BMRU
...

Fixes: f5d827aece36 ("forcedeth: implement ndo_get_stats64() API")
CC: Joe Jin 
CC: JUNXIAO_BI 
Reported-and-tested-by: Nan san 
Signed-off-by: Zhu Yanjun 
---
 drivers/net/ethernet/nvidia/forcedeth.c | 132 +---
 1 file changed, 88 insertions(+), 44 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c 
b/drivers/net/ethernet/nvidia/forcedeth.c
index b327b29..ee8bb9d 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -713,6 +713,21 @@ struct nv_skb_map {
struct nv_skb_map *next_tx_ctx;
 };
 
+struct nv_txrx_stats {
+   u64 stat_rx_packets;
+   u64 stat_rx_bytes; /* not always available in HW */
+   u64 stat_rx_missed_errors;
+   u64 stat_rx_dropped;
+   u64 stat_tx_packets; /* not always available in HW */
+   u64 stat_tx_bytes;
+   u64 stat_tx_dropped;
+};
+
+#define nv_txrx_stats_inc(member) \
+   __this_cpu_inc(np->txrx_stats->member)
+#define nv_txrx_stats_add(member, count) \
+   __this_cpu_add(np->txrx_stats->member, (count))
+
 /*
  * SMP locking:
  * All hardware access under netdev_priv(dev)->lock, except the performance
@@ -797,10 +812,7 @@ struct fe_priv {
 
/* RX software stats */
struct u64_stats_sync swstats_rx_syncp;
-   u64 stat_rx_packets;
-   u64 stat_rx_bytes; /* not always available in HW */
-   u64 stat_rx_missed_errors;
-   u64 stat_rx_dropped;
+   struct nv_txrx_stats __percpu *txrx_stats;
 
/* media detection workaround.
 * Locking: Within irq hander or disable_irq+spin_lock(&np->lock);
@@ -826,9 +838,6 @@ struct fe_priv {
 
/* TX software stats */
struct u64_stats_sync swstats_tx_syncp;
-   u64 stat_tx_packets; /* not always available in HW */
-   u64 stat_tx_bytes;
-   u64 stat_tx_dropped;
 
/* msi/msi-x fields */
u32 msi_flags;
@@ -1721,6 +1730,28 @@ static void nv_update_stats(struct net_device *dev)
}
 }
 
+static inline void nv_get_stats(int cpu, struct fe_priv *np,
+   struct rtnl_link_stats64 *storage)
+{
+   struct nv_txrx_stats *src = per_cpu_ptr(np->txrx_stats, cpu);
+   unsigned int syncp_start;
+
+   do {
+   syncp_start = u64_stats_fetch_begin_irq(&np->swstats_rx_syncp);
+   storage->rx_packets   += src->stat_rx_packets;
+   storage->rx_bytes += src->stat_rx_bytes;
+   storage->rx_dropped   += src->stat_rx_dropped;
+   storage->rx_missed_errors += src->stat_rx_missed_errors;
+   } while (u64_stats_fetch_retry_irq(&np->swstats_rx_syncp, syncp_start));
+
+   do {
+   syncp_start = u64_stats_fetch_begin_irq(&np->swstats_tx

Re: [PATCHv2 1/1] net: rds: add service level support in rds-info

2019-08-25 Thread Zhu Yanjun



On 2019/8/25 7:58, David Miller wrote:

From: Zhu Yanjun 
Date: Fri, 23 Aug 2019 21:04:16 -0400


diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h
index fd6b5f6..cba368e 100644
--- a/include/uapi/linux/rds.h
+++ b/include/uapi/linux/rds.h
@@ -250,6 +250,7 @@ struct rds_info_rdma_connection {
__u32   rdma_mr_max;
__u32   rdma_mr_size;
__u8tos;
+   __u8sl;
__u32   cache_allocs;
  };

I'm applying this, but I am once again severely disappointed in how
RDS development is being handled.

>From the Fixes: commit:

Since rds.h in rds-tools is not related with the kernel rds.h,
the change in kernel rds.h does not affect rds-tools.

This is the height of arrogance and shows a lack of understanding of
what user ABI requirements are all about.

It is possible for other userland components to be built by other
people, outside of your controlled eco-system and tools, that use
these interfaces.

And you cannot control that.

Therefore you cannot make arbitrary changes to UABI data strucures
just because the tool you use and maintain is not effected by it.

Please stop making these incredibly incompatible user interface
changes in the RDS stack.

I am, from this point forward, going to be extra strict on RDS stack
changes especially in this area.


OK. It is up to you to decide to merge this commit or not.

Zhu Yanjun






Re: [PATCHv2 1/1] net: rds: add service level support in rds-info

2019-08-23 Thread Zhu Yanjun



On 2019/8/24 9:25, santosh.shilim...@oracle.com wrote:

On 8/23/19 6:04 PM, Zhu Yanjun wrote:

 From IB specific 7.6.5 SERVICE LEVEL, Service Level (SL)
is used to identify different flows within an IBA subnet.
It is carried in the local route header of the packet.

Before this commit, run "rds-info -I". The outputs are as
below:
"
RDS IB Connections:
  LocalAddr  RemoteAddr Tos SL  LocalDev   RemoteDev
192.2.95.3  192.2.95.1  2   0  fe80::21:28:1a:39 fe80::21:28:10:b9
192.2.95.3  192.2.95.1  1   0  fe80::21:28:1a:39 fe80::21:28:10:b9
192.2.95.3  192.2.95.1  0   0  fe80::21:28:1a:39 fe80::21:28:10:b9
"
After this commit, the output is as below:
"
RDS IB Connections:
  LocalAddr  RemoteAddr Tos SL  LocalDev   RemoteDev
192.2.95.3  192.2.95.1  2   2  fe80::21:28:1a:39 fe80::21:28:10:b9
192.2.95.3  192.2.95.1  1   1  fe80::21:28:1a:39 fe80::21:28:10:b9
192.2.95.3  192.2.95.1  0   0  fe80::21:28:1a:39 fe80::21:28:10:b9
"

The commit fe3475af3bdf ("net: rds: add per rds connection cache
statistics") adds cache_allocs in struct rds_info_rdma_connection
as below:
struct rds_info_rdma_connection {
...
 __u32   rdma_mr_max;
 __u32   rdma_mr_size;
 __u8    tos;
 __u32   cache_allocs;
  };
The peer struct in rds-tools of struct rds_info_rdma_connection is as
below:
struct rds_info_rdma_connection {
...
 uint32_t    rdma_mr_max;
 uint32_t    rdma_mr_size;
 uint8_t tos;
 uint8_t sl;
 uint32_t    cache_allocs;
};
The difference between userspace and kernel is the member variable sl.
In the kernel struct, the member variable sl is missing. This will
introduce risks. So it is necessary to use this commit to avoid this 
risk.


Fixes: fe3475af3bdf ("net: rds: add per rds connection cache 
statistics")

CC: Joe Jin 
CC: JUNXIAO_BI 
Suggested-by: Gerd Rausch 
Signed-off-by: Zhu Yanjun 
---
V1->V2: fix typos in commit logs.
---

I did ask you when ypu posted the patch about whether you did
backward compatibility tests for which you said, you did all the
tests and said "So do not worry about backward compatibility. This
commit will work well with older rds-tools2.0.5 and 2.0.6."

https://www.spinics.net/lists/netdev/msg574691.html

I was worried about exactly such issue as described in commit.


Sorry. My bad. I will make more work to let rds robust.

Thanks a lot for your Ack.

Zhu Yanjun



Anyways thanks for the fixup patch. Should be applied to stable
as well.

Acked-by: Santosh Shilimkar 

Regards,
Santosh




[PATCHv2 1/1] net: rds: add service level support in rds-info

2019-08-23 Thread Zhu Yanjun
>From IB specific 7.6.5 SERVICE LEVEL, Service Level (SL)
is used to identify different flows within an IBA subnet.
It is carried in the local route header of the packet.

Before this commit, run "rds-info -I". The outputs are as
below:
"
RDS IB Connections:
 LocalAddr  RemoteAddr Tos SL  LocalDev   RemoteDev
192.2.95.3  192.2.95.1  2   0  fe80::21:28:1a:39  fe80::21:28:10:b9
192.2.95.3  192.2.95.1  1   0  fe80::21:28:1a:39  fe80::21:28:10:b9
192.2.95.3  192.2.95.1  0   0  fe80::21:28:1a:39  fe80::21:28:10:b9
"
After this commit, the output is as below:
"
RDS IB Connections:
 LocalAddr  RemoteAddr Tos SL  LocalDev   RemoteDev
192.2.95.3  192.2.95.1  2   2  fe80::21:28:1a:39  fe80::21:28:10:b9
192.2.95.3  192.2.95.1  1   1  fe80::21:28:1a:39  fe80::21:28:10:b9
192.2.95.3  192.2.95.1  0   0  fe80::21:28:1a:39  fe80::21:28:10:b9
"

The commit fe3475af3bdf ("net: rds: add per rds connection cache
statistics") adds cache_allocs in struct rds_info_rdma_connection
as below:
struct rds_info_rdma_connection {
...
__u32   rdma_mr_max;
__u32   rdma_mr_size;
__u8tos;
__u32   cache_allocs;
 };
The peer struct in rds-tools of struct rds_info_rdma_connection is as
below:
struct rds_info_rdma_connection {
...
uint32_trdma_mr_max;
uint32_trdma_mr_size;
uint8_t tos;
uint8_t sl;
uint32_tcache_allocs;
};
The difference between userspace and kernel is the member variable sl.
In the kernel struct, the member variable sl is missing. This will
introduce risks. So it is necessary to use this commit to avoid this risk.

Fixes: fe3475af3bdf ("net: rds: add per rds connection cache statistics")
CC: Joe Jin 
CC: JUNXIAO_BI 
Suggested-by: Gerd Rausch 
Signed-off-by: Zhu Yanjun 
---
V1->V2: fix typos in commit logs.
---
 include/uapi/linux/rds.h |2 ++
 net/rds/ib.c |   16 ++--
 net/rds/ib.h |1 +
 net/rds/ib_cm.c  |3 +++
 net/rds/rdma_transport.c |   10 --
 5 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h
index fd6b5f6..cba368e 100644
--- a/include/uapi/linux/rds.h
+++ b/include/uapi/linux/rds.h
@@ -250,6 +250,7 @@ struct rds_info_rdma_connection {
__u32   rdma_mr_max;
__u32   rdma_mr_size;
__u8tos;
+   __u8sl;
__u32   cache_allocs;
 };
 
@@ -265,6 +266,7 @@ struct rds6_info_rdma_connection {
__u32   rdma_mr_max;
__u32   rdma_mr_size;
__u8tos;
+   __u8sl;
__u32   cache_allocs;
 };
 
diff --git a/net/rds/ib.c b/net/rds/ib.c
index ec05d91..45acab2 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -291,7 +291,7 @@ static int rds_ib_conn_info_visitor(struct rds_connection 
*conn,
void *buffer)
 {
struct rds_info_rdma_connection *iinfo = buffer;
-   struct rds_ib_connection *ic;
+   struct rds_ib_connection *ic = conn->c_transport_data;
 
/* We will only ever look at IB transports */
if (conn->c_trans != &rds_ib_transport)
@@ -301,15 +301,16 @@ static int rds_ib_conn_info_visitor(struct rds_connection 
*conn,
 
iinfo->src_addr = conn->c_laddr.s6_addr32[3];
iinfo->dst_addr = conn->c_faddr.s6_addr32[3];
-   iinfo->tos = conn->c_tos;
+   if (ic) {
+   iinfo->tos = conn->c_tos;
+   iinfo->sl = ic->i_sl;
+   }
 
memset(&iinfo->src_gid, 0, sizeof(iinfo->src_gid));
memset(&iinfo->dst_gid, 0, sizeof(iinfo->dst_gid));
if (rds_conn_state(conn) == RDS_CONN_UP) {
struct rds_ib_device *rds_ibdev;
 
-   ic = conn->c_transport_data;
-
rdma_read_gids(ic->i_cm_id, (union ib_gid *)&iinfo->src_gid,
   (union ib_gid *)&iinfo->dst_gid);
 
@@ -329,7 +330,7 @@ static int rds6_ib_conn_info_visitor(struct rds_connection 
*conn,
 void *buffer)
 {
struct rds6_info_rdma_connection *iinfo6 = buffer;
-   struct rds_ib_connection *ic;
+   struct rds_ib_connection *ic = conn->c_transport_data;
 
/* We will only ever look at IB transports */
if (conn->c_trans != &rds_ib_transport)
@@ -337,6 +338,10 @@ static int rds6_ib_conn_info_visitor(struct rds_connection 
*conn,
 
iinfo6->src_addr = conn->c_laddr;
iinfo6->dst_addr = conn->c_faddr;
+   if (ic) {
+   iinfo6->tos = conn->c_tos;
+   iinfo6->sl = ic->i_sl;
+   }
 
memset(&iinfo6->src_gid, 0, sizeof(iinfo6->src_gid));
memset(&

Re: [PATCH 1/1] net: rds: add service level support in rds-info

2019-08-20 Thread Zhu Yanjun

Hi,Doug

My reply is in line.

On 2019/8/20 23:28, Doug Ledford wrote:

On Mon, 2019-08-19 at 20:52 -0400, Zhu Yanjun wrote:

diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h
index fd6b5f6..cba368e 100644
--- a/include/uapi/linux/rds.h
+++ b/include/uapi/linux/rds.h
@@ -250,6 +250,7 @@ struct rds_info_rdma_connection {
 __u32   rdma_mr_max;
 __u32   rdma_mr_size;
 __u8tos;
+   __u8sl;
 __u32   cache_allocs;
  };
  
@@ -265,6 +266,7 @@ struct rds6_info_rdma_connection {

 __u32   rdma_mr_max;
 __u32   rdma_mr_size;
 __u8tos;
+   __u8sl;
 __u32   cache_allocs;
  };
  

This is a user space API break (as was the prior patch mentioned
below)...


The commit fe3475af3bdf ("net: rds: add per rds connection cache
statistics") adds cache_allocs in struct rds_info_rdma_connection
as below:
struct rds_info_rdma_connection {
...
 __u32   rdma_mr_max;
 __u32   rdma_mr_size;
 __u8tos;
 __u32   cache_allocs;
  };
The peer struct in rds-tools of struct rds_info_rdma_connection is as
below:
struct rds_info_rdma_connection {
...
 uint32_trdma_mr_max;
 uint32_trdma_mr_size;
 uint8_t tos;
 uint8_t sl;
 uint32_tcache_allocs;
};

Why are the user space rds tools not using the kernel provided abi
files?

Perhaps it is a long story.


In order to know if this ABI breakage is safe, we need to know what
versions of rds-tools are out in the wild and have their own headers
that we need to match up with.


From my works in LAB and in the customer's host, rds-tools 2.0.7 is the 
popular


version. Other versions rds-tools are used less.


   Are there any versions of rds-tools that
actually use the kernel provided headers?


"the kernel provided headers", do you mean include/uapi/linux/rds.h?

I checked the rds-tools source code. I do not find any version of 
rds-tools us this header files.



Are there any other users of
uapi/linux/rds.h besides rds-tools?


Not sure. But in Oracle, there are some rds applications. I am not sure 
whether these rds applications


will use include/uapi/linux/rds.h file or not.

I will investigate it.



Once the kernel and rds-tools package are in sync,


After this commit is merged into mailine, the kernel and rds-tools 
package are in sync.


I will make investigations about rds-tools using the kernel header 
include/uapi/linux/rds.h.


Thanks a lot for your comments.

Zhu Yanjun


  rds-tools needs to be
modified to use the kernel header and proper ABI maintenance needs to be
started.



[PATCH 1/1] net: rds: add service level support in rds-info

2019-08-19 Thread Zhu Yanjun
>From IB specific 7.6.5 SERVICE LEVEL, Service Level (SL)
is used to identify different flows within an IBA subnet.
It is carried in the local route header of the packet.

Before this commit, run "rds-info -I". The output is as
below:
"
RDS IB Connections:
 LocalAddr  RemoteAddr Tos SL  LocalDev   RemoteDev
192.2.95.3  192.2.95.1  2   0  fe80::21:28:1a:39  fe80::21:28:10:b9
192.2.95.3  192.2.95.1  1   0  fe80::21:28:1a:39  fe80::21:28:10:b9
192.2.95.3  192.2.95.1  0   0  fe80::21:28:1a:39  fe80::21:28:10:b9
"
After this commit, the output is as below:
"
RDS IB Connections:
 LocalAddr  RemoteAddr Tos SL  LocalDev   RemoteDev
192.2.95.3  192.2.95.1  2   2  fe80::21:28:1a:39  fe80::21:28:10:b9
192.2.95.3  192.2.95.1  1   1  fe80::21:28:1a:39  fe80::21:28:10:b9
192.2.95.3  192.2.95.1  0   0  fe80::21:28:1a:39  fe80::21:28:10:b9
"

The commit fe3475af3bdf ("net: rds: add per rds connection cache
statistics") adds cache_allocs in struct rds_info_rdma_connection
as below:
struct rds_info_rdma_connection {
...
__u32   rdma_mr_max;
__u32   rdma_mr_size;
__u8tos;
__u32   cache_allocs;
 };
The peer struct in rds-tools of struct rds_info_rdma_connection is as
below:
struct rds_info_rdma_connection {
...
uint32_trdma_mr_max;
uint32_trdma_mr_size;
uint8_t tos;
uint8_t sl;
uint32_tcache_allocs;
};
The difference between userspace and kernel is the member variable sl.
In kernel struct, the member variable sl is missing. This will introduce
risks. So it is necessary to use this commit to avoid this risk.

Fixes: fe3475af3bdf ("net: rds: add per rds connection cache statistics")
CC: Joe Jin 
CC: JUNXIAO_BI 
Suggested-by: Gerd Rausch 
Signed-off-by: Zhu Yanjun 
---
 include/uapi/linux/rds.h |2 ++
 net/rds/ib.c |   16 ++--
 net/rds/ib.h |1 +
 net/rds/ib_cm.c  |3 +++
 net/rds/rdma_transport.c |   10 --
 5 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h
index fd6b5f6..cba368e 100644
--- a/include/uapi/linux/rds.h
+++ b/include/uapi/linux/rds.h
@@ -250,6 +250,7 @@ struct rds_info_rdma_connection {
__u32   rdma_mr_max;
__u32   rdma_mr_size;
__u8tos;
+   __u8sl;
__u32   cache_allocs;
 };
 
@@ -265,6 +266,7 @@ struct rds6_info_rdma_connection {
__u32   rdma_mr_max;
__u32   rdma_mr_size;
__u8tos;
+   __u8sl;
__u32   cache_allocs;
 };
 
diff --git a/net/rds/ib.c b/net/rds/ib.c
index ec05d91..45acab2 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -291,7 +291,7 @@ static int rds_ib_conn_info_visitor(struct rds_connection 
*conn,
void *buffer)
 {
struct rds_info_rdma_connection *iinfo = buffer;
-   struct rds_ib_connection *ic;
+   struct rds_ib_connection *ic = conn->c_transport_data;
 
/* We will only ever look at IB transports */
if (conn->c_trans != &rds_ib_transport)
@@ -301,15 +301,16 @@ static int rds_ib_conn_info_visitor(struct rds_connection 
*conn,
 
iinfo->src_addr = conn->c_laddr.s6_addr32[3];
iinfo->dst_addr = conn->c_faddr.s6_addr32[3];
-   iinfo->tos = conn->c_tos;
+   if (ic) {
+   iinfo->tos = conn->c_tos;
+   iinfo->sl = ic->i_sl;
+   }
 
memset(&iinfo->src_gid, 0, sizeof(iinfo->src_gid));
memset(&iinfo->dst_gid, 0, sizeof(iinfo->dst_gid));
if (rds_conn_state(conn) == RDS_CONN_UP) {
struct rds_ib_device *rds_ibdev;
 
-   ic = conn->c_transport_data;
-
rdma_read_gids(ic->i_cm_id, (union ib_gid *)&iinfo->src_gid,
   (union ib_gid *)&iinfo->dst_gid);
 
@@ -329,7 +330,7 @@ static int rds6_ib_conn_info_visitor(struct rds_connection 
*conn,
 void *buffer)
 {
struct rds6_info_rdma_connection *iinfo6 = buffer;
-   struct rds_ib_connection *ic;
+   struct rds_ib_connection *ic = conn->c_transport_data;
 
/* We will only ever look at IB transports */
if (conn->c_trans != &rds_ib_transport)
@@ -337,6 +338,10 @@ static int rds6_ib_conn_info_visitor(struct rds_connection 
*conn,
 
iinfo6->src_addr = conn->c_laddr;
iinfo6->dst_addr = conn->c_faddr;
+   if (ic) {
+   iinfo6->tos = conn->c_tos;
+   iinfo6->sl = ic->i_sl;
+   }
 
memset(&iinfo6->src_gid, 0, sizeof(iinfo6->src_gid));
memset(&iinfo6->dst_gid, 0, sizeof(iinfo6->dst_gid))

[PATCHv2 1/2] forcedeth: add recv cache to make nic work steadily

2019-07-21 Thread Zhu Yanjun
A recv cache is added. The size of recv cache is 1000Mbit / skb_length.
When the system memory is not enough, this recv cache can make nic work
steadily.
When nic is up, this recv cache and work queue are created. When nic
is down, this recv cache will be destroyed and delayed workqueue is
canceled.
When nic is polled or rx interrupt is triggerred, rx handler will
get a skb from recv cache. Then a work is queued to fill up recv cache.
When skb size is changed, the old recv cache is destroyed and new recv
cache is created.
When the system memory is not enough, the allocation of skb failed. Then
recv cache will continue allocate skb with GFP_KERNEL until the recv
cache is filled up. When the system memory is not enough, this can make
nic work steadily. Becase of recv cache, the performance of nic is
enhanced.

CC: Joe Jin 
CC: Junxiao Bi 
Tested-by: Nan san 
Signed-off-by: Zhu Yanjun 
---
 drivers/net/ethernet/nvidia/forcedeth.c | 103 +++-
 1 file changed, 101 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c 
b/drivers/net/ethernet/nvidia/forcedeth.c
index b327b29..f8e766f 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -674,6 +674,11 @@ struct nv_ethtool_stats {
u64 tx_broadcast;
 };
 
+/* 1000Mb is 125M bytes, 125 * 1024 * 1024 bytes
+ * The length of recv cache is 125M / skb_length
+ */
+#define RECV_CACHE_LIST_LENGTH (125 * 1024 * 1024 / np->rx_buf_sz)
+
 #define NV_DEV_STATISTICS_V3_COUNT (sizeof(struct 
nv_ethtool_stats)/sizeof(u64))
 #define NV_DEV_STATISTICS_V2_COUNT (NV_DEV_STATISTICS_V3_COUNT - 3)
 #define NV_DEV_STATISTICS_V1_COUNT (NV_DEV_STATISTICS_V2_COUNT - 6)
@@ -844,6 +849,11 @@ struct fe_priv {
char name_rx[IFNAMSIZ + 3];   /* -rx*/
char name_tx[IFNAMSIZ + 3];   /* -tx*/
char name_other[IFNAMSIZ + 6];/* -other */
+
+   /* This is to schedule work */
+   struct delayed_work recv_cache_work;
+   /* This list is to store skb queue for recv */
+   struct sk_buff_head recv_list;
 };
 
 /*
@@ -1804,7 +1814,8 @@ static int nv_alloc_rx(struct net_device *dev)
less_rx = np->last_rx.orig;
 
while (np->put_rx.orig != less_rx) {
-   struct sk_buff *skb = netdev_alloc_skb(dev, np->rx_buf_sz + 
NV_RX_ALLOC_PAD);
+   struct sk_buff *skb = skb_dequeue(&np->recv_list);
+
if (likely(skb)) {
np->put_rx_ctx->skb = skb;
np->put_rx_ctx->dma = dma_map_single(&np->pci_dev->dev,
@@ -1829,9 +1840,15 @@ static int nv_alloc_rx(struct net_device *dev)
u64_stats_update_begin(&np->swstats_rx_syncp);
np->stat_rx_dropped++;
u64_stats_update_end(&np->swstats_rx_syncp);
+
+   schedule_delayed_work(&np->recv_cache_work, 0);
+
return 1;
}
}
+
+   schedule_delayed_work(&np->recv_cache_work, 0);
+
return 0;
 }
 
@@ -1845,7 +1862,8 @@ static int nv_alloc_rx_optimized(struct net_device *dev)
less_rx = np->last_rx.ex;
 
while (np->put_rx.ex != less_rx) {
-   struct sk_buff *skb = netdev_alloc_skb(dev, np->rx_buf_sz + 
NV_RX_ALLOC_PAD);
+   struct sk_buff *skb = skb_dequeue(&np->recv_list);
+
if (likely(skb)) {
np->put_rx_ctx->skb = skb;
np->put_rx_ctx->dma = dma_map_single(&np->pci_dev->dev,
@@ -1871,9 +1889,15 @@ static int nv_alloc_rx_optimized(struct net_device *dev)
u64_stats_update_begin(&np->swstats_rx_syncp);
np->stat_rx_dropped++;
u64_stats_update_end(&np->swstats_rx_syncp);
+
+   schedule_delayed_work(&np->recv_cache_work, 0);
+
return 1;
}
}
+
+   schedule_delayed_work(&np->recv_cache_work, 0);
+
return 0;
 }
 
@@ -1957,6 +1981,43 @@ static void nv_init_tx(struct net_device *dev)
}
 }
 
+static void nv_init_recv_cache(struct net_device *dev)
+{
+   struct fe_priv *np = netdev_priv(dev);
+
+   skb_queue_head_init(&np->recv_list);
+   while (skb_queue_len(&np->recv_list) < RECV_CACHE_LIST_LENGTH) {
+   struct sk_buff *skb = netdev_alloc_skb(dev,
+np->rx_buf_sz + NV_RX_ALLOC_PAD);
+   /* skb is null. This indicates that memory is not
+* enough.
+*/
+   if (unlikely(!skb)) {
+   /* When allocating memory with GFP_ATOMIC fails,
+* allocating with GFP_KERNEL will get memory
+  

[PATCHv2 2/2] forcedeth: disable recv cache by default

2019-07-21 Thread Zhu Yanjun
The recv cache is to allocate 125MiB memory to reserve for NIC.
In the past time, this recv cache works very well. When the memory
is not enough, this recv cache reserves memory for NIC.
And the communications through this NIC is not affected by the
memory shortage. And the performance of NIC is better because of
this recv cache.
But this recv cache reserves 125MiB memory for one NIC port. Normally
there are 2 NIC ports in one card. So in a host, there are about 250
MiB memory reserved for NIC ports. To a host on which communications
are not mandatory, it is not necessary to reserve memory.
So this recv cache is disabled by default.

CC: Joe Jin 
CC: Junxiao Bi 
Tested-by: Nan san 
Signed-off-by: Zhu Yanjun 
---
 drivers/net/ethernet/nvidia/Kconfig | 11 
 drivers/net/ethernet/nvidia/Makefile|  1 +
 drivers/net/ethernet/nvidia/forcedeth.c | 46 ++---
 3 files changed, 48 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/Kconfig 
b/drivers/net/ethernet/nvidia/Kconfig
index faacbd1..9a9f42a 100644
--- a/drivers/net/ethernet/nvidia/Kconfig
+++ b/drivers/net/ethernet/nvidia/Kconfig
@@ -26,4 +26,15 @@ config FORCEDETH
  To compile this driver as a module, choose M here. The module
  will be called forcedeth.
 
+config FORCEDETH_RECV_CACHE
+   bool "nForce Ethernet recv cache support"
+   depends on FORCEDETH
+   default n
+   ---help---
+ The recv cache can make nic work steadily when the system memory is
+ not enough. And it can also enhance nic performance. But to a host
+ on which the communications are not mandatory, it is not necessary
+ to reserve 125MiB memory for NIC.
+ So recv cache is disabled by default.
+
 endif # NET_VENDOR_NVIDIA
diff --git a/drivers/net/ethernet/nvidia/Makefile 
b/drivers/net/ethernet/nvidia/Makefile
index 8935699..40c055e 100644
--- a/drivers/net/ethernet/nvidia/Makefile
+++ b/drivers/net/ethernet/nvidia/Makefile
@@ -4,3 +4,4 @@
 #
 
 obj-$(CONFIG_FORCEDETH) += forcedeth.o
+ccflags-$(CONFIG_FORCEDETH_RECV_CACHE) :=  -DFORCEDETH_RECV_CACHE
diff --git a/drivers/net/ethernet/nvidia/forcedeth.c 
b/drivers/net/ethernet/nvidia/forcedeth.c
index f8e766f..deda276 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -674,10 +674,12 @@ struct nv_ethtool_stats {
u64 tx_broadcast;
 };
 
+#ifdef FORCEDETH_RECV_CACHE
 /* 1000Mb is 125M bytes, 125 * 1024 * 1024 bytes
  * The length of recv cache is 125M / skb_length
  */
 #define RECV_CACHE_LIST_LENGTH (125 * 1024 * 1024 / np->rx_buf_sz)
+#endif
 
 #define NV_DEV_STATISTICS_V3_COUNT (sizeof(struct 
nv_ethtool_stats)/sizeof(u64))
 #define NV_DEV_STATISTICS_V2_COUNT (NV_DEV_STATISTICS_V3_COUNT - 3)
@@ -850,10 +852,12 @@ struct fe_priv {
char name_tx[IFNAMSIZ + 3];   /* -tx*/
char name_other[IFNAMSIZ + 6];/* -other */
 
+#ifdef FORCEDETH_RECV_CACHE
/* This is to schedule work */
struct delayed_work recv_cache_work;
/* This list is to store skb queue for recv */
struct sk_buff_head recv_list;
+#endif
 };
 
 /*
@@ -1814,8 +1818,12 @@ static int nv_alloc_rx(struct net_device *dev)
less_rx = np->last_rx.orig;
 
while (np->put_rx.orig != less_rx) {
+#ifdef FORCEDETH_RECV_CACHE
struct sk_buff *skb = skb_dequeue(&np->recv_list);
-
+#else
+   struct sk_buff *skb = netdev_alloc_skb(np->dev,
+np->rx_buf_sz + NV_RX_ALLOC_PAD);
+#endif
if (likely(skb)) {
np->put_rx_ctx->skb = skb;
np->put_rx_ctx->dma = dma_map_single(&np->pci_dev->dev,
@@ -1840,15 +1848,15 @@ static int nv_alloc_rx(struct net_device *dev)
u64_stats_update_begin(&np->swstats_rx_syncp);
np->stat_rx_dropped++;
u64_stats_update_end(&np->swstats_rx_syncp);
-
+#ifdef FORCEDETH_RECV_CACHE
schedule_delayed_work(&np->recv_cache_work, 0);
-
+#endif
return 1;
}
}
-
+#ifdef FORCEDETH_RECV_CACHE
schedule_delayed_work(&np->recv_cache_work, 0);
-
+#endif
return 0;
 }
 
@@ -1862,7 +1870,12 @@ static int nv_alloc_rx_optimized(struct net_device *dev)
less_rx = np->last_rx.ex;
 
while (np->put_rx.ex != less_rx) {
+#ifdef FORCEDETH_RECV_CACHE
struct sk_buff *skb = skb_dequeue(&np->recv_list);
+#else
+   struct sk_buff *skb = netdev_alloc_skb(np->dev,
+   np->rx_buf_sz + NV_RX_ALLOC_PAD);
+#endif
 
if (likely(skb)) {
np->put_rx_ctx->skb = skb;
@@ -1889,15 +1902,15 @@ static int nv_a

[PATCHv2 0/2] forcedeth: recv cache to make NIC work steadily

2019-07-21 Thread Zhu Yanjun
These patches are to this scenario:

"
When the host run for long time, there are a lot of memory fragments in
the hosts. And it is possible that kernel will compact memory fragments.
But normally it is difficult for NIC driver to allocate a memory from
kernel. From this variable stat_rx_dropped, we can confirm that NIC driver
can not allocate skb very frequently.
"

Since NIC driver can not allocate skb in time, this makes some important
tasks not be completed in time.
To avoid it, a recv cache is created to pre-allocate skb for NIC driver.
This can make the important tasks be completed in time.
>From Nan's tests in LAB, these patches can make NIC driver work steadily.
Now in production hosts, these patches are applied.

With these patches, one NIC port needs 125MiB reserved. This 125MiB memory
can not be used by others. To a host on which the communications are not
mandatory, it is not necessary to reserve so much memory. So this recv cache
is disabled by default.

V1->V2:
1. ndelay is replaced with GFP_KERNEL function __netdev_alloc_skb.
2. skb_queue_purge is used when recv cache is destroyed.
3. RECV_LIST_ALLOCATE bit is removed.
4. schedule_delayed_work is moved out of while loop.

Zhu Yanjun (2):
  forcedeth: add recv cache make nic work steadily
  forcedeth: disable recv cache by default

 drivers/net/ethernet/nvidia/Kconfig |  11 +++
 drivers/net/ethernet/nvidia/Makefile|   1 +
 drivers/net/ethernet/nvidia/forcedeth.c | 129 +++-
 3 files changed, 139 insertions(+), 2 deletions(-)

-- 
2.7.4



Re: [PATCH 1/2] forcedeth: add recv cache make nic work steadily

2019-07-05 Thread Zhu Yanjun

Add Nan

He is interested this commit.

在 2019/7/5 14:19, Zhu Yanjun 写道:

A recv cache is added. The size of recv cache is 1000Mb / skb_length.
When the system memory is not enough, this recv cache can make nic work
steadily.
When nic is up, this recv cache and work queue are created. When nic
is down, this recv cache will be destroyed and delayed workqueue is
canceled.
When nic is polled or rx interrupt is triggerred, rx handler will
get a skb from recv cache. Then the state of recv cache is checked.
If recv cache is not in filling up state, a work is queued to fill
up recv cache.
When skb size is changed, the old recv cache is destroyed and new recv
cache is created.
When the system memory is not enough, the allocation of skb failed.
recv cache will continue allocate skb until the recv cache is filled up.
When the system memory is not enough, this can make nic work steadily.
Becase of recv cache, the performance of nic is enhanced.

CC: Joe Jin 
CC: Junxiao Bi 
Signed-off-by: Zhu Yanjun 
---
  drivers/net/ethernet/nvidia/forcedeth.c | 100 +++-
  1 file changed, 98 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c 
b/drivers/net/ethernet/nvidia/forcedeth.c
index b327b29..a673005 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -674,6 +674,11 @@ struct nv_ethtool_stats {
u64 tx_broadcast;
  };
  
+/* 1000Mb is 125M bytes, 125 * 1024 * 1024 bytes

+ * The length of recv cache is 125M / skb_length
+ */
+#define RECV_CACHE_LIST_LENGTH (125 * 1024 * 1024 / np->rx_buf_sz)
+
  #define NV_DEV_STATISTICS_V3_COUNT (sizeof(struct 
nv_ethtool_stats)/sizeof(u64))
  #define NV_DEV_STATISTICS_V2_COUNT (NV_DEV_STATISTICS_V3_COUNT - 3)
  #define NV_DEV_STATISTICS_V1_COUNT (NV_DEV_STATISTICS_V2_COUNT - 6)
@@ -844,8 +849,18 @@ struct fe_priv {
char name_rx[IFNAMSIZ + 3];   /* -rx*/
char name_tx[IFNAMSIZ + 3];   /* -tx*/
char name_other[IFNAMSIZ + 6];/* -other */
+
+   /* This is to schedule work */
+   struct delayed_work recv_cache_work;
+   /* This list is to store skb queue for recv */
+   struct sk_buff_head recv_list;
+   unsigned long nv_recv_list_state;
  };
  
+/* This is recv list state to fill up recv cache */

+enum recv_list_state {
+   RECV_LIST_ALLOCATE
+};
  /*
   * Maximum number of loops until we assume that a bit in the irq mask
   * is stuck. Overridable with module param.
@@ -1804,7 +1819,11 @@ static int nv_alloc_rx(struct net_device *dev)
less_rx = np->last_rx.orig;
  
  	while (np->put_rx.orig != less_rx) {

-   struct sk_buff *skb = netdev_alloc_skb(dev, np->rx_buf_sz + 
NV_RX_ALLOC_PAD);
+   struct sk_buff *skb = skb_dequeue(&np->recv_list);
+
+   if (!test_bit(RECV_LIST_ALLOCATE, &np->nv_recv_list_state))
+   schedule_delayed_work(&np->recv_cache_work, 0);
+
if (likely(skb)) {
np->put_rx_ctx->skb = skb;
np->put_rx_ctx->dma = dma_map_single(&np->pci_dev->dev,
@@ -1845,7 +1864,11 @@ static int nv_alloc_rx_optimized(struct net_device *dev)
less_rx = np->last_rx.ex;
  
  	while (np->put_rx.ex != less_rx) {

-   struct sk_buff *skb = netdev_alloc_skb(dev, np->rx_buf_sz + 
NV_RX_ALLOC_PAD);
+   struct sk_buff *skb = skb_dequeue(&np->recv_list);
+
+   if (!test_bit(RECV_LIST_ALLOCATE, &np->nv_recv_list_state))
+   schedule_delayed_work(&np->recv_cache_work, 0);
+
if (likely(skb)) {
np->put_rx_ctx->skb = skb;
np->put_rx_ctx->dma = dma_map_single(&np->pci_dev->dev,
@@ -1957,6 +1980,40 @@ static void nv_init_tx(struct net_device *dev)
}
  }
  
+static void nv_init_recv_cache(struct net_device *dev)

+{
+   struct fe_priv *np = netdev_priv(dev);
+
+   skb_queue_head_init(&np->recv_list);
+   while (skb_queue_len(&np->recv_list) < RECV_CACHE_LIST_LENGTH) {
+   struct sk_buff *skb = netdev_alloc_skb(dev,
+np->rx_buf_sz + NV_RX_ALLOC_PAD);
+   /* skb is null. This indicates that memory is not
+* enough.
+*/
+   if (unlikely(!skb)) {
+   ndelay(3);
+   continue;
+   }
+
+   skb_queue_tail(&np->recv_list, skb);
+   }
+}
+
+static void nv_destroy_recv_cache(struct net_device *dev)
+{
+   struct sk_buff *skb;
+   struct fe_priv *np = netdev_priv(dev);
+
+   cancel_delayed_work_sync(&np->recv_cache_work);
+   WARN_ON(delayed_work_pending(&np->recv_cache_work));
+
+   while ((skb = skb_dequeue(&

[PATCH 2/2] forcedeth: disable recv cache by default

2019-07-04 Thread Zhu Yanjun
The recv cache is to allocate 125MiB memory to reserve for NIC.
In the past time, this recv cache works very well. When the memory
is not enough, this recv cache reserves memory for NIC.
And the communications through this NIC is not affected by the
memory shortage. And the performance of NIC is better because of
this recv cache.
But this recv cache reserves 125MiB memory for one NIC port. Normally
there are 2 NIC ports in one card. So in a host, there are about 250
MiB memory reserved for NIC ports. To a host on which communications
are not mandatory, it is not necessary to reserve memory.
So this recv cache is disabled by default.

CC: Joe Jin 
CC: Junxiao Bi 
Signed-off-by: Zhu Yanjun 
---
 drivers/net/ethernet/nvidia/Kconfig | 11 +++
 drivers/net/ethernet/nvidia/Makefile|  1 +
 drivers/net/ethernet/nvidia/forcedeth.c | 34 ++---
 3 files changed, 43 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/Kconfig 
b/drivers/net/ethernet/nvidia/Kconfig
index faacbd1..9a9f42a 100644
--- a/drivers/net/ethernet/nvidia/Kconfig
+++ b/drivers/net/ethernet/nvidia/Kconfig
@@ -26,4 +26,15 @@ config FORCEDETH
  To compile this driver as a module, choose M here. The module
  will be called forcedeth.
 
+config FORCEDETH_RECV_CACHE
+   bool "nForce Ethernet recv cache support"
+   depends on FORCEDETH
+   default n
+   ---help---
+ The recv cache can make nic work steadily when the system memory is
+ not enough. And it can also enhance nic performance. But to a host
+ on which the communications are not mandatory, it is not necessary
+ to reserve 125MiB memory for NIC.
+ So recv cache is disabled by default.
+
 endif # NET_VENDOR_NVIDIA
diff --git a/drivers/net/ethernet/nvidia/Makefile 
b/drivers/net/ethernet/nvidia/Makefile
index 8935699..40c055e 100644
--- a/drivers/net/ethernet/nvidia/Makefile
+++ b/drivers/net/ethernet/nvidia/Makefile
@@ -4,3 +4,4 @@
 #
 
 obj-$(CONFIG_FORCEDETH) += forcedeth.o
+ccflags-$(CONFIG_FORCEDETH_RECV_CACHE) :=  -DFORCEDETH_RECV_CACHE
diff --git a/drivers/net/ethernet/nvidia/forcedeth.c 
b/drivers/net/ethernet/nvidia/forcedeth.c
index a673005..59f813b 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -674,10 +674,12 @@ struct nv_ethtool_stats {
u64 tx_broadcast;
 };
 
+#ifdef FORCEDETH_RECV_CACHE
 /* 1000Mb is 125M bytes, 125 * 1024 * 1024 bytes
  * The length of recv cache is 125M / skb_length
  */
 #define RECV_CACHE_LIST_LENGTH (125 * 1024 * 1024 / np->rx_buf_sz)
+#endif
 
 #define NV_DEV_STATISTICS_V3_COUNT (sizeof(struct 
nv_ethtool_stats)/sizeof(u64))
 #define NV_DEV_STATISTICS_V2_COUNT (NV_DEV_STATISTICS_V3_COUNT - 3)
@@ -850,17 +852,22 @@ struct fe_priv {
char name_tx[IFNAMSIZ + 3];   /* -tx*/
char name_other[IFNAMSIZ + 6];/* -other */
 
+#ifdef FORCEDETH_RECV_CACHE
/* This is to schedule work */
struct delayed_work recv_cache_work;
/* This list is to store skb queue for recv */
struct sk_buff_head recv_list;
unsigned long nv_recv_list_state;
+#endif
 };
 
+#ifdef FORCEDETH_RECV_CACHE
 /* This is recv list state to fill up recv cache */
 enum recv_list_state {
RECV_LIST_ALLOCATE
 };
+#endif
+
 /*
  * Maximum number of loops until we assume that a bit in the irq mask
  * is stuck. Overridable with module param.
@@ -1819,11 +1826,15 @@ static int nv_alloc_rx(struct net_device *dev)
less_rx = np->last_rx.orig;
 
while (np->put_rx.orig != less_rx) {
+#ifdef FORCEDETH_RECV_CACHE
struct sk_buff *skb = skb_dequeue(&np->recv_list);
 
if (!test_bit(RECV_LIST_ALLOCATE, &np->nv_recv_list_state))
schedule_delayed_work(&np->recv_cache_work, 0);
-
+#else
+   struct sk_buff *skb = netdev_alloc_skb(np->dev,
+np->rx_buf_sz + NV_RX_ALLOC_PAD);
+#endif
if (likely(skb)) {
np->put_rx_ctx->skb = skb;
np->put_rx_ctx->dma = dma_map_single(&np->pci_dev->dev,
@@ -1864,11 +1875,15 @@ static int nv_alloc_rx_optimized(struct net_device *dev)
less_rx = np->last_rx.ex;
 
while (np->put_rx.ex != less_rx) {
+#ifdef FORCEDETH_RECV_CACHE
struct sk_buff *skb = skb_dequeue(&np->recv_list);
 
if (!test_bit(RECV_LIST_ALLOCATE, &np->nv_recv_list_state))
schedule_delayed_work(&np->recv_cache_work, 0);
-
+#else
+   struct sk_buff *skb = netdev_alloc_skb(np->dev,
+   np->rx_buf_sz + NV_RX_ALLOC_PAD);
+#endif
if (likely(skb)) {
np->put_rx_ctx->skb = skb;

[PATCH 1/2] forcedeth: add recv cache make nic work steadily

2019-07-04 Thread Zhu Yanjun
A recv cache is added. The size of recv cache is 1000Mb / skb_length.
When the system memory is not enough, this recv cache can make nic work
steadily.
When nic is up, this recv cache and work queue are created. When nic
is down, this recv cache will be destroyed and delayed workqueue is
canceled.
When nic is polled or rx interrupt is triggerred, rx handler will
get a skb from recv cache. Then the state of recv cache is checked.
If recv cache is not in filling up state, a work is queued to fill
up recv cache.
When skb size is changed, the old recv cache is destroyed and new recv
cache is created.
When the system memory is not enough, the allocation of skb failed.
recv cache will continue allocate skb until the recv cache is filled up.
When the system memory is not enough, this can make nic work steadily.
Becase of recv cache, the performance of nic is enhanced.

CC: Joe Jin 
CC: Junxiao Bi 
Signed-off-by: Zhu Yanjun 
---
 drivers/net/ethernet/nvidia/forcedeth.c | 100 +++-
 1 file changed, 98 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c 
b/drivers/net/ethernet/nvidia/forcedeth.c
index b327b29..a673005 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -674,6 +674,11 @@ struct nv_ethtool_stats {
u64 tx_broadcast;
 };
 
+/* 1000Mb is 125M bytes, 125 * 1024 * 1024 bytes
+ * The length of recv cache is 125M / skb_length
+ */
+#define RECV_CACHE_LIST_LENGTH (125 * 1024 * 1024 / np->rx_buf_sz)
+
 #define NV_DEV_STATISTICS_V3_COUNT (sizeof(struct 
nv_ethtool_stats)/sizeof(u64))
 #define NV_DEV_STATISTICS_V2_COUNT (NV_DEV_STATISTICS_V3_COUNT - 3)
 #define NV_DEV_STATISTICS_V1_COUNT (NV_DEV_STATISTICS_V2_COUNT - 6)
@@ -844,8 +849,18 @@ struct fe_priv {
char name_rx[IFNAMSIZ + 3];   /* -rx*/
char name_tx[IFNAMSIZ + 3];   /* -tx*/
char name_other[IFNAMSIZ + 6];/* -other */
+
+   /* This is to schedule work */
+   struct delayed_work recv_cache_work;
+   /* This list is to store skb queue for recv */
+   struct sk_buff_head recv_list;
+   unsigned long nv_recv_list_state;
 };
 
+/* This is recv list state to fill up recv cache */
+enum recv_list_state {
+   RECV_LIST_ALLOCATE
+};
 /*
  * Maximum number of loops until we assume that a bit in the irq mask
  * is stuck. Overridable with module param.
@@ -1804,7 +1819,11 @@ static int nv_alloc_rx(struct net_device *dev)
less_rx = np->last_rx.orig;
 
while (np->put_rx.orig != less_rx) {
-   struct sk_buff *skb = netdev_alloc_skb(dev, np->rx_buf_sz + 
NV_RX_ALLOC_PAD);
+   struct sk_buff *skb = skb_dequeue(&np->recv_list);
+
+   if (!test_bit(RECV_LIST_ALLOCATE, &np->nv_recv_list_state))
+   schedule_delayed_work(&np->recv_cache_work, 0);
+
if (likely(skb)) {
np->put_rx_ctx->skb = skb;
np->put_rx_ctx->dma = dma_map_single(&np->pci_dev->dev,
@@ -1845,7 +1864,11 @@ static int nv_alloc_rx_optimized(struct net_device *dev)
less_rx = np->last_rx.ex;
 
while (np->put_rx.ex != less_rx) {
-   struct sk_buff *skb = netdev_alloc_skb(dev, np->rx_buf_sz + 
NV_RX_ALLOC_PAD);
+   struct sk_buff *skb = skb_dequeue(&np->recv_list);
+
+   if (!test_bit(RECV_LIST_ALLOCATE, &np->nv_recv_list_state))
+   schedule_delayed_work(&np->recv_cache_work, 0);
+
if (likely(skb)) {
np->put_rx_ctx->skb = skb;
np->put_rx_ctx->dma = dma_map_single(&np->pci_dev->dev,
@@ -1957,6 +1980,40 @@ static void nv_init_tx(struct net_device *dev)
}
 }
 
+static void nv_init_recv_cache(struct net_device *dev)
+{
+   struct fe_priv *np = netdev_priv(dev);
+
+   skb_queue_head_init(&np->recv_list);
+   while (skb_queue_len(&np->recv_list) < RECV_CACHE_LIST_LENGTH) {
+   struct sk_buff *skb = netdev_alloc_skb(dev,
+np->rx_buf_sz + NV_RX_ALLOC_PAD);
+   /* skb is null. This indicates that memory is not
+* enough.
+*/
+   if (unlikely(!skb)) {
+   ndelay(3);
+   continue;
+   }
+
+   skb_queue_tail(&np->recv_list, skb);
+   }
+}
+
+static void nv_destroy_recv_cache(struct net_device *dev)
+{
+   struct sk_buff *skb;
+   struct fe_priv *np = netdev_priv(dev);
+
+   cancel_delayed_work_sync(&np->recv_cache_work);
+   WARN_ON(delayed_work_pending(&np->recv_cache_work));
+
+   while ((skb = skb_dequeue(&np->recv_list)))
+   kfree_skb(skb);
+
+   WARN_ON(skb_queue_len

[PATCH 0/2] forcedeth: recv cache support

2019-07-04 Thread Zhu Yanjun
This recv cache is to make NIC work steadily when the system memory is
not enough.

>From long time testing, the NIC worked very well when the system memory
is not enough. And the NIC performance is better from about 920M to
about 940M.

Some simple tests are made:

ip link set forcedeth_nic down/up
modprobe/rmmod forcedeth
ip link set mtu 1500 dev forcedeth_nic
ethtool -G forcedeth_nic tx 512 rx 1024
And other tests, the NIC with the recv cache can work well.

Since the recv cache will reserve 125M memory for NIC, normally this recv
cache is disabled by default.

Zhu Yanjun (2):
  forcedeth: add recv cache make nic work steadily
  forcedeth: disable recv cache by default

 drivers/net/ethernet/nvidia/Kconfig |  11 +++
 drivers/net/ethernet/nvidia/Makefile|   1 +
 drivers/net/ethernet/nvidia/forcedeth.c | 128 +++-
 3 files changed, 138 insertions(+), 2 deletions(-)

-- 
2.7.4



[PATCH 1/1] net: rds: fix memory leak in rds_ib_flush_mr_pool

2019-06-06 Thread Zhu Yanjun
When the following tests last for several hours, the problem will occur.

Server:
rds-stress -r 1.1.1.16 -D 1M
Client:
rds-stress -r 1.1.1.14 -s 1.1.1.16 -D 1M -T 30

The following will occur.

"
Starting up
tsks   tx/s   rx/s  tx+rx K/smbi K/smbo K/s tx us/c   rtt us cpu
%
  1  0  0   0.00   0.00   0.000.00 0.00 -1.00
  1  0  0   0.00   0.00   0.000.00 0.00 -1.00
  1  0  0   0.00   0.00   0.000.00 0.00 -1.00
  1  0  0   0.00   0.00   0.000.00 0.00 -1.00
"
>From vmcore, we can find that clean_list is NULL.

>From the source code, rds_mr_flushd calls rds_ib_mr_pool_flush_worker.
Then rds_ib_mr_pool_flush_worker calls
"
 rds_ib_flush_mr_pool(pool, 0, NULL);
"
Then in function
"
int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
 int free_all, struct rds_ib_mr **ibmr_ret)
"
ibmr_ret is NULL.

In the source code,
"
...
list_to_llist_nodes(pool, &unmap_list, &clean_nodes, &clean_tail);
if (ibmr_ret)
*ibmr_ret = llist_entry(clean_nodes, struct rds_ib_mr, llnode);

/* more than one entry in llist nodes */
if (clean_nodes->next)
llist_add_batch(clean_nodes->next, clean_tail, &pool->clean_list);
...
"
When ibmr_ret is NULL, llist_entry is not executed. clean_nodes->next
instead of clean_nodes is added in clean_list.
So clean_nodes is discarded. It can not be used again.
The workqueue is executed periodically. So more and more clean_nodes are
discarded. Finally the clean_list is NULL.
Then this problem will occur.

Fixes: 1bc144b62524 ("net, rds, Replace xlist in net/rds/xlist.h with llist")
Signed-off-by: Zhu Yanjun 
---
 net/rds/ib_rdma.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index d664e9a..0b347f4 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -428,12 +428,14 @@ int rds_ib_flush_mr_pool(struct rds_ib_mr_pool *pool,
wait_clean_list_grace();
 
list_to_llist_nodes(pool, &unmap_list, &clean_nodes, 
&clean_tail);
-   if (ibmr_ret)
+   if (ibmr_ret) {
*ibmr_ret = llist_entry(clean_nodes, struct rds_ib_mr, 
llnode);
-
+   clean_nodes = clean_nodes->next;
+   }
/* more than one entry in llist nodes */
-   if (clean_nodes->next)
-   llist_add_batch(clean_nodes->next, clean_tail, 
&pool->clean_list);
+   if (clean_nodes)
+   llist_add_batch(clean_nodes, clean_tail,
+   &pool->clean_list);
 
}
 
-- 
2.7.4



[PATCH 1/1] net: rds: fix memory leak when unload rds_rdma

2019-06-03 Thread Zhu Yanjun
When KASAN is enabled, after several rds connections are
created, then "rmmod rds_rdma" is run. The following will
appear.

"
BUG rds_ib_incoming (Not tainted): Objects remaining
in rds_ib_incoming on __kmem_cache_shutdown()

Call Trace:
 dump_stack+0x71/0xab
 slab_err+0xad/0xd0
 __kmem_cache_shutdown+0x17d/0x370
 shutdown_cache+0x17/0x130
 kmem_cache_destroy+0x1df/0x210
 rds_ib_recv_exit+0x11/0x20 [rds_rdma]
 rds_ib_exit+0x7a/0x90 [rds_rdma]
 __x64_sys_delete_module+0x224/0x2c0
 ? __ia32_sys_delete_module+0x2c0/0x2c0
 do_syscall_64+0x73/0x190
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
"
This is rds connection memory leak. The root cause is:
When "rmmod rds_rdma" is run, rds_ib_remove_one will call
rds_ib_dev_shutdown to drop the rds connections.
rds_ib_dev_shutdown will call rds_conn_drop to drop rds
connections as below.
"
rds_conn_path_drop(&conn->c_path[0], false);
"
In the above, destroy is set to false.
void rds_conn_path_drop(struct rds_conn_path *cp, bool destroy)
{
atomic_set(&cp->cp_state, RDS_CONN_ERROR);

rcu_read_lock();
if (!destroy && rds_destroy_pending(cp->cp_conn)) {
rcu_read_unlock();
return;
}
queue_work(rds_wq, &cp->cp_down_w);
rcu_read_unlock();
}
In the above function, destroy is set to false. rds_destroy_pending
is called. This does not move rds connections to ib_nodev_conns.
So destroy is set to true to move rds connections to ib_nodev_conns.
In rds_ib_unregister_client, flush_workqueue is called to make rds_wq
finsh shutdown rds connections. The function rds_ib_destroy_nodev_conns
is called to shutdown rds connections finally.
Then rds_ib_recv_exit is called to destroy slab.

void rds_ib_recv_exit(void)
{
kmem_cache_destroy(rds_ib_incoming_slab);
kmem_cache_destroy(rds_ib_frag_slab);
}
The above slab memory leak will not occur again.

>From tests,
256 rds connections
[root@ca-dev14 ~]# time rmmod rds_rdma

real0m16.522s
user0m0.000s
sys 0m8.152s
512 rds connections
[root@ca-dev14 ~]# time rmmod rds_rdma

real0m32.054s
user0m0.000s
sys 0m15.568s

To rmmod rds_rdma with 256 rds connections, about 16 seconds are needed.
And with 512 rds connections, about 32 seconds are needed.
>From ftrace, when one rds connection is destroyed,

"
 19)   |  rds_conn_destroy [rds]() {
 19)   7.782 us|rds_conn_path_drop [rds]();
 15)   |  rds_shutdown_worker [rds]() {
 15)   |rds_conn_shutdown [rds]() {
 15)   1.651 us|  rds_send_path_reset [rds]();
 15)   7.195 us|}
 15) + 11.434 us   |  }
 19)   2.285 us|rds_cong_remove_conn [rds]();
 19) * 24062.76 us |  }
"
So if many rds connections will be destroyed, this function
rds_ib_destroy_nodev_conns uses most of time.

Suggested-by: Håkon Bugge 
Signed-off-by: Zhu Yanjun 
---
 net/rds/ib.c  | 2 +-
 net/rds/ib_recv.c | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/rds/ib.c b/net/rds/ib.c
index f9baf2d..ec05d91 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -87,7 +87,7 @@ static void rds_ib_dev_shutdown(struct rds_ib_device 
*rds_ibdev)
 
spin_lock_irqsave(&rds_ibdev->spinlock, flags);
list_for_each_entry(ic, &rds_ibdev->conn_list, ib_node)
-   rds_conn_drop(ic->conn);
+   rds_conn_path_drop(&ic->conn->c_path[0], true);
spin_unlock_irqrestore(&rds_ibdev->spinlock, flags);
 }
 
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index 8946c89..3cae88c 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -168,6 +168,7 @@ void rds_ib_recv_free_caches(struct rds_ib_connection *ic)
list_del(&inc->ii_cache_entry);
WARN_ON(!list_empty(&inc->ii_frags));
kmem_cache_free(rds_ib_incoming_slab, inc);
+   atomic_dec(&rds_ib_allocation);
}
 
rds_ib_cache_xfer_to_ready(&ic->i_cache_frags);
@@ -1057,6 +1058,8 @@ int rds_ib_recv_init(void)
 
 void rds_ib_recv_exit(void)
 {
+   WARN_ON(atomic_read(&rds_ib_allocation));
+
kmem_cache_destroy(rds_ib_incoming_slab);
kmem_cache_destroy(rds_ib_frag_slab);
 }
-- 
2.7.4



[PATCHv2 1/1] net: rds: add per rds connection cache statistics

2019-06-02 Thread Zhu Yanjun
The variable cache_allocs is to indicate how many frags (KiB) are in one
rds connection frag cache.
The command "rds-info -Iv" will output the rds connection cache
statistics as below:
"
RDS IB Connections:
  LocalAddr RemoteAddr Tos SL  LocalDevRemoteDev
  1.1.1.14 1.1.1.14   58 255  fe80::2:c903:a:7a31 fe80::2:c903:a:7a31
  send_wr=256, recv_wr=1024, send_sge=8, rdma_mr_max=4096,
  rdma_mr_size=257, cache_allocs=12
"
This means that there are about 12KiB frag in this rds connection frag
cache. 
Since rds.h in rds-tools is not related with the kernel rds.h, the change
in kernel rds.h does not affect rds-tools.
rds-info in rds-tools 2.0.5 and 2.0.6 is tested with this commit. It works
well.

Signed-off-by: Zhu Yanjun 
---
V1->V2: RDS CI is removed. 
---
 include/uapi/linux/rds.h | 2 ++
 net/rds/ib.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h
index 5d0f76c..fd6b5f6 100644
--- a/include/uapi/linux/rds.h
+++ b/include/uapi/linux/rds.h
@@ -250,6 +250,7 @@ struct rds_info_rdma_connection {
__u32   rdma_mr_max;
__u32   rdma_mr_size;
__u8tos;
+   __u32   cache_allocs;
 };
 
 struct rds6_info_rdma_connection {
@@ -264,6 +265,7 @@ struct rds6_info_rdma_connection {
__u32   rdma_mr_max;
__u32   rdma_mr_size;
__u8tos;
+   __u32   cache_allocs;
 };
 
 /* RDS message Receive Path Latency points */
diff --git a/net/rds/ib.c b/net/rds/ib.c
index 2da9b75..f9baf2d 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -318,6 +318,7 @@ static int rds_ib_conn_info_visitor(struct rds_connection 
*conn,
iinfo->max_recv_wr = ic->i_recv_ring.w_nr;
iinfo->max_send_sge = rds_ibdev->max_sge;
rds_ib_get_mr_info(rds_ibdev, iinfo);
+   iinfo->cache_allocs = atomic_read(&ic->i_cache_allocs);
}
return 1;
 }
@@ -351,6 +352,7 @@ static int rds6_ib_conn_info_visitor(struct rds_connection 
*conn,
iinfo6->max_recv_wr = ic->i_recv_ring.w_nr;
iinfo6->max_send_sge = rds_ibdev->max_sge;
rds6_ib_get_mr_info(rds_ibdev, iinfo6);
+   iinfo6->cache_allocs = atomic_read(&ic->i_cache_allocs);
}
return 1;
 }
-- 
2.7.4



[PATCH 1/1] net: rds: add per rds connection cache statistics

2019-06-01 Thread Zhu Yanjun
The variable cache_allocs is to indicate how many frags (KiB) are in one
rds connection frag cache.
The command "rds-info -Iv" will output the rds connection cache
statistics as below:
"
RDS IB Connections:
  LocalAddr RemoteAddr Tos SL  LocalDevRemoteDev
  1.1.1.14 1.1.1.14   58 255  fe80::2:c903:a:7a31 fe80::2:c903:a:7a31
  send_wr=256, recv_wr=1024, send_sge=8, rdma_mr_max=4096,
  rdma_mr_size=257, cache_allocs=12
"
This means that there are about 12KiB frag in this rds connection frag
 cache.

Tested-by: RDS CI 
Signed-off-by: Zhu Yanjun 
---
 include/uapi/linux/rds.h | 2 ++
 net/rds/ib.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h
index 5d0f76c..fd6b5f6 100644
--- a/include/uapi/linux/rds.h
+++ b/include/uapi/linux/rds.h
@@ -250,6 +250,7 @@ struct rds_info_rdma_connection {
__u32   rdma_mr_max;
__u32   rdma_mr_size;
__u8tos;
+   __u32   cache_allocs;
 };
 
 struct rds6_info_rdma_connection {
@@ -264,6 +265,7 @@ struct rds6_info_rdma_connection {
__u32   rdma_mr_max;
__u32   rdma_mr_size;
__u8tos;
+   __u32   cache_allocs;
 };
 
 /* RDS message Receive Path Latency points */
diff --git a/net/rds/ib.c b/net/rds/ib.c
index 2da9b75..f9baf2d 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -318,6 +318,7 @@ static int rds_ib_conn_info_visitor(struct rds_connection 
*conn,
iinfo->max_recv_wr = ic->i_recv_ring.w_nr;
iinfo->max_send_sge = rds_ibdev->max_sge;
rds_ib_get_mr_info(rds_ibdev, iinfo);
+   iinfo->cache_allocs = atomic_read(&ic->i_cache_allocs);
}
return 1;
 }
@@ -351,6 +352,7 @@ static int rds6_ib_conn_info_visitor(struct rds_connection 
*conn,
iinfo6->max_recv_wr = ic->i_recv_ring.w_nr;
iinfo6->max_send_sge = rds_ibdev->max_sge;
rds6_ib_get_mr_info(rds_ibdev, iinfo6);
+   iinfo6->cache_allocs = atomic_read(&ic->i_cache_allocs);
}
return 1;
 }
-- 
2.7.4



[PATCH 1/1] net: rds: exchange of 8K and 1M pool

2019-04-23 Thread Zhu Yanjun
Before the commit 490ea5967b0d ("RDS: IB: move FMR code to its own file"),
when the dirty_count is greater than 9/10 of max_items of 8K pool,
1M pool is used, Vice versa. After the commit 490ea5967b0d ("RDS: IB: move
FMR code to its own file"), the above is removed. When we make the
following tests.

Server:
  rds-stress -r 1.1.1.16 -D 1M

Client:
  rds-stress -r 1.1.1.14 -s 1.1.1.16 -D 1M

The following will appear.
"
connecting to 1.1.1.16:4000
negotiated options, tasks will start in 2 seconds
Starting up..header from 1.1.1.166:4001 to id 4001 bogus
..
tsks  tx/s  rx/s tx+rx K/s  mbi K/s  mbo K/s tx us/c  rtt us
cpu %
   100 0.00 0.00 0.000.00 0.00 -1.00
   100 0.00 0.00 0.000.00 0.00 -1.00
   100 0.00 0.00 0.000.00 0.00 -1.00
   100 0.00 0.00 0.000.00 0.00 -1.00
   100 0.00 0.00 0.000.00 0.00 -1.00
... 
"
So this exchange between 8K and 1M pool is added back.

Fixes: commit 490ea5967b0d ("RDS: IB: move FMR code to its own file")
Signed-off-by: Zhu Yanjun 
---
 net/rds/ib_fmr.c  | 11 +++
 net/rds/ib_rdma.c |  3 ---
 2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/net/rds/ib_fmr.c b/net/rds/ib_fmr.c
index 4fe8f4f..6fc5e2c 100644
--- a/net/rds/ib_fmr.c
+++ b/net/rds/ib_fmr.c
@@ -44,6 +44,17 @@ struct rds_ib_mr *rds_ib_alloc_fmr(struct rds_ib_device 
*rds_ibdev, int npages)
else
pool = rds_ibdev->mr_1m_pool;
 
+   if (atomic_read(&pool->dirty_count) >= pool->max_items / 10)
+   queue_delayed_work(rds_ib_mr_wq, &pool->flush_worker, 10);
+
+   /* Switch pools if one of the pool is reaching upper limit */
+   if (atomic_read(&pool->dirty_count) >=  pool->max_items * 9 / 10) {
+   if (pool->pool_type == RDS_IB_MR_8K_POOL)
+   pool = rds_ibdev->mr_1m_pool;
+   else
+   pool = rds_ibdev->mr_8k_pool;
+   }
+
ibmr = rds_ib_try_reuse_ibmr(pool);
if (ibmr)
return ibmr;
diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index f7164ac..eb0b1cd 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -441,9 +441,6 @@ struct rds_ib_mr *rds_ib_try_reuse_ibmr(struct 
rds_ib_mr_pool *pool)
struct rds_ib_mr *ibmr = NULL;
int iter = 0;
 
-   if (atomic_read(&pool->dirty_count) >= pool->max_items_soft / 10)
-   queue_delayed_work(rds_ib_mr_wq, &pool->flush_worker, 10);
-
while (1) {
ibmr = rds_ib_reuse_mr(pool);
if (ibmr)
-- 
2.7.4



Re: [net-next PATCH] net/rds: Return proper "tos" value to user-space

2019-03-08 Thread Zhu Yanjun



在 2019/3/9 6:37, Gerd Rausch 写道:

On 07/03/2019 17.37, santosh.shilim...@oracle.com wrote:

--- a/net/rds/connection.c
+++ b/net/rds/connection.c
@@ -736,6 +736,7 @@ static int rds_conn_info_visitor(struct rds_conn_path *cp, 
void *buffer)
   cinfo->next_rx_seq = cp->cp_next_rx_seq;
   cinfo->laddr = conn->c_laddr.s6_addr32[3];
   cinfo->faddr = conn->c_faddr.s6_addr32[3];
+    cinfo->tos = conn->c_tos;
   strncpy(cinfo->transport, conn->c_trans->t_name,
   sizeof(cinfo->transport));
   cinfo->flags = 0;


Transport function populates it "iinfo->tos" so 'rds-info -I'
already should be showing the correct output but we should popullate
it here to for socket option so looks good


"rds-info -I" did show the correct output, but

"rds-info -n" did not:


Thanks.

Reviewed-by: Zhu Yanjun 




% rds-info -n
RDS Connections:
   LocalAddr  RemoteAddr  Tos   NextTX   NextRX Flgs
   192.168.253.2   192.168.253.1  1594694046264 --C-


Acked-by: Santosh Shilimkar 


Thanks,

   Gerd


Re: [PATCH] net: nvidia: forcedeth: Fix two possible concurrency use-after-free bugs

2019-01-08 Thread Zhu Yanjun



在 2019/1/8 20:45, Jia-Ju Bai 写道:

In drivers/net/ethernet/nvidia/forcedeth.c, the functions
nv_start_xmit() and nv_start_xmit_optimized() can be concurrently
executed with nv_poll_controller().

nv_start_xmit
   line 2321: prev_tx_ctx->skb = skb;

nv_start_xmit_optimized
   line 2479: prev_tx_ctx->skb = skb;

nv_poll_controller
   nv_do_nic_poll
 line 4134: spin_lock(&np->lock);
 nv_drain_rxtx
   nv_drain_tx
 nv_release_txskb
   line 2004: dev_kfree_skb_any(tx_skb->skb);

Thus, two possible concurrency use-after-free bugs may occur.

To fix these possible bugs,



Does this really occur? Can you reproduce this ?



  the calls to spin_lock_irqsave() in
nv_start_xmit() and nv_start_xmit_optimized() are moved to the
front of "prev_tx_ctx->skb = skb;"

Signed-off-by: Jia-Ju Bai 
---
  drivers/net/ethernet/nvidia/forcedeth.c | 8 
  1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c 
b/drivers/net/ethernet/nvidia/forcedeth.c
index 1d9b0d44ddb6..48fa5a0bd2cb 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -2317,6 +2317,8 @@ static netdev_tx_t nv_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
/* set last fragment flag  */
prev_tx->flaglen |= cpu_to_le32(tx_flags_extra);
  
+	spin_lock_irqsave(&np->lock, flags);

+
/* save skb in this slot's context area */
prev_tx_ctx->skb = skb;
  
@@ -2326,8 +2328,6 @@ static netdev_tx_t nv_start_xmit(struct sk_buff *skb, struct net_device *dev)

tx_flags_extra = skb->ip_summed == CHECKSUM_PARTIAL ?
 NV_TX2_CHECKSUM_L3 | NV_TX2_CHECKSUM_L4 : 0;
  
-	spin_lock_irqsave(&np->lock, flags);

-
/* set tx flags */
start_tx->flaglen |= cpu_to_le32(tx_flags | tx_flags_extra);
  
@@ -2475,6 +2475,8 @@ static netdev_tx_t nv_start_xmit_optimized(struct sk_buff *skb,

/* set last fragment flag  */
prev_tx->flaglen |= cpu_to_le32(NV_TX2_LASTPACKET);
  
+	spin_lock_irqsave(&np->lock, flags);

+
/* save skb in this slot's context area */
prev_tx_ctx->skb = skb;
  
@@ -2491,8 +2493,6 @@ static netdev_tx_t nv_start_xmit_optimized(struct sk_buff *skb,

else
start_tx->txvlan = 0;
  
-	spin_lock_irqsave(&np->lock, flags);

-
if (np->tx_limit) {
/* Limit the number of outstanding tx. Setup all fragments, but
 * do not set the VALID bit on the first descriptor. Save a 
pointer


[PATCH 1/1] net: rds: remove unnecessary NULL check

2018-12-30 Thread Zhu Yanjun
In kfree, the NULL check is done.

Signed-off-by: Zhu Yanjun 
---
 net/rds/tcp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index b9bbcf3d6c63..c16f0a362c32 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -623,7 +623,7 @@ static void __net_exit rds_tcp_exit_net(struct net *net)
if (rtn->rds_tcp_sysctl)
unregister_net_sysctl_table(rtn->rds_tcp_sysctl);
 
-   if (net != &init_net && rtn->ctl_table)
+   if (net != &init_net)
kfree(rtn->ctl_table);
 }
 
-- 
2.17.1



[PATCHv2 net-next 1/1] net: rds: use memset to optimize the recv

2018-09-16 Thread Zhu Yanjun
The function rds_inc_init is in recv process. To use memset can optimize
the function rds_inc_init.
The test result:

 Before:
 1) + 24.950 us   |rds_inc_init [rds]();
 After:
 1) + 10.990 us   |rds_inc_init [rds]();

Acked-by: Santosh Shilimkar 
Signed-off-by: Zhu Yanjun 
---
V1->V2: a new patch for net-next
---
 net/rds/recv.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/net/rds/recv.c b/net/rds/recv.c
index 12719653188a..727639dac8a7 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -43,8 +43,6 @@
 void rds_inc_init(struct rds_incoming *inc, struct rds_connection *conn,
 struct in6_addr *saddr)
 {
-   int i;
-
refcount_set(&inc->i_refcount, 1);
INIT_LIST_HEAD(&inc->i_item);
inc->i_conn = conn;
@@ -52,8 +50,7 @@ void rds_inc_init(struct rds_incoming *inc, struct 
rds_connection *conn,
inc->i_rdma_cookie = 0;
inc->i_rx_tstamp = ktime_set(0, 0);
 
-   for (i = 0; i < RDS_RX_MAX_TRACES; i++)
-   inc->i_rx_lat_trace[i] = 0;
+   memset(inc->i_rx_lat_trace, 0, sizeof(inc->i_rx_lat_trace));
 }
 EXPORT_SYMBOL_GPL(rds_inc_init);
 
-- 
2.17.1



[PATCH 1/1] net: rds: use memset to optimize the recv

2018-09-14 Thread Zhu Yanjun
The function rds_inc_init is in recv process. To use memset can optimize
the function rds_inc_init.
The test result:

Before:
1) + 24.950 us   |rds_inc_init [rds]();
After:
1) + 10.990 us   |rds_inc_init [rds]();

Signed-off-by: Zhu Yanjun 
---
 net/rds/recv.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/net/rds/recv.c b/net/rds/recv.c
index 504cd6bcc54c..a9399ddbb7bf 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -43,8 +43,6 @@
 void rds_inc_init(struct rds_incoming *inc, struct rds_connection *conn,
 struct in6_addr *saddr)
 {
-   int i;
-
refcount_set(&inc->i_refcount, 1);
INIT_LIST_HEAD(&inc->i_item);
inc->i_conn = conn;
@@ -53,8 +51,7 @@ void rds_inc_init(struct rds_incoming *inc, struct 
rds_connection *conn,
inc->i_rx_tstamp.tv_sec = 0;
inc->i_rx_tstamp.tv_usec = 0;
 
-   for (i = 0; i < RDS_RX_MAX_TRACES; i++)
-   inc->i_rx_lat_trace[i] = 0;
+   memset(inc->i_rx_lat_trace, 0, sizeof(inc->i_rx_lat_trace));
 }
 EXPORT_SYMBOL_GPL(rds_inc_init);
 
-- 
2.17.1



[PATCH 1/1] net/rds: Use rdma_read_gids to get connection SGID/DGID in IPv6

2018-08-25 Thread Zhu Yanjun
In IPv4, the newly introduced rdma_read_gids is used to read the SGID/DGID
for the connection which returns GID correctly for RoCE transport as well.

In IPv6, rdma_read_gids is also used. The following are why rdma_read_gids
is introduced.

rdma_addr_get_dgid() for RoCE for client side connections returns MAC
address, instead of DGID.
rdma_addr_get_sgid() for RoCE doesn't return correct SGID for IPv6 and
when more than one IP address is assigned to the netdevice.

So the transport agnostic rdma_read_gids() API is provided by rdma_cm
module.

Signed-off-by: Zhu Yanjun 
---
 net/rds/ib.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/net/rds/ib.c b/net/rds/ib.c
index c1d97640c0be..eba75c1ba359 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -341,15 +341,10 @@ static int rds6_ib_conn_info_visitor(struct 
rds_connection *conn,
 
if (rds_conn_state(conn) == RDS_CONN_UP) {
struct rds_ib_device *rds_ibdev;
-   struct rdma_dev_addr *dev_addr;
 
ic = conn->c_transport_data;
-   dev_addr = &ic->i_cm_id->route.addr.dev_addr;
-   rdma_addr_get_sgid(dev_addr,
-  (union ib_gid *)&iinfo6->src_gid);
-   rdma_addr_get_dgid(dev_addr,
-  (union ib_gid *)&iinfo6->dst_gid);
-
+   rdma_read_gids(ic->i_cm_id, (union ib_gid *)&iinfo6->src_gid,
+  (union ib_gid *)&iinfo6->dst_gid);
rds_ibdev = ic->rds_ibdev;
iinfo6->max_send_wr = ic->i_send_ring.w_nr;
iinfo6->max_recv_wr = ic->i_recv_ring.w_nr;
-- 
2.17.1



[PATCH 1/1] Revert "rds: ib: add error handle"

2018-04-23 Thread Zhu Yanjun
This reverts commit 3b12f73a5c2977153f28a224392fd4729b50d1dc.

After long time discussion and investigations, it seems that there
is no mem leak. So this patch is reverted.

Signed-off-by: Zhu Yanjun 
---
 net/rds/ib_cm.c | 47 +++
 1 file changed, 11 insertions(+), 36 deletions(-)

diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index eea1d86..d64bfaf 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -443,7 +443,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
ic->i_send_cq = NULL;
ibdev_put_vector(rds_ibdev, ic->i_scq_vector);
rdsdebug("ib_create_cq send failed: %d\n", ret);
-   goto rds_ibdev_out;
+   goto out;
}
 
ic->i_rcq_vector = ibdev_get_unused_vector(rds_ibdev);
@@ -457,19 +457,19 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
ic->i_recv_cq = NULL;
ibdev_put_vector(rds_ibdev, ic->i_rcq_vector);
rdsdebug("ib_create_cq recv failed: %d\n", ret);
-   goto send_cq_out;
+   goto out;
}
 
ret = ib_req_notify_cq(ic->i_send_cq, IB_CQ_NEXT_COMP);
if (ret) {
rdsdebug("ib_req_notify_cq send failed: %d\n", ret);
-   goto recv_cq_out;
+   goto out;
}
 
ret = ib_req_notify_cq(ic->i_recv_cq, IB_CQ_SOLICITED);
if (ret) {
rdsdebug("ib_req_notify_cq recv failed: %d\n", ret);
-   goto recv_cq_out;
+   goto out;
}
 
/* XXX negotiate max send/recv with remote? */
@@ -495,7 +495,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
ret = rdma_create_qp(ic->i_cm_id, ic->i_pd, &attr);
if (ret) {
rdsdebug("rdma_create_qp failed: %d\n", ret);
-   goto recv_cq_out;
+   goto out;
}
 
ic->i_send_hdrs = ib_dma_alloc_coherent(dev,
@@ -505,7 +505,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
if (!ic->i_send_hdrs) {
ret = -ENOMEM;
rdsdebug("ib_dma_alloc_coherent send failed\n");
-   goto qp_out;
+   goto out;
}
 
ic->i_recv_hdrs = ib_dma_alloc_coherent(dev,
@@ -515,7 +515,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
if (!ic->i_recv_hdrs) {
ret = -ENOMEM;
rdsdebug("ib_dma_alloc_coherent recv failed\n");
-   goto send_hdrs_dma_out;
+   goto out;
}
 
ic->i_ack = ib_dma_alloc_coherent(dev, sizeof(struct rds_header),
@@ -523,7 +523,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
if (!ic->i_ack) {
ret = -ENOMEM;
rdsdebug("ib_dma_alloc_coherent ack failed\n");
-   goto recv_hdrs_dma_out;
+   goto out;
}
 
ic->i_sends = vzalloc_node(ic->i_send_ring.w_nr * sizeof(struct 
rds_ib_send_work),
@@ -531,7 +531,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
if (!ic->i_sends) {
ret = -ENOMEM;
rdsdebug("send allocation failed\n");
-   goto ack_dma_out;
+   goto out;
}
 
ic->i_recvs = vzalloc_node(ic->i_recv_ring.w_nr * sizeof(struct 
rds_ib_recv_work),
@@ -539,7 +539,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
if (!ic->i_recvs) {
ret = -ENOMEM;
rdsdebug("recv allocation failed\n");
-   goto sends_out;
+   goto out;
}
 
rds_ib_recv_init_ack(ic);
@@ -547,33 +547,8 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
rdsdebug("conn %p pd %p cq %p %p\n", conn, ic->i_pd,
 ic->i_send_cq, ic->i_recv_cq);
 
-   return ret;
-
-sends_out:
-   vfree(ic->i_sends);
-ack_dma_out:
-   ib_dma_free_coherent(dev, sizeof(struct rds_header),
-ic->i_ack, ic->i_ack_dma);
-recv_hdrs_dma_out:
-   ib_dma_free_coherent(dev, ic->i_recv_ring.w_nr *
-   sizeof(struct rds_header),
-   ic->i_recv_hdrs, ic->i_recv_hdrs_dma);
-send_hdrs_dma_out:
-   ib_dma_free_coherent(dev, ic->i_send_ring.w_nr *
-   sizeof(struct rds_header),
-   ic->i_send_hdrs, ic->i_send_hdrs_dma);
-qp_out:
-   rdma_destroy_qp(ic->i_cm_id);
-recv_cq_out:
-   if (!ib_destroy_cq(ic->i_recv_cq))
-   ic->i_recv_cq = NULL;
-send_cq_out:
-   if (!ib_destroy_cq(ic->i_send_cq))
-   ic->i_send_cq = NULL;
-rds_ibdev_out:
-   rds_ib_remove_conn(rds_ibdev, conn);
+out:
rds_ib_dev_put(rds_ibdev);
-
return ret;
 }
 
-- 
2.7.4



[PATCHv2 1/1] net/mlx4_core: avoid resetting HCA when accessing an offline device

2018-04-18 Thread Zhu Yanjun
While a faulty cable is used or HCA firmware error, HCA device will
be offline. When the driver is accessing this offline device, the
following call trace will pop out.

"
...
  [] dump_stack+0x63/0x81
  [] panic+0xcc/0x21b
  [] mlx4_enter_error_state+0xba/0xf0 [mlx4_core]
  [] mlx4_cmd_reset_flow+0x38/0x60 [mlx4_core]
  [] mlx4_cmd_poll+0xc1/0x2e0 [mlx4_core]
  [] __mlx4_cmd+0xb0/0x160 [mlx4_core]
  [] mlx4_SENSE_PORT+0x54/0xd0 [mlx4_core]
  [] mlx4_dev_cap+0x4a4/0xb50 [mlx4_core]
...
"
In the above call trace, the function mlx4_cmd_poll calls the function
mlx4_cmd_post to access the HCA while HCA is offline. Then mlx4_cmd_post
returns an error -EIO. Per -EIO, the function mlx4_cmd_poll calls
mlx4_cmd_reset_flow to reset HCA. And the above call trace pops out.

This is not reasonable. Since HCA device is offline when it is being
accessed, it should not be reset again.

In this patch, since HCA is offline, the function mlx4_cmd_post returns
an error -EINVAL. Per -EINVAL, the function mlx4_cmd_poll directly returns
instead of resetting HCA.

CC: Srinivas Eeda 
CC: Junxiao Bi 
Suggested-by: Håkon Bugge 
Suggested-by: Tariq Toukan 
Signed-off-by: Zhu Yanjun 
---
V1->V2: Follow Tariq's advice, avoid the disturbance from other returned errors.
Since the returned values from the function mlx4_cmd_post are -EIO and -EINVAL,
to -EIO, the HCA device should be reset. To -EINVAL, that means that the 
function
mlx4_cmd_post is accessing an offline device. It is not necessary to reset HCA.
Go to label out directly.
---
 drivers/net/ethernet/mellanox/mlx4/cmd.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c 
b/drivers/net/ethernet/mellanox/mlx4/cmd.c
index 6a9086d..df735b8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
@@ -451,6 +451,8 @@ static int mlx4_cmd_post(struct mlx4_dev *dev, u64 
in_param, u64 out_param,
 * Device is going through error recovery
 * and cannot accept commands.
 */
+   mlx4_err(dev, "%s : Device is in error recovery.\n", __func__);
+   ret = -EINVAL;
goto out;
}
 
@@ -610,8 +612,11 @@ static int mlx4_cmd_poll(struct mlx4_dev *dev, u64 
in_param, u64 *out_param,
 
err = mlx4_cmd_post(dev, in_param, out_param ? *out_param : 0,
in_modifier, op_modifier, op, CMD_POLL_TOKEN, 0);
-   if (err)
+   if (err) {
+   if (err == -EINVAL)
+   goto out;
goto out_reset;
+   }
 
end = msecs_to_jiffies(timeout) + jiffies;
while (cmd_pending(dev) && time_before(jiffies, end)) {
@@ -710,8 +715,11 @@ static int mlx4_cmd_wait(struct mlx4_dev *dev, u64 
in_param, u64 *out_param,
 
err = mlx4_cmd_post(dev, in_param, out_param ? *out_param : 0,
in_modifier, op_modifier, op, context->token, 1);
-   if (err)
+   if (err) {
+   if (err == -EINVAL)
+   goto out;
goto out_reset;
+   }
 
if (op == MLX4_CMD_SENSE_PORT) {
ret_wait =
-- 
2.7.4



[PATCH 1/1] net/mlx4_core: avoid resetting HCA when accessing an offline device

2018-04-15 Thread Zhu Yanjun
While a faulty cable is used or HCA firmware error, HCA device will
be offline. When the driver is accessing this offline device, the
following call trace will pop out.

"
...
  [] dump_stack+0x63/0x81
  [] panic+0xcc/0x21b
  [] mlx4_enter_error_state+0xba/0xf0 [mlx4_core]
  [] mlx4_cmd_reset_flow+0x38/0x60 [mlx4_core]
  [] mlx4_cmd_poll+0xc1/0x2e0 [mlx4_core]
  [] __mlx4_cmd+0xb0/0x160 [mlx4_core]
  [] mlx4_SENSE_PORT+0x54/0xd0 [mlx4_core]
  [] mlx4_dev_cap+0x4a4/0xb50 [mlx4_core]
...
"
In the above call trace, the function mlx4_cmd_poll calls the function
mlx4_cmd_post to access the HCA while HCA is offline. Then mlx4_cmd_post
returns an error -EIO. Per -EIO, the function mlx4_cmd_poll calls
mlx4_cmd_reset_flow to reset HCA. And the above call trace pops out.

This is not reasonable. Since HCA device is offline when it is being
accessed, it should not be reset again.

In this patch, since HCA is offline, the function mlx4_cmd_post returns
an error -EINVAL. Per -EINVAL, the function mlx4_cmd_poll directly returns
instead of resetting HCA.

CC: Srinivas Eeda 
CC: Junxiao Bi 
Suggested-by: Håkon Bugge 
Signed-off-by: Zhu Yanjun 
---
 drivers/net/ethernet/mellanox/mlx4/cmd.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c 
b/drivers/net/ethernet/mellanox/mlx4/cmd.c
index 6a9086d..f1c8c42 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
@@ -451,6 +451,8 @@ static int mlx4_cmd_post(struct mlx4_dev *dev, u64 
in_param, u64 out_param,
 * Device is going through error recovery
 * and cannot accept commands.
 */
+   mlx4_err(dev, "%s : Device is in error recovery.\n", __func__);
+   ret = -EINVAL;
goto out;
}
 
@@ -657,6 +659,9 @@ static int mlx4_cmd_poll(struct mlx4_dev *dev, u64 
in_param, u64 *out_param,
}
 
 out_reset:
+   if (err == -EINVAL)
+   goto out;
+
if (err)
err = mlx4_cmd_reset_flow(dev, op, op_modifier, err);
 out:
@@ -766,6 +771,9 @@ static int mlx4_cmd_wait(struct mlx4_dev *dev, u64 
in_param, u64 *out_param,
*out_param = context->out_param;
 
 out_reset:
+   if (err == -EINVAL)
+   goto out;
+
if (err)
err = mlx4_cmd_reset_flow(dev, op, op_modifier, err);
 out:
-- 
2.7.4



[PATCH net-next 1/1] forcedeth: remove duplicate structure member in rx

2018-01-22 Thread Zhu Yanjun
Since both first_rx_ctx and rx_skb are the head of rx ctx, it not
necessary to use two structure members to statically indicate
the head of rx ctx. So first_rx_ctx is removed.

CC: Srinivas Eeda 
CC: Joe Jin 
CC: Junxiao Bi 
Signed-off-by: Zhu Yanjun 
---
 drivers/net/ethernet/nvidia/forcedeth.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c 
b/drivers/net/ethernet/nvidia/forcedeth.c
index a3f6d51..66c665d 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -795,7 +795,7 @@ struct fe_priv {
 */
union ring_type get_rx, put_rx, last_rx;
struct nv_skb_map *get_rx_ctx, *put_rx_ctx;
-   struct nv_skb_map *first_rx_ctx, *last_rx_ctx;
+   struct nv_skb_map *last_rx_ctx;
struct nv_skb_map *rx_skb;
 
union ring_type rx_ring;
@@ -1835,7 +1835,7 @@ static int nv_alloc_rx(struct net_device *dev)
if (unlikely(np->put_rx.orig++ == np->last_rx.orig))
np->put_rx.orig = np->rx_ring.orig;
if (unlikely(np->put_rx_ctx++ == np->last_rx_ctx))
-   np->put_rx_ctx = np->first_rx_ctx;
+   np->put_rx_ctx = np->rx_skb;
} else {
 packet_dropped:
u64_stats_update_begin(&np->swstats_rx_syncp);
@@ -1877,7 +1877,7 @@ static int nv_alloc_rx_optimized(struct net_device *dev)
if (unlikely(np->put_rx.ex++ == np->last_rx.ex))
np->put_rx.ex = np->rx_ring.ex;
if (unlikely(np->put_rx_ctx++ == np->last_rx_ctx))
-   np->put_rx_ctx = np->first_rx_ctx;
+   np->put_rx_ctx = np->rx_skb;
} else {
 packet_dropped:
u64_stats_update_begin(&np->swstats_rx_syncp);
@@ -1910,7 +1910,8 @@ static void nv_init_rx(struct net_device *dev)
np->last_rx.orig = &np->rx_ring.orig[np->rx_ring_size-1];
else
np->last_rx.ex = &np->rx_ring.ex[np->rx_ring_size-1];
-   np->get_rx_ctx = np->put_rx_ctx = np->first_rx_ctx = np->rx_skb;
+   np->get_rx_ctx = np->rx_skb;
+   np->put_rx_ctx = np->rx_skb;
np->last_rx_ctx = &np->rx_skb[np->rx_ring_size-1];
 
for (i = 0; i < np->rx_ring_size; i++) {
@@ -2914,7 +2915,7 @@ static int nv_rx_process(struct net_device *dev, int 
limit)
if (unlikely(np->get_rx.orig++ == np->last_rx.orig))
np->get_rx.orig = np->rx_ring.orig;
if (unlikely(np->get_rx_ctx++ == np->last_rx_ctx))
-   np->get_rx_ctx = np->first_rx_ctx;
+   np->get_rx_ctx = np->rx_skb;
 
rx_work++;
}
@@ -3003,7 +3004,7 @@ static int nv_rx_process_optimized(struct net_device 
*dev, int limit)
if (unlikely(np->get_rx.ex++ == np->last_rx.ex))
np->get_rx.ex = np->rx_ring.ex;
if (unlikely(np->get_rx_ctx++ == np->last_rx_ctx))
-   np->get_rx_ctx = np->first_rx_ctx;
+   np->get_rx_ctx = np->rx_skb;
 
rx_work++;
}
-- 
2.7.4



[PATCHv2 net-next 1/1] forcedeth: remove unused variable

2018-01-16 Thread Zhu Yanjun
The variable miistat is not used. So it is removed.

CC: Srinivas Eeda 
CC: Joe Jin 
CC: Junxiao Bi 
Signed-off-by: Zhu Yanjun 
---
v1->v2: Keep readl function
---
 drivers/net/ethernet/nvidia/forcedeth.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c 
b/drivers/net/ethernet/nvidia/forcedeth.c
index 21e15cb..a3f6d51 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -5510,11 +5510,9 @@ static int nv_open(struct net_device *dev)
/* One manual link speed update: Interrupts are enabled, future link
 * speed changes cause interrupts and are handled by nv_link_irq().
 */
-   {
-   u32 miistat;
-   miistat = readl(base + NvRegMIIStatus);
-   writel(NVREG_MIISTAT_MASK_ALL, base + NvRegMIIStatus);
-   }
+   readl(base + NvRegMIIStatus);
+   writel(NVREG_MIISTAT_MASK_ALL, base + NvRegMIIStatus);
+
/* set linkspeed to invalid value, thus force nv_update_linkspeed
 * to init hw */
np->linkspeed = 0;
-- 
2.7.4



[PATCH net-next 1/1] forcedeth: remove unused variable

2018-01-14 Thread Zhu Yanjun
The variable miistat is not used. So it is removed.

CC: Srinivas Eeda 
CC: Joe Jin 
CC: Junxiao Bi 
Signed-off-by: Zhu Yanjun 
---
 drivers/net/ethernet/nvidia/forcedeth.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c 
b/drivers/net/ethernet/nvidia/forcedeth.c
index 21e15cb..c518f8c 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -5510,11 +5510,8 @@ static int nv_open(struct net_device *dev)
/* One manual link speed update: Interrupts are enabled, future link
 * speed changes cause interrupts and are handled by nv_link_irq().
 */
-   {
-   u32 miistat;
-   miistat = readl(base + NvRegMIIStatus);
-   writel(NVREG_MIISTAT_MASK_ALL, base + NvRegMIIStatus);
-   }
+   writel(NVREG_MIISTAT_MASK_ALL, base + NvRegMIIStatus);
+
/* set linkspeed to invalid value, thus force nv_update_linkspeed
 * to init hw */
np->linkspeed = 0;
-- 
2.7.4



[PATCH NET-NEXT 1/1] forcedeth: remove duplicate structure member in rx

2018-01-04 Thread Zhu Yanjun
Since both first_rx and rx_ring are the head of rx ring, it not
necessary to use two structure members to statically indicate
the head of rx ring. So first_rx is removed.

CC: Srinivas Eeda 
CC: Joe Jin 
CC: Junxiao Bi 
Signed-off-by: Zhu Yanjun 
---
 drivers/net/ethernet/nvidia/forcedeth.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c 
b/drivers/net/ethernet/nvidia/forcedeth.c
index a79b9f8..21e15cb 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -793,7 +793,7 @@ struct fe_priv {
/* rx specific fields.
 * Locking: Within irq hander or disable_irq+spin_lock(&np->lock);
 */
-   union ring_type get_rx, put_rx, first_rx, last_rx;
+   union ring_type get_rx, put_rx, last_rx;
struct nv_skb_map *get_rx_ctx, *put_rx_ctx;
struct nv_skb_map *first_rx_ctx, *last_rx_ctx;
struct nv_skb_map *rx_skb;
@@ -1812,7 +1812,7 @@ static int nv_alloc_rx(struct net_device *dev)
struct ring_desc *less_rx;
 
less_rx = np->get_rx.orig;
-   if (less_rx-- == np->first_rx.orig)
+   if (less_rx-- == np->rx_ring.orig)
less_rx = np->last_rx.orig;
 
while (np->put_rx.orig != less_rx) {
@@ -1833,7 +1833,7 @@ static int nv_alloc_rx(struct net_device *dev)
wmb();
np->put_rx.orig->flaglen = cpu_to_le32(np->rx_buf_sz | 
NV_RX_AVAIL);
if (unlikely(np->put_rx.orig++ == np->last_rx.orig))
-   np->put_rx.orig = np->first_rx.orig;
+   np->put_rx.orig = np->rx_ring.orig;
if (unlikely(np->put_rx_ctx++ == np->last_rx_ctx))
np->put_rx_ctx = np->first_rx_ctx;
} else {
@@ -1853,7 +1853,7 @@ static int nv_alloc_rx_optimized(struct net_device *dev)
struct ring_desc_ex *less_rx;
 
less_rx = np->get_rx.ex;
-   if (less_rx-- == np->first_rx.ex)
+   if (less_rx-- == np->rx_ring.ex)
less_rx = np->last_rx.ex;
 
while (np->put_rx.ex != less_rx) {
@@ -1875,7 +1875,7 @@ static int nv_alloc_rx_optimized(struct net_device *dev)
wmb();
np->put_rx.ex->flaglen = cpu_to_le32(np->rx_buf_sz | 
NV_RX2_AVAIL);
if (unlikely(np->put_rx.ex++ == np->last_rx.ex))
-   np->put_rx.ex = np->first_rx.ex;
+   np->put_rx.ex = np->rx_ring.ex;
if (unlikely(np->put_rx_ctx++ == np->last_rx_ctx))
np->put_rx_ctx = np->first_rx_ctx;
} else {
@@ -1903,7 +1903,8 @@ static void nv_init_rx(struct net_device *dev)
struct fe_priv *np = netdev_priv(dev);
int i;
 
-   np->get_rx = np->put_rx = np->first_rx = np->rx_ring;
+   np->get_rx = np->rx_ring;
+   np->put_rx = np->rx_ring;
 
if (!nv_optimized(np))
np->last_rx.orig = &np->rx_ring.orig[np->rx_ring_size-1];
@@ -2911,7 +2912,7 @@ static int nv_rx_process(struct net_device *dev, int 
limit)
u64_stats_update_end(&np->swstats_rx_syncp);
 next_pkt:
if (unlikely(np->get_rx.orig++ == np->last_rx.orig))
-   np->get_rx.orig = np->first_rx.orig;
+   np->get_rx.orig = np->rx_ring.orig;
if (unlikely(np->get_rx_ctx++ == np->last_rx_ctx))
np->get_rx_ctx = np->first_rx_ctx;
 
@@ -3000,7 +3001,7 @@ static int nv_rx_process_optimized(struct net_device 
*dev, int limit)
}
 next_pkt:
if (unlikely(np->get_rx.ex++ == np->last_rx.ex))
-   np->get_rx.ex = np->first_rx.ex;
+   np->get_rx.ex = np->rx_ring.ex;
if (unlikely(np->get_rx_ctx++ == np->last_rx_ctx))
np->get_rx_ctx = np->first_rx_ctx;
 
-- 
2.7.4



[PATCH net-next 1/1] forcedeth: optimize the rx with likely

2017-12-25 Thread Zhu Yanjun
In the rx fastpath, the function netdev_alloc_skb rarely fails.
Therefore, a likely() optimization is added to this error check
conditional.

CC: Srinivas Eeda 
CC: Joe Jin 
CC: Junxiao Bi 
Signed-off-by: Zhu Yanjun 
---
 drivers/net/ethernet/nvidia/forcedeth.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c 
b/drivers/net/ethernet/nvidia/forcedeth.c
index 49d6d78..a79b9f8 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -1817,7 +1817,7 @@ static int nv_alloc_rx(struct net_device *dev)
 
while (np->put_rx.orig != less_rx) {
struct sk_buff *skb = netdev_alloc_skb(dev, np->rx_buf_sz + 
NV_RX_ALLOC_PAD);
-   if (skb) {
+   if (likely(skb)) {
np->put_rx_ctx->skb = skb;
np->put_rx_ctx->dma = dma_map_single(&np->pci_dev->dev,
 skb->data,
@@ -1858,7 +1858,7 @@ static int nv_alloc_rx_optimized(struct net_device *dev)
 
while (np->put_rx.ex != less_rx) {
struct sk_buff *skb = netdev_alloc_skb(dev, np->rx_buf_sz + 
NV_RX_ALLOC_PAD);
-   if (skb) {
+   if (likely(skb)) {
np->put_rx_ctx->skb = skb;
np->put_rx_ctx->dma = dma_map_single(&np->pci_dev->dev,
 skb->data,
-- 
2.7.4



[PATCH net-next 1/1] forcedeth: remove duplicate structure member in xmit

2017-12-16 Thread Zhu Yanjun
Since both first_tx_ctx and tx_skb are the head of tx ctx, it not
necessary to use two structure members to statically indicate
the head of tx ctx. So first_tx_ctx is removed.

CC: Srinivas Eeda 
CC: Joe Jin 
CC: Junxiao Bi 
Signed-off-by: Zhu Yanjun 
---
 drivers/net/ethernet/nvidia/forcedeth.c | 25 +
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c 
b/drivers/net/ethernet/nvidia/forcedeth.c
index cadea67..49d6d78 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -824,7 +824,7 @@ struct fe_priv {
 */
union ring_type get_tx, put_tx, last_tx;
struct nv_skb_map *get_tx_ctx, *put_tx_ctx;
-   struct nv_skb_map *first_tx_ctx, *last_tx_ctx;
+   struct nv_skb_map *last_tx_ctx;
struct nv_skb_map *tx_skb;
 
union ring_type tx_ring;
@@ -1939,7 +1939,8 @@ static void nv_init_tx(struct net_device *dev)
np->last_tx.orig = &np->tx_ring.orig[np->tx_ring_size-1];
else
np->last_tx.ex = &np->tx_ring.ex[np->tx_ring_size-1];
-   np->get_tx_ctx = np->put_tx_ctx = np->first_tx_ctx = np->tx_skb;
+   np->get_tx_ctx = np->tx_skb;
+   np->put_tx_ctx = np->tx_skb;
np->last_tx_ctx = &np->tx_skb[np->tx_ring_size-1];
netdev_reset_queue(np->dev);
np->tx_pkts_in_progress = 0;
@@ -2251,7 +2252,7 @@ static netdev_tx_t nv_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
if (unlikely(put_tx++ == np->last_tx.orig))
put_tx = np->tx_ring.orig;
if (unlikely(np->put_tx_ctx++ == np->last_tx_ctx))
-   np->put_tx_ctx = np->first_tx_ctx;
+   np->put_tx_ctx = np->tx_skb;
} while (size);
 
/* setup the fragments */
@@ -2277,7 +2278,7 @@ static netdev_tx_t nv_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
do {
nv_unmap_txskb(np, start_tx_ctx);
if (unlikely(tmp_tx_ctx++ == 
np->last_tx_ctx))
-   tmp_tx_ctx = np->first_tx_ctx;
+   tmp_tx_ctx = np->tx_skb;
} while (tmp_tx_ctx != np->put_tx_ctx);
dev_kfree_skb_any(skb);
np->put_tx_ctx = start_tx_ctx;
@@ -2297,7 +2298,7 @@ static netdev_tx_t nv_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
if (unlikely(put_tx++ == np->last_tx.orig))
put_tx = np->tx_ring.orig;
if (unlikely(np->put_tx_ctx++ == np->last_tx_ctx))
-   np->put_tx_ctx = np->first_tx_ctx;
+   np->put_tx_ctx = np->tx_skb;
} while (frag_size);
}
 
@@ -2306,7 +2307,7 @@ static netdev_tx_t nv_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
else
prev_tx = put_tx - 1;
 
-   if (unlikely(np->put_tx_ctx == np->first_tx_ctx))
+   if (unlikely(np->put_tx_ctx == np->tx_skb))
prev_tx_ctx = np->last_tx_ctx;
else
prev_tx_ctx = np->put_tx_ctx - 1;
@@ -2409,7 +2410,7 @@ static netdev_tx_t nv_start_xmit_optimized(struct sk_buff 
*skb,
if (unlikely(put_tx++ == np->last_tx.ex))
put_tx = np->tx_ring.ex;
if (unlikely(np->put_tx_ctx++ == np->last_tx_ctx))
-   np->put_tx_ctx = np->first_tx_ctx;
+   np->put_tx_ctx = np->tx_skb;
} while (size);
 
/* setup the fragments */
@@ -2435,7 +2436,7 @@ static netdev_tx_t nv_start_xmit_optimized(struct sk_buff 
*skb,
do {
nv_unmap_txskb(np, start_tx_ctx);
if (unlikely(tmp_tx_ctx++ == 
np->last_tx_ctx))
-   tmp_tx_ctx = np->first_tx_ctx;
+   tmp_tx_ctx = np->tx_skb;
} while (tmp_tx_ctx != np->put_tx_ctx);
dev_kfree_skb_any(skb);
np->put_tx_ctx = start_tx_ctx;
@@ -2455,7 +2456,7 @@ static netdev_tx_t nv_start_xmit_optimized(struct sk_buff 
*skb,
if (unlikely(put_tx++ == np->last_tx.ex))
put_tx = np->tx_ring.ex;
if (unlikely(np->put_tx_ctx++ == np->last_tx_ctx))
-   np->put_tx_ctx = np->first_tx_ctx;
+ 

[PATCHv2 net-next 1/1] forcedeth: remove unnecessary structure member

2017-12-09 Thread Zhu Yanjun
Since both tx_ring and first_tx are the head of tx ring, it not
necessary to use two structure members to statically indicate
the head of tx ring. So first_tx is removed.

CC: Srinivas Eeda 
CC: Joe Jin 
CC: Junxiao Bi 
Signed-off-by: Zhu Yanjun 
---
 drivers/net/ethernet/nvidia/forcedeth.c | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c 
b/drivers/net/ethernet/nvidia/forcedeth.c
index 53614ed..cadea67 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -822,7 +822,7 @@ struct fe_priv {
/*
 * tx specific fields.
 */
-   union ring_type get_tx, put_tx, first_tx, last_tx;
+   union ring_type get_tx, put_tx, last_tx;
struct nv_skb_map *get_tx_ctx, *put_tx_ctx;
struct nv_skb_map *first_tx_ctx, *last_tx_ctx;
struct nv_skb_map *tx_skb;
@@ -1932,7 +1932,8 @@ static void nv_init_tx(struct net_device *dev)
struct fe_priv *np = netdev_priv(dev);
int i;
 
-   np->get_tx = np->put_tx = np->first_tx = np->tx_ring;
+   np->get_tx = np->tx_ring;
+   np->put_tx = np->tx_ring;
 
if (!nv_optimized(np))
np->last_tx.orig = &np->tx_ring.orig[np->tx_ring_size-1];
@@ -2248,7 +2249,7 @@ static netdev_tx_t nv_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
offset += bcnt;
size -= bcnt;
if (unlikely(put_tx++ == np->last_tx.orig))
-   put_tx = np->first_tx.orig;
+   put_tx = np->tx_ring.orig;
if (unlikely(np->put_tx_ctx++ == np->last_tx_ctx))
np->put_tx_ctx = np->first_tx_ctx;
} while (size);
@@ -2294,13 +2295,13 @@ static netdev_tx_t nv_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
offset += bcnt;
frag_size -= bcnt;
if (unlikely(put_tx++ == np->last_tx.orig))
-   put_tx = np->first_tx.orig;
+   put_tx = np->tx_ring.orig;
if (unlikely(np->put_tx_ctx++ == np->last_tx_ctx))
np->put_tx_ctx = np->first_tx_ctx;
} while (frag_size);
}
 
-   if (unlikely(put_tx == np->first_tx.orig))
+   if (unlikely(put_tx == np->tx_ring.orig))
prev_tx = np->last_tx.orig;
else
prev_tx = put_tx - 1;
@@ -2406,7 +2407,7 @@ static netdev_tx_t nv_start_xmit_optimized(struct sk_buff 
*skb,
offset += bcnt;
size -= bcnt;
if (unlikely(put_tx++ == np->last_tx.ex))
-   put_tx = np->first_tx.ex;
+   put_tx = np->tx_ring.ex;
if (unlikely(np->put_tx_ctx++ == np->last_tx_ctx))
np->put_tx_ctx = np->first_tx_ctx;
} while (size);
@@ -2452,13 +2453,13 @@ static netdev_tx_t nv_start_xmit_optimized(struct 
sk_buff *skb,
offset += bcnt;
frag_size -= bcnt;
if (unlikely(put_tx++ == np->last_tx.ex))
-   put_tx = np->first_tx.ex;
+   put_tx = np->tx_ring.ex;
if (unlikely(np->put_tx_ctx++ == np->last_tx_ctx))
np->put_tx_ctx = np->first_tx_ctx;
} while (frag_size);
}
 
-   if (unlikely(put_tx == np->first_tx.ex))
+   if (unlikely(put_tx == np->tx_ring.ex))
prev_tx = np->last_tx.ex;
else
prev_tx = put_tx - 1;
@@ -2597,7 +2598,7 @@ static int nv_tx_done(struct net_device *dev, int limit)
}
}
if (unlikely(np->get_tx.orig++ == np->last_tx.orig))
-   np->get_tx.orig = np->first_tx.orig;
+   np->get_tx.orig = np->tx_ring.orig;
if (unlikely(np->get_tx_ctx++ == np->last_tx_ctx))
np->get_tx_ctx = np->first_tx_ctx;
}
@@ -2651,7 +2652,7 @@ static int nv_tx_done_optimized(struct net_device *dev, 
int limit)
}
 
if (unlikely(np->get_tx.ex++ == np->last_tx.ex))
-   np->get_tx.ex = np->first_tx.ex;
+   np->get_tx.ex = np->tx_ring.ex;
if (unlikely(np->get_tx_ctx++ == np->last_tx_ctx))
np->get_tx_ctx = np->first_tx_ctx;
}
-- 
2.7.4



[PATCH net-next 1/1] forcedeth: remove unnecessary variable

2017-12-06 Thread Zhu Yanjun
Since both tx_ring and first_tx are the head of tx ring, it not
necessary to use two variables. So first_tx is removed.

CC: Srinivas Eeda 
CC: Joe Jin 
CC: Junxiao Bi 
Signed-off-by: Zhu Yanjun 
---
 drivers/net/ethernet/nvidia/forcedeth.c | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c 
b/drivers/net/ethernet/nvidia/forcedeth.c
index 53614ed..cadea67 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -822,7 +822,7 @@ struct fe_priv {
/*
 * tx specific fields.
 */
-   union ring_type get_tx, put_tx, first_tx, last_tx;
+   union ring_type get_tx, put_tx, last_tx;
struct nv_skb_map *get_tx_ctx, *put_tx_ctx;
struct nv_skb_map *first_tx_ctx, *last_tx_ctx;
struct nv_skb_map *tx_skb;
@@ -1932,7 +1932,8 @@ static void nv_init_tx(struct net_device *dev)
struct fe_priv *np = netdev_priv(dev);
int i;
 
-   np->get_tx = np->put_tx = np->first_tx = np->tx_ring;
+   np->get_tx = np->tx_ring;
+   np->put_tx = np->tx_ring;
 
if (!nv_optimized(np))
np->last_tx.orig = &np->tx_ring.orig[np->tx_ring_size-1];
@@ -2248,7 +2249,7 @@ static netdev_tx_t nv_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
offset += bcnt;
size -= bcnt;
if (unlikely(put_tx++ == np->last_tx.orig))
-   put_tx = np->first_tx.orig;
+   put_tx = np->tx_ring.orig;
if (unlikely(np->put_tx_ctx++ == np->last_tx_ctx))
np->put_tx_ctx = np->first_tx_ctx;
} while (size);
@@ -2294,13 +2295,13 @@ static netdev_tx_t nv_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
offset += bcnt;
frag_size -= bcnt;
if (unlikely(put_tx++ == np->last_tx.orig))
-   put_tx = np->first_tx.orig;
+   put_tx = np->tx_ring.orig;
if (unlikely(np->put_tx_ctx++ == np->last_tx_ctx))
np->put_tx_ctx = np->first_tx_ctx;
} while (frag_size);
}
 
-   if (unlikely(put_tx == np->first_tx.orig))
+   if (unlikely(put_tx == np->tx_ring.orig))
prev_tx = np->last_tx.orig;
else
prev_tx = put_tx - 1;
@@ -2406,7 +2407,7 @@ static netdev_tx_t nv_start_xmit_optimized(struct sk_buff 
*skb,
offset += bcnt;
size -= bcnt;
if (unlikely(put_tx++ == np->last_tx.ex))
-   put_tx = np->first_tx.ex;
+   put_tx = np->tx_ring.ex;
if (unlikely(np->put_tx_ctx++ == np->last_tx_ctx))
np->put_tx_ctx = np->first_tx_ctx;
} while (size);
@@ -2452,13 +2453,13 @@ static netdev_tx_t nv_start_xmit_optimized(struct 
sk_buff *skb,
offset += bcnt;
frag_size -= bcnt;
if (unlikely(put_tx++ == np->last_tx.ex))
-   put_tx = np->first_tx.ex;
+   put_tx = np->tx_ring.ex;
if (unlikely(np->put_tx_ctx++ == np->last_tx_ctx))
np->put_tx_ctx = np->first_tx_ctx;
} while (frag_size);
}
 
-   if (unlikely(put_tx == np->first_tx.ex))
+   if (unlikely(put_tx == np->tx_ring.ex))
prev_tx = np->last_tx.ex;
else
prev_tx = put_tx - 1;
@@ -2597,7 +2598,7 @@ static int nv_tx_done(struct net_device *dev, int limit)
}
}
if (unlikely(np->get_tx.orig++ == np->last_tx.orig))
-   np->get_tx.orig = np->first_tx.orig;
+   np->get_tx.orig = np->tx_ring.orig;
if (unlikely(np->get_tx_ctx++ == np->last_tx_ctx))
np->get_tx_ctx = np->first_tx_ctx;
}
@@ -2651,7 +2652,7 @@ static int nv_tx_done_optimized(struct net_device *dev, 
int limit)
}
 
if (unlikely(np->get_tx.ex++ == np->last_tx.ex))
-   np->get_tx.ex = np->first_tx.ex;
+   np->get_tx.ex = np->tx_ring.ex;
if (unlikely(np->get_tx_ctx++ == np->last_tx_ctx))
np->get_tx_ctx = np->first_tx_ctx;
}
-- 
2.7.4



[PATCH net-next 1/1] forcedeth: optimize the xmit with unlikely

2017-11-27 Thread Zhu Yanjun
In xmit, it is very impossible that TX_ERROR occurs. So using
unlikely optimizes the xmit process.

CC: Srinivas Eeda 
CC: Joe Jin 
CC: Junxiao Bi 
Signed-off-by: Zhu Yanjun 
---
 drivers/net/ethernet/nvidia/forcedeth.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c 
b/drivers/net/ethernet/nvidia/forcedeth.c
index ac8439c..e65276f 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -2563,7 +2563,7 @@ static int nv_tx_done(struct net_device *dev, int limit)
 
if (np->desc_ver == DESC_VER_1) {
if (flags & NV_TX_LASTPACKET) {
-   if (flags & NV_TX_ERROR) {
+   if (unlikely(flags & NV_TX_ERROR)) {
if ((flags & NV_TX_RETRYERROR)
&& !(flags & NV_TX_RETRYCOUNT_MASK))
nv_legacybackoff_reseed(dev);
@@ -2580,7 +2580,7 @@ static int nv_tx_done(struct net_device *dev, int limit)
}
} else {
if (flags & NV_TX2_LASTPACKET) {
-   if (flags & NV_TX2_ERROR) {
+   if (unlikely(flags & NV_TX2_ERROR)) {
if ((flags & NV_TX2_RETRYERROR)
&& !(flags & 
NV_TX2_RETRYCOUNT_MASK))
nv_legacybackoff_reseed(dev);
@@ -2626,7 +2626,7 @@ static int nv_tx_done_optimized(struct net_device *dev, 
int limit)
nv_unmap_txskb(np, np->get_tx_ctx);
 
if (flags & NV_TX2_LASTPACKET) {
-   if (flags & NV_TX2_ERROR) {
+   if (unlikely(flags & NV_TX2_ERROR)) {
if ((flags & NV_TX2_RETRYERROR)
&& !(flags & NV_TX2_RETRYCOUNT_MASK)) {
if (np->driver_data & DEV_HAS_GEAR_MODE)
-- 
2.7.4



[PATCHv2 net-next 1/1] forcedeth: replace pci_unmap_page with dma_unmap_page

2017-11-19 Thread Zhu Yanjun
The function pci_unmap_page is obsolete. So it is replaced with
the function dma_unmap_page.

CC: Srinivas Eeda 
CC: Joe Jin 
CC: Junxiao Bi 
Signed-off-by: Zhu Yanjun 
---
V1->V2: fix direction flag error.
---
 drivers/net/ethernet/nvidia/forcedeth.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c 
b/drivers/net/ethernet/nvidia/forcedeth.c
index ac8439c..481876b 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -1986,9 +1986,9 @@ static void nv_unmap_txskb(struct fe_priv *np, struct 
nv_skb_map *tx_skb)
 tx_skb->dma_len,
 DMA_TO_DEVICE);
else
-   pci_unmap_page(np->pci_dev, tx_skb->dma,
+   dma_unmap_page(&np->pci_dev->dev, tx_skb->dma,
   tx_skb->dma_len,
-  PCI_DMA_TODEVICE);
+  DMA_TO_DEVICE);
tx_skb->dma = 0;
}
 }
-- 
2.7.4



[PATCH net-next 1/1] forcedeth: replace pci_unmap_page with dma_unmap_page

2017-11-19 Thread Zhu Yanjun
The function pci_unmap_page is obsolete. So it is replaced with
the function dma_unmap_page.

CC: Srinivas Eeda 
CC: Joe Jin 
CC: Junxiao Bi 
Signed-off-by: Zhu Yanjun 
---
 drivers/net/ethernet/nvidia/forcedeth.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c 
b/drivers/net/ethernet/nvidia/forcedeth.c
index ac8439c..0febe41 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -1986,7 +1986,7 @@ static void nv_unmap_txskb(struct fe_priv *np, struct 
nv_skb_map *tx_skb)
 tx_skb->dma_len,
 DMA_TO_DEVICE);
else
-   pci_unmap_page(np->pci_dev, tx_skb->dma,
+   dma_unmap_page(&np->pci_dev->dev, tx_skb->dma,
   tx_skb->dma_len,
   PCI_DMA_TODEVICE);
tx_skb->dma = 0;
-- 
2.7.4



[PATCHv3 1/1] bnx2x: fix slowpath null crash

2017-11-11 Thread Zhu Yanjun
When "NETDEV WATCHDOG: em4 (bnx2x): transmit queue 2 timed out" occurs,
BNX2X_SP_RTNL_TX_TIMEOUT is set. In the function bnx2x_sp_rtnl_task,
bnx2x_nic_unload and bnx2x_nic_load are executed to shutdown and open
NIC. In the function bnx2x_nic_load, bnx2x_alloc_mem allocates dma
failure. The message "bnx2x: [bnx2x_alloc_mem:8399(em4)]Can't
allocate memory" pops out. The variable slowpath is set to NULL.
When shutdown the NIC, the function bnx2x_nic_unload is called. In
the function bnx2x_nic_unload, the following functions are executed.
bnx2x_chip_cleanup
bnx2x_set_storm_rx_mode
bnx2x_set_q_rx_mode
bnx2x_set_q_rx_mode
bnx2x_config_rx_mode
bnx2x_set_rx_mode_e2
In the function bnx2x_set_rx_mode_e2, the variable slowpath is operated.
Then the crash occurs.
To fix this crash, the variable slowpath is checked. And in the function
bnx2x_sp_rtnl_task, after dma memory allocation fails, another shutdown
and open NIC is executed.

CC: Joe Jin 
CC: Junxiao Bi 
Signed-off-by: Zhu Yanjun 
Acked-by: Ariel Elior 
---
v2->v3
Changes: fix the style of comments, add the leading space
V1->v2
Changes: add Acker and remove unnecessary brackets
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index c12b4d3..fbd302a 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -9332,7 +9332,7 @@ void bnx2x_chip_cleanup(struct bnx2x *bp, int 
unload_mode, bool keep_link)
/* Schedule the rx_mode command */
if (test_bit(BNX2X_FILTER_RX_MODE_PENDING, &bp->sp_state))
set_bit(BNX2X_FILTER_RX_MODE_SCHED, &bp->sp_state);
-   else
+   else if (bp->slowpath)
bnx2x_set_storm_rx_mode(bp);
 
/* Cleanup multicast configuration */
@@ -10271,8 +10271,15 @@ static void bnx2x_sp_rtnl_task(struct work_struct 
*work)
smp_mb();
 
bnx2x_nic_unload(bp, UNLOAD_NORMAL, true);
-   bnx2x_nic_load(bp, LOAD_NORMAL);
-
+   /* When ret value shows failure of allocation failure,
+* the nic is rebooted again. If open still fails, a error
+* message to notify the user.
+*/
+   if (bnx2x_nic_load(bp, LOAD_NORMAL) == -ENOMEM) {
+   bnx2x_nic_unload(bp, UNLOAD_NORMAL, true);
+   if (bnx2x_nic_load(bp, LOAD_NORMAL))
+   BNX2X_ERR("Open the NIC fails again!\n");
+   }
rtnl_unlock();
return;
}
-- 
2.7.4



[PATCH net-next 1/1] forcedeth: remove redudant assignments in xmit

2017-11-10 Thread Zhu Yanjun
In xmit process, the variables are set many times. In fact,
it is enough for these variables to be set once.
After a long time test, the throughput performance is better
than before.

CC: Srinivas Eeda 
CC: Joe Jin 
CC: Junxiao Bi 
Signed-off-by: Zhu Yanjun 
---
 drivers/net/ethernet/nvidia/forcedeth.c | 28 
 1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/nvidia/forcedeth.c 
b/drivers/net/ethernet/nvidia/forcedeth.c
index 63a9e1e..22912e7 100644
--- a/drivers/net/ethernet/nvidia/forcedeth.c
+++ b/drivers/net/ethernet/nvidia/forcedeth.c
@@ -2218,8 +2218,6 @@ static netdev_tx_t nv_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
 
/* setup the header buffer */
do {
-   prev_tx = put_tx;
-   prev_tx_ctx = np->put_tx_ctx;
bcnt = (size > NV_TX2_TSO_MAX_SIZE) ? NV_TX2_TSO_MAX_SIZE : 
size;
np->put_tx_ctx->dma = dma_map_single(&np->pci_dev->dev,
 skb->data + offset, bcnt,
@@ -2254,8 +2252,6 @@ static netdev_tx_t nv_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
offset = 0;
 
do {
-   prev_tx = put_tx;
-   prev_tx_ctx = np->put_tx_ctx;
if (!start_tx_ctx)
start_tx_ctx = tmp_tx_ctx = np->put_tx_ctx;
 
@@ -2296,6 +2292,16 @@ static netdev_tx_t nv_start_xmit(struct sk_buff *skb, 
struct net_device *dev)
} while (frag_size);
}
 
+   if (unlikely(put_tx == np->first_tx.orig))
+   prev_tx = np->last_tx.orig;
+   else
+   prev_tx = put_tx - 1;
+
+   if (unlikely(np->put_tx_ctx == np->first_tx_ctx))
+   prev_tx_ctx = np->last_tx_ctx;
+   else
+   prev_tx_ctx = np->put_tx_ctx - 1;
+
/* set last fragment flag  */
prev_tx->flaglen |= cpu_to_le32(tx_flags_extra);
 
@@ -2368,8 +2374,6 @@ static netdev_tx_t nv_start_xmit_optimized(struct sk_buff 
*skb,
 
/* setup the header buffer */
do {
-   prev_tx = put_tx;
-   prev_tx_ctx = np->put_tx_ctx;
bcnt = (size > NV_TX2_TSO_MAX_SIZE) ? NV_TX2_TSO_MAX_SIZE : 
size;
np->put_tx_ctx->dma = dma_map_single(&np->pci_dev->dev,
 skb->data + offset, bcnt,
@@ -2405,8 +2409,6 @@ static netdev_tx_t nv_start_xmit_optimized(struct sk_buff 
*skb,
offset = 0;
 
do {
-   prev_tx = put_tx;
-   prev_tx_ctx = np->put_tx_ctx;
bcnt = (frag_size > NV_TX2_TSO_MAX_SIZE) ? 
NV_TX2_TSO_MAX_SIZE : frag_size;
if (!start_tx_ctx)
start_tx_ctx = tmp_tx_ctx = np->put_tx_ctx;
@@ -2447,6 +2449,16 @@ static netdev_tx_t nv_start_xmit_optimized(struct 
sk_buff *skb,
} while (frag_size);
}
 
+   if (unlikely(put_tx == np->first_tx.ex))
+   prev_tx = np->last_tx.ex;
+   else
+   prev_tx = put_tx - 1;
+
+   if (unlikely(np->put_tx_ctx == np->first_tx_ctx))
+   prev_tx_ctx = np->last_tx_ctx;
+   else
+   prev_tx_ctx = np->put_tx_ctx - 1;
+
/* set last fragment flag  */
prev_tx->flaglen |= cpu_to_le32(NV_TX2_LASTPACKET);
 
-- 
2.7.4



[PATCHv2 1/1] bnx2x: fix slowpath null crash

2017-11-08 Thread Zhu Yanjun
When "NETDEV WATCHDOG: em4 (bnx2x): transmit queue 2 timed out" occurs,
BNX2X_SP_RTNL_TX_TIMEOUT is set. In the function bnx2x_sp_rtnl_task,
bnx2x_nic_unload and bnx2x_nic_load are executed to shutdown and open
NIC. In the function bnx2x_nic_load, bnx2x_alloc_mem allocates dma
failure. The message "bnx2x: [bnx2x_alloc_mem:8399(em4)]Can't
allocate memory" pops out. The variable slowpath is set to NULL.
When shutdown the NIC, the function bnx2x_nic_unload is called. In
the function bnx2x_nic_unload, the following functions are executed.
bnx2x_chip_cleanup
bnx2x_set_storm_rx_mode
bnx2x_set_q_rx_mode
bnx2x_set_q_rx_mode
bnx2x_config_rx_mode
bnx2x_set_rx_mode_e2
In the function bnx2x_set_rx_mode_e2, the variable slowpath is operated.
Then the crash occurs.
To fix this crash, the variable slowpath is checked. And in the function
bnx2x_sp_rtnl_task, after dma memory allocation fails, another shutdown
and open NIC is executed.

CC: Joe Jin 
CC: Junxiao Bi 
Signed-off-by: Zhu Yanjun 
Acked-by: Ariel Elior 
---
V1->v2
Changes: add Acker and remove unnecessary brackets
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index c12b4d3..fbd302a 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -9332,7 +9332,7 @@ void bnx2x_chip_cleanup(struct bnx2x *bp, int 
unload_mode, bool keep_link)
/* Schedule the rx_mode command */
if (test_bit(BNX2X_FILTER_RX_MODE_PENDING, &bp->sp_state))
set_bit(BNX2X_FILTER_RX_MODE_SCHED, &bp->sp_state);
-   else
+   else if (bp->slowpath)
bnx2x_set_storm_rx_mode(bp);
 
/* Cleanup multicast configuration */
@@ -10271,8 +10271,15 @@ static void bnx2x_sp_rtnl_task(struct work_struct 
*work)
smp_mb();
 
bnx2x_nic_unload(bp, UNLOAD_NORMAL, true);
-   bnx2x_nic_load(bp, LOAD_NORMAL);
-
+   /*When ret value shows failure of allocation failure,
+*the nic is rebooted again. If open still fails, a error
+*message to notify the user.
+*/
+   if (bnx2x_nic_load(bp, LOAD_NORMAL) == -ENOMEM) {
+   bnx2x_nic_unload(bp, UNLOAD_NORMAL, true);
+   if (bnx2x_nic_load(bp, LOAD_NORMAL))
+   BNX2X_ERR("Open the NIC fails again!\n");
+   }
rtnl_unlock();
return;
}
-- 
2.7.4



  1   2   >