Re: mlx5 bug in error path of mlx5e_open_channel()

2016-11-01 Thread Saeed Mahameed


On 11/01/2016 04:44 PM, Jesper Dangaard Brouer wrote:
> 
> In driver mlx5 function mlx5e_open_channel() does not handle error
> path correctly. (Tested by letting mlx5e_create_rq fail with -ENOMEM,
> propagates to mlx5e_open_rq)
> 
> This first seemed related to commit b5503b994ed5 ("net/mlx5e: XDP TX
> forwarding support").  As a failed call of mlx5e_open_rq() always
> calls mlx5e_close_sq(>xdp_sq) on "xdp_sq" even-though a XDP program
> is not attached.
> 

Indeed, Nice catch.

> Fixing this like:
> 
> @@ -1504,24 +1533,38 @@ static int mlx5e_open_channel(struct mlx5e_priv 
> *priv, int ix,
>  
> c->xdp = !!priv->xdp_prog;
> err = mlx5e_open_rq(c, >rq, >rq);
> -   if (err)
> -   goto err_close_xdp_sq;
> +   if (err) {
> +   if (c->xdp)
> +   goto err_close_xdp_sq;
> +   else
> +   goto err_close_sqs;
> +   }
> 
> The fix does remove one problem, but the error path still cause the
> kernel to crash.  This time it seems related to correct disabling of
> NAPI polling before disabling the queues.
> 

Well a more proper fix will be to add a xdp check in close_xdp_sq error flow,
rather than complicating the error handling branching decision.

i.e:
Keep:
 err = mlx5e_open_rq(c, >rq, >rq);
 if (err)
  goto err_close_xdp_sq;
[...]
err_close_xdp_sq:
-mlx5e_close_sq(>xdp_sq);   
+if (priv->xdp_prog)
+ mlx5e_close_sq(>xdp_sq);

One more thing, the error flow handling is missing 
mlx5e_close_cq(>xdp_sq.cq);
which might be related to the other bug you reported below.

> Now with another error:
> 
>  XXX: call mlx5e_close_sqs(c)
>  BUG: unable to handle kernel NULL pointer dereference at   (null)
>  IP: [<  (null)>]   (null)
>  PGD 401e00067
>  PUD 40746e067 PMD 0
>  Oops: 0010 [#1] PREEMPT SMP
>  Modules linked in: mlx5_core coretemp kvm_intel kvm irqbypass intel_cstate 
> mxm_wmi i2c_i801 i2c_smbus]
>  CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.0-rc3-page_pool04+ #124
>  Hardware name: To Be Filled By O.E.M./Z97 Extreme4, BIOS P2.10 05/12/2015
>  task: 81c0c4c0 task.stack: 81c0
>  RIP: 0010:[<>]  [<  (null)>]   (null)
>  RSP: 0018:88041fa03e70  EFLAGS: 00010286
>  RAX:  RBX: 880401ecc000 RCX: 0005
>  RDX:  RSI: 880401c38000 RDI: 880401ecc000
>  RBP: 88041fa03e88 R08: 0001 R09: 8803ea6a7230
>  R10:  R11: 0040 R12: 880401c38000
>  R13: 880401ecf148 R14: 0040 R15: 880401ecc000
>  FS:  () GS:88041fa0() knlGS:
>  CS:  0010 DS:  ES:  CR0: 80050033
>  CR2:  CR3: 00040c468000 CR4: 001406f0
>  Stack:
>   a02e62e0  0001 88041fa03ed0
>   a02e84c2  0040 880401ecf148
>   0040  012c 
>  Call Trace:
>[  428.032595]  [] ? mlx5e_post_rx_wqes+0x80/0xc0 
> [mlx5_core]
>   [] mlx5e_napi_poll+0xf2/0x530 [mlx5_core]
>   [] net_rx_action+0x1fc/0x350
>   [] __do_softirq+0xc8/0x2c6
>   [] irq_exit+0xbe/0xd0
>   [] do_IRQ+0x54/0xd0
>   [] common_interrupt+0x7f/0x7f
>[  428.075157]  [] ? _raw_spin_unlock_irq+0x10/0x20
>   [] ? finish_task_switch+0x78/0x200
>   [] __schedule+0x17a/0x670
>   [] schedule+0x3d/0x90
>   [] schedule_preempt_disabled+0x15/0x20
>   [] cpu_startup_entry+0x12c/0x1c0
>   [] rest_init+0x84/0x90
>   [] start_kernel+0x3fe/0x40b
>   [] x86_64_start_reservations+0x2a/0x2c
>   [] x86_64_start_kernel+0x168/0x176
>  Code:  Bad RIP value.
>  RIP  [<  (null)>]   (null)
>   RSP 
>  CR2: 
>  ---[ end trace a871278f0d0523ac ]---
> 
> Could you please look at fixing your driver?
> 

Will handle it ASAP, Thank you Jesper.

-Saeed.


Re: mlx5 bug in error path of mlx5e_open_channel()

2016-11-01 Thread Jesper Dangaard Brouer

On Tue, 1 Nov 2016 17:48:31 +0200 Saeed Mahameed  wrote:

> On 11/01/2016 04:44 PM, Jesper Dangaard Brouer wrote:
> > 
> > In driver mlx5 function mlx5e_open_channel() does not handle error
> > path correctly. (Tested by letting mlx5e_create_rq fail with -ENOMEM,
> > propagates to mlx5e_open_rq)
> > 
> > This first seemed related to commit b5503b994ed5 ("net/mlx5e: XDP TX
> > forwarding support").  As a failed call of mlx5e_open_rq() always
> > calls mlx5e_close_sq(>xdp_sq) on "xdp_sq" even-though a XDP program
> > is not attached.
> >   
> 
> Indeed, Nice catch.
> 
> > Fixing this like:
> > 
> > @@ -1504,24 +1533,38 @@ static int mlx5e_open_channel(struct mlx5e_priv 
> > *priv, int ix,
> >  
> > c->xdp = !!priv->xdp_prog;
> > err = mlx5e_open_rq(c, >rq, >rq);
> > -   if (err)
> > -   goto err_close_xdp_sq;
> > +   if (err) {
> > +   if (c->xdp)
> > +   goto err_close_xdp_sq;
> > +   else
> > +   goto err_close_sqs;
> > +   }
> > 
> > The fix does remove one problem, but the error path still cause the
> > kernel to crash.  This time it seems related to correct disabling of
> > NAPI polling before disabling the queues.
> >   
> 
> Well a more proper fix will be to add a xdp check in close_xdp_sq error flow,
> rather than complicating the error handling branching decision.
> 
> i.e:
> Keep:
>  err = mlx5e_open_rq(c, >rq, >rq);
>  if (err)
>   goto err_close_xdp_sq;
> [...]
> err_close_xdp_sq:
> -mlx5e_close_sq(>xdp_sq);   
> +if (priv->xdp_prog)
> + mlx5e_close_sq(>xdp_sq);

Agree, that would be better.


> One more thing, the error flow handling is missing 
> mlx5e_close_cq(>xdp_sq.cq);
> which might be related to the other bug you reported below.
> 
> > Now with another error:
> > 
> >  XXX: call mlx5e_close_sqs(c)
> >  BUG: unable to handle kernel NULL pointer dereference at   (null)
> >  IP: [<  (null)>]   (null)
> >  PGD 401e00067
> >  PUD 40746e067 PMD 0
> >  Oops: 0010 [#1] PREEMPT SMP
> >  Modules linked in: mlx5_core coretemp kvm_intel kvm irqbypass intel_cstate 
> > mxm_wmi i2c_i801 i2c_smbus]
> >  CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.0-rc3-page_pool04+ #124
> >  Hardware name: To Be Filled By O.E.M./Z97 Extreme4, BIOS P2.10 05/12/2015
> >  task: 81c0c4c0 task.stack: 81c0
> >  RIP: 0010:[<>]  [<  (null)>]   (null)
> >  RSP: 0018:88041fa03e70  EFLAGS: 00010286
> >  RAX:  RBX: 880401ecc000 RCX: 0005
> >  RDX:  RSI: 880401c38000 RDI: 880401ecc000
> >  RBP: 88041fa03e88 R08: 0001 R09: 8803ea6a7230
> >  R10:  R11: 0040 R12: 880401c38000
> >  R13: 880401ecf148 R14: 0040 R15: 880401ecc000
> >  FS:  () GS:88041fa0() 
> > knlGS:
> >  CS:  0010 DS:  ES:  CR0: 80050033
> >  CR2:  CR3: 00040c468000 CR4: 001406f0
> >  Stack:
> >   a02e62e0  0001 88041fa03ed0
> >   a02e84c2  0040 880401ecf148
> >   0040  012c 
> >  Call Trace:
> >[  428.032595]  [] ? mlx5e_post_rx_wqes+0x80/0xc0 
> > [mlx5_core]
> >   [] mlx5e_napi_poll+0xf2/0x530 [mlx5_core]
> >   [] net_rx_action+0x1fc/0x350
> >   [] __do_softirq+0xc8/0x2c6
> >   [] irq_exit+0xbe/0xd0
> >   [] do_IRQ+0x54/0xd0
> >   [] common_interrupt+0x7f/0x7f
> >[  428.075157]  [] ? 
> > _raw_spin_unlock_irq+0x10/0x20
> >   [] ? finish_task_switch+0x78/0x200
> >   [] __schedule+0x17a/0x670
> >   [] schedule+0x3d/0x90
> >   [] schedule_preempt_disabled+0x15/0x20
> >   [] cpu_startup_entry+0x12c/0x1c0
> >   [] rest_init+0x84/0x90
> >   [] start_kernel+0x3fe/0x40b
> >   [] x86_64_start_reservations+0x2a/0x2c
> >   [] x86_64_start_kernel+0x168/0x176
> >  Code:  Bad RIP value.
> >  RIP  [<  (null)>]   (null)
> >   RSP 
> >  CR2: 
> >  ---[ end trace a871278f0d0523ac ]---
> > 
> > Could you please look at fixing your driver?
> >   
> 
> Will handle it ASAP, Thank you Jesper.

Thanks for your quick response :-)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer


mlx5 bug in error path of mlx5e_open_channel()

2016-11-01 Thread Jesper Dangaard Brouer

In driver mlx5 function mlx5e_open_channel() does not handle error
path correctly. (Tested by letting mlx5e_create_rq fail with -ENOMEM,
propagates to mlx5e_open_rq)

This first seemed related to commit b5503b994ed5 ("net/mlx5e: XDP TX
forwarding support").  As a failed call of mlx5e_open_rq() always
calls mlx5e_close_sq(>xdp_sq) on "xdp_sq" even-though a XDP program
is not attached.

Fixing this like:

@@ -1504,24 +1533,38 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, 
int ix,
 
c->xdp = !!priv->xdp_prog;
err = mlx5e_open_rq(c, >rq, >rq);
-   if (err)
-   goto err_close_xdp_sq;
+   if (err) {
+   if (c->xdp)
+   goto err_close_xdp_sq;
+   else
+   goto err_close_sqs;
+   }

The fix does remove one problem, but the error path still cause the
kernel to crash.  This time it seems related to correct disabling of
NAPI polling before disabling the queues.

Now with another error:

 XXX: call mlx5e_close_sqs(c)
 BUG: unable to handle kernel NULL pointer dereference at   (null)
 IP: [<  (null)>]   (null)
 PGD 401e00067
 PUD 40746e067 PMD 0
 Oops: 0010 [#1] PREEMPT SMP
 Modules linked in: mlx5_core coretemp kvm_intel kvm irqbypass intel_cstate 
mxm_wmi i2c_i801 i2c_smbus]
 CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.9.0-rc3-page_pool04+ #124
 Hardware name: To Be Filled By O.E.M./Z97 Extreme4, BIOS P2.10 05/12/2015
 task: 81c0c4c0 task.stack: 81c0
 RIP: 0010:[<>]  [<  (null)>]   (null)
 RSP: 0018:88041fa03e70  EFLAGS: 00010286
 RAX:  RBX: 880401ecc000 RCX: 0005
 RDX:  RSI: 880401c38000 RDI: 880401ecc000
 RBP: 88041fa03e88 R08: 0001 R09: 8803ea6a7230
 R10:  R11: 0040 R12: 880401c38000
 R13: 880401ecf148 R14: 0040 R15: 880401ecc000
 FS:  () GS:88041fa0() knlGS:
 CS:  0010 DS:  ES:  CR0: 80050033
 CR2:  CR3: 00040c468000 CR4: 001406f0
 Stack:
  a02e62e0  0001 88041fa03ed0
  a02e84c2  0040 880401ecf148
  0040  012c 
 Call Trace:
   [  428.032595]  [] ? mlx5e_post_rx_wqes+0x80/0xc0 
[mlx5_core]
  [] mlx5e_napi_poll+0xf2/0x530 [mlx5_core]
  [] net_rx_action+0x1fc/0x350
  [] __do_softirq+0xc8/0x2c6
  [] irq_exit+0xbe/0xd0
  [] do_IRQ+0x54/0xd0
  [] common_interrupt+0x7f/0x7f
   [  428.075157]  [] ? _raw_spin_unlock_irq+0x10/0x20
  [] ? finish_task_switch+0x78/0x200
  [] __schedule+0x17a/0x670
  [] schedule+0x3d/0x90
  [] schedule_preempt_disabled+0x15/0x20
  [] cpu_startup_entry+0x12c/0x1c0
  [] rest_init+0x84/0x90
  [] start_kernel+0x3fe/0x40b
  [] x86_64_start_reservations+0x2a/0x2c
  [] x86_64_start_kernel+0x168/0x176
 Code:  Bad RIP value.
 RIP  [<  (null)>]   (null)
  RSP 
 CR2: 
 ---[ end trace a871278f0d0523ac ]---

Could you please look at fixing your driver?


Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer