Re: [ofa-general] Re: Kernel panic in IPoIB stability testing

2009-02-04 Thread Jack Morgenstein
On Wednesday 04 February 2009 18:16, Moni Shoua wrote: > This one looks good  to me. > Are you going to make a patch and submit it? > > I think it would be best if you run the same test on the patched IPoIB before > submission. > Do you agree? > I'll do a patch tomorrow. We'll run the test over

Re: [ofa-general] Re: Kernel panic in IPoIB stability testing

2009-02-04 Thread Moni Shoua
> Still need some correction. If the path did not exist previously (i.e, !path > = TRUE, > and, below, had_path = 0), then need to call path_free or we will have a leak. > True > Maybe the correct patch is: >path = __path_find(dev, phdr->hwaddr + 4); > if (!path || !path->valid)

Re: [ofa-general] Re: Kernel panic in IPoIB stability testing

2009-02-04 Thread Jack Morgenstein
On Wednesday 04 February 2009 17:45, Moni Shoua wrote: > Besides the locking issue that I hadn't think about yet what if we this fix > looks the right thing to do. > But what if we leave the path without freeing it even if path_rec_start() > fails? > This would leave a path which is not valid in

Re: [ofa-general] Re: Kernel panic in IPoIB stability testing

2009-02-04 Thread Moni Shoua
> I doubt it. You are leaving a deleted path record as part of the path list. > This is list corruption (since the list pointers themselves are part of the > path record structure -- what if this returned storage is re-allocated?). > You are right. > I think the correct fix (after your previous

Re: [ofa-general] Re: Kernel panic in IPoIB stability testing

2009-02-04 Thread Jack Morgenstein
On Wednesday 04 February 2009 15:33, Moni Shoua wrote: > Isn't the fix just as simple as this? > > void ipoib_mark_paths_invalid(struct net_device *dev) > { > struct ipoib_dev_priv *priv = netdev_priv(dev); > struct ipoib_path *path, *tp; > > spin_lock_irq(&priv->lock); >

Re: [ofa-general] Re: Kernel panic in IPoIB stability testing

2009-02-04 Thread Moni Shoua
Yossi Etigin wrote: > I think it comes from unicast_arp_send. > > Consider this scenario: > - paths are flushed (opensm up/down). > - unicast_arp_send() is called with a path in priv->path_list. > path->valid is 0. > - path_rec_start() fails with -EAGAIN (-11) because alloc_mad() fails - > no sm a

Re: [ofa-general] Re: Kernel panic in IPoIB stability testing

2009-02-04 Thread Moni Shoua
> It was originally written without the path->valid check in the "if", and so > was based on the path record > being allocated within the "if". In this case, the path record was not yet > inserted into the path list. > When you added the "valid" processing, you did not take this into account. >

[ofa-general] Re: Kernel panic in IPoIB stability testing

2009-02-03 Thread Jack Morgenstein
On Wednesday 04 February 2009 08:46, Jack Morgenstein wrote: > On Tuesday 03 February 2009 19:56, Yossi Etigin wrote: > > I think it comes from unicast_arp_send. > > Consider this scenario: > > - paths are flushed (opensm up/down). > > - unicast_arp_send() is called with a path in priv->path_list.

[ofa-general] Re: Kernel panic in IPoIB stability testing

2009-02-03 Thread Jack Morgenstein
On Tuesday 03 February 2009 19:56, Yossi Etigin wrote: > I think it comes from unicast_arp_send. > Consider this scenario: > - paths are flushed (opensm up/down). > - unicast_arp_send() is called with a path in priv->path_list. path->valid is > 0. > - path_rec_start() fails with -EAGAIN (-11) beca

[ofa-general] Re: Kernel panic in IPoIB stability testing

2009-02-03 Thread Yossi Etigin
I think it comes from unicast_arp_send. Consider this scenario: - paths are flushed (opensm up/down). - unicast_arp_send() is called with a path in priv->path_list. path->valid is 0. - path_rec_start() fails with -EAGAIN (-11) because alloc_mad() fails - no sm ah (yet) (see the prints just befo