Re: [PATCH] opensm/opensm/osm_lid_mgr.c: set "send_set" when setting rereg bit (Was: Re: [ofa-general] Nodes dropping out of IPoIB mcast group due to a temporary node soft lockup.)

2008-05-14 Thread Sasha Khapyorsky
On 07:59 Wed 14 May , Ira Weiny wrote:
> Yea I guess it should be.

Ok. I applied this to 1.3 branch too.

Sasha
___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [PATCH] opensm/opensm/osm_lid_mgr.c: set "send_set" when setting rereg bit (Was: Re: [ofa-general] Nodes dropping out of IPoIB mcast group due to a temporary node soft lockup.)

2008-05-14 Thread Ira Weiny
Yea I guess it should be.

Ira


On Wed, 14 May 2008 04:24:04 -0700
Hal Rosenstock <[EMAIL PROTECTED]> wrote:

> On Wed, 2008-05-14 at 00:02 +, Sasha Khapyorsky wrote:
> > On 18:16 Thu 24 Apr , Ira Weiny wrote:
> > > 
> > > From 2e5511d6daf9c586c39698416e4bd36e24b13e62 Mon Sep 17 00:00:00 2001
> > > From: Ira K. Weiny <[EMAIL PROTECTED]>
> > > Date: Thu, 24 Apr 2008 18:05:01 -0700
> > > Subject: [PATCH] opensm/opensm/osm_lid_mgr.c: set "send_set" when setting 
> > > rereg bit
> > > 
> > > 
> > > Signed-off-by: Ira K. Weiny <[EMAIL PROTECTED]>
> > 
> > Applied. Thanks.
> 
> Would this change also be applied to ofed_1_3 branch ?
> 
> -- Hal
> 
> > Sasha
> > ___
> > general mailing list
> > [email protected]
> > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> > 
> > To unsubscribe, please visit 
> > http://openib.org/mailman/listinfo/openib-general
> 
___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [PATCH] opensm/opensm/osm_lid_mgr.c: set "send_set" when setting rereg bit (Was: Re: [ofa-general] Nodes dropping out of IPoIB mcast group due to a temporary node soft lockup.)

2008-05-14 Thread Hal Rosenstock
On Wed, 2008-05-14 at 00:02 +, Sasha Khapyorsky wrote:
> On 18:16 Thu 24 Apr , Ira Weiny wrote:
> > 
> > From 2e5511d6daf9c586c39698416e4bd36e24b13e62 Mon Sep 17 00:00:00 2001
> > From: Ira K. Weiny <[EMAIL PROTECTED]>
> > Date: Thu, 24 Apr 2008 18:05:01 -0700
> > Subject: [PATCH] opensm/opensm/osm_lid_mgr.c: set "send_set" when setting 
> > rereg bit
> > 
> > 
> > Signed-off-by: Ira K. Weiny <[EMAIL PROTECTED]>
> 
> Applied. Thanks.

Would this change also be applied to ofed_1_3 branch ?

-- Hal

> Sasha
> ___
> general mailing list
> [email protected]
> http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
> 
> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [PATCH] opensm/opensm/osm_lid_mgr.c: set "send_set" when setting rereg bit (Was: Re: [ofa-general] Nodes dropping out of IPoIB mcast group due to a temporary node soft lockup.)

2008-05-13 Thread Sasha Khapyorsky
On 18:16 Thu 24 Apr , Ira Weiny wrote:
> 
> From 2e5511d6daf9c586c39698416e4bd36e24b13e62 Mon Sep 17 00:00:00 2001
> From: Ira K. Weiny <[EMAIL PROTECTED]>
> Date: Thu, 24 Apr 2008 18:05:01 -0700
> Subject: [PATCH] opensm/opensm/osm_lid_mgr.c: set "send_set" when setting 
> rereg bit
> 
> 
> Signed-off-by: Ira K. Weiny <[EMAIL PROTECTED]>

Applied. Thanks.

Sasha
___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [PATCH] opensm/opensm/osm_lid_mgr.c: set "send_set" when setting rereg bit (Was: Re: [ofa-general] Nodes dropping out of IPoIB mcast group due to a temporary node soft lockup.)

2008-04-28 Thread Ira Weiny
On Sun, 27 Apr 2008 11:47:54 +0300
Or Gerlitz <[EMAIL PROTECTED]> wrote:

> Ira Weiny wrote:
> >
> > I did not get any output with multicast_debug_level!  
> why should you, as from the node's point of view nothing has happened 
> (the exact param name is mcast_debug_level)
> >
> > Here is a patch which fixes the problem.  (At least with the partial 
> > sub-nets
> > configuration I explained before.)  I will have to verify this fixes the 
> > problem
> > I originally reported.
> OK, good. Does this problem exist in the released openSM? if yes, what 
> would be the trigger for the SM to "really discover" (i.e do PortInfo 
> SET) this sub-fabric and how much time would it take to reach this 
> trigger, worst case wise?

Yes, this is in the current released version of OpenSM, AFAICT.  The trigger
is: the single link separating the partial sub net will come up and that trap
will cause OpenSM to resweep.  I believe this will happen on the next resweep
cycle which is by default 10 sec.  (But this is configurable.)  I don't think
there is an issue with allowing OpenSM to resweep as designed.

> 
> The failure configuration you have set to reproduce the problem is very 
> untypical, I think.

I agree.  I made a patch to turn off the processing of MAD's in the kernel to
test my original theory, that the node is not responding to MAD's.  Using this
patch I have been able to verify that if a node stops responding that the rereg
is sent by OpenSM when the node comes back.

See my next email response to Sasha concerning the original issue.

Ira

>
> Since under common clos etc topologies which don't 
> have a 1:n blocking nature, failure of such link would cause re-route 
> etc by the SM which would not (and should not) be noted by the nodes (I 
> hope I am not falling into another problem here...)
> 
> Or.
> 
> 
> 
___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [PATCH] opensm/opensm/osm_lid_mgr.c: set "send_set" when setting rereg bit (Was: Re: [ofa-general] Nodes dropping out of IPoIB mcast group due to a temporary node soft lockup.)

2008-04-27 Thread Or Gerlitz

Ira Weiny wrote:


I did not get any output with multicast_debug_level!  
why should you, as from the node's point of view nothing has happened 
(the exact param name is mcast_debug_level)


Here is a patch which fixes the problem.  (At least with the partial sub-nets
configuration I explained before.)  I will have to verify this fixes the problem
I originally reported.
OK, good. Does this problem exist in the released openSM? if yes, what 
would be the trigger for the SM to "really discover" (i.e do PortInfo 
SET) this sub-fabric and how much time would it take to reach this 
trigger, worst case wise?


The failure configuration you have set to reproduce the problem is very 
untypical, I think. Since under common clos etc topologies which don't 
have a 1:n blocking nature, failure of such link would cause re-route 
etc by the SM which would not (and should not) be noted by the nodes (I 
hope I am not falling into another problem here...)


Or.



___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general