Re: [ofa-general] [PATCH] infiniband-diags: Fix IB network discovery from switch node.
Ira Weiny wrote:
> On Tue, 29 Sep 2009 18:16:21 +0200
> "Eli Dorfman (Voltaire)" wrote:
>
>> Ira Weiny wrote:
>>> Eli,
>>>
>>> On Wed, 26 Aug 2009 17:37:30 +0300
>>> "Eli Dorfman (Voltaire)" wrote:
>>>
Subject: [PATCH] Fix IB network discovery from switch node.
>>> Sorry for the late inquiry on this but what exactly was the bug here?
>> Sorry for the late response.
>> The problem is related to wrong discovery when running from the switch.
>> Without the patch ibnetdiscover finds only local switch
>
> Ok I see.
>
> [snip]
>
>> I think that the problem is related to NodeInfo:LocalPort which is 0 in case
>> of a switch.
>> I see that get_remote_node() sends direct route MAD to switch with path 0,0
>> and that fails (at least for Mellanox IS4 switch chips).
>> Another way to bypass this may be as follows:
>>
>> diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
>> b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
>> index 1e93ff8..3dd0dc6 100644
>> --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
>> +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
>> @@ -461,7 +461,7 @@ get_remote_node(struct ibnd_fabric *fabric, struct
>> ibnd_node *node, struct ibnd_
>> != IB_PORT_PHYS_STATE_LINKUP)
>> return -1;
>>
>> -if (extend_dpath(fabric, path, portnum) < 0)
>> +if (portnum > 0 && extend_dpath(fabric, path, portnum) < 0)
>> return -1;
>>
>> if (query_node(fabric, &node_buf, &port_buf, path)) {
>>
>>
>> Please check whether this is OK and I can send a new patch.
>>
>
> This seems to fix my issue. Here is a patch against master which works for
> me. If you want to verify that would be great.
Verified this again and it works.
Sasha, please apply this patch.
Thanks,
Eli
>
> Thanks for helping me out,
> Ira
>
> From: Ira Weiny
> Date: Tue, 22 Sep 2009 11:08:28 -0700
> Subject: [PATCH] infiniband-diags/libibnetdisc/src/ibnetdisc.c: fix bug in
> single node processing.
>
> Eli fixed an issue with running ibnetdiscover from a switch but it
> introduced a bug in processing a single switch:
>
> 17:19:42 > ./iblinkinfo -S 0x000b8c00490c
> Switch 0x000b8c00490c MT47396 Infiniscale-III Mellanox Technologies:
> ...
>8 11[ ] ==( 4X 2.5 Gbps Down/ Polling)==> [ ] ""
> ( )
>8 12[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> [ ] ""
> ( )
>8 13[ ] ==( 4X 2.5 Gbps Down/ Polling)==> [ ] ""
> ( )
> ...
>
> The port we "come in on" when discovering the switch is not reported
> properly.
>
>This patch, suggested by Eli, reverses Eli's patch and fixes his original
>bug in a way which does not introduce the above issue.
>
> Signed-off-by: Ira Weiny
> ---
> infiniband-diags/libibnetdisc/src/ibnetdisc.c | 18 --
> 1 files changed, 8 insertions(+), 10 deletions(-)
>
> diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
> b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
> index 97e369c..96f72c5 100644
> --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
> +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
> @@ -506,7 +506,7 @@ static int get_remote_node(struct ibmad_port *ibmad_port,
> != IB_PORT_PHYS_STATE_LINKUP)
> return 1; /* positive == non-fatal error */
>
> - if (extend_dpath(ibmad_port, fabric, path, portnum) < 0)
> + if (portnum > 0 && extend_dpath(ibmad_port, fabric, path, portnum) < 0)
> return -1;
>
> if (query_node(ibmad_port, fabric, &node_buf, &port_buf, path)) {
> @@ -600,15 +600,13 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port *
> ibmad_port,
> if (!port)
> goto error;
>
> - if (node->type != IB_NODE_SWITCH) {
> - rc = get_remote_node(ibmad_port, fabric, node, port, from,
> - mad_get_field(node->info, 0,
> -IB_NODE_LOCAL_PORT_F), 0);
> - if (rc < 0)
> - goto error;
> - if (rc > 0) /* non-fatal error, nothing more to be
> done */
> - return ((ibnd_fabric_t *) fabric);
> - }
> + rc = get_remote_node(ibmad_port, fabric, node, port, from,
> + mad_get_field(node->info, 0,
> +IB_NODE_LOCAL_PORT_F), 0);
> + if (rc < 0)
> + goto error;
> + if (rc > 0) /* non-fatal error, nothing more to be done */
> + return ((ibnd_fabric_t *) fabric);
>
> for (dist = 0; dist <= max_hops; dist++) {
>
___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH] infiniband-diags: Fix IB network discovery from switch node.
On Tue, 29 Sep 2009 18:16:21 +0200
"Eli Dorfman (Voltaire)" wrote:
> Ira Weiny wrote:
> > Eli,
> >
> > On Wed, 26 Aug 2009 17:37:30 +0300
> > "Eli Dorfman (Voltaire)" wrote:
> >
> >> Subject: [PATCH] Fix IB network discovery from switch node.
> >
> > Sorry for the late inquiry on this but what exactly was the bug here?
>
> Sorry for the late response.
> The problem is related to wrong discovery when running from the switch.
> Without the patch ibnetdiscover finds only local switch
Ok I see.
[snip]
>
> I think that the problem is related to NodeInfo:LocalPort which is 0 in case
> of a switch.
> I see that get_remote_node() sends direct route MAD to switch with path 0,0
> and that fails (at least for Mellanox IS4 switch chips).
> Another way to bypass this may be as follows:
>
> diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
> b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
> index 1e93ff8..3dd0dc6 100644
> --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
> +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
> @@ -461,7 +461,7 @@ get_remote_node(struct ibnd_fabric *fabric, struct
> ibnd_node *node, struct ibnd_
> != IB_PORT_PHYS_STATE_LINKUP)
> return -1;
>
> - if (extend_dpath(fabric, path, portnum) < 0)
> + if (portnum > 0 && extend_dpath(fabric, path, portnum) < 0)
> return -1;
>
> if (query_node(fabric, &node_buf, &port_buf, path)) {
>
>
> Please check whether this is OK and I can send a new patch.
>
This seems to fix my issue. Here is a patch against master which works for
me. If you want to verify that would be great.
Thanks for helping me out,
Ira
From: Ira Weiny
Date: Tue, 22 Sep 2009 11:08:28 -0700
Subject: [PATCH] infiniband-diags/libibnetdisc/src/ibnetdisc.c: fix bug in
single node processing.
Eli fixed an issue with running ibnetdiscover from a switch but it
introduced a bug in processing a single switch:
17:19:42 > ./iblinkinfo -S 0x000b8c00490c
Switch 0x000b8c00490c MT47396 Infiniscale-III Mellanox Technologies:
...
8 11[ ] ==( 4X 2.5 Gbps Down/ Polling)==> [ ] "" (
)
8 12[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> [ ] "" (
)
8 13[ ] ==( 4X 2.5 Gbps Down/ Polling)==> [ ] "" (
)
...
The port we "come in on" when discovering the switch is not reported
properly.
This patch, suggested by Eli, reverses Eli's patch and fixes his original
bug in a way which does not introduce the above issue.
Signed-off-by: Ira Weiny
---
infiniband-diags/libibnetdisc/src/ibnetdisc.c | 18 --
1 files changed, 8 insertions(+), 10 deletions(-)
diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
index 97e369c..96f72c5 100644
--- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
+++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
@@ -506,7 +506,7 @@ static int get_remote_node(struct ibmad_port *ibmad_port,
!= IB_PORT_PHYS_STATE_LINKUP)
return 1; /* positive == non-fatal error */
- if (extend_dpath(ibmad_port, fabric, path, portnum) < 0)
+ if (portnum > 0 && extend_dpath(ibmad_port, fabric, path, portnum) < 0)
return -1;
if (query_node(ibmad_port, fabric, &node_buf, &port_buf, path)) {
@@ -600,15 +600,13 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port *
ibmad_port,
if (!port)
goto error;
- if (node->type != IB_NODE_SWITCH) {
- rc = get_remote_node(ibmad_port, fabric, node, port, from,
-mad_get_field(node->info, 0,
- IB_NODE_LOCAL_PORT_F), 0);
- if (rc < 0)
- goto error;
- if (rc > 0) /* non-fatal error, nothing more to be
done */
- return ((ibnd_fabric_t *) fabric);
- }
+ rc = get_remote_node(ibmad_port, fabric, node, port, from,
+mad_get_field(node->info, 0,
+ IB_NODE_LOCAL_PORT_F), 0);
+ if (rc < 0)
+ goto error;
+ if (rc > 0) /* non-fatal error, nothing more to be done */
+ return ((ibnd_fabric_t *) fabric);
for (dist = 0; dist <= max_hops; dist++) {
--
1.5.4.5
___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [ofa-general] [PATCH] infiniband-diags: Fix IB network discovery from switch node.
Ira Weiny wrote:
> Eli,
>
> On Wed, 26 Aug 2009 17:37:30 +0300
> "Eli Dorfman (Voltaire)" wrote:
>
>> Subject: [PATCH] Fix IB network discovery from switch node.
>
> Sorry for the late inquiry on this but what exactly was the bug here?
Sorry for the late response.
The problem is related to wrong discovery when running from the switch.
Without the patch ibnetdiscover finds only local switch
4036% ibnetdiscover
ibwarn: [2833] mad_rpc: _do_madrpc failed; dport (DR path slid 0; dlid 0; 0,0)
ibwarn: [2833] get_remote_node: NodeInfo on DR path slid 0; dlid 0; 0,0 failed,
skipping port
#
# Topology file: generated on Tue Sep 29 15:29:50 2009
#
# Max of 1 hops discovered
# Initiated from node 0008f1050010006e port 0008f1050010006e
vendid=0x8f1
devid=0x5a5a
sysimgguid=0x8f1050010006f
switchguid=0x8f1050010006e(8f1050010006e)
Switch 36 "S-0008f1050010006e" # "Voltaire 4036 - 36 QDR ports switch"
enhanced port 0 lid 1 lmc 0
With the patch we see the switch is connected to 2 HCAs
#
# Topology file: generated on Tue Sep 29 15:19:24 2009
#
# Max of 1 hops discovered
# Initiated from node 0008f1050010006e port 0008f1050010006e
vendid=0x8f1
devid=0x5a5a
sysimgguid=0x8f1050010006f
switchguid=0x8f1050010006e(8f1050010006e)
Switch 36 "S-0008f1050010006e" # "Voltaire 4036 - 36 QDR ports switch"
enhanced port 0 lid 1 lmc 0
[24]"H-0008f104039a0198"[2](8f104039a019a) # "luna6 HCA-1" lid 3
4xQDR
[29]"H-0008f1040399f444"[2](8f1040399f446) # "localhost HCA-1" lid
2 4xQDR
vendid=0x2c9
devid=0x673c
sysimgguid=0x8f1040399f447
caguid=0x8f1040399f444
Ca 2 "H-0008f1040399f444" # "localhost HCA-1"
[2](8f1040399f446) "S-0008f1050010006e"[29]# lid 2 lmc 0
"Voltaire 4036 - 36 QDR ports switch" lid 1 4xQDR
vendid=0x2c9
devid=0x673c
sysimgguid=0x8f104039a019b
caguid=0x8f104039a0198
Ca 2 "H-0008f104039a0198" # "luna6 HCA-1"
[2](8f104039a019a) "S-0008f1050010006e"[24]# lid 3 lmc 0
"Voltaire 4036 - 36 QDR ports switch" lid 1 4xQDR
>
> I just found that this change introduced a bug. The problem is that if you
> don't do this query, even when the first found node is a switch, the port you
> came into the switch on will not get reported properly. Here is what I mean.
>
> Running with the current master:
>
> 17:19:42 > ./iblinkinfo -S 0x000b8c00490c
> Switch 0x000b8c00490c MT47396 Infiniscale-III Mellanox Technologies:
>81[ ] ==( 4X 2.5 Gbps Down/ Polling)==> [ ] ""
> ( )
> ...
>89[ ] ==( 4X 2.5 Gbps Down/ Polling)==> [ ] ""
> ( )
>8 10[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> 15 24[ ]
> "ISR9024D Voltaire" ( )
>8 11[ ] ==( 4X 2.5 Gbps Down/ Polling)==> [ ] ""
> ( )
>8 12[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> [ ] ""
> ( )
>8 13[ ] ==( 4X 2.5 Gbps Down/ Polling)==> [ ] ""
> ( )
> ...
>
> The DR path "came in" on port 12 and is reported as Active/LinkUp but has no
> information on the other end. Here is what the output should look like with
> your change removed.
>
> 17:22:36 > ./iblinkinfo -S 0x000b8c00490c
> Switch 0x000b8c00490c MT47396 Infiniscale-III Mellanox Technologies:
>81[ ] ==( 4X 2.5 Gbps Down/ Polling)==> [ ] ""
> ( )
> ...
>89[ ] ==( 4X 2.5 Gbps Down/ Polling)==> [ ] ""
> ( )
>8 10[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> 15 24[ ]
> "ISR9024D Voltaire" ( )
>8 11[ ] ==( 4X 2.5 Gbps Down/ Polling)==> [ ] ""
> ( )
>8 12[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> 78[ ]
> "Cisco Switch SFS7000D" ( )
>8 13[ ] ==( 4X 2.5 Gbps Down/ Polling)==> [ ] ""
> ( )
> ...
>
> This properly reports the other end of this link as another switch.
>
> Could you explain the problem a bit more so we can come up with a better
> solution?
I think that the problem is related to NodeInfo:LocalPort which is 0 in case of
a switch.
I see that get_remote_node() sends direct route MAD to switch with path 0,0 and
that fails (at least for Mellanox IS4 switch chips).
Another way to bypass this may be as follows:
diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
index 1e93ff8..3dd0dc6 100644
--- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
+++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
@@ -461,7 +461,7 @@ get_remote_node(struct ibnd_fabric *fabric, struct
ibnd_node *node, struct ibnd_
!= IB_PORT_PHYS_STATE_LINKUP)
return -1;
- if (extend_dpath(fabric, path, portnum) < 0)
+ if (portnum > 0 && extend_dpath(fabric, path, portnum) < 0)
return -1;
if (query_node(fabric, &node_buf, &port_buf, path)) {
Please check whether thi
Re: [ofa-general] [PATCH] infiniband-diags: Fix IB network discovery from switch node.
Eli,
On Wed, 26 Aug 2009 17:37:30 +0300
"Eli Dorfman (Voltaire)" wrote:
> Subject: [PATCH] Fix IB network discovery from switch node.
Sorry for the late inquiry on this but what exactly was the bug here?
I just found that this change introduced a bug. The problem is that if you
don't do this query, even when the first found node is a switch, the port you
came into the switch on will not get reported properly. Here is what I mean.
Running with the current master:
17:19:42 > ./iblinkinfo -S 0x000b8c00490c
Switch 0x000b8c00490c MT47396 Infiniscale-III Mellanox Technologies:
81[ ] ==( 4X 2.5 Gbps Down/ Polling)==> [ ] "" (
)
...
89[ ] ==( 4X 2.5 Gbps Down/ Polling)==> [ ] "" (
)
8 10[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> 15 24[ ]
"ISR9024D Voltaire" ( )
8 11[ ] ==( 4X 2.5 Gbps Down/ Polling)==> [ ] "" (
)
8 12[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> [ ] "" (
)
8 13[ ] ==( 4X 2.5 Gbps Down/ Polling)==> [ ] "" (
)
...
The DR path "came in" on port 12 and is reported as Active/LinkUp but has no
information on the other end. Here is what the output should look like with
your change removed.
17:22:36 > ./iblinkinfo -S 0x000b8c00490c
Switch 0x000b8c00490c MT47396 Infiniscale-III Mellanox Technologies:
81[ ] ==( 4X 2.5 Gbps Down/ Polling)==> [ ] "" (
)
...
89[ ] ==( 4X 2.5 Gbps Down/ Polling)==> [ ] "" (
)
8 10[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> 15 24[ ]
"ISR9024D Voltaire" ( )
8 11[ ] ==( 4X 2.5 Gbps Down/ Polling)==> [ ] "" (
)
8 12[ ] ==( 4X 5.0 Gbps Active/ LinkUp)==> 78[ ]
"Cisco Switch SFS7000D" ( )
8 13[ ] ==( 4X 2.5 Gbps Down/ Polling)==> [ ] "" (
)
...
This properly reports the other end of this link as another switch.
Could you explain the problem a bit more so we can come up with a better
solution?
Thanks,
Ira
>
> Signed-off-by: Eli Dorfman
> ---
> infiniband-diags/libibnetdisc/src/ibnetdisc.c | 16 +---
> 1 files changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
> b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
> index c69467e..779e659 100644
> --- a/infiniband-diags/libibnetdisc/src/ibnetdisc.c
> +++ b/infiniband-diags/libibnetdisc/src/ibnetdisc.c
> @@ -590,13 +590,15 @@ ibnd_fabric_t *ibnd_discover_fabric(struct ibmad_port *
> ibmad_port,
> if (!port)
> goto error;
>
> - rc = get_remote_node(ibmad_port, fabric, node, port, from,
> - mad_get_field(node->info, 0,
> -IB_NODE_LOCAL_PORT_F), 0);
> - if (rc < 0)
> - goto error;
> - if (rc > 0) /* non-fatal error, nothing more to be done */
> - return ((ibnd_fabric_t *) fabric);
> + if (node->node.type != IB_NODE_SWITCH) {
> + rc = get_remote_node(ibmad_port, fabric, node, port, from,
> + mad_get_field(node->info, 0,
> +IB_NODE_LOCAL_PORT_F), 0);
> + if (rc < 0)
> + goto error;
> + if (rc > 0) /* non-fatal error, nothing more to be
> done */
> + return ((ibnd_fabric_t *) fabric);
> + }
>
> for (dist = 0; dist <= max_hops; dist++) {
>
> --
> 1.5.5
>
> ___
> general mailing list
> [email protected]
> http://*lists.openfabrics.org/cgi-bin/mailman/listinfo/general
>
> To unsubscribe, please visit
> http://*openib.org/mailman/listinfo/openib-general
>
--
Ira Weiny
Math Programmer/Computer Scientist
Lawrence Livermore National Lab
925-423-8008
[email protected]
___
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
