Re: IPoIB issues

2010-03-10 Thread Moni Shoua
Eli Cohen wrote: I just posted a patch which might fix your problem. Please try it and let us know if it fixed anything. Hi Eli Although Josh already reported that the patch seems to fix the issue I have a question though. post_send failed prints were during work in datagram mode. I don't

[PATCH V3 0/2] Add support for enhanced atomic operations

2010-03-10 Thread Vladimir Sokolovsky
Hi Roland, This patchset adds support for the following enhanced atomic operations: - Masked atomic compare and swap - Masked atomic fetch and add These operations enable using a smaller amount of memory when using multiple locks by using portions of a 64 bit value in an atomic operation. For

[PATCH V3 1/2] IB/core: Add support for enhanced atomic operations

2010-03-10 Thread Vladimir Sokolovsky
- Add a new IB_WR_MASKED_ATOMIC_CMP_AND_SWP and IB_WR_MASKED_ATOMIC_FETCH_AND_ADD send opcodes that can be used to mark a masked atomic compare and swap and masked atomic fetch and add work request correspondingly. - Add IB_DEVICE_MASKED_ATOMIC capability bit. - Add mask fields to atomic struct

[PATCH V3 2/2] mlx4/IB: Add support for enhanced atomic operations

2010-03-10 Thread Vladimir Sokolovsky
Added support for masked atomic operations: - Masked Compare and Swap - Masked Fetch and Add Signed-off-by: Vladimir Sokolovsky v...@mellanox.co.il --- drivers/infiniband/hw/mlx4/cq.c |8 drivers/infiniband/hw/mlx4/main.c |3 ++- drivers/infiniband/hw/mlx4/qp.c | 27

Re: kitten - mlx4: Unhandled interrupt - owner bit

2010-03-10 Thread Eli Cohen
On Wed, Mar 10, 2010 at 04:03:26PM +0100, Fredrik Unger wrote: When investigating the error it seems to stem from next_eqe_sw in drivers/net/mlx4/eq.c called by the interrupt handler. What happens is that (eqe-owner 0x80) is true causing the routine to return NULL resulting in an

[PATCH v2 04/15] opensm: Track the minimum value in the fabric of data VLs supported.

2010-03-10 Thread Jim Schutt
A routing engine that wants to make contributions to SL2VL maps in support of routing free from credit loops may need to know the minimum number of supported data VLs in the fabric. This code tracks that value. Signed-off-by: Jim Schutt jasc...@sandia.gov --- opensm/include/opensm/osm_subnet.h

[PATCH v2 06/15] opensm: Make mcast_mgr_purge_tree() available outside osm_mcast_mgr.c.

2010-03-10 Thread Jim Schutt
A routing engine that needs to compute multicast spanning trees with special properties will need to delete old trees. There's already a function that does this: mcast_mgr_purge_tree(). Make it available outside osm_mcast_mgr.c, and change the name to follow the naming convention (osm_ prefix)

[PATCH v2 05/15] opensm: Add struct osm_routing_engine callback to build spanning trees for multicast.

2010-03-10 Thread Jim Schutt
If a routing engine needs to compute spanning trees with special properties, it needs a way to override the default implementation. A routing engine callback provides that mechanism. Routing engines that can use the default implementation can leave the callback pointer set to NULL.

[PATCH v2 01/15] opensm: Prepare for routing engine input to path record SL lookup and SL2VL map setup.

2010-03-10 Thread Jim Schutt
In the event a routing engine needs to participate in SL assignment and SL2VL map setup in order to avoid credit loops in a fabric, it will be useful to make the routing engine context more widely available. To this end, have osm_opensm_t save a pointer to the routing engine used, rather than its

[PATCH v2 08/15] opensm: Update documentation to describe torus-2QoS.

2010-03-10 Thread Jim Schutt
Signed-off-by: Jim Schutt jasc...@sandia.gov --- opensm/doc/current-routing.txt | 269 +++- opensm/man/opensm.8.in |9 ++- 2 files changed, 275 insertions(+), 3 deletions(-) diff --git a/opensm/doc/current-routing.txt

[PATCH v2 02/15] opensm: Allow the routing engine to influence SL2VL calculations.

2010-03-10 Thread Jim Schutt
Note that the original code assumes that QoS setup is mostly static and based only on user configuration. As a result, there is no provision for routing engines that want to compute contributions to the SL2VL maps. Fix this up by adding a callback to struct osm_routing_engine that computes a

[PATCH v2 10/15] opensm: Add opensm option to specify file name for extra torus-2QoS configuration information.

2010-03-10 Thread Jim Schutt
Signed-off-by: Jim Schutt jasc...@sandia.gov --- opensm/include/opensm/osm_base.h | 18 ++ opensm/include/opensm/osm_subnet.h |5 + opensm/opensm/main.c |9 + opensm/opensm/osm_subnet.c |1 + opensm/opensm/osm_torus.c |

[PATCH v2 00/15] opensm: Add new torus routing engine: torus-2QoS

2010-03-10 Thread Jim Schutt
This is v2 of a patchset to add to opensm a new routing engine designed to handle large fabrics connected with a 2D/3D topology. Changes since initial version: - Merged my patchsets from 11/20/2009, 12/18/2009, 2/16/2010. - Moved infomation contained in the earlier patch series introduction

[PATCH v2 13/15] opensm: Avoid havoc in minhop caused by torus-2QoS persistent use of osm_port_t:priv.

2010-03-10 Thread Jim Schutt
Torus-2QoS makes persistent use of osm_port_t:priv to speed calculation of path SL values. It cannot clear osm_port_t:priv members when it tears down its persistent data for the following reason: If a port is removed from the fabric, the opensm core will delete the corresponding osm_port_t

[PATCH v2 11/15] opensm: Do not require -Q option for torus-2QoS routing engine.

2010-03-10 Thread Jim Schutt
The torus-2QoS engine provides a deadlock-free routing for a 2D/3D torus, but requires that switch SL2VL maps be programmed. Before this change, opensm -Q was required for that to happen. When a routing engine sets the struct osm_routing_engine:update_sl2vl pointer, it is signalling its intent

[PATCH v2 07/15] opensm: Add torus-2QoS routing engine.

2010-03-10 Thread Jim Schutt
Generating routes for a torus that are free of credit loops requires the use of multiple virtual lanes, and thus SLs on IB. For IB fabrics it also requires that _every_ application use path record queries - any application that uses an SL that was not obtained via a path record query may cause

Re: [PATCH v2 00/15] opensm: torus-2QoS example input files

2010-03-10 Thread Jim Schutt
The attached files can be used to test the torus-2QoS routing engine using ibsim. fabric-torus-5x5x5 contains a fabric description that ibsim can read. Once ibsim is running, run opensm like this: opensm --config opensm.conf --torus_config torus-2QoS-5x5x5.conf or opensm --config

Re: kitten - mlx4: Unhandled interrupt - owner bit

2010-03-10 Thread Fredrik Unger
Eli Cohen wrote: On Wed, Mar 10, 2010 at 04:03:26PM +0100, Fredrik Unger wrote: When investigating the error it seems to stem from next_eqe_sw in drivers/net/mlx4/eq.c called by the interrupt handler. What happens is that (eqe-owner 0x80) is true causing the routine to return NULL

Re: IPoIB issues

2010-03-10 Thread Eli Cohen
On Wed, Mar 10, 2010 at 05:30:38PM +0200, Moni Shoua wrote: Hi Eli Although Josh already reported that the patch seems to fix the issue I have a question though. post_send failed prints were during work in datagram mode. I don't know if Josh verified that but I don't expect that these

Re: IPoIB issues

2010-03-10 Thread Or Gerlitz
Eli Cohen wrote: The patch does not address these failures directly but maybe as a side effect they would go away too. The patch seems to solve a case of possible live lock happening in a node which has both CM and datagram neighbors e.g where ipoib have called netif_stop etc but there is now

Re: IPoIB issues

2010-03-10 Thread Eli Cohen
On Thu, Mar 11, 2010 at 09:47:31AM +0200, Or Gerlitz wrote: The patch does not address these failures directly but maybe as a side effect they would go away too. The patch seems to solve a case of possible live lock happening in a node which has both CM and datagram neighbors e.g where ipoib