from:"\"Pavel Shamis\""


Matt,
For all 1.2.X versions you should use btl_openib_ib_pkey_val
In ongoing 1.3 version the parameter was renamed to btl_openib_of_pkey_val.

BTW we plan to release 1.2.8 version very soon and it will include the 
partition bug fix.


Regards,
Pasha

Matt Burgess wrote:

Pasha,

With your patch and parameter suggestion, it works! So to be clear 
btl_openib_ib_pkey_val is for 1.2.6 and btl_openib_of_pkey_val is for 
1.2.7?


Thanks again,
Matt

On Tue, Oct 7, 2008 at 12:24 PM, Pavel Shamis (Pasha) 
mailto:pa...@dev.mellanox.co.il>> wrote:


Matt,
Can you please run " cat
/sys/class/infiniband/mlx4_0/ports/1/pkeys/* " on your d2-ib,d3-ib.
I would like to check the partition configuration.

Ohh, BTW I see that the command line in previous email was wrong,
Please use follow command line (the parameter name should be
"btl_openib_ib_pkey_val" for ompi-1.2.6 and my patch accepts
HEX/DEC values):
/opt/openmpi-ib/1.2.6/bin/mpirun -np 2 -H d2-ib,d3-ib -mca btl
openib,self -mca btl_openib_ib_pkey_val 0x8109
/cluster/pallas/x86_64-ib/IMB-MPI1

Ompi 1.2.6 version should work ok with this patch.


Thanks,
Pasha

Matt Burgess wrote:

Pasha,

Thanks for the patch. Unfortunately, it doesn't seem like that
fixed the problem. I realized earlier I didn't mention what
version of OpenMPI I was trying - it's 1.2.6. <http://1.2.6.>
<http://1.2.6.> Should I be trying 1.2.7 with this patch?

    Thanks,
Matt

2008/10/7 Pavel Shamis (Pasha) mailto:pa...@dev.mellanox.co.il>
<mailto:pa...@dev.mellanox.co.il
<mailto:pa...@dev.mellanox.co.il>>>


   Matt,
   Can you please try attached patch ? I guess it will resolve
this
   issue.

   Thanks,
   Pasha

   Matt Burgess wrote:

   Lenny,

   Thanks for the info. It doesn't seem to be be working
still.
   My command line is:

   /opt/openmpi-ib/1.2.6/bin/mpirun -np 2 -H d2-ib,d3-ib
-mca btl
   openib,self -mca btl_openib_of_pkey_val 33033
   /cluster/pallas/x86_64-ib/IMB-MPI1

   I don't have a
"/sys/class/infiniband/mthca0/ports/1/pkeys/"
   but I do have
"/sys/class/infiniband/mlx4_0/ports/1/pkeys/".
   It's contents are:

   0106  114  122  16   24   32   40   49   57   65  
73   81

 998
   1107  115  123  17   25   33   41   558   66  
74   82

 90   99
   10   108  116  124  18   26   34   42   50   59   67  
75   83
 91  100  109  117  125  19   27   35   43   51   6  
 68  76   84   92  101  11   118  126  228   36  
44   52   60
 69   77   85   93  102  110  119  127  20   29   37  
45  53   61   778   86   94  103  111  12   13  
21   338
 46   54   62   70   79   87   95  104  112  120  14  
22  30   39   47   55   63   71   888   96  105

 113  121  15
 23   31   448   56   64   72   80   89   97
   We aren't using the opensm, but voltaire's SM on a 2012
switch.

   Thanks again,
   Matt


   On Tue, Oct 7, 2008 at 9:37 AM, Lenny Verkhovsky
   mailto:lenny.verkhov...@gmail.com>
   <mailto:lenny.verkhov...@gmail.com
<mailto:lenny.verkhov...@gmail.com>>
   <mailto:lenny.verkhov...@gmail.com
<mailto:lenny.verkhov...@gmail.com>
   <mailto:lenny.verkhov...@gmail.com
<mailto:lenny.verkhov...@gmail.com>>>> wrote:

  Hi Matt,

  It seems that the right way to do it is the fallowing:

  -mca btl openib,self -mca btl_openib_ib_pkey_val 33033

  when the value is a decimal number of the pkey, in
your case
  0x8109 = 33033, and no need for
btl_openib_ib_pkey_ix value.

  ex.
  mpirun -np 2 -H witch2,witch3 -mca btl openib,self -mca
  btl_openib_ib_pkey_val 32769 ./mpi_p1_4_1_2 -t lt
  LT (2) (size min max avg) 1 3.511429 3.511429 3.511429

  if it's not working check cat
  /sys/class/infiniband/mthca0/ports/1/pkeys/* for
pkeys ans SM,
  maybe it's a setup.

  Pasha is currently checking this issue.

  Best regards,

  Lenny.





  On 10/7/08, *Jeff Squyres* mailto:jsquy...@cisco.com>
   <mailto:jsquy...@cisco.c

[OMPI devel] OpenIB BTL - removing btl_openib_ib_pkey_ix parameter


Hi,

I would like to remove the btl_openib_ib_pkey_ix (btl_openib_ib_pkey_ix) 
parameter.
The partition key index (btl_openib_ib_pkey_ix) is defined locally by 
each HCA, so in most cases each host will have different pkey index and
user have no control over this value. So direct pkey_ix specification is 
not possible because each HCA will have different value.
If user want to use specific partition he should specify only the 
btl_openib_ibib_pkey_val parameter, and Open-mpi translate it to 
corresponding pkey_ix.


I think the btl_openib_ib_pkey_ix is useless and I would like to remove it.

Comments ?

--
Pavel Shamis (Pasha)
Mellanox Technologies LTD.

Re: [OMPI devel] Fwd: [OMPI users] OpenMPI with openib partitions


Matt,
Can you please run " cat /sys/class/infiniband/mlx4_0/ports/1/pkeys/* " 
on your d2-ib,d3-ib.

I would like to check the partition configuration.

Ohh, BTW I see that the command line in previous email was wrong,
Please use follow command line (the parameter name should be 
"btl_openib_ib_pkey_val" for ompi-1.2.6 and my patch accepts HEX/DEC 
values):
/opt/openmpi-ib/1.2.6/bin/mpirun -np 2 -H d2-ib,d3-ib -mca btl 
openib,self -mca btl_openib_ib_pkey_val 0x8109 
/cluster/pallas/x86_64-ib/IMB-MPI1


Ompi 1.2.6 version should work ok with this patch.

Thanks,
Pasha

Matt Burgess wrote:

Pasha,

Thanks for the patch. Unfortunately, it doesn't seem like that fixed 
the problem. I realized earlier I didn't mention what version of 
OpenMPI I was trying - it's 1.2.6. <http://1.2.6.> Should I be trying 
1.2.7 with this patch?


Thanks,
Matt

2008/10/7 Pavel Shamis (Pasha) <mailto:pa...@dev.mellanox.co.il>>


Matt,
Can you please try attached patch ? I guess it will resolve this
issue.

Thanks,
Pasha

Matt Burgess wrote:

Lenny,

Thanks for the info. It doesn't seem to be be working still.
My command line is:

/opt/openmpi-ib/1.2.6/bin/mpirun -np 2 -H d2-ib,d3-ib -mca btl
openib,self -mca btl_openib_of_pkey_val 33033
/cluster/pallas/x86_64-ib/IMB-MPI1

I don't have a "/sys/class/infiniband/mthca0/ports/1/pkeys/"
but I do have "/sys/class/infiniband/mlx4_0/ports/1/pkeys/".
It's contents are:

0106  114  122  16   24   32   40   49   57   65   73   81
  998
1107  115  123  17   25   33   41   558   66   74   82
  90   99
10   108  116  124  18   26   34   42   50   59   67   75   83
  91  100  109  117  125  19   27   35   43   51   668  
76   84   92  101  11   118  126  228   36   44   52   60
  69   77   85   93  102  110  119  127  20   29   37   45  
53   61   778   86   94  103  111  12   13   21   338
  46   54   62   70   79   87   95  104  112  120  14   22  
30   39   47   55   63   71   888   96  105  113  121  15

  23   31   448   56   64   72   80   89   97
We aren't using the opensm, but voltaire's SM on a 2012 switch.

Thanks again,
Matt


On Tue, Oct 7, 2008 at 9:37 AM, Lenny Verkhovsky
mailto:lenny.verkhov...@gmail.com>
<mailto:lenny.verkhov...@gmail.com
<mailto:lenny.verkhov...@gmail.com>>> wrote:

   Hi Matt,

   It seems that the right way to do it is the fallowing:

   -mca btl openib,self -mca btl_openib_ib_pkey_val 33033

   when the value is a decimal number of the pkey, in your case
   0x8109 = 33033, and no need for btl_openib_ib_pkey_ix value.

   ex.
   mpirun -np 2 -H witch2,witch3 -mca btl openib,self -mca
   btl_openib_ib_pkey_val 32769 ./mpi_p1_4_1_2 -t lt
   LT (2) (size min max avg) 1 3.511429 3.511429 3.511429

   if it's not working check cat
   /sys/class/infiniband/mthca0/ports/1/pkeys/* for pkeys ans SM,
   maybe it's a setup.

   Pasha is currently checking this issue.

   Best regards,

   Lenny.





   On 10/7/08, *Jeff Squyres* mailto:jsquy...@cisco.com>
   <mailto:jsquy...@cisco.com <mailto:jsquy...@cisco.com>>> wrote:

   FWIW, if this configuration is for all of your users, you
   might want to specify these MCA params in the default MCA
   param file, or the environment, ...etc.  Just so that you
   don't have to specify it on every mpirun command line.

   See
 
 http://www.open-mpi.org/faq/?category=tuning#setting-mca-params.




   On Oct 7, 2008, at 5:43 AM, Lenny Verkhovsky wrote:

   Sorry, misunderstood the question,

   thanks for Pasha the right command line will be

   -mca btl openib,self -mca btl_openib_of_pkey_val 0x8109
   -mca btl_openib_of_pkey_ix 1

   ex.

   #mpirun -np 2 -H witch2,witch3 -mca btl openib,self
-mca
   btl_openib_of_pkey_val 0x8001 -mca
btl_openib_of_pkey_ix 1
   ./mpi_p1_4_TRUNK -t lt
   LT (2) (size min max avg) 1 3.443480 3.443480 3.443480


   Best regards

   Lenny.


   On 10/6/08, Jeff Squyres mailto:jsquy...@cisco.com>
   <mailto:jsquy...@cisco.com
<mailto:jsquy...@cisco.com>>> wrote: On Oct 5, 2008, at

   1:22 PM, Lenny Verkhovsky wrote:

   you should probably use -mca tcp,self  -m

Re: [OMPI devel] Fwd: [OMPI users] OpenMPI with openib partitions

 parameter I am missing?

I was successful using tcp only:

/opt/openmpi-ib/1.2.6/bin/mpirun -np 2 -machinefile
machinefile -mca btl tcp,self -mca btl_openib_max_btls 1
-mca btl_openib_ib_pkey_val 0x8109
/cluster/pallas/x86_64-ib/IMB-MPI1



Thanks,
Matt Burgess

___
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres

Cisco Systems


___
users mailing list
us...@open-mpi.org <mailto:us...@open-mpi.org>
http://www.open-mpi.org/mailman/listinfo.cgi/users



-- 
Jeff Squyres

Cisco Systems






_______
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
--
Pavel Shamis (Pasha)
Mellanox Technologies LTD.

Index: ompi/mca/btl/openib/btl_openib_component.c
===
--- ompi/mca/btl/openib/btl_openib_component.c  (revision 19490)
+++ ompi/mca/btl/openib/btl_openib_component.c  (working copy)
@@ -558,7 +558,7 @@ static int init_one_hca(opal_list_t *btl
  goto dealloc_pd;
 }

-ret = OMPI_SUCCESS; 
+ret = OMPI_SUCCESS;
 /* Note ports are 1 based hence j = 1 */
 for(i = 1; i <= hca->ib_dev_attr.phys_port_cnt; i++){
 struct ibv_port_attr ib_port_attr;
@@ -580,7 +580,7 @@ static int init_one_hca(opal_list_t *btl
 uint16_t pkey,j;
 for (j=0; j < hca->ib_dev_attr.max_pkeys; j++) {
 ibv_query_pkey(hca->ib_dev_context, i, j, &pkey);
-pkey=ntohs(pkey);
+pkey=ntohs(pkey) & 0x7fff;
 if(pkey == mca_btl_openib_component.ib_pkey_val){
 ret = init_one_port(btl_list, hca, i, j, 
&ib_port_attr);
 break;
Index: ompi/mca/btl/openib/btl_openib_ini.c
===
--- ompi/mca/btl/openib/btl_openib_ini.c(revision 19490)
+++ ompi/mca/btl/openib/btl_openib_ini.c(working copy)
@@ -90,8 +90,6 @@ static int parse_line(parsed_section_val
 static void reset_section(bool had_previous_value, parsed_section_values_t *s);
 static void reset_values(ompi_btl_openib_ini_values_t *v);
 static int save_section(parsed_section_values_t *s);
-static int intify(char *string);
-static int intify_list(char *str, uint32_t **values, int *len);
 static inline void show_help(const char *topic);


@@ -364,14 +362,14 @@ static int parse_line(parsed_section_val
all whitespace at the beginning and ending of the value. */

 if (0 == strcasecmp(key_buffer, "vendor_id")) {
-if (OMPI_SUCCESS != (ret = intify_list(value, &sv->vendor_ids, 
+if (OMPI_SUCCESS != (ret = ompi_btl_openib_ini_intify_list(value, 
&sv->vendor_ids, 
&sv->vendor_ids_len))) {
 return ret;
 }
 }

 else if (0 == strcasecmp(key_buffer, "vendor_part_id")) {
-if (OMPI_SUCCESS != (ret = intify_list(value, &sv->vendor_part_ids,
+if (OMPI_SUCCESS != (ret = ompi_btl_openib_ini_intify_list(value, 
&sv->vendor_part_ids,
&sv->vendor_part_ids_len))) {
 return ret;
 }
@@ -379,13 +377,13 @@ static int parse_line(parsed_section_val

 else if (0 == strcasecmp(key_buffer, "mtu")) {
 /* Single value */
-sv->values.mtu = (uint32_t) intify(value);
+sv->values.mtu = (uint32_t) ompi_btl_openib_ini_intify(value);
 sv->values.mtu_set = true;
 }

 else if (0 == strcasecmp(key_buffer, "use_eager_rdma")) {
 /* Single value */
-sv->values.use_eager_rdma = (uint32_t) intify(value);
+sv->values.use_eager_rdma = (uint32_t) 
ompi_btl_openib_ini_intify(value);
 sv->values.use_eager_rdma_set = true;
 }

@@ -547,7 +545,7 @@ static int save_section(parsed_section_v
 /*
  * Do string-to-integer conversion, for both hex and decimal numbers
  */
-static int intify(char *str)
+int ompi_btl_openib_ini_intify(char *str)
 {
 while (isspace(*str)) {
 ++str;
@@ -568,7 +566,7 @@ static int intify(char *str)
 /*
  * Take a comma-delimite

Re: [OMPI devel] openib component lock

2008-08-06 Thread Pavel Shamis (Pasha)


Jeff Squyres wrote:
In working on https://svn.open-mpi.org/trac/ompi/ticket/1434, I see 
fairly inconsistent use of the mca_openib_btl_component.ib_lock, such 
as within btl_openib_proc.c.


In fixing #1434, do I need to be mindful doing all the locking 
properly, or is it already so broken that it doesn't really matter, 
and all I need to do is ensure that I don't put in any bozo deadlocks?


Hmm good question... I never tested thread build of openib btl, so I 
don't know if it really works. But we try to keep it thread safe.

(All the thread stuff in openib btl require serious review..)

Re: [OMPI devel] IBCM error

2008-08-03 Thread Pavel Shamis (Pasha)


Thanks for update.

Sean Hefty wrote:

I've committed a patch to my libibcm git tree with the values

IB_CM_ASSIGN_SERVICE_ID
IB_CM_ASSIGN_SERVICE_ID_MASK

these will be in libibcm release 1.0.3, which will shortly...

- Sean

Re: [OMPI devel] trunk hangs since r19010

2008-07-29 Thread Pavel Shamis (Pasha)


Jeff Squyres wrote:


This used to be true, but I think we changed it a while ago (Pasha: do 
you remember?) because Mellanox HCAs are capable of send-to-self 
(process) and there were no code changes necessary to enable it.  So 
it allowed a slightly simpler command line.  This was quite a while 
ago, IIRC.

Yep, Correct.

FYI. In my MTT testing I also see a lot of killed tests.

Re: [OMPI devel] IBCM error

2008-07-17 Thread Pavel Shamis (Pasha)


Sean Hefty wrote:

It is not zero, it should be:
#define IB_CM_ASSIGN_SERVICE_ID __cpu_to_be64(0x0200ULL)

Unfortunately the value defined in kernel level IBCM and does not
exposed to user level.
Can you please expose it to user level (infiniband/cm.h)



Oops - good catch.  I will add the assign ID and mask values to the header file
for the next release.  Until then, can you try using the values given in the
kernel header file and let me know if it solves the problem?
  

I already prepared patch for OMPI that defines the value.
Few people already reported that the patch ok ( 
https://svn.open-mpi.org/trac/ompi/ticket/1388 )


Pasha

Re: [OMPI devel] IBCM error

2008-07-17 Thread Pavel Shamis (Pasha)


Jeff Squyres wrote:

On Jul 16, 2008, at 11:07 AM, Don Kerr wrote:


Pasha added configure switches for this about a week ago:

   --en|disable-openib-ibcm
   --en|disable-openib-rdmacm
I like these flags but I thought there was going to be a run time 
check for cases where Open MPI is built on a system that has ibcm 
support but is later run on a system without ibcm support.



Yes, there are.

- if the /dev/infiniband/ucm* files aren't there, we silently return 
"not supported" and skip ibcm
- if ib_cm_open_device() (the first function call) fails, we assume 
that IBCM simply isn't supported on this platform and silently return 
"not supported" and skip ibcm


Right not we are skipping the IBCM all time. Only if user specify IBCM 
explicitly via include/exclude interface the IBCM will be used.


Pasha

Re: [OMPI devel] IBCM error

2008-07-17 Thread Pavel Shamis (Pasha)


Sean Hefty wrote:

If you don't care what the service ID is, you can specify 0, and the kernel will
assign one.  The assigned value can be retrieved by calling ib_cm_attr_id().
(I'm assuming that you communicate the IDs out of band somehow.)
  

It is not zero, it should be:
#define IB_CM_ASSIGN_SERVICE_ID __cpu_to_be64(0x0200ULL)

Unfortunately the value defined in kernel level IBCM and does not 
exposed to user level.

Can you please expose it to user level (infiniband/cm.h)

Regards,
Pasha

Re: [OMPI devel] Segfault in 1.3 branch


I opened ticket for the bug:
https://svn.open-mpi.org/trac/ompi/ticket/1389

Ralph Castain wrote:

It looks like a new issue to me, Pasha. Possibly a side consequence of the
IOF change made by Jeff and I the other day. From what I can see, it looks
like you app was a simple "hello" - correct?

If you look at the error, the problem occurs when mpirun is trying to route
a message. Since the app is clearly running at this time, the problem is
probably in the IOF. The error message shows that mpirun is attempting to
route a message to a jobid that doesn't exist. We have a test in the RML
that forces an "abort" if that occurs.

I would guess that there is either a race condition or memory corruption
occurring somewhere, but I have no idea where.

This may be the "new hole in the dyke" I cautioned about in earlier notes
regarding the IOF... :-)

Still, given that this hits rarely, it probably is a more acceptable bug to
leave in the code than the one we just fixed (duplicated stdin)...

Ralph



On 7/14/08 1:11 AM, "Pavel Shamis (Pasha)"  wrote:

  

Please see http://www.open-mpi.org/mtt/index.php?do_redir=764

The error is not consistent. It takes a lot of iteration to reproduce it.
In my MTT testing I seen it few times.

Is it know issue ?

Regards,
Pasha
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] IBCM error




Guess what - we don't always put them out there because - tada - we don't
use them! What goes out on the backend is a stripped down version of
libraries we require. Given the huge number of libraries people provide
(looking at the bigger, beyond OMPI picture), it consumes a lot of limited
disk space to install every library on every node. So sometimes we build our
own rpm's to pick up only what we need.

As long as --without-rdmacm --without-ibcm are present, then we are happy.

  

FYI
I recently added options that allow enable/disable all the *cm stuff:

 --enable-openib-ibcmEnable Open Fabrics IBCM support in openib BTL
 (default: enabled)
 --enable-openib-rdmacm  Enable Open Fabrics RDMACM support in openib BTL
 (default: enabled)

Re: [OMPI devel] IBCM error



I need to check on this.  You may want to look at section A3.2.3 of 
the spec.
If you set the first byte (network order) to 0x00, and the 2nd byte 
to 0x01,
then you hit a 'reserved' range that probably isn't being used 
currently.


If you don't care what the service ID is, you can specify 0, and the 
kernel will
assign one.  The assigned value can be retrieved by calling 
ib_cm_attr_id().

(I'm assuming that you communicate the IDs out of band somehow.)



Ok; we'll need to check into this.  I don't remember the ordering -- 
we might actually be communicating the IDs before calling 
ib_cm_listen() (since we were simply using the PIDs, we could do that).


Thanks for the tip!  Pasha -- can you look into this?
It looks that th modex message we are preparing during query stage, so 
the order looks ok.
Unfortunately on my machines ibcm module does not create 
"/dev/infiniband/ucm*" and I can not thest the functionality.


Regards,
Pasha.

Re: [OMPI devel] Segfault in 1.3 branch




It looks like a new issue to me, Pasha. Possibly a side consequence of the
IOF change made by Jeff and I the other day. From what I can see, it looks
like you app was a simple "hello" - correct?
  

Yep, it is simple hello application.

If you look at the error, the problem occurs when mpirun is trying to route
a message. Since the app is clearly running at this time, the problem is
probably in the IOF. The error message shows that mpirun is attempting to
route a message to a jobid that doesn't exist. We have a test in the RML
that forces an "abort" if that occurs.

I would guess that there is either a race condition or memory corruption
occurring somewhere, but I have no idea where.

This may be the "new hole in the dyke" I cautioned about in earlier notes
regarding the IOF... :-)

Still, given that this hits rarely, it probably is a more acceptable bug to
leave in the code than the one we just fixed (duplicated stdin)...
  
It is not so rare issue, 19 failures in my MTT run 
(http://www.open-mpi.org/mtt/index.php?do_redir=765).


Pasha

Ralph



On 7/14/08 1:11 AM, "Pavel Shamis (Pasha)"  wrote:

  

Please see http://www.open-mpi.org/mtt/index.php?do_redir=764

The error is not consistent. It takes a lot of iteration to reproduce it.
In my MTT testing I seen it few times.

Is it know issue ?

Regards,
Pasha
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] IBCM error

2008-07-14 Thread Pavel Shamis (Pasha)





Should we not even build support for it?
I think IBCM CPC build should be enabled by default. The IBCM is 
supplied with OFED so it should not be any problem during install.


PRO: don't even allow the possibility of running with it, because we 
know that there are issues with the ibcm userspace library (i.e., 
reduce problem reports from users)


PRO: users don't have to have libibcm installed on compute nodes 
(we've actually gotten some complaints about this)
We got compliances only for case when ompi was build on platform with 
IBCM and after it was run on platform without IBCM.  Also we did not 
have option to disable
the ibcm during compilation. So actually it was no way to install OMPI 
on compute node. We added the option and the problem was resolved.
In most cases the OFED install is the same on all nodes and it should 
not be any problem to build IBCM support by default.


Pasha

Re: [OMPI devel] IBCM error

2008-07-14 Thread Pavel Shamis (Pasha)


I can add in head of  query function something like :

if (!mca_btl_openib_component.cpc_explicitly_defined)
   return OMPI_ERR_NOT_SUPPORTED;


Jeff Squyres wrote:

On Jul 14, 2008, at 3:59 AM, Lenny Verkhovsky wrote:


Seems to be fixed.


Well, it's "fixed" in that Pasha turned off the error message.  But 
the same issue is undoubtedly happening.


I was asking for something a little stronger: perhaps we should 
actually have IBCM not try to be used unless it's specifically asked 
for.  Or maybe it shouldn't even build itself unless specifically 
asked for (which would obviously take care of the run-time issues as 
well).


The whole point of doing IBCM was to have a simple and fast mechanism 
for IB wireup.  But with these two problems (IBCM not properly 
installed on all systems, and ib_cm_listen() fails periodically), it 
more or less makes it unusable.  Therefore we shouldn't recommend it 
to production customers, and per precedent elsewhere in the code base, 
we should either not build it by default and/or not use it unless 
specifically asked for.

[OMPI devel] Segfault in 1.3 branch

2008-07-14 Thread Pavel Shamis (Pasha)


Please see http://www.open-mpi.org/mtt/index.php?do_redir=764

The error is not consistent. It takes a lot of iteration to reproduce it.
In my MTT testing I seen it few times.

Is it know issue ?

Regards,
Pasha

Re: [OMPI devel] IBCM error

2008-07-13 Thread Pavel Shamis (Pasha)

Fixed in https://svn.open-mpi.org/trac/ompi/changeset/18897

Is it any other know IBCM issue ?

Regards,
Pasha

Jeff Squyres wrote:
I think you said opposite things: Lenny's command line did not
specifically ask for ibcm, but it was used anyway. Lenny -- did you
explicitly request it somewhere else (e.g., env var or MCA param file)?

I suspect that you did not; I suspect (without looking at the code
again) that ibcm tried to select itself and failed on the
ibcm_listen() call, so it fell back to oob. This might have to be
another workaround in OMPI, perhaps something like this:

if (ibcm_listen() fails)
if (ibcm explicitly requested)
print_warning()
fail to use ibcm

Has this been filed as a bug at openfabrics.org? I don't think that I
filed it when Brad and I were testing on RoadRunner -- it would
probably be good if someone filed it.

On Jul 13, 2008, at 8:56 AM, Lenny Verkhovsky wrote:

Pasha is right, I didn't disabled it.

On 7/13/08, Pavel Shamis (Pasha) wrote:
Jeff Squyres wrote:
Brad and I did some scale testing of IBCM and saw this error
sometimes. It seemed to happen with higher frequency when you
increased the number of processes on a single node.

I talked to Sean Hefty about it, but we never figured out a
definitive cause or solution. My best guess is that there is
something wonky about multiple processes simultaneously interacting
with the IBCM kernel driver from userspace; but I don't know jack
about kernel stuff, so that's a total SWAG.

Thanks for reminding me of this issue; I admit that I had forgotten
about it. :-( Pasha -- should IBCM not be the default?

It is not default. I guess Lenny configured it explicitly, is not it ?

Pasha.

On Jul 13, 2008, at 7:08 AM, Lenny Verkhovsky wrote:

Hi,

I am getting this error sometimes.

/home/USERS/lenny/OMPI_COMP_PATH/bin/mpirun -np 100 -hostfile
/home/USERS/lenny/TESTS/COMPILERS/hostfile
/home/USERS/lenny/TESTS/COMPILERS/hello
[witch24][[32428,1],96][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_ibcm.c:769:ibcm_component_query]
failed to ib_cm_listen 10 times: rc=-1, errno=22

Hello world! I'm 0 of 100 on witch2

Best Regards

Lenny.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] IBCM error

2008-07-13 Thread Pavel Shamis (Pasha)


Jeff Squyres wrote:
Brad and I did some scale testing of IBCM and saw this error 
sometimes.  It seemed to happen with higher frequency when you 
increased the number of processes on a single node.


I talked to Sean Hefty about it, but we never figured out a definitive 
cause or solution.  My best guess is that there is something wonky 
about multiple processes simultaneously interacting with the IBCM 
kernel driver from userspace; but I don't know jack about kernel 
stuff, so that's a total SWAG.


Thanks for reminding me of this issue; I admit that I had forgotten 
about it.  :-(  Pasha -- should IBCM not be the default?

It is not default. I guess Lenny configured it explicitly, is not it ?

Pasha.





On Jul 13, 2008, at 7:08 AM, Lenny Verkhovsky wrote:


Hi,

I am getting this error sometimes.

/home/USERS/lenny/OMPI_COMP_PATH/bin/mpirun -np 100 -hostfile 
/home/USERS/lenny/TESTS/COMPILERS/hostfile 
/home/USERS/lenny/TESTS/COMPILERS/hello
[witch24][[32428,1],96][../../../../../ompi/mca/btl/openib/connect/btl_openib_connect_ibcm.c:769:ibcm_component_query] 
failed to ib_cm_listen 10 times: rc=-1, errno=22

Hello world! I'm 0 of 100 on witch2


Best Regards

Lenny.


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] open ib dependency question

2008-07-10 Thread Pavel Shamis (Pasha)


FYI the issue was resolved - https://svn.open-mpi.org/trac/ompi/ticket/1376
Regards,
Pasha

Bogdan Costescu wrote:

On Thu, 3 Jul 2008, Pavel Shamis (Pasha) wrote:

I had similar issue recently. It will be nice to have option to 
disable/enable *CM via config flags.


Not sure if this is related... I am looking at using a nightly 1.3 
snapshot and I get this type of error messages when running:


[n020205][[36506,1],0][connect/btl_openib_connect_ibcm.c:723:ibcm_component_query] 
failed to open IB CM device: /dev/infiniband/ucm0


which is actually right, as /dev/infiniband on the nodes doesn't 
contain ucm0. On the same cluster, OpenMPI 1.2.7rc2 works fine; the 
configure options for building them are the same.


The output of ldd shows for the binary linked with 1.3a:

libibcm.so.1 => /opt/ofed/1.2.5.4/lib64/libibcm.so.1

while this is missing from the binary linked with 1.2. Even after 
printing these messages, the binary linked with 1.3a works; it works 
even when I specify "--mca btl openib,self" so I think that the IB 
stack is still being used (there is also a TCP/GigE network which 
could be chosen otherwise).


I don't know whether this is caused by a somehow inconsistent setup of 
the system, but I would welcome an option to make 1.3a behave like 1.2.

Re: [OMPI devel] open ib dependency question

2008-07-08 Thread Pavel Shamis (Pasha)





Pasha -- do you want to open a ticket?


Done. https://svn.open-mpi.org/trac/ompi/ticket/1376
Pasha.

Re: [OMPI devel] open ib dependency question

2008-07-06 Thread Pavel Shamis (Pasha)


I added patch to the ticked, please review.

Pasha.
Jeff Squyres wrote:

Ok:

https://svn.open-mpi.org/trac/ompi/ticket/1375

I think any of us could do this -- it's pretty straightforward.  No 
guarantees on when I can get to it; my 1.3 list is already pretty long...



On Jul 3, 2008, at 6:20 AM, Pavel Shamis (Pasha) wrote:


Jeff Squyres wrote:

Do you need configury to disable building ibcm / rdmacm support?

The more I think about it, the more I think that these would be good 
features to have for v1.3...
I had similar issue recently. It will be nice to have option to 
disable/enable *CM via config flags.



On Jul 3, 2008, at 2:52 AM, Don Kerr wrote:

I did not think it was required but it hung me up when I built ompi 
on one system which had the ibcm libraries and then ran on a system 
without the ibcm libs. I had another issue on the system without 
ibcm libs which prevented my building there but I will go down that 
path again. Thanks.


Jeff Squyres wrote:
That is the IBCM library for the IBCM CPC -- IB connection manager 
stuff.


It's not *necessary*; you could use the OOB CPC if you want to.

That being said, OMPI currently builds support for it (and links 
it in) if it finds the right headers and library files. We don't 
currently have configury to disable this behavior (and *not* build 
RDMACM and/or IBCM support).


Do you have a problem / need to disable building support for IBCM?



On Jul 2, 2008, at 12:02 PM, Don Kerr wrote:



It appears that the mca_btl_openib.so has a dependency on 
libibcm.so. Is this necessary?



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] open ib dependency question

2008-07-06 Thread Pavel Shamis (Pasha)

I see the same issue on my Mellanox OFED 1.3. IBCM module is loaded but 
is no such device in system.

Jeff, looks like some bug in IBCM stuff... (not ompi)
I think we should print the error only if  ibcm  was  explicitly 
selected by user. But from the cpc level it is no way to know

about explicit selectionMaybe just hide the print ?

Bogdan Costescu wrote:

On Thu, 3 Jul 2008, Pavel Shamis (Pasha) wrote:

I had similar issue recently. It will be nice to have option to 
disable/enable *CM via config flags.


Not sure if this is related... I am looking at using a nightly 1.3 
snapshot and I get this type of error messages when running:


[n020205][[36506,1],0][connect/btl_openib_connect_ibcm.c:723:ibcm_component_query] 
failed to open IB CM device: /dev/infiniband/ucm0


which is actually right, as /dev/infiniband on the nodes doesn't 
contain ucm0. On the same cluster, OpenMPI 1.2.7rc2 works fine; the 
configure options for building them are the same.


The output of ldd shows for the binary linked with 1.3a:

libibcm.so.1 => /opt/ofed/1.2.5.4/lib64/libibcm.so.1

while this is missing from the binary linked with 1.2. Even after 
printing these messages, the binary linked with 1.3a works; it works 
even when I specify "--mca btl openib,self" so I think that the IB 
stack is still being used (there is also a TCP/GigE network which 
could be chosen otherwise).


I don't know whether this is caused by a somehow inconsistent setup of 
the system, but I would welcome an option to make 1.3a behave like 1.2.

Re: [OMPI devel] open ib dependency question

2008-07-03 Thread Pavel Shamis (Pasha)




Ok:

https://svn.open-mpi.org/trac/ompi/ticket/1375

I think any of us could do this -- it's pretty straightforward.  No 
guarantees on when I can get to it; my 1.3 list is already pretty long...

No problem. I will take this one.
Pasha.



On Jul 3, 2008, at 6:20 AM, Pavel Shamis (Pasha) wrote:


Jeff Squyres wrote:

Do you need configury to disable building ibcm / rdmacm support?

The more I think about it, the more I think that these would be good 
features to have for v1.3...
I had similar issue recently. It will be nice to have option to 
disable/enable *CM via config flags.



On Jul 3, 2008, at 2:52 AM, Don Kerr wrote:

I did not think it was required but it hung me up when I built ompi 
on one system which had the ibcm libraries and then ran on a system 
without the ibcm libs. I had another issue on the system without 
ibcm libs which prevented my building there but I will go down that 
path again. Thanks.


Jeff Squyres wrote:
That is the IBCM library for the IBCM CPC -- IB connection manager 
stuff.


It's not *necessary*; you could use the OOB CPC if you want to.

That being said, OMPI currently builds support for it (and links 
it in) if it finds the right headers and library files. We don't 
currently have configury to disable this behavior (and *not* build 
RDMACM and/or IBCM support).


Do you have a problem / need to disable building support for IBCM?



On Jul 2, 2008, at 12:02 PM, Don Kerr wrote:



It appears that the mca_btl_openib.so has a dependency on 
libibcm.so. Is this necessary?



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] open ib dependency question

2008-07-03 Thread Pavel Shamis (Pasha)


Jeff Squyres wrote:

Do you need configury to disable building ibcm / rdmacm support?

The more I think about it, the more I think that these would be good 
features to have for v1.3...
I had similar issue recently. It will be nice to have option to 
disable/enable *CM via config flags.



On Jul 3, 2008, at 2:52 AM, Don Kerr wrote:

I did not think it was required but it hung me up when I built ompi 
on one system which had the ibcm libraries and then ran on a system 
without the ibcm libs. I had another issue on the system without ibcm 
libs which prevented my building there but I will go down that path 
again. Thanks.


Jeff Squyres wrote:
That is the IBCM library for the IBCM CPC -- IB connection manager 
stuff.


It's not *necessary*; you could use the OOB CPC if you want to.

That being said, OMPI currently builds support for it (and links it 
in) if it finds the right headers and library files. We don't 
currently have configury to disable this behavior (and *not* build 
RDMACM and/or IBCM support).


Do you have a problem / need to disable building support for IBCM?



On Jul 2, 2008, at 12:02 PM, Don Kerr wrote:



It appears that the mca_btl_openib.so has a dependency on 
libibcm.so. Is this necessary?



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Is trunk broken ?


I did fresh check out and everything works well.
So looks like some svn up screw my svn.
Ralph, thanks for help !

Ralph H Castain wrote:

Hmmm...something isn't right, Pasha. There is simply no way you should be
encountering this error. You are picking up the wrong grpcomm module.

I went ahead and fixed the grpcomm/basic module, but as I note in the commit
message, that is now an experimental area. The grpcomm/bad module is the
default for that reason.

Check to ensure you have the orte/mca/grpcomm/bad directory, and that it is
getting built. My guess is that you have a corrupted checkout or build and
that the component is either missing or not getting built.


On 6/19/08 1:37 PM, "Pavel Shamis (Pasha)"  wrote:

  

Ralph H Castain wrote:


I can't find anything wrong so far. I'm waiting in a queue on Odin to try
there since Jeff indicated you are using rsh as a launcher, and that's the
only access I have to such an environment. Guess Odin is being pounded
because the queue isn't going anywhere.
  
  

 I use ssh., here is command line:
./bin/mpirun -np 2 -H sw214,sw214 -mca btl openib,sm,self
./osu_benchmarks-3.0/osu_latency


Meantime, I'm building on RoadRunner and will test there (TM enviro).


On 6/19/08 1:18 PM, "Pavel Shamis (Pasha)"  wrote:

  
  

You'll have to tell us something more than that, Pasha. What kind of
environment, what rev level were you at, etc.
  
  
  

Ahh, sorry :) I run on Linux x86_64 Sles10 sp1. (Open MPI) 1.3a1r18682M
, OFED 1.3.1
Pasha.



So far as I know, the trunk is fine.


On 6/19/08 12:01 PM, "Pavel Shamis (Pasha)" 
wrote:

  
  
  

I tried to run trunk on my machines and I got follow error:

[sw214:04367] [[16563,1],1] ORTE_ERROR_LOG: Data unpack would read past
end of buffer in file base/grpcomm_base_modex.c at line 451
[sw214:04367] [[16563,1],1] ORTE_ERROR_LOG: Data unpack would read past
end of buffer in file grpcomm_basic_module.c at line 560
[sw214:04365]
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  orte_grpcomm_modex failed
  --> Returned "Data unpack would read past end of buffer" (-26) instead
of "Success" (0)

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

  
  
  

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

  
  

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Is trunk broken ?


Ralph H Castain wrote:

Ha! I found it - you left out one very important detail. You are specifying
the use of the grpcomm basic module instead of the default "bad" one.
  

Hmm , I did not specified any "grpcomm" module.

I just checked and that module is indeed showing a problem. I'll see what I
can do.

For now, though, just use the default grpcomm and it will work fine.


On 6/19/08 1:18 PM, "Pavel Shamis (Pasha)"  wrote:

  

You'll have to tell us something more than that, Pasha. What kind of
environment, what rev level were you at, etc.
  
  

Ahh, sorry :) I run on Linux x86_64 Sles10 sp1. (Open MPI) 1.3a1r18682M
, OFED 1.3.1
Pasha.


So far as I know, the trunk is fine.


On 6/19/08 12:01 PM, "Pavel Shamis (Pasha)" 
wrote:

  
  

I tried to run trunk on my machines and I got follow error:

[sw214:04367] [[16563,1],1] ORTE_ERROR_LOG: Data unpack would read past
end of buffer in file base/grpcomm_base_modex.c at line 451
[sw214:04367] [[16563,1],1] ORTE_ERROR_LOG: Data unpack would read past
end of buffer in file grpcomm_basic_module.c at line 560
[sw214:04365] 
--

It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  orte_grpcomm_modex failed
  --> Returned "Data unpack would read past end of buffer" (-26) instead
of "Success" (0)

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

  
  

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Is trunk broken ?


Ralph H Castain wrote:

I can't find anything wrong so far. I'm waiting in a queue on Odin to try
there since Jeff indicated you are using rsh as a launcher, and that's the
only access I have to such an environment. Guess Odin is being pounded
because the queue isn't going anywhere.
  

I use ssh., here is command line:
./bin/mpirun -np 2 -H sw214,sw214 -mca btl openib,sm,self   
./osu_benchmarks-3.0/osu_latency

Meantime, I'm building on RoadRunner and will test there (TM enviro).


On 6/19/08 1:18 PM, "Pavel Shamis (Pasha)"  wrote:

  

You'll have to tell us something more than that, Pasha. What kind of
environment, what rev level were you at, etc.
  
  

Ahh, sorry :) I run on Linux x86_64 Sles10 sp1. (Open MPI) 1.3a1r18682M
, OFED 1.3.1
Pasha.


So far as I know, the trunk is fine.


On 6/19/08 12:01 PM, "Pavel Shamis (Pasha)" 
wrote:

  
  

I tried to run trunk on my machines and I got follow error:

[sw214:04367] [[16563,1],1] ORTE_ERROR_LOG: Data unpack would read past
end of buffer in file base/grpcomm_base_modex.c at line 451
[sw214:04367] [[16563,1],1] ORTE_ERROR_LOG: Data unpack would read past
end of buffer in file grpcomm_basic_module.c at line 560
[sw214:04365] 
--

It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  orte_grpcomm_modex failed
  --> Returned "Data unpack would read past end of buffer" (-26) instead
of "Success" (0)

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

  
  

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Is trunk broken ?




You'll have to tell us something more than that, Pasha. What kind of
environment, what rev level were you at, etc.
  
Ahh, sorry :) I run on Linux x86_64 Sles10 sp1. (Open MPI) 1.3a1r18682M 
, OFED 1.3.1

Pasha.

So far as I know, the trunk is fine.


On 6/19/08 12:01 PM, "Pavel Shamis (Pasha)" 
wrote:

  

I tried to run trunk on my machines and I got follow error:

[sw214:04367] [[16563,1],1] ORTE_ERROR_LOG: Data unpack would read past
end of buffer in file base/grpcomm_base_modex.c at line 451
[sw214:04367] [[16563,1],1] ORTE_ERROR_LOG: Data unpack would read past
end of buffer in file grpcomm_basic_module.c at line 560
[sw214:04365] 
--

It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  orte_grpcomm_modex failed
  --> Returned "Data unpack would read past end of buffer" (-26) instead
of "Success" (0)

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

[OMPI devel] Is trunk broken ?