date:20130128

Re: [OMPI devel] [OMPI bugs] [Open MPI] #3489: Move r27954 to v1.7 branch

2013-01-28 Thread Ralph Castain


On Jan 28, 2013, at 7:30 PM, George Bosilca  wrote:

> Ralph,
> 
> What if I say it wasn't a "stale" option nobody cares about. You just
> removed one of the critical pieces of the configury, completely
> disabling the work of other people.

Well, I would say that (a) the code that these options enabled doesn't even 
compile any more, and (b) Josh is no longer able to maintain it and agreed with 
removing the options. And since he was the author, I follow his opinions.

> 
> I am absolutely sorry that I didn't make it in the 27 minutes you
> generously provided for comments. Removing from the trunk and pushing
> it in the 1.7 branch in absolute agreement with yourself and all that
> in a mere 27 minutes is an absolute feat (and not your first). For
> some obscure reasons I had the feeling we had some level of protection
> (gk, rm, a reasonable amount of time to people to comment), but I
> guess those rules are for weaklings.

No, but since you make no effort to contribute, I generally don't worry about 
waiting for your input. Josh wrote the code, attends the weekly telecons, and 
participates in the decisions far more than you do. Hence, I pay attention to 
his input.


> 
>  George.
> 
> PS: I have so much fun reading a barely 3-weeks old thread on our
> mailing list. Absolutely terrific:
> http://www.open-mpi.org/community/lists/devel/2013/01/11901.php.
> 
> 
> 
> On Mon, Jan 28, 2013 at 9:22 PM, Open MPI  wrote:
>> #3489: Move r27954 to v1.7 branch
>> ---+---
>> Reporter:  rhc |   Owner:  rhc
>>Type:  changeset move request  |  Status:  closed
>> Priority:  major   |   Milestone:  Open MPI 1.7
>> Version:  trunk   |  Resolution:  fixed
>> Keywords:  |
>> ---+---
>> Changes (by rhc):
>> 
>> * status:  new => closed
>> * resolution:   => fixed
>> 
>> 
>> Comment:
>> 
>> (In [27957]) Fixes #3489: Move r27954 to v1.7 branch
>> 
>> ---svn-pre-commit-ignore-below---
>> 
>> r27954 [[BR]]
>> Remove stale ft options.
>> 
>> cmr:v1.7
>> 
>> --
>> Ticket URL: 
>> Open MPI 
>> 
>> ___
>> bugs mailing list
>> b...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/bugs
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] [OMPI bugs] [Open MPI] #3489: Move r27954 to v1.7 branch

2013-01-28 Thread George Bosilca

Ralph,

What if I say it wasn't a "stale" option nobody cares about. You just
removed one of the critical pieces of the configury, completely
disabling the work of other people.

I am absolutely sorry that I didn't make it in the 27 minutes you
generously provided for comments. Removing from the trunk and pushing
it in the 1.7 branch in absolute agreement with yourself and all that
in a mere 27 minutes is an absolute feat (and not your first). For
some obscure reasons I had the feeling we had some level of protection
(gk, rm, a reasonable amount of time to people to comment), but I
guess those rules are for weaklings.

  George.

PS: I have so much fun reading a barely 3-weeks old thread on our
mailing list. Absolutely terrific:
http://www.open-mpi.org/community/lists/devel/2013/01/11901.php.

On Mon, Jan 28, 2013 at 9:22 PM, Open MPI  wrote:
> #3489: Move r27954 to v1.7 branch
> ---+---
> Reporter:  rhc |   Owner:  rhc
> Type:  changeset move request  |  Status:  closed
> Priority:  major   |   Milestone:  Open MPI 1.7
>  Version:  trunk   |  Resolution:  fixed
> Keywords:  |
> ---+---
> Changes (by rhc):
>
>  * status:  new => closed
>  * resolution:   => fixed
>
>
> Comment:
>
>  (In [27957]) Fixes #3489: Move r27954 to v1.7 branch
>
>  ---svn-pre-commit-ignore-below---
>
>  r27954 [[BR]]
>  Remove stale ft options.
>
>  cmr:v1.7
>
> --
> Ticket URL: 
> Open MPI 
>
> ___
> bugs mailing list
> b...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/bugs

Re: [OMPI devel] Open MPI on Cray XC30 - suspicous configury

2013-01-28 Thread Paul Hargrove

Ralph and Nathan,

As I said, the results I see fail to match the actual ALPS header locations
on both CLE4 and CLE5 systems at NERSC.
However, the CLE4 system "just works" because the actual location
(/usr/include) gets searched no matter what value configure picks for
$orte_check_alps_dir.  I suspect that this is why you didn't see any errors
on LANL's system.

Regardless of the defaults, there is still an additional issue with
orte_check_alps.m4 that occurs when I give an explicit
with-alps=/opt/cray/alps/default in the platform file, which the following
bit of config.log confirms:

> configure:99227: checking --with-alps value
> configure:99247: result: sanity check ok (/opt/cray/alps/default)
> configure:99329: checking for alps libraries in
> "/opt/cray/alps/default/lib64"
> configure:99334: result: found



However, when trying to configure the ras:alps component, the value of
ras_alps_CPPFLAGS does not contain "-I/opt/cray/alps/default/include" as I
would have expected from reading the relevant .m4 files and the generated
configure script:

> configure:113697: checking for MCA component ras:alps compile mode
> configure:113703: result: static
> configure:113871: checking alps/apInfo.h usability
> configure:113871: gcc -std=gnu99 -c -O3 -DNDEBUG -march=amdfam10
> -finline-functions -fno-strict-aliasing -fexceptions -pthread
> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.9a1r27905/opal/mca/hwloc/hwloc151/hwloc/include
> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.9a1r27905/BUILD-edison-gcc/opal/mca/hwloc/hwloc151/hwloc/include
> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.9a1r27905/opal/mca/event/libevent2019/libevent
> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.9a1r27905/opal/mca/event/libevent2019/libevent/include
> -I/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.9a1r27905/BUILD-edison-gcc/opal/mca/event/libevent2019/libevent/include
> -I/opt/cray/pmi/default/include -I/opt/cray/pmi/default/include
> -I/opt/cray/pmi/default/include -I/opt/cray/pmi/default/include  conftest.c
> >&5
> conftest.c:640:25: fatal error: alps/apInfo.h: No such file or directory
> compilation terminated.
> configure:113871: $? = 1


While only 95% certain, I think that this logic
in config/orte_check_alps.m4 is to blame:

> if test "$with_alps" = "no" -o -z "$with_alps" ; then
> orte_check_alps_happy="no"
> else
># Only need to do these tests once (this macro is invoked
># from multiple different components' configure.m4 scripts


Specifically, the setting of "$1_CPPFLAGS" appears to be ERRONEOUSLY placed
within the else-clause of the logic above.  So, when
orte/mca/ess/alps/configure.m4 is run BEFORE
orte/mca/ras/alps/configure.m4, the variable "with_alps" gets set and the
"$1_CPPFLAGS=..." is then unreachable when the ORTE_CHECK_ALPS macro is run
later from config/orte_check_alps.m4.

Though it leaves the indentation sloppy, I believe the following might fix
the problem, but I lack the autotools versions to test this myself:

--- config/orte_check_alps.m4   (revision 27954)
+++ config/orte_check_alps.m4   (working copy)
@@ -80,6 +80,7 @@
 [orte_check_alps_dir="/opt/cray/alps/default"],
 [orte_check_alps_dir="$with_alps"])
fi
+fi

$1_CPPFLAGS="-I$orte_check_alps_dir/include"
$1_LDFLAGS="-L$orte_check_alps_libdir"
@@ -106,7 +107,6 @@
   AC_MSG_ERROR([Cannot continue])])
fi
fi
-fi
 fi

 AS_IF([test "$orte_check_alps_happy" = "yes"],


-Paul




On Mon, Jan 28, 2013 at 6:30 PM, Ralph Castain  wrote:

> Like I said, I didn't write this code - all I can say for certain is that
> it gets the right answer on the LANL Crays. I'll talk to Nathan (the
> author) about it tomorrow.
>
> On Jan 28, 2013, at 6:23 PM, Paul Hargrove  wrote:
>
> Ralph writes
>
>> ?? It looks correct to me - if with_alps is "yes", then no path was given
>> and we have to look at a default location. If it isn't yes, then a path was
>> given and we use it.
>> Am I missing something?
>
>
> Maybe *I* am the one missing something, but the way I read it the
> following defaults are applied
>
> CLE4:
>orte_check_alps_libdir="/usr/lib/alps"
>orte_check_alps_dir="/opt/cray/alps/default"
> CLE5:
>orte_check_alps_libdir="/opt/cray/alps/default/lib64"
>orte_check_alps_dir="/usr"
>
> Unless I am mistaken, the defaults for orte_check_alps_dir should be
> exchanged to yield:
>
> CLE4:
>orte_check_alps_libdir="/usr/lib/alps"
>orte_check_alps_dir="/usr"
> CLE5:
>orte_check_alps_libdir="/opt/cray/alps/default/lib64"
>orte_check_alps_dir="/opt/cray/alps/default"
>
> -Paul
>
>
> On Mon, Jan 28, 2013 at 6:14 PM, Ralph Castain  wrote:
>
>>
>> On Jan 28, 2013, at 6:10 PM, Paul Hargrove  wrote:
>>
>> The following 2 fragment from

Re: [OMPI devel] Open MPI on Cray XC30 - suspicous configury

2013-01-28 Thread Ralph Castain

Like I said, I didn't write this code - all I can say for certain is that it 
gets the right answer on the LANL Crays. I'll talk to Nathan (the author) about 
it tomorrow.

On Jan 28, 2013, at 6:23 PM, Paul Hargrove  wrote:

> Ralph writes
> ?? It looks correct to me - if with_alps is "yes", then no path was given and 
> we have to look at a default location. If it isn't yes, then a path was given 
> and we use it.
> Am I missing something?
> 
> Maybe *I* am the one missing something, but the way I read it the following 
> defaults are applied
> 
> CLE4:
>orte_check_alps_libdir="/usr/lib/alps"
>orte_check_alps_dir="/opt/cray/alps/default"
> CLE5:
>orte_check_alps_libdir="/opt/cray/alps/default/lib64"
>orte_check_alps_dir="/usr"
> 
> Unless I am mistaken, the defaults for orte_check_alps_dir should be 
> exchanged to yield:
> 
> CLE4:
>orte_check_alps_libdir="/usr/lib/alps"
>orte_check_alps_dir="/usr"
> CLE5:
>orte_check_alps_libdir="/opt/cray/alps/default/lib64"
>orte_check_alps_dir="/opt/cray/alps/default"
> 
> -Paul
> 
> 
> On Mon, Jan 28, 2013 at 6:14 PM, Ralph Castain  wrote:
> 
> On Jan 28, 2013, at 6:10 PM, Paul Hargrove  wrote:
> 
>> The following 2 fragment from config/orte_check_alps.m4 appear to be 
>> contradictory.
>> By that I mean the first appears to mean that "--with-alps" with no argument 
>> means /opt/cray/alps/default/... for CLE5 and /usr/... for CLE4, while the 
>> second fragment appears to be doing the opposite:
>> 
>>if test "$using_cle5_install" = "yes"; then
>>orte_check_alps_libdir="/opt/cray/alps/default/lib64"
>>else
>>orte_check_alps_libdir="/usr/lib/alps"
>>fi
>> 
>> 
>>if test "$using_cle5_install" = "yes" ; then
>>   AS_IF([test "$with_alps" = "yes"],
>> [orte_check_alps_dir="/usr"],
>> [orte_check_alps_dir="$with_alps"])
>>else   
>>   AS_IF([test "$with_alps" = "yes"],
>> [orte_check_alps_dir="/opt/cray/alps/default"],
>> [orte_check_alps_dir="$with_alps"])
>>fi
>> 
>> At least based on header and lib locations on NERSC's XC30 (CLE 5.0.15) and 
>> XE6 (CLE 4.1.40), the first fragment is correctwhile the second fragment is 
>> "backwards" (the two calls to AS_IF should be exchanged, or the initial 
>> "test" should be inverted).
> 
> ?? It looks correct to me - if with_alps is "yes", then no path was given and 
> we have to look at a default location. If it isn't yes, then a path was given 
> and we use it.
> 
> Am I missing something?
> 
>> 
>> Note this same logic is present in both trunk and v1.7 (in SVN - I am not 
>> looking at tarballs this time).
>> 
>> -Paul
>> 
>> 
>> 
>> 
>> 
>> 
>> -- 
>> Paul H. Hargrove  phhargr...@lbl.gov
>> Future Technologies Group
>> Computer and Data Sciences Department Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Open MPI on Cray XC30 - suspicous configury

2013-01-28 Thread Ralph Castain


On Jan 28, 2013, at 6:23 PM, George Bosilca  wrote:

> What Paul is saying is that there is a path mismatch between the two
> cases. Few lines above using_cle5_install is only set to yes if
> /usr/lib/alps/libalps.a exist. Then in the snippet pasted in Paul's
> email if using_cle5_install is yes then you set the
> orte_check_alps_libdir to something in /opt/cray/.

Good question - I'll leave that to Nathan.

> Why not to /usr/ as
> in the test few lines above?

Because the default path at LANL is /opt/cray/alps/default, and they wrote this 
code! :-)

I will talk to them and see if they are comfortable about making a change, 
putting their path in their platform files, and resolving the apparent 
conflict. All I can say for sure is that it works on their Crays.


> 
> On Mon, Jan 28, 2013 at 9:14 PM, Ralph Castain  wrote:
>> 
>> On Jan 28, 2013, at 6:10 PM, Paul Hargrove  wrote:
>> 
>> The following 2 fragment from config/orte_check_alps.m4 appear to be
>> contradictory.
>> By that I mean the first appears to mean that "--with-alps" with no argument
>> means /opt/cray/alps/default/... for CLE5 and /usr/... for CLE4, while the
>> second fragment appears to be doing the opposite:
>> 
>>   if test "$using_cle5_install" = "yes"; then
>>   orte_check_alps_libdir="/opt/cray/alps/default/lib64"
>>   else
>>   orte_check_alps_libdir="/usr/lib/alps"
>>   fi
>> 
>> 
>>   if test "$using_cle5_install" = "yes" ; then
>>  AS_IF([test "$with_alps" = "yes"],
>>[orte_check_alps_dir="/usr"],
>>[orte_check_alps_dir="$with_alps"])
>>   else
>>  AS_IF([test "$with_alps" = "yes"],
>>[orte_check_alps_dir="/opt/cray/alps/default"],
>>[orte_check_alps_dir="$with_alps"])
>>   fi
>> 
>> At least based on header and lib locations on NERSC's XC30 (CLE 5.0.15) and
>> XE6 (CLE 4.1.40), the first fragment is correctwhile the second fragment is
>> "backwards" (the two calls to AS_IF should be exchanged, or the initial
>> "test" should be inverted).
>> 
>> 
>> ?? It looks correct to me - if with_alps is "yes", then no path was given
>> and we have to look at a default location. If it isn't yes, then a path was
>> given and we use it.
>> 
>> Am I missing something?
>> 
>> 
>> Note this same logic is present in both trunk and v1.7 (in SVN - I am not
>> looking at tarballs this time).
>> 
>> -Paul
>> 
>> 
>> 
>> 
>> 
>> 
>> --
>> Paul H. Hargrove  phhargr...@lbl.gov
>> Future Technologies Group
>> Computer and Data Sciences Department Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] 1.7rc6 build failure: bogus errmgr code

2013-01-28 Thread Ralph Castain

LOL - yeah, I've heard that term :-)

I removed the options. Thanks!

On Jan 28, 2013, at 6:18 PM, Paul Hargrove  wrote:

> You might say that I like to "push all the buttons and see which ones go 
> boom".
> See the commit message for r8099 (which I don't imagine Jeff or Brian ever 
> thought I'd read).
> 
> -Paul
> 
> 
> On Mon, Jan 28, 2013 at 5:43 PM, Ralph Castain  wrote:
> Yes, we need to make it absolutely clear that c/r is no longer supported - 
> I'll remove that configure option.
> 
> Thanks
> Ralph
> 
> On Jan 28, 2013, at 5:38 PM, Paul Hargrove  wrote:
> 
>> When configured using --with-ft=cr on linux/x86 I see the following build 
>> failure:
>> 
>> Making all in mca/errmgr
>> make[2]: Entering directory 
>> `/home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/BLD/orte/mca/errmgr'
>>   CC   base/errmgr_base_close.lo
>>   CC   base/errmgr_base_select.lo
>>   CC   base/errmgr_base_open.lo
>>   CC   base/errmgr_base_fns.lo
>> /home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/openmpi-1.7rc6/orte/mca/errmgr/base/errmgr_base_fns.c:
>>  In function 'orte_errmgr_base_proc_state_notify':
>> /home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/openmpi-1.7rc6/orte/mca/errmgr/base/errmgr_base_fns.c:331:
>>  error: parse error before ',' token
>> /home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/openmpi-1.7rc6/orte/mca/errmgr/base/errmgr_base_fns.c:
>>  In function 'orte_errmgr_base_restart_job':
>> /home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/openmpi-1.7rc6/orte/mca/errmgr/base/errmgr_base_fns.c:622:
>>  error: 'orte_errmgr_base_module_t' has no member named 'update_state'
>> /home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/openmpi-1.7rc6/orte/mca/errmgr/base/errmgr_base_fns.c:622:
>>  error: 'ORTE_JOB_STATE_RESTART' undeclared (first use in this function)
>> /home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/openmpi-1.7rc6/orte/mca/errmgr/base/errmgr_base_fns.c:622:
>>  error: (Each undeclared identifier is reported only once
>> /home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/openmpi-1.7rc6/orte/mca/errmgr/base/errmgr_base_fns.c:622:
>>  error: for each function it appears in.)
>> make[2]: *** [base/errmgr_base_fns.lo] Error 1
>> make[2]: Leaving directory 
>> `/home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/BLD/orte/mca/errmgr'
>> make[1]: *** [all-recursive] Error 1
>> make[1]: Leaving directory 
>> `/home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/BLD/orte'
>> make: *** [all-recursive] Error 1
>> 
>> Both errors appear to have be absent from trunk, suggesting there is at 
>> least one CMR needed.
>> 
>> These errors were fixed on the trunk by changesets 26773 and 26770, 
>> respectively, which also make numerous changes in other files.
>> 
>> -Paul
>> 
>> -- 
>> Paul H. Hargrove  phhargr...@lbl.gov
>> Future Technologies Group
>> Computer and Data Sciences Department Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Open MPI on Cray XC30 - suspicous configury

2013-01-28 Thread Paul Hargrove

Ralph writes

> ?? It looks correct to me - if with_alps is "yes", then no path was given
> and we have to look at a default location. If it isn't yes, then a path was
> given and we use it.
> Am I missing something?


Maybe *I* am the one missing something, but the way I read it the following
defaults are applied

CLE4:
   orte_check_alps_libdir="/usr/lib/alps"
   orte_check_alps_dir="/opt/cray/alps/default"
CLE5:
   orte_check_alps_libdir="/opt/cray/alps/default/lib64"
   orte_check_alps_dir="/usr"

Unless I am mistaken, the defaults for orte_check_alps_dir should be
exchanged to yield:

CLE4:
   orte_check_alps_libdir="/usr/lib/alps"
   orte_check_alps_dir="/usr"
CLE5:
   orte_check_alps_libdir="/opt/cray/alps/default/lib64"
   orte_check_alps_dir="/opt/cray/alps/default"

-Paul


On Mon, Jan 28, 2013 at 6:14 PM, Ralph Castain  wrote:

>
> On Jan 28, 2013, at 6:10 PM, Paul Hargrove  wrote:
>
> The following 2 fragment from config/orte_check_alps.m4 appear to be
> contradictory.
> By that I mean the first appears to mean that "--with-alps" with no
> argument means /opt/cray/alps/default/... for CLE5 and /usr/... for CLE4,
> while the second fragment appears to be doing the opposite:
>
>if test "$using_cle5_install" = "yes"; then
>
>  orte_check_alps_libdir="/opt/cray/alps/default/lib64"
>else
>orte_check_alps_libdir="/usr/lib/alps"
>fi
>
>
>if test "$using_cle5_install" = "yes" ; then
>   AS_IF([test "$with_alps" = "yes"],
> [orte_check_alps_dir="/usr"],
> [orte_check_alps_dir="$with_alps"])
>else
>   AS_IF([test "$with_alps" = "yes"],
> [orte_check_alps_dir="/opt/cray/alps/default"],
> [orte_check_alps_dir="$with_alps"])
>fi
>
> At least based on header and lib locations on NERSC's XC30 (CLE 5.0.15)
> and XE6 (CLE 4.1.40), the first fragment is correctwhile the second
> fragment is "backwards" (the two calls to AS_IF should be exchanged, or the
> initial "test" should be inverted).
>
>
> ?? It looks correct to me - if with_alps is "yes", then no path was given
> and we have to look at a default location. If it isn't yes, then a path was
> given and we use it.
>
> Am I missing something?
>
>
> Note this same logic is present in both trunk and v1.7 (in SVN - I am not
> looking at tarballs this time).
>
> -Paul
>
>
>
>
>
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>  ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] Open MPI on Cray XC30 - suspicous configury

2013-01-28 Thread George Bosilca

What Paul is saying is that there is a path mismatch between the two
cases. Few lines above using_cle5_install is only set to yes if
/usr/lib/alps/libalps.a exist. Then in the snippet pasted in Paul's
email if using_cle5_install is yes then you set the
orte_check_alps_libdir to something in /opt/cray/. Why not to /usr/ as
in the test few lines above?

On Mon, Jan 28, 2013 at 9:14 PM, Ralph Castain  wrote:
>
> On Jan 28, 2013, at 6:10 PM, Paul Hargrove  wrote:
>
> The following 2 fragment from config/orte_check_alps.m4 appear to be
> contradictory.
> By that I mean the first appears to mean that "--with-alps" with no argument
> means /opt/cray/alps/default/... for CLE5 and /usr/... for CLE4, while the
> second fragment appears to be doing the opposite:
>
>if test "$using_cle5_install" = "yes"; then
>orte_check_alps_libdir="/opt/cray/alps/default/lib64"
>else
>orte_check_alps_libdir="/usr/lib/alps"
>fi
>
>
>if test "$using_cle5_install" = "yes" ; then
>   AS_IF([test "$with_alps" = "yes"],
> [orte_check_alps_dir="/usr"],
> [orte_check_alps_dir="$with_alps"])
>else
>   AS_IF([test "$with_alps" = "yes"],
> [orte_check_alps_dir="/opt/cray/alps/default"],
> [orte_check_alps_dir="$with_alps"])
>fi
>
> At least based on header and lib locations on NERSC's XC30 (CLE 5.0.15) and
> XE6 (CLE 4.1.40), the first fragment is correctwhile the second fragment is
> "backwards" (the two calls to AS_IF should be exchanged, or the initial
> "test" should be inverted).
>
>
> ?? It looks correct to me - if with_alps is "yes", then no path was given
> and we have to look at a default location. If it isn't yes, then a path was
> given and we use it.
>
> Am I missing something?
>
>
> Note this same logic is present in both trunk and v1.7 (in SVN - I am not
> looking at tarballs this time).
>
> -Paul
>
>
>
>
>
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] 1.7rc6 build failure: bogus errmgr code

2013-01-28 Thread Paul Hargrove

You might say that I like to "push all the buttons and see which ones go
boom".
See the commit message for
r8099 (which
I don't imagine Jeff or Brian ever thought I'd read).

-Paul


On Mon, Jan 28, 2013 at 5:43 PM, Ralph Castain  wrote:

> Yes, we need to make it absolutely clear that c/r is no longer supported -
> I'll remove that configure option.
>
> Thanks
> Ralph
>
> On Jan 28, 2013, at 5:38 PM, Paul Hargrove  wrote:
>
> When configured using --with-ft=cr on linux/x86 I see the following build
> failure:
>
> Making all in mca/errmgr
> make[2]: Entering directory
> `/home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/BLD/orte/mca/errmgr'
>   CC   base/errmgr_base_close.lo
>   CC   base/errmgr_base_select.lo
>   CC   base/errmgr_base_open.lo
>   CC   base/errmgr_base_fns.lo
> /home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/openmpi-1.7rc6/orte/mca/errmgr/base/errmgr_base_fns.c:
> In function 'orte_errmgr_base_proc_state_notify':
> /home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/openmpi-1.7rc6/orte/mca/errmgr/base/errmgr_base_fns.c:331:
> error: parse error before ',' token
> /home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/openmpi-1.7rc6/orte/mca/errmgr/base/errmgr_base_fns.c:
> In function 'orte_errmgr_base_restart_job':
> /home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/openmpi-1.7rc6/orte/mca/errmgr/base/errmgr_base_fns.c:622:
> error: 'orte_errmgr_base_module_t' has no member named 'update_state'
> /home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/openmpi-1.7rc6/orte/mca/errmgr/base/errmgr_base_fns.c:622:
> error: 'ORTE_JOB_STATE_RESTART' undeclared (first use in this function)
> /home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/openmpi-1.7rc6/orte/mca/errmgr/base/errmgr_base_fns.c:622:
> error: (Each undeclared identifier is reported only once
> /home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/openmpi-1.7rc6/orte/mca/errmgr/base/errmgr_base_fns.c:622:
> error: for each function it appears in.)
> make[2]: *** [base/errmgr_base_fns.lo] Error 1
> make[2]: Leaving directory
> `/home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/BLD/orte/mca/errmgr'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory
> `/home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/BLD/orte'
> make: *** [all-recursive] Error 1
>
> Both errors appear to have be absent from trunk, suggesting there is at
> least one CMR needed.
>
> These errors were fixed on the trunk by changesets 26773 and 26770,
> respectively, which also make numerous changes in other files.
>
> -Paul
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>  ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

[OMPI devel] Open MPI on Cray XC30 - suspicous configury

2013-01-28 Thread Paul Hargrove

The following 2 fragment from config/orte_check_alps.m4 appear to be
contradictory.
By that I mean the first appears to mean that "--with-alps" with no
argument means /opt/cray/alps/default/... for CLE5 and /usr/... for CLE4,
while the second fragment appears to be doing the opposite:

   if test "$using_cle5_install" = "yes"; then
   orte_check_alps_libdir="/opt/cray/alps/default/lib64"
   else
   orte_check_alps_libdir="/usr/lib/alps"
   fi


   if test "$using_cle5_install" = "yes" ; then
  AS_IF([test "$with_alps" = "yes"],
[orte_check_alps_dir="/usr"],
[orte_check_alps_dir="$with_alps"])
   else
  AS_IF([test "$with_alps" = "yes"],
[orte_check_alps_dir="/opt/cray/alps/default"],
[orte_check_alps_dir="$with_alps"])
   fi

At least based on header and lib locations on NERSC's XC30 (CLE 5.0.15) and
XE6 (CLE 4.1.40), the first fragment is correctwhile the second fragment is
"backwards" (the two calls to AS_IF should be exchanged, or the initial
"test" should be inverted).

Note this same logic is present in both trunk and v1.7 (in SVN - I am not
looking at tarballs this time).

-Paul






-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] openib unloaded before last mem dereg

2013-01-28 Thread Steve Wise



On 1/28/2013 7:32 PM, Ralph Castain wrote:

Out of curiosity, could you tell us how you configured OMPI?


./configure --enable-debug --enable-mpirun-prefix-by-default 
--prefix=/usr/mpi/gcc/openmpi-1.6.4rc2-dbg





On Jan 28, 2013, at 12:46 PM, Steve Wise  wrote:


On 1/28/2013 2:04 PM, Ralph Castain wrote:

On Jan 28, 2013, at 11:55 AM, Steve Wise  wrote:


Do you know if the rdmacm CPC is really being used for your connection setup 
(vs other CPCs supported by IB)?  Cuz iwarp only supports rdmacm.  Maybe that's 
the difference?

Dunno for certain, but I expect it is using the OOB cm since I didn't direct it 
to do anything different. Like I said, I suspect the problem is that the 
cluster doesn't have iWARP on it.

Definitely, or it could be the different CPC used for IWvs IB is tickling the 
issue.


Steve.

On 1/28/2013 1:47 PM, Ralph Castain wrote:

Nope - still works just fine. I didn't receive that warning at all, and it ran 
to completion without problem.

I suspect the problem is that the system I can use just isn't configured like 
yours, and so I can't trigger the problem. Afraid I can't be of help after 
all... :-(


On Jan 28, 2013, at 11:25 AM, Steve Wise  wrote:


On 1/28/2013 12:48 PM, Ralph Castain wrote:

Hmmm...afraid I cannot replicate this using the current state of the 1.6 branch 
(which is the 1.6.4rcN) on the only IB-based cluster I can access.

Can you try it with a 1.6.4 tarball and see if you still see the problem? Could 
be someone already fixed it.

I still hit it on 1.6.4rc2.

Note iWARP != IB so you may not have this issue on IB systems for various 
reasons.  Did you use the same mpirun line? Namely using this:

--mca btl_openib_ipaddr_include "192.168.170.0/24"

(adjusted to your network config).

Because if I don't use ipaddr_include, then I don't see this issue on my setup.

Also, did you see these logged:

Right after starting the job:

--
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:   hpc-hn1.ogc.int
  Local device: cxgb4_0
  Local port:   2
  CPCs attempted:   oob, xoob, rdmacm
--
...

At the end of the job:

[hpc-hn1.ogc.int:07850] 5 more processes have sent help message 
help-mpi-btl-openib-cpc-base.txt / no cpcs for port


I think these are benign, but prolly indicate a bug: the mpirun is restricting 
the job to use port 1 only, so the CPCs shouldn't be attempting port 2...

Steve.



On Jan 28, 2013, at 10:03 AM, Steve Wise  wrote:


On 1/28/2013 11:48 AM, Ralph Castain wrote:

On Jan 28, 2013, at 9:12 AM, Steve Wise  wrote:


On 1/25/2013 12:19 PM, Steve Wise wrote:

Hello,

I'm tracking an issue I see in openmpi-1.6.3.  Running this command on my 
chelsio iwarp/rdma setup causes a seg fault every time:

/usr/mpi/gcc/openmpi-1.6.3-dbg/bin/mpirun --np 2 --host hpc-hn1,hpc-cn2 --mca btl 
openib,sm,self --mca btl_openib_ipaddr_include "192.168.170.0/24" 
/usr/mpi/gcc/openmpi-1.6.3/tests/IMB-3.2/IMB-MPI1 pingpong

The segfault is during finalization, and I've debugged this to the point were I 
see a call to dereg_mem() after the openib blt is unloaded via dlclose().  
dereg_mem() dereferences a function pointer to call the btl-specific dereg 
function, in this case it is openib_dereg_mr().  However, since that btl has 
already been unloaded, the deref causes a seg fault.  Happens every time with 
the above mpi job.

Now, I tried this same experiment with openmpi-1.7rc6 and I don't see the seg 
fault, and I don't see a call to dereg_mem() after the openib btl is unloaded.  
That's all well good. :)  But I'd like to get this fix pushed into 1.6 since 
that is the current stable release.

Question:  Can someone point me to the fix in 1.7?

Thanks,

Steve.

It appears that in ompi_mpi_finalize(), mca_pml_base_close() is called which 
unloads the openib btl.  Then further down in ompi_mpi_finalize(), 
mca_mpool_base_close() is called which ends up calling dereg_mem() which seg 
faults trying to call into the unloaded openib btl.


That definitely sounds like a bug


Anybody have thoughts?  Anybody care? :)

I care! It needs to be fixed - I'll take a look. Probably something that forgot 
to be cmr'd.

Great!  If you want me to try out a fix or gather more debug, just hollar.

Thanks,

Steve.

Re: [OMPI devel] 1.7rc6 build failure: bogus errmgr code

2013-01-28 Thread Ralph Castain

Yes, we need to make it absolutely clear that c/r is no longer supported - I'll 
remove that configure option.

Thanks
Ralph

On Jan 28, 2013, at 5:38 PM, Paul Hargrove  wrote:

> When configured using --with-ft=cr on linux/x86 I see the following build 
> failure:
> 
> Making all in mca/errmgr
> make[2]: Entering directory 
> `/home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/BLD/orte/mca/errmgr'
>   CC   base/errmgr_base_close.lo
>   CC   base/errmgr_base_select.lo
>   CC   base/errmgr_base_open.lo
>   CC   base/errmgr_base_fns.lo
> /home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/openmpi-1.7rc6/orte/mca/errmgr/base/errmgr_base_fns.c:
>  In function 'orte_errmgr_base_proc_state_notify':
> /home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/openmpi-1.7rc6/orte/mca/errmgr/base/errmgr_base_fns.c:331:
>  error: parse error before ',' token
> /home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/openmpi-1.7rc6/orte/mca/errmgr/base/errmgr_base_fns.c:
>  In function 'orte_errmgr_base_restart_job':
> /home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/openmpi-1.7rc6/orte/mca/errmgr/base/errmgr_base_fns.c:622:
>  error: 'orte_errmgr_base_module_t' has no member named 'update_state'
> /home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/openmpi-1.7rc6/orte/mca/errmgr/base/errmgr_base_fns.c:622:
>  error: 'ORTE_JOB_STATE_RESTART' undeclared (first use in this function)
> /home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/openmpi-1.7rc6/orte/mca/errmgr/base/errmgr_base_fns.c:622:
>  error: (Each undeclared identifier is reported only once
> /home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/openmpi-1.7rc6/orte/mca/errmgr/base/errmgr_base_fns.c:622:
>  error: for each function it appears in.)
> make[2]: *** [base/errmgr_base_fns.lo] Error 1
> make[2]: Leaving directory 
> `/home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/BLD/orte/mca/errmgr'
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory 
> `/home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/BLD/orte'
> make: *** [all-recursive] Error 1
> 
> Both errors appear to have be absent from trunk, suggesting there is at least 
> one CMR needed.
> 
> These errors were fixed on the trunk by changesets 26773 and 26770, 
> respectively, which also make numerous changes in other files.
> 
> -Paul
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

[OMPI devel] 1.7rc6 build failure: bogus errmgr code

2013-01-28 Thread Paul Hargrove

When configured using --with-ft=cr on linux/x86 I see the following build
failure:

Making all in mca/errmgr
make[2]: Entering directory
`/home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/BLD/orte/mca/errmgr'
  CC   base/errmgr_base_close.lo
  CC   base/errmgr_base_select.lo
  CC   base/errmgr_base_open.lo
  CC   base/errmgr_base_fns.lo
/home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/openmpi-1.7rc6/orte/mca/errmgr/base/errmgr_base_fns.c:
In function 'orte_errmgr_base_proc_state_notify':
/home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/openmpi-1.7rc6/orte/mca/errmgr/base/errmgr_base_fns.c:331:
error: parse error before ',' token
/home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/openmpi-1.7rc6/orte/mca/errmgr/base/errmgr_base_fns.c:
In function 'orte_errmgr_base_restart_job':
/home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/openmpi-1.7rc6/orte/mca/errmgr/base/errmgr_base_fns.c:622:
error: 'orte_errmgr_base_module_t' has no member named 'update_state'
/home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/openmpi-1.7rc6/orte/mca/errmgr/base/errmgr_base_fns.c:622:
error: 'ORTE_JOB_STATE_RESTART' undeclared (first use in this function)
/home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/openmpi-1.7rc6/orte/mca/errmgr/base/errmgr_base_fns.c:622:
error: (Each undeclared identifier is reported only once
/home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/openmpi-1.7rc6/orte/mca/errmgr/base/errmgr_base_fns.c:622:
error: for each function it appears in.)
make[2]: *** [base/errmgr_base_fns.lo] Error 1
make[2]: Leaving directory
`/home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/BLD/orte/mca/errmgr'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory
`/home/pcp1/phargrov/OMPI/openmpi-1.7rc6-linux-x86-blcr/BLD/orte'
make: *** [all-recursive] Error 1

Both errors appear to have be absent from trunk, suggesting there is at
least one CMR needed.

These errors were fixed on the trunk by changesets 26773 and 26770,
respectively, which also make numerous changes in other files.

-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] openib unloaded before last mem dereg

2013-01-28 Thread Ralph Castain

Out of curiosity, could you tell us how you configured OMPI?


On Jan 28, 2013, at 12:46 PM, Steve Wise  wrote:

> On 1/28/2013 2:04 PM, Ralph Castain wrote:
>> On Jan 28, 2013, at 11:55 AM, Steve Wise  wrote:
>> 
>>> Do you know if the rdmacm CPC is really being used for your connection 
>>> setup (vs other CPCs supported by IB)?  Cuz iwarp only supports rdmacm.  
>>> Maybe that's the difference?
>> Dunno for certain, but I expect it is using the OOB cm since I didn't direct 
>> it to do anything different. Like I said, I suspect the problem is that the 
>> cluster doesn't have iWARP on it.
> 
> Definitely, or it could be the different CPC used for IWvs IB is tickling the 
> issue.
> 
>>> Steve.
>>> 
>>> On 1/28/2013 1:47 PM, Ralph Castain wrote:
 Nope - still works just fine. I didn't receive that warning at all, and it 
 ran to completion without problem.
 
 I suspect the problem is that the system I can use just isn't configured 
 like yours, and so I can't trigger the problem. Afraid I can't be of help 
 after all... :-(
 
 
 On Jan 28, 2013, at 11:25 AM, Steve Wise  
 wrote:
 
> On 1/28/2013 12:48 PM, Ralph Castain wrote:
>> Hmmm...afraid I cannot replicate this using the current state of the 1.6 
>> branch (which is the 1.6.4rcN) on the only IB-based cluster I can access.
>> 
>> Can you try it with a 1.6.4 tarball and see if you still see the 
>> problem? Could be someone already fixed it.
> I still hit it on 1.6.4rc2.
> 
> Note iWARP != IB so you may not have this issue on IB systems for various 
> reasons.  Did you use the same mpirun line? Namely using this:
> 
> --mca btl_openib_ipaddr_include "192.168.170.0/24"
> 
> (adjusted to your network config).
> 
> Because if I don't use ipaddr_include, then I don't see this issue on my 
> setup.
> 
> Also, did you see these logged:
> 
> Right after starting the job:
> 
> --
> No OpenFabrics connection schemes reported that they were able to be
> used on a specific port.  As such, the openib BTL (OpenFabrics
> support) will be disabled for this port.
> 
>  Local host:   hpc-hn1.ogc.int
>  Local device: cxgb4_0
>  Local port:   2
>  CPCs attempted:   oob, xoob, rdmacm
> --
> ...
> 
> At the end of the job:
> 
> [hpc-hn1.ogc.int:07850] 5 more processes have sent help message 
> help-mpi-btl-openib-cpc-base.txt / no cpcs for port
> 
> 
> I think these are benign, but prolly indicate a bug: the mpirun is 
> restricting the job to use port 1 only, so the CPCs shouldn't be 
> attempting port 2...
> 
> Steve.
> 
> 
>> On Jan 28, 2013, at 10:03 AM, Steve Wise  
>> wrote:
>> 
>>> On 1/28/2013 11:48 AM, Ralph Castain wrote:
 On Jan 28, 2013, at 9:12 AM, Steve Wise  
 wrote:
 
> On 1/25/2013 12:19 PM, Steve Wise wrote:
>> Hello,
>> 
>> I'm tracking an issue I see in openmpi-1.6.3.  Running this command 
>> on my chelsio iwarp/rdma setup causes a seg fault every time:
>> 
>> /usr/mpi/gcc/openmpi-1.6.3-dbg/bin/mpirun --np 2 --host 
>> hpc-hn1,hpc-cn2 --mca btl openib,sm,self --mca 
>> btl_openib_ipaddr_include "192.168.170.0/24" 
>> /usr/mpi/gcc/openmpi-1.6.3/tests/IMB-3.2/IMB-MPI1 pingpong
>> 
>> The segfault is during finalization, and I've debugged this to the 
>> point were I see a call to dereg_mem() after the openib blt is 
>> unloaded via dlclose().  dereg_mem() dereferences a function pointer 
>> to call the btl-specific dereg function, in this case it is 
>> openib_dereg_mr().  However, since that btl has already been 
>> unloaded, the deref causes a seg fault.  Happens every time with the 
>> above mpi job.
>> 
>> Now, I tried this same experiment with openmpi-1.7rc6 and I don't 
>> see the seg fault, and I don't see a call to dereg_mem() after the 
>> openib btl is unloaded.  That's all well good. :)  But I'd like to 
>> get this fix pushed into 1.6 since that is the current stable 
>> release.
>> 
>> Question:  Can someone point me to the fix in 1.7?
>> 
>> Thanks,
>> 
>> Steve.
> It appears that in ompi_mpi_finalize(), mca_pml_base_close() is 
> called which unloads the openib btl.  Then further down in 
> ompi_mpi_finalize(), mca_mpool_base_close() is called which ends up 
> calling dereg_mem() which seg

Re: [OMPI devel] Looking for a replacement call for repeated call to MPI_IPROBE

2013-01-28 Thread Jeff Squyres (jsquyres)

Is there a reason you're using buffered sends?  They're generally pretty evil:

 
http://blogs.cisco.com/performance/top-10-reasons-why-buffered-sends-are-evil/

FWIW, you can probably install Open MPI 1.6.3 yourself -- you can just install 
it under $HOME, or some other directory that is available on all compute nodes. 
 Then just set your PATH and LD_LIBRARY_PATH to point to your install instead 
of the system install.

This would at least let you know if upgrading to 1.6.3 will fix your issue.



On Jan 28, 2013, at 12:22 PM, Jeremy McCaslin  wrote:

> Thank you for the feedback.  I actually just changed the repeated probing for 
> a message to a blocking MPI_RECV, as the processor waiting to receive does 
> nothing but repeatedly probe until the message is there anyway.  This also 
> works, and it makes more sense to do it this way.  However, this did not fix 
> my hanging issue.  I am wondering if it has something to do with the size of 
> my buffer used in MPI_BUFFER_ATTACH.  I believe I am following the proper 
> MPI_BSEND_OVERHEAD protocol.  I am waiting on the admins to install 
> openmpi-1.6.3, and hoping that maybe this will fix my issue.
> 
> On Sat, Jan 26, 2013 at 7:32 AM, Jeff Squyres (jsquyres)  
> wrote:
> First off, 1.4.4 is fairly ancient.  You might want to try upgrading to 1.6.3.
> 
> Second, you might want to use non-blocking receives for B such that you can 
> MPI_WAITALL, or perhaps MPI_WAITSOME or MPI_WAITANY to wait for some/all of 
> the values to arrive in B.  This keeps any looping down in MPI (i.e., as 
> close to the hardware as possible).
> 
> 
> On Jan 25, 2013, at 3:21 PM, Jeremy McCaslin  wrote:
> 
> > Hello,
> >
> > I am trying to figure out the most appropriate MPI calls for a certain 
> > portion of my code.  I will describe the situation here:
> >
> > Each cell (i,j) of my array A is being updated by a calculation that 
> > depends on the values of 1 or 2 of the 4 possible neighbors A(i+1,j), 
> > A(i-1,j), A(i,j+1), and A(i,j-1).  Say, for example, 
> > A(i,j)=A(i-1,j)*A(i,j-1).  The thing is, the values of the neighbors 
> > A(i-1,j) and A(i,j-1) cannot be used until an auxiliary array B has been 
> > updated from 0 to 1.  The values B(i-1,j) and B(i,j-1) are changed from 0 
> > -> 1 after the values A(i-1,j) and A(i,j-1) have been communicated to the 
> > proc that contains cell (i,j), as cells (i-1,j) and (i,j-1) belong to 
> > different procs.  Here is pseudocode for how I have the algorithm 
> > implemented (in fortran):
> >
> > do while (B(ii,jj,kk).eq.0)
> >  if (probe_for_message(i0,j0,k0,this_sc)) then
> >   my_ibuf(1)=my_ibuf(1)+1
> >   A(i0,j0,k0)=this_sc
> >   B(i0,j0,k0)=1
> >  end if
> > end do
> >
> > The function 'probe_for_message' uses an 'MPI_IPROBE' to see if 
> > 'MPI_ANY_SOURCE' has a message for my current proc.  If there is a message, 
> > the function returns a true logical and calls 'MPI_RECV', receiving 
> > (i0,j0,k0,this_sc) from the proc that has the message.  This works!  My 
> > concern is that I am probing repeatedly inside the while loop until I 
> > receive a message from a proc such that ii=i0, jj=j0, kk=k0.  I could 
> > potentially call MPI_IPROBE many many times before this happens... and I'm 
> > worried that this is a messy way of doing this.  Could I "break" the mpi 
> > probe call?  Are there MPI routines that would allow me to accomplish the 
> > same thing in a more formal or safer way?  Maybe a persistent communication 
> > or something?  For very large computations with many procs, I am observing 
> > a hanging situation which I suspect may be due to this.  I observe it when 
> > using openmpi-1.4.4, and the hanging seems to disappear if I use mvapich.  
> > Any suggestions/comments would be greatly ap!
>  preciated.  Thanks so much!
> >
> > --
> > JM ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> 
> -- 
> JM ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] [EXTERNAL] Open MPI Configure Script

2013-01-28 Thread Jeff Squyres (jsquyres)

You're basically telling your build system to use a C++ compiler as the linker 
when creating libtorque.  This probably does more-or-less what I suggested: 
rpath'ing in whatever dependencies you need such that when we link against 
libtorque, all of the (C++) dependencies that you need are automatically pulled 
along.

Glad it got resolved!


On Jan 28, 2013, at 4:55 PM, David Beer 
 wrote:

> 
> On Mon, Jan 28, 2013 at 12:14 PM, Barrett, Brian W  wrote:
> On 1/28/13 11:54 AM, "David Beer"  wrote:
> 
> checking for tm_init in -ltorque... no
> configure: error: TM support requested but not found.  Aborting
> 
> Oddly enough, if you have already configured with an older version of TORQUE, 
> you can build open-mpi with TORQUE 4.2 installed, so it can find the function 
> definitions when compiling, its just for some reason it doesn't find them in 
> the configure script. This is why I think that something in the configure 
> script is assuming that libtorque was compiled with gcc.
> 
> Right, the configure output to stdout/stderr isn't very useful in diagnosing 
> why a test failed.  The config.log file generated by configure will have much 
> more information.
> 
> All,
> 
> Thanks for your help. I found a way to resolve this by changing OpenMPI's 
> configure script, but then someone who knows a bit more about these things 
> showed me that we can solve this by defining some more things on our end, 
> namely adding:
> 
> +LT_LANG([C++])
> +AC_SUBST([LIBTOOL_DEPS])
> 
> and 
> 
> +CCLD="$CXX"
> +AC_SUBST([CCLD])
> +LIBTOOLFLAGS="--tag=CXX"
> +AC_SUBST([LIBTOOLFLAGS])
> 
> to our configure.ac and 
> 
> +LIBTOOL_DEPS = @LIBTOOL_DEPS@
> +libtool: $(LIBTOOL_DEPS)
> + $(SHELL) ./config.status --recheck
> 
> to our Makefile.am. I'm going to try to look some things up and see why this 
> makes a difference, but I'm guessing that we previously had an incomplete 
> definition that confused OpenMPI's configure script. Thanks for all of your 
> help and I'm glad we could resolve this by fixing TORQUE.
> 
> -- 
> David Beer | Senior Software Engineer
> Adaptive Computing
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] [EXTERNAL] Open MPI Configure Script

2013-01-28 Thread David Beer

On Mon, Jan 28, 2013 at 12:14 PM, Barrett, Brian W wrote:

> On 1/28/13 11:54 AM, "David Beer"  wrote:
>
> checking for tm_init in -ltorque... no
> configure: error: TM support requested but not found.  Aborting
>
> Oddly enough, if you have already configured with an older version of
> TORQUE, you can build open-mpi with TORQUE 4.2 installed, so it can find
> the function definitions when compiling, its just for some reason it
> doesn't find them in the configure script. This is why I think that
> something in the configure script is assuming that libtorque was compiled
> with gcc.
>
>
> Right, the configure output to stdout/stderr isn't very useful in
> diagnosing why a test failed.  The config.log file generated by configure
> will have much more information.
>

All,

Thanks for your help. I found a way to resolve this by changing OpenMPI's
configure script, but then someone who knows a bit more about these things
showed me that we can solve this by defining some more things on our end,
namely adding:

+LT_LANG([C++])
+AC_SUBST([LIBTOOL_DEPS])

and

+CCLD="$CXX"
+AC_SUBST([CCLD])
+LIBTOOLFLAGS="--tag=CXX"
+AC_SUBST([LIBTOOLFLAGS])

to our configure.ac and

+LIBTOOL_DEPS = @LIBTOOL_DEPS@
+libtool: $(LIBTOOL_DEPS)
+ $(SHELL) ./config.status --recheck

to our Makefile.am. I'm going to try to look some things up and see why
this makes a difference, but I'm guessing that we previously had an
incomplete definition that confused OpenMPI's configure script. Thanks for
all of your help and I'm glad we could resolve this by fixing TORQUE.

-- 
David Beer | Senior Software Engineer
Adaptive Computing

Re: [OMPI devel] openib unloaded before last mem dereg

2013-01-28 Thread Steve Wise


On 1/28/2013 2:04 PM, Ralph Castain wrote:

On Jan 28, 2013, at 11:55 AM, Steve Wise  wrote:


Do you know if the rdmacm CPC is really being used for your connection setup 
(vs other CPCs supported by IB)?  Cuz iwarp only supports rdmacm.  Maybe that's 
the difference?

Dunno for certain, but I expect it is using the OOB cm since I didn't direct it 
to do anything different. Like I said, I suspect the problem is that the 
cluster doesn't have iWARP on it.


Definitely, or it could be the different CPC used for IWvs IB is 
tickling the issue.



Steve.

On 1/28/2013 1:47 PM, Ralph Castain wrote:

Nope - still works just fine. I didn't receive that warning at all, and it ran 
to completion without problem.

I suspect the problem is that the system I can use just isn't configured like 
yours, and so I can't trigger the problem. Afraid I can't be of help after 
all... :-(


On Jan 28, 2013, at 11:25 AM, Steve Wise  wrote:


On 1/28/2013 12:48 PM, Ralph Castain wrote:

Hmmm...afraid I cannot replicate this using the current state of the 1.6 branch 
(which is the 1.6.4rcN) on the only IB-based cluster I can access.

Can you try it with a 1.6.4 tarball and see if you still see the problem? Could 
be someone already fixed it.

I still hit it on 1.6.4rc2.

Note iWARP != IB so you may not have this issue on IB systems for various 
reasons.  Did you use the same mpirun line? Namely using this:

--mca btl_openib_ipaddr_include "192.168.170.0/24"

(adjusted to your network config).

Because if I don't use ipaddr_include, then I don't see this issue on my setup.

Also, did you see these logged:

Right after starting the job:

--
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:   hpc-hn1.ogc.int
  Local device: cxgb4_0
  Local port:   2
  CPCs attempted:   oob, xoob, rdmacm
--
...

At the end of the job:

[hpc-hn1.ogc.int:07850] 5 more processes have sent help message 
help-mpi-btl-openib-cpc-base.txt / no cpcs for port


I think these are benign, but prolly indicate a bug: the mpirun is restricting 
the job to use port 1 only, so the CPCs shouldn't be attempting port 2...

Steve.



On Jan 28, 2013, at 10:03 AM, Steve Wise  wrote:


On 1/28/2013 11:48 AM, Ralph Castain wrote:

On Jan 28, 2013, at 9:12 AM, Steve Wise  wrote:


On 1/25/2013 12:19 PM, Steve Wise wrote:

Hello,

I'm tracking an issue I see in openmpi-1.6.3.  Running this command on my 
chelsio iwarp/rdma setup causes a seg fault every time:

/usr/mpi/gcc/openmpi-1.6.3-dbg/bin/mpirun --np 2 --host hpc-hn1,hpc-cn2 --mca btl 
openib,sm,self --mca btl_openib_ipaddr_include "192.168.170.0/24" 
/usr/mpi/gcc/openmpi-1.6.3/tests/IMB-3.2/IMB-MPI1 pingpong

The segfault is during finalization, and I've debugged this to the point were I 
see a call to dereg_mem() after the openib blt is unloaded via dlclose().  
dereg_mem() dereferences a function pointer to call the btl-specific dereg 
function, in this case it is openib_dereg_mr().  However, since that btl has 
already been unloaded, the deref causes a seg fault.  Happens every time with 
the above mpi job.

Now, I tried this same experiment with openmpi-1.7rc6 and I don't see the seg 
fault, and I don't see a call to dereg_mem() after the openib btl is unloaded.  
That's all well good. :)  But I'd like to get this fix pushed into 1.6 since 
that is the current stable release.

Question:  Can someone point me to the fix in 1.7?

Thanks,

Steve.

It appears that in ompi_mpi_finalize(), mca_pml_base_close() is called which 
unloads the openib btl.  Then further down in ompi_mpi_finalize(), 
mca_mpool_base_close() is called which ends up calling dereg_mem() which seg 
faults trying to call into the unloaded openib btl.


That definitely sounds like a bug


Anybody have thoughts?  Anybody care? :)

I care! It needs to be fixed - I'll take a look. Probably something that forgot 
to be cmr'd.

Great!  If you want me to try out a fix or gather more debug, just hollar.

Thanks,

Steve.

Re: [OMPI devel] Open MPI (not quite) on Cray XC30

2013-01-28 Thread Paul Hargrove

I will be happy to retest on both the XC30 and XE6 at NERSC from a nightly
tarball with the fixes.
Please give me a heads up when that is available.

-Paul


On Mon, Jan 28, 2013 at 7:52 AM, Ralph Castain  wrote:

> The key was to --enable-static --disable-shared. That's the only way to
> generate the problem.
>
> Brian was already aware of it and fixed it this weekend. I tested the fix
> and it works fine. Waiting for Jeff to review it before committing to the
> trunk.
>
>
> On Jan 28, 2013, at 7:45 AM, Nathan Hjelm  wrote:
>
> > Try building static. Lots of errors due to missing libraries in
> libs_static.
> >
> > -Nathan
> >
> > On Fri, Jan 25, 2013 at 04:09:16PM -0800, Ralph Castain wrote:
> >> FWIW: I can build it fine without setting any of the CC... flags on
> LANL's Cray XE6, and mpicc worked just fine for me once built that way.
> >>
> >> So I'm not quite sure I understand the "mpicc is completely borked in
> the trunk". Can you elaborate?
> >>
> >> On Jan 25, 2013, at 3:59 PM, Paul Hargrove  wrote:
> >>
> >>> Nathan,
> >>>
> >>> The 2nd and 3rd non-blank lines of my original post:
> >>> Given that it is INTENDED to be API-compatible with the XE series, I
> began configuring with
> >>>CC=cc CXX=CC FC=ftn
> --with-platform=lanl/cray_xe6/optimized-nopanasas
> >>>
> >>> So, I am surprised that nobody objected before now to my use of the
> Cray-provided wrapper compilers.
> >>> I mistakenly believed that if I don't use them then I wouldn't get
> through configure w/ ugni and alps support.
> >>> However, I've just now completed configure w/o setting CC, CXX, FC and
> see the expected components.
> >>> I'll report more from this build later ("make all" is running now).
> >>>
> >>> I would appreciate (perhaps off-list) receiving any module or platform
> file or additional instructions that maybe appropriate to building on a
> Cray XE, XK or XC system.
> >>>
> >>> Getting OMPI running on our XC30 is of exactly ZERO importance beyond
> my own edification.
> >>> So, I am likely to stop fighting this battle soon.
> >>>
> >>> -Paul
> >>>
> >>>
> >>> On Fri, Jan 25, 2013 at 3:21 PM, Nathan Hjelm  wrote:
> >>> Hmm, I see mpicc in there. It will use the compiler directly instead
> of Cray's wrappers. We didn't want Open MPI's wrapper linking in MPT
> afterall ;). mpicc is completely borked in the trunk.
> >>>
> >>> If you want to use the Cray wrappers with Open MPI I can give you a
> module file that sets up the environment correctly (link against -lmpi not
> -lmpich, etc).
> >>>
> >>> -Nathan
> >>>
> >>> On Fri, Jan 25, 2013 at 03:10:37PM -0800, Paul Hargrove wrote:
>  Nathan,
> 
>  Cray's "cc" wrapper is adding xpmem, ugni, pmi, alps and others
> already:
> 
>  $ cc -v hello.c 2>&1 | grep collect
> > /opt/gcc/4.7.2/snos/libexec/gcc/x86_64-suse-linux/4.7.2/collect2
> > --sysroot= -m elf_x86_64 -static -u pthread_mutex_trylock -u
> > pthread_mutex_destroy -u pthread_create /usr/lib/../lib64/crt1.o
> > /usr/lib/../lib64/crti.o
> > /opt/gcc/4.7.2/snos/lib/gcc/x86_64-suse-linux/4.7.2/crtbeginT.o
> > -L/opt/cray/udreg/2.3.2-1.0500.5931.3.1.ari/lib64
> > -L/opt/cray/ugni/4.0-1.0500.5836.7.58.ari/lib64
> > -L/opt/cray/pmi/4.0.0-1..9282.69.4.ari/lib64
> > -L/opt/cray/dmapp/4.0.1-1.0500.5932.6.5.ari/lib64
> > -L/opt/cray/xpmem/0.1-2.0500.36799.3.6.ari/lib64
> > -L/opt/cray/alps/5.0.1-2.0500.7663.1.1.ari/lib64
> > -L/opt/cray/rca/1.0.0-2.0500.37705.3.12.ari/lib64
> > -L/opt/cray/mpt/5.6.0/gni/mpich2-gnu/47/lib
> > -L/opt/cray/mpt/5.6.0/gni/sma/lib64
> > -L/opt/cray/libsci/12.0.00/gnu/47/sandybridge/lib
> > -L/opt/cray/alps/5.0.1-2.0500.7663.1.1.ari/lib64
> > -L/opt/gcc/4.7.2/snos/lib/gcc/x86_64-suse-linux/4.7.2
> >
> -L/opt/gcc/4.7.2/snos/lib/gcc/x86_64-suse-linux/4.7.2/../../../../lib64
> > -L/lib/../lib64 -L/usr/lib/../lib64
> > -L/opt/gcc/4.7.2/snos/lib/gcc/x86_64-suse-linux/4.7.2/../../..
> > /scratch1/scratchdirs/hargrove/ccQ1f0sx.o -lrca
> -L/opt/cray/atp/1.6.0/lib/
> > --undefined=_ATP_Data_Globals --undefined=__atpHandlerInstall
> > -lAtpSigHCommData -lAtpSigHandler --start-group -lgfortran
> -lscicpp_gnu
> > -lsci_gnu_mp -lstdc++ -lgfortran -lmpich_gnu_47 -lmpl -lrt -lsma
> -lxpmem
> > -ldmapp -lugni -lpmi -lalpslli -lalpsutil -lalps -ludreg -lpthread
> -lm
> > --end-group -lgomp -lpthread --start-group -lgcc -lgcc_eh -lc
> --end-group
> > /opt/gcc/4.7.2/snos/lib/gcc/x86_64-suse-linux/4.7.2/crtend.o
> > /usr/lib/../lib64/crtn.o
> 
> 
>  -Paul
> 
> 
>  On Fri, Jan 25, 2013 at 2:46 PM, Nathan Hjelm 
> wrote:
> 
> > Something is wrong with the wrappers. A number of libraries (-lxpmem,
> > -lugni, etc) are missing from libs_static. Might be a similar issue
> with eh
> > missing -llustreapi. Going to create a critical bug to track this
> issue.
>

Re: [OMPI devel] 1.6.4rc2 released

2013-01-28 Thread Paul Hargrove

I am pleased to say that 1.6.4rc2 builds and runs (single node, sm btl) on
my BSD menagerie:
   freebsd6-amd64
   freebsd7-amd64
   freebsd8-amd64
   freebsd8-i386
   freebsd9-amd64
   freebsd9-i386
   netbsd6-amd64
   netbsd6-i386
   openbsd5-amd64
   openbsd5-i386

The {Free,Net,Open}BSD platforms have all been updated this month to their
latest respective stable versions.

I can also confirm testing of the following less common components
solaris-11(snv_151a) on x86 and x64:  btl:udapl (compile only)
linux on x86:   mtl:mx, btl:mx and btl:elan (compile only)
linux on x86-64: mtl:psm (compile and run)

As time allows I will be scanning the logs from those builds to see if any
"alarming" warnings appear.

-Paul



On Sat, Jan 26, 2013 at 4:25 AM, Jeff Squyres (jsquyres)  wrote:

> In the usual location:
>
> http://www.open-mpi.org/software/ompi/v1.6/
>
> Changes since rc1:
>
> - Automatically provide compiler flags that compile properly on some
>   types of ARM systems.
> - Fix slot_list behavior when multiple sockets are specified.  Thanks
>   to Siegmar Gross for reporting the problem.
> - Fixed memory leak in one-sided operations.  Thanks to Victor
>   Vysotskiy for letting us know about this one.
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] [EXTERNAL] Open MPI Configure Script

2013-01-28 Thread Jeff Squyres (jsquyres)

I'll +1 what Brian said: we *really* don't want to have to link Open MPI with a 
C++ compiler.

Can't you rpath in whatever support libraries you need (e.g., the g++ libraries 
with the cxx_personality symbol), such that when we -ltorque, it just pulls in 
whatever other dependencies it needs?

(I'm assuming that you're extern "C"'ing all the tm_*() function calls so that 
they can be called from C code, not C++ code)


On Jan 28, 2013, at 2:14 PM, "Barrett, Brian W"  wrote:

> On 1/28/13 11:54 AM, "David Beer"  wrote:
> 
>> checking for tm_init in -ltorque... no
>> configure: error: TM support requested but not found.  Aborting
>> 
>> Oddly enough, if you have already configured with an older version of 
>> TORQUE, you can build open-mpi with TORQUE 4.2 installed, so it can find the 
>> function definitions when compiling, its just for some reason it doesn't 
>> find them in the configure script. This is why I think that something in the 
>> configure script is assuming that libtorque was compiled with gcc.
> 
> Right, the configure output to stdout/stderr isn't very useful in diagnosing 
> why a test failed.  The config.log file generated by configure will have much 
> more information.
> 
> Brian
> 
> --
>   Brian W. Barrett
>   Scalable System Software Group
>   Sandia National Laboratories
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] openib unloaded before last mem dereg

2013-01-28 Thread Ralph Castain


On Jan 28, 2013, at 11:55 AM, Steve Wise  wrote:

> Do you know if the rdmacm CPC is really being used for your connection setup 
> (vs other CPCs supported by IB)?  Cuz iwarp only supports rdmacm.  Maybe 
> that's the difference?

Dunno for certain, but I expect it is using the OOB cm since I didn't direct it 
to do anything different. Like I said, I suspect the problem is that the 
cluster doesn't have iWARP on it.

> 
> Steve.
> 
> On 1/28/2013 1:47 PM, Ralph Castain wrote:
>> Nope - still works just fine. I didn't receive that warning at all, and it 
>> ran to completion without problem.
>> 
>> I suspect the problem is that the system I can use just isn't configured 
>> like yours, and so I can't trigger the problem. Afraid I can't be of help 
>> after all... :-(
>> 
>> 
>> On Jan 28, 2013, at 11:25 AM, Steve Wise  wrote:
>> 
>>> On 1/28/2013 12:48 PM, Ralph Castain wrote:
 Hmmm...afraid I cannot replicate this using the current state of the 1.6 
 branch (which is the 1.6.4rcN) on the only IB-based cluster I can access.
 
 Can you try it with a 1.6.4 tarball and see if you still see the problem? 
 Could be someone already fixed it.
>>> I still hit it on 1.6.4rc2.
>>> 
>>> Note iWARP != IB so you may not have this issue on IB systems for various 
>>> reasons.  Did you use the same mpirun line? Namely using this:
>>> 
>>> --mca btl_openib_ipaddr_include "192.168.170.0/24"
>>> 
>>> (adjusted to your network config).
>>> 
>>> Because if I don't use ipaddr_include, then I don't see this issue on my 
>>> setup.
>>> 
>>> Also, did you see these logged:
>>> 
>>> Right after starting the job:
>>> 
>>> --
>>> No OpenFabrics connection schemes reported that they were able to be
>>> used on a specific port.  As such, the openib BTL (OpenFabrics
>>> support) will be disabled for this port.
>>> 
>>>  Local host:   hpc-hn1.ogc.int
>>>  Local device: cxgb4_0
>>>  Local port:   2
>>>  CPCs attempted:   oob, xoob, rdmacm
>>> --
>>> ...
>>> 
>>> At the end of the job:
>>> 
>>> [hpc-hn1.ogc.int:07850] 5 more processes have sent help message 
>>> help-mpi-btl-openib-cpc-base.txt / no cpcs for port
>>> 
>>> 
>>> I think these are benign, but prolly indicate a bug: the mpirun is 
>>> restricting the job to use port 1 only, so the CPCs shouldn't be attempting 
>>> port 2...
>>> 
>>> Steve.
>>> 
>>> 
 On Jan 28, 2013, at 10:03 AM, Steve Wise  
 wrote:
 
> On 1/28/2013 11:48 AM, Ralph Castain wrote:
>> On Jan 28, 2013, at 9:12 AM, Steve Wise  
>> wrote:
>> 
>>> On 1/25/2013 12:19 PM, Steve Wise wrote:
 Hello,
 
 I'm tracking an issue I see in openmpi-1.6.3.  Running this command on 
 my chelsio iwarp/rdma setup causes a seg fault every time:
 
 /usr/mpi/gcc/openmpi-1.6.3-dbg/bin/mpirun --np 2 --host 
 hpc-hn1,hpc-cn2 --mca btl openib,sm,self --mca 
 btl_openib_ipaddr_include "192.168.170.0/24" 
 /usr/mpi/gcc/openmpi-1.6.3/tests/IMB-3.2/IMB-MPI1 pingpong
 
 The segfault is during finalization, and I've debugged this to the 
 point were I see a call to dereg_mem() after the openib blt is 
 unloaded via dlclose().  dereg_mem() dereferences a function pointer 
 to call the btl-specific dereg function, in this case it is 
 openib_dereg_mr().  However, since that btl has already been unloaded, 
 the deref causes a seg fault.  Happens every time with the above mpi 
 job.
 
 Now, I tried this same experiment with openmpi-1.7rc6 and I don't see 
 the seg fault, and I don't see a call to dereg_mem() after the openib 
 btl is unloaded.  That's all well good. :)  But I'd like to get this 
 fix pushed into 1.6 since that is the current stable release.
 
 Question:  Can someone point me to the fix in 1.7?
 
 Thanks,
 
 Steve.
>>> It appears that in ompi_mpi_finalize(), mca_pml_base_close() is called 
>>> which unloads the openib btl.  Then further down in 
>>> ompi_mpi_finalize(), mca_mpool_base_close() is called which ends up 
>>> calling dereg_mem() which seg faults trying to call into the unloaded 
>>> openib btl.
>>> 
>> That definitely sounds like a bug
>> 
>>> Anybody have thoughts?  Anybody care? :)
>> I care! It needs to be fixed - I'll take a look. Probably something that 
>> forgot to be cmr'd.
> Great!  If you want me to try out a fix or gather more debug, just hollar.
> 
> Thanks,
> 
> Steve.
> 
>

Re: [OMPI devel] openib unloaded before last mem dereg

2013-01-28 Thread Steve Wise

Do you know if the rdmacm CPC is really being used for your connection 
setup (vs other CPCs supported by IB)?  Cuz iwarp only supports rdmacm.  
Maybe that's the difference?


Steve.

On 1/28/2013 1:47 PM, Ralph Castain wrote:

Nope - still works just fine. I didn't receive that warning at all, and it ran 
to completion without problem.

I suspect the problem is that the system I can use just isn't configured like 
yours, and so I can't trigger the problem. Afraid I can't be of help after 
all... :-(


On Jan 28, 2013, at 11:25 AM, Steve Wise  wrote:


On 1/28/2013 12:48 PM, Ralph Castain wrote:

Hmmm...afraid I cannot replicate this using the current state of the 1.6 branch 
(which is the 1.6.4rcN) on the only IB-based cluster I can access.

Can you try it with a 1.6.4 tarball and see if you still see the problem? Could 
be someone already fixed it.

I still hit it on 1.6.4rc2.

Note iWARP != IB so you may not have this issue on IB systems for various 
reasons.  Did you use the same mpirun line? Namely using this:

--mca btl_openib_ipaddr_include "192.168.170.0/24"

(adjusted to your network config).

Because if I don't use ipaddr_include, then I don't see this issue on my setup.

Also, did you see these logged:

Right after starting the job:

--
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:   hpc-hn1.ogc.int
  Local device: cxgb4_0
  Local port:   2
  CPCs attempted:   oob, xoob, rdmacm
--
...

At the end of the job:

[hpc-hn1.ogc.int:07850] 5 more processes have sent help message 
help-mpi-btl-openib-cpc-base.txt / no cpcs for port


I think these are benign, but prolly indicate a bug: the mpirun is restricting 
the job to use port 1 only, so the CPCs shouldn't be attempting port 2...

Steve.



On Jan 28, 2013, at 10:03 AM, Steve Wise  wrote:


On 1/28/2013 11:48 AM, Ralph Castain wrote:

On Jan 28, 2013, at 9:12 AM, Steve Wise  wrote:


On 1/25/2013 12:19 PM, Steve Wise wrote:

Hello,

I'm tracking an issue I see in openmpi-1.6.3.  Running this command on my 
chelsio iwarp/rdma setup causes a seg fault every time:

/usr/mpi/gcc/openmpi-1.6.3-dbg/bin/mpirun --np 2 --host hpc-hn1,hpc-cn2 --mca btl 
openib,sm,self --mca btl_openib_ipaddr_include "192.168.170.0/24" 
/usr/mpi/gcc/openmpi-1.6.3/tests/IMB-3.2/IMB-MPI1 pingpong

The segfault is during finalization, and I've debugged this to the point were I 
see a call to dereg_mem() after the openib blt is unloaded via dlclose().  
dereg_mem() dereferences a function pointer to call the btl-specific dereg 
function, in this case it is openib_dereg_mr().  However, since that btl has 
already been unloaded, the deref causes a seg fault.  Happens every time with 
the above mpi job.

Now, I tried this same experiment with openmpi-1.7rc6 and I don't see the seg 
fault, and I don't see a call to dereg_mem() after the openib btl is unloaded.  
That's all well good. :)  But I'd like to get this fix pushed into 1.6 since 
that is the current stable release.

Question:  Can someone point me to the fix in 1.7?

Thanks,

Steve.

It appears that in ompi_mpi_finalize(), mca_pml_base_close() is called which 
unloads the openib btl.  Then further down in ompi_mpi_finalize(), 
mca_mpool_base_close() is called which ends up calling dereg_mem() which seg 
faults trying to call into the unloaded openib btl.


That definitely sounds like a bug


Anybody have thoughts?  Anybody care? :)

I care! It needs to be fixed - I'll take a look. Probably something that forgot 
to be cmr'd.

Great!  If you want me to try out a fix or gather more debug, just hollar.

Thanks,

Steve.

Re: [OMPI devel] openib unloaded before last mem dereg

2013-01-28 Thread Ralph Castain

Nope - still works just fine. I didn't receive that warning at all, and it ran 
to completion without problem.

I suspect the problem is that the system I can use just isn't configured like 
yours, and so I can't trigger the problem. Afraid I can't be of help after 
all... :-(


On Jan 28, 2013, at 11:25 AM, Steve Wise  wrote:

> On 1/28/2013 12:48 PM, Ralph Castain wrote:
>> Hmmm...afraid I cannot replicate this using the current state of the 1.6 
>> branch (which is the 1.6.4rcN) on the only IB-based cluster I can access.
>> 
>> Can you try it with a 1.6.4 tarball and see if you still see the problem? 
>> Could be someone already fixed it.
> 
> I still hit it on 1.6.4rc2.
> 
> Note iWARP != IB so you may not have this issue on IB systems for various 
> reasons.  Did you use the same mpirun line? Namely using this:
> 
> --mca btl_openib_ipaddr_include "192.168.170.0/24"
> 
> (adjusted to your network config).
> 
> Because if I don't use ipaddr_include, then I don't see this issue on my 
> setup.
> 
> Also, did you see these logged:
> 
> Right after starting the job:
> 
> --
> No OpenFabrics connection schemes reported that they were able to be
> used on a specific port.  As such, the openib BTL (OpenFabrics
> support) will be disabled for this port.
> 
>  Local host:   hpc-hn1.ogc.int
>  Local device: cxgb4_0
>  Local port:   2
>  CPCs attempted:   oob, xoob, rdmacm
> --
> ...
> 
> At the end of the job:
> 
> [hpc-hn1.ogc.int:07850] 5 more processes have sent help message 
> help-mpi-btl-openib-cpc-base.txt / no cpcs for port
> 
> 
> I think these are benign, but prolly indicate a bug: the mpirun is 
> restricting the job to use port 1 only, so the CPCs shouldn't be attempting 
> port 2...
> 
> Steve.
> 
> 
>> 
>> On Jan 28, 2013, at 10:03 AM, Steve Wise  wrote:
>> 
>>> On 1/28/2013 11:48 AM, Ralph Castain wrote:
 On Jan 28, 2013, at 9:12 AM, Steve Wise  
 wrote:
 
> On 1/25/2013 12:19 PM, Steve Wise wrote:
>> Hello,
>> 
>> I'm tracking an issue I see in openmpi-1.6.3.  Running this command on 
>> my chelsio iwarp/rdma setup causes a seg fault every time:
>> 
>> /usr/mpi/gcc/openmpi-1.6.3-dbg/bin/mpirun --np 2 --host hpc-hn1,hpc-cn2 
>> --mca btl openib,sm,self --mca btl_openib_ipaddr_include 
>> "192.168.170.0/24" /usr/mpi/gcc/openmpi-1.6.3/tests/IMB-3.2/IMB-MPI1 
>> pingpong
>> 
>> The segfault is during finalization, and I've debugged this to the point 
>> were I see a call to dereg_mem() after the openib blt is unloaded via 
>> dlclose().  dereg_mem() dereferences a function pointer to call the 
>> btl-specific dereg function, in this case it is openib_dereg_mr().  
>> However, since that btl has already been unloaded, the deref causes a 
>> seg fault.  Happens every time with the above mpi job.
>> 
>> Now, I tried this same experiment with openmpi-1.7rc6 and I don't see 
>> the seg fault, and I don't see a call to dereg_mem() after the openib 
>> btl is unloaded.  That's all well good. :)  But I'd like to get this fix 
>> pushed into 1.6 since that is the current stable release.
>> 
>> Question:  Can someone point me to the fix in 1.7?
>> 
>> Thanks,
>> 
>> Steve.
> It appears that in ompi_mpi_finalize(), mca_pml_base_close() is called 
> which unloads the openib btl.  Then further down in ompi_mpi_finalize(), 
> mca_mpool_base_close() is called which ends up calling dereg_mem() which 
> seg faults trying to call into the unloaded openib btl.
> 
 That definitely sounds like a bug
 
> Anybody have thoughts?  Anybody care? :)
 I care! It needs to be fixed - I'll take a look. Probably something that 
 forgot to be cmr'd.
>>> Great!  If you want me to try out a fix or gather more debug, just hollar.
>>> 
>>> Thanks,
>>> 
>>> Steve.
>>> 
>

Re: [OMPI devel] openib unloaded before last mem dereg

2013-01-28 Thread Steve Wise


On 1/28/2013 12:48 PM, Ralph Castain wrote:

Hmmm...afraid I cannot replicate this using the current state of the 1.6 branch 
(which is the 1.6.4rcN) on the only IB-based cluster I can access.

Can you try it with a 1.6.4 tarball and see if you still see the problem? Could 
be someone already fixed it.


I still hit it on 1.6.4rc2.

Note iWARP != IB so you may not have this issue on IB systems for 
various reasons.  Did you use the same mpirun line? Namely using this:


 --mca btl_openib_ipaddr_include "192.168.170.0/24"

(adjusted to your network config).

Because if I don't use ipaddr_include, then I don't see this issue on my 
setup.


Also, did you see these logged:

Right after starting the job:

--
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:   hpc-hn1.ogc.int
  Local device: cxgb4_0
  Local port:   2
  CPCs attempted:   oob, xoob, rdmacm
--
...

At the end of the job:

[hpc-hn1.ogc.int:07850] 5 more processes have sent help message 
help-mpi-btl-openib-cpc-base.txt / no cpcs for port



I think these are benign, but prolly indicate a bug: the mpirun is 
restricting the job to use port 1 only, so the CPCs shouldn't be 
attempting port 2...


Steve.




On Jan 28, 2013, at 10:03 AM, Steve Wise  wrote:


On 1/28/2013 11:48 AM, Ralph Castain wrote:

On Jan 28, 2013, at 9:12 AM, Steve Wise  wrote:


On 1/25/2013 12:19 PM, Steve Wise wrote:

Hello,

I'm tracking an issue I see in openmpi-1.6.3.  Running this command on my 
chelsio iwarp/rdma setup causes a seg fault every time:

/usr/mpi/gcc/openmpi-1.6.3-dbg/bin/mpirun --np 2 --host hpc-hn1,hpc-cn2 --mca btl 
openib,sm,self --mca btl_openib_ipaddr_include "192.168.170.0/24" 
/usr/mpi/gcc/openmpi-1.6.3/tests/IMB-3.2/IMB-MPI1 pingpong

The segfault is during finalization, and I've debugged this to the point were I 
see a call to dereg_mem() after the openib blt is unloaded via dlclose().  
dereg_mem() dereferences a function pointer to call the btl-specific dereg 
function, in this case it is openib_dereg_mr().  However, since that btl has 
already been unloaded, the deref causes a seg fault.  Happens every time with 
the above mpi job.

Now, I tried this same experiment with openmpi-1.7rc6 and I don't see the seg 
fault, and I don't see a call to dereg_mem() after the openib btl is unloaded.  
That's all well good. :)  But I'd like to get this fix pushed into 1.6 since 
that is the current stable release.

Question:  Can someone point me to the fix in 1.7?

Thanks,

Steve.

It appears that in ompi_mpi_finalize(), mca_pml_base_close() is called which 
unloads the openib btl.  Then further down in ompi_mpi_finalize(), 
mca_mpool_base_close() is called which ends up calling dereg_mem() which seg 
faults trying to call into the unloaded openib btl.


That definitely sounds like a bug


Anybody have thoughts?  Anybody care? :)

I care! It needs to be fixed - I'll take a look. Probably something that forgot 
to be cmr'd.

Great!  If you want me to try out a fix or gather more debug, just hollar.

Thanks,

Steve.

Re: [OMPI devel] [EXTERNAL] Open MPI Configure Script

2013-01-28 Thread Barrett, Brian W

On 1/28/13 11:54 AM, "David Beer"  wrote:

> checking for tm_init in -ltorque... no
> configure: error: TM support requested but not found.  Aborting
> 
> Oddly enough, if you have already configured with an older version of TORQUE,
> you can build open-mpi with TORQUE 4.2 installed, so it can find the function
> definitions when compiling, its just for some reason it doesn't find them in
> the configure script. This is why I think that something in the configure
> script is assuming that libtorque was compiled with gcc.

Right, the configure output to stdout/stderr isn't very useful in diagnosing
why a test failed.  The config.log file generated by configure will have
much more information.

Brian

--
  Brian W. Barrett
  Scalable System Software Group
  Sandia National Laboratories




smime.p7s
Description: S/MIME cryptographic signature

Re: [OMPI devel] [EXTERNAL] Open MPI Configure Script

2013-01-28 Thread Ralph Castain

I don't see anything in the config script that checks for gcc - you might take 
a look at it to check. It's in config/orte_check_tm.m4 on our developer's trunk

On Jan 28, 2013, at 10:54 AM, David Beer  wrote:

> 
> On Mon, Jan 28, 2013 at 10:54 AM, Barrett, Brian W  wrote:
> 
> We assume that we can link lib torque into a C application (if this is a 
> problem for you, it's a huge deal breaker for us, since OMPI is a C library). 
>  What does config.log say when checking for tm_init?
> 
> Brian
> 
> 
> Brian,
> 
> libtorque can still be linked in to C applications. In testing with a simple 
> C program, we did have to add 
> 
> void *__gxx_personality_v0;
> 
> to the C program. Here is the error reported by the configure script:
> 
> checking for pbs-config... /usr/local/bin/pbs-config
> checking tm.h usability... yes
> checking tm.h presence... yes
> checking for tm.h... yes
> checking for tm_finalize... no
> looking for header without includes
> checking tm.h usability... yes
> checking tm.h presence... yes
> checking for tm.h... yes
> looking for library without search path
> checking for tm_init in -lpbs... no
> looking for library in lib
> checking for tm_init in -lpbs... no
> looking for library in lib64
> checking for tm_init in -lpbs... no
> looking for library without search path
> checking for tm_init in -ltorque... no
> looking for library in lib
> checking for tm_init in -ltorque... no
> looking for library in lib64
> checking for tm_init in -ltorque... no
> configure: error: TM support requested but not found.  Aborting
> 
> Oddly enough, if you have already configured with an older version of TORQUE, 
> you can build open-mpi with TORQUE 4.2 installed, so it can find the function 
> definitions when compiling, its just for some reason it doesn't find them in 
> the configure script. This is why I think that something in the configure 
> script is assuming that libtorque was compiled with gcc.
> 
> David
>  
> --
>   Brian W. Barrett
>   Scalable System Software Group
>   Sandia National Laboratories
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> 
> -- 
> David Beer | Senior Software Engineer
> Adaptive Computing
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] [EXTERNAL] Open MPI Configure Script

2013-01-28 Thread David Beer

On Mon, Jan 28, 2013 at 10:54 AM, Barrett, Brian W wrote:
>
>
> We assume that we can link lib torque into a C application (if this is a
> problem for you, it's a huge deal breaker for us, since OMPI is a C
> library).  What does config.log say when checking for tm_init?
>
> Brian
>
>
Brian,

libtorque can still be linked in to C applications. In testing with a
simple C program, we did have to add

void *__gxx_personality_v0;

to the C program. Here is the error reported by the configure script:

checking for pbs-config... /usr/local/bin/pbs-config
checking tm.h usability... yes
checking tm.h presence... yes
checking for tm.h... yes
checking for tm_finalize... no
looking for header without includes
checking tm.h usability... yes
checking tm.h presence... yes
checking for tm.h... yes
looking for library without search path
checking for tm_init in -lpbs... no
looking for library in lib
checking for tm_init in -lpbs... no
looking for library in lib64
checking for tm_init in -lpbs... no
looking for library without search path
checking for tm_init in -ltorque... no
looking for library in lib
checking for tm_init in -ltorque... no
looking for library in lib64
checking for tm_init in -ltorque... no
configure: error: TM support requested but not found.  Aborting

Oddly enough, if you have already configured with an older version of
TORQUE, you can build open-mpi with TORQUE 4.2 installed, so it can find
the function definitions when compiling, its just for some reason it
doesn't find them in the configure script. This is why I think that
something in the configure script is assuming that libtorque was compiled
with gcc.

David

> --
>   Brian W. Barrett
>   Scalable System Software Group
>   Sandia National Laboratories
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

-- 
David Beer | Senior Software Engineer
Adaptive Computing

Re: [OMPI devel] openib unloaded before last mem dereg

2013-01-28 Thread Ralph Castain

Hmmm...afraid I cannot replicate this using the current state of the 1.6 branch 
(which is the 1.6.4rcN) on the only IB-based cluster I can access.

Can you try it with a 1.6.4 tarball and see if you still see the problem? Could 
be someone already fixed it.

On Jan 28, 2013, at 10:03 AM, Steve Wise  wrote:

> On 1/28/2013 11:48 AM, Ralph Castain wrote:
>> On Jan 28, 2013, at 9:12 AM, Steve Wise  wrote:
>> 
>>> On 1/25/2013 12:19 PM, Steve Wise wrote:
 Hello,

 I'm tracking an issue I see in openmpi-1.6.3.  Running this command on my 
 chelsio iwarp/rdma setup causes a seg fault every time:

 /usr/mpi/gcc/openmpi-1.6.3-dbg/bin/mpirun --np 2 --host hpc-hn1,hpc-cn2 
 --mca btl openib,sm,self --mca btl_openib_ipaddr_include 
 "192.168.170.0/24" /usr/mpi/gcc/openmpi-1.6.3/tests/IMB-3.2/IMB-MPI1 
 pingpong

 The segfault is during finalization, and I've debugged this to the point 
 were I see a call to dereg_mem() after the openib blt is unloaded via 
 dlclose().  dereg_mem() dereferences a function pointer to call the 
 btl-specific dereg function, in this case it is openib_dereg_mr().  
 However, since that btl has already been unloaded, the deref causes a seg 
 fault.  Happens every time with the above mpi job.

 Now, I tried this same experiment with openmpi-1.7rc6 and I don't see the 
 seg fault, and I don't see a call to dereg_mem() after the openib btl is 
 unloaded.  That's all well good. :)  But I'd like to get this fix pushed 
 into 1.6 since that is the current stable release.

 Question:  Can someone point me to the fix in 1.7?

 Thanks,

 Steve.
>>> It appears that in ompi_mpi_finalize(), mca_pml_base_close() is called 
>>> which unloads the openib btl.  Then further down in ompi_mpi_finalize(), 
>>> mca_mpool_base_close() is called which ends up calling dereg_mem() which 
>>> seg faults trying to call into the unloaded openib btl.
>>> 
>> That definitely sounds like a bug
>> 
>>> Anybody have thoughts?  Anybody care? :)
>> I care! It needs to be fixed - I'll take a look. Probably something that 
>> forgot to be cmr'd.
> 
> Great!  If you want me to try out a fix or gather more debug, just hollar.
> 
> Thanks,
> 
> Steve.
>

Re: [OMPI devel] openib unloaded before last mem dereg

2013-01-28 Thread Steve Wise


On 1/28/2013 11:48 AM, Ralph Castain wrote:

On Jan 28, 2013, at 9:12 AM, Steve Wise  wrote:


On 1/25/2013 12:19 PM, Steve Wise wrote:

Hello,

I'm tracking an issue I see in openmpi-1.6.3.  Running this command on my 
chelsio iwarp/rdma setup causes a seg fault every time:

/usr/mpi/gcc/openmpi-1.6.3-dbg/bin/mpirun --np 2 --host hpc-hn1,hpc-cn2 --mca btl 
openib,sm,self --mca btl_openib_ipaddr_include "192.168.170.0/24" 
/usr/mpi/gcc/openmpi-1.6.3/tests/IMB-3.2/IMB-MPI1 pingpong

The segfault is during finalization, and I've debugged this to the point were I 
see a call to dereg_mem() after the openib blt is unloaded via dlclose().  
dereg_mem() dereferences a function pointer to call the btl-specific dereg 
function, in this case it is openib_dereg_mr().  However, since that btl has 
already been unloaded, the deref causes a seg fault.  Happens every time with 
the above mpi job.

Now, I tried this same experiment with openmpi-1.7rc6 and I don't see the seg 
fault, and I don't see a call to dereg_mem() after the openib btl is unloaded.  
That's all well good. :)  But I'd like to get this fix pushed into 1.6 since 
that is the current stable release.

Question:  Can someone point me to the fix in 1.7?

Thanks,

Steve.

It appears that in ompi_mpi_finalize(), mca_pml_base_close() is called which 
unloads the openib btl.  Then further down in ompi_mpi_finalize(), 
mca_mpool_base_close() is called which ends up calling dereg_mem() which seg 
faults trying to call into the unloaded openib btl.


That definitely sounds like a bug


Anybody have thoughts?  Anybody care? :)

I care! It needs to be fixed - I'll take a look. Probably something that forgot 
to be cmr'd.


Great!  If you want me to try out a fix or gather more debug, just hollar.

Thanks,

Steve.

Re: [OMPI devel] [EXTERNAL] Open MPI Configure Script

2013-01-28 Thread Barrett, Brian W

On 1/28/13 10:50 AM, "David Beer"  wrote:

> By way of introduction, I'm a TORQUE developer and I probably should've joined
> this list - even if only to keep myself informed - years ago.
> 
> At any rate, we're in the process of changing TORQUE so that it compiles using
> g++ instead of gcc. We're starting to use some C++ constructs to make our
> lives easier. In doing this, we've noticed that OpenMPI doesn't like TORQUE's
> libraries when built by gcc, it fails at the configure time claiming in can't
> find tm_init() in libtorque.so. I've been trying to track down exactly why,
> and where I am now is making me think that something in the configure script
> is assuming that TORQUE's libraries are compiled by gcc. Is there someone who
> could advise me on how to resolve this issue?

We assume that we can link lib torque into a C application (if this is a
problem for you, it's a huge deal breaker for us, since OMPI is a C
library).  What does config.log say when checking for tm_init?

Brian

--
  Brian W. Barrett
  Scalable System Software Group
  Sandia National Laboratories




smime.p7s
Description: S/MIME cryptographic signature

[OMPI devel] Open MPI Configure Script

2013-01-28 Thread David Beer

All,

By way of introduction, I'm a TORQUE developer and I probably should've
joined this list - even if only to keep myself informed - years ago.

At any rate, we're in the process of changing TORQUE so that it compiles
using g++ instead of gcc. We're starting to use some C++ constructs to make
our lives easier. In doing this, we've noticed that OpenMPI doesn't like
TORQUE's libraries when built by gcc, it fails at the configure time
claiming in can't find tm_init() in libtorque.so. I've been trying to track
down exactly why, and where I am now is making me think that something in
the configure script is assuming that TORQUE's libraries are compiled by
gcc. Is there someone who could advise me on how to resolve this issue?

-- 
David Beer | Senior Software Engineer
Adaptive Computing

Re: [OMPI devel] openib unloaded before last mem dereg

2013-01-28 Thread Ralph Castain


On Jan 28, 2013, at 9:12 AM, Steve Wise  wrote:

> On 1/25/2013 12:19 PM, Steve Wise wrote:
>> Hello,
>> 
>> I'm tracking an issue I see in openmpi-1.6.3.  Running this command on my 
>> chelsio iwarp/rdma setup causes a seg fault every time:
>> 
>> /usr/mpi/gcc/openmpi-1.6.3-dbg/bin/mpirun --np 2 --host hpc-hn1,hpc-cn2 
>> --mca btl openib,sm,self --mca btl_openib_ipaddr_include "192.168.170.0/24" 
>> /usr/mpi/gcc/openmpi-1.6.3/tests/IMB-3.2/IMB-MPI1 pingpong
>> 
>> The segfault is during finalization, and I've debugged this to the point 
>> were I see a call to dereg_mem() after the openib blt is unloaded via 
>> dlclose().  dereg_mem() dereferences a function pointer to call the 
>> btl-specific dereg function, in this case it is openib_dereg_mr().  However, 
>> since that btl has already been unloaded, the deref causes a seg fault.  
>> Happens every time with the above mpi job.
>> 
>> Now, I tried this same experiment with openmpi-1.7rc6 and I don't see the 
>> seg fault, and I don't see a call to dereg_mem() after the openib btl is 
>> unloaded.  That's all well good. :)  But I'd like to get this fix pushed 
>> into 1.6 since that is the current stable release.
>> 
>> Question:  Can someone point me to the fix in 1.7?
>> 
>> Thanks,
>> 
>> Steve.
> 
> It appears that in ompi_mpi_finalize(), mca_pml_base_close() is called which 
> unloads the openib btl.  Then further down in ompi_mpi_finalize(), 
> mca_mpool_base_close() is called which ends up calling dereg_mem() which seg 
> faults trying to call into the unloaded openib btl.
> 

That definitely sounds like a bug

> Anybody have thoughts?  Anybody care? :)

I care! It needs to be fixed - I'll take a look. Probably something that forgot 
to be cmr'd.

Thanks
Ralph

> 
> Steve.
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] openib unloaded before last mem dereg

2013-01-28 Thread Steve Wise


On 1/25/2013 12:19 PM, Steve Wise wrote:

Hello,

I'm tracking an issue I see in openmpi-1.6.3.  Running this command on 
my chelsio iwarp/rdma setup causes a seg fault every time:


/usr/mpi/gcc/openmpi-1.6.3-dbg/bin/mpirun --np 2 --host 
hpc-hn1,hpc-cn2 --mca btl openib,sm,self --mca 
btl_openib_ipaddr_include "192.168.170.0/24" 
/usr/mpi/gcc/openmpi-1.6.3/tests/IMB-3.2/IMB-MPI1 pingpong


The segfault is during finalization, and I've debugged this to the 
point were I see a call to dereg_mem() after the openib blt is 
unloaded via dlclose().  dereg_mem() dereferences a function pointer 
to call the btl-specific dereg function, in this case it is 
openib_dereg_mr().  However, since that btl has already been unloaded, 
the deref causes a seg fault.  Happens every time with the above mpi job.


Now, I tried this same experiment with openmpi-1.7rc6 and I don't see 
the seg fault, and I don't see a call to dereg_mem() after the openib 
btl is unloaded.  That's all well good. :)  But I'd like to get this 
fix pushed into 1.6 since that is the current stable release.


Question:  Can someone point me to the fix in 1.7?

Thanks,

Steve.


It appears that in ompi_mpi_finalize(), mca_pml_base_close() is called 
which unloads the openib btl.  Then further down in ompi_mpi_finalize(), 
mca_mpool_base_close() is called which ends up calling dereg_mem() which 
seg faults trying to call into the unloaded openib btl.


Anybody have thoughts?  Anybody care? :)

Steve.

Re: [OMPI devel] Open MPI (not quite) on Cray XC30

2013-01-28 Thread Ralph Castain

The key was to --enable-static --disable-shared. That's the only way to 
generate the problem.

Brian was already aware of it and fixed it this weekend. I tested the fix and 
it works fine. Waiting for Jeff to review it before committing to the trunk.


On Jan 28, 2013, at 7:45 AM, Nathan Hjelm  wrote:

> Try building static. Lots of errors due to missing libraries in libs_static.
> 
> -Nathan
> 
> On Fri, Jan 25, 2013 at 04:09:16PM -0800, Ralph Castain wrote:
>> FWIW: I can build it fine without setting any of the CC... flags on LANL's 
>> Cray XE6, and mpicc worked just fine for me once built that way.
>> 
>> So I'm not quite sure I understand the "mpicc is completely borked in the 
>> trunk". Can you elaborate?
>> 
>> On Jan 25, 2013, at 3:59 PM, Paul Hargrove  wrote:
>> 
>>> Nathan,
>>> 
>>> The 2nd and 3rd non-blank lines of my original post:
>>> Given that it is INTENDED to be API-compatible with the XE series, I began 
>>> configuring with
>>>CC=cc CXX=CC FC=ftn --with-platform=lanl/cray_xe6/optimized-nopanasas
>>> 
>>> So, I am surprised that nobody objected before now to my use of the 
>>> Cray-provided wrapper compilers.
>>> I mistakenly believed that if I don't use them then I wouldn't get through 
>>> configure w/ ugni and alps support.
>>> However, I've just now completed configure w/o setting CC, CXX, FC and see 
>>> the expected components.
>>> I'll report more from this build later ("make all" is running now).
>>> 
>>> I would appreciate (perhaps off-list) receiving any module or platform file 
>>> or additional instructions that maybe appropriate to building on a Cray XE, 
>>> XK or XC system.
>>> 
>>> Getting OMPI running on our XC30 is of exactly ZERO importance beyond my 
>>> own edification.
>>> So, I am likely to stop fighting this battle soon.
>>> 
>>> -Paul
>>> 
>>> 
>>> On Fri, Jan 25, 2013 at 3:21 PM, Nathan Hjelm  wrote:
>>> Hmm, I see mpicc in there. It will use the compiler directly instead of 
>>> Cray's wrappers. We didn't want Open MPI's wrapper linking in MPT afterall 
>>> ;). mpicc is completely borked in the trunk.
>>> 
>>> If you want to use the Cray wrappers with Open MPI I can give you a module 
>>> file that sets up the environment correctly (link against -lmpi not 
>>> -lmpich, etc).
>>> 
>>> -Nathan
>>> 
>>> On Fri, Jan 25, 2013 at 03:10:37PM -0800, Paul Hargrove wrote:
 Nathan,
 
 Cray's "cc" wrapper is adding xpmem, ugni, pmi, alps and others already:
 
 $ cc -v hello.c 2>&1 | grep collect
> /opt/gcc/4.7.2/snos/libexec/gcc/x86_64-suse-linux/4.7.2/collect2
> --sysroot= -m elf_x86_64 -static -u pthread_mutex_trylock -u
> pthread_mutex_destroy -u pthread_create /usr/lib/../lib64/crt1.o
> /usr/lib/../lib64/crti.o
> /opt/gcc/4.7.2/snos/lib/gcc/x86_64-suse-linux/4.7.2/crtbeginT.o
> -L/opt/cray/udreg/2.3.2-1.0500.5931.3.1.ari/lib64
> -L/opt/cray/ugni/4.0-1.0500.5836.7.58.ari/lib64
> -L/opt/cray/pmi/4.0.0-1..9282.69.4.ari/lib64
> -L/opt/cray/dmapp/4.0.1-1.0500.5932.6.5.ari/lib64
> -L/opt/cray/xpmem/0.1-2.0500.36799.3.6.ari/lib64
> -L/opt/cray/alps/5.0.1-2.0500.7663.1.1.ari/lib64
> -L/opt/cray/rca/1.0.0-2.0500.37705.3.12.ari/lib64
> -L/opt/cray/mpt/5.6.0/gni/mpich2-gnu/47/lib
> -L/opt/cray/mpt/5.6.0/gni/sma/lib64
> -L/opt/cray/libsci/12.0.00/gnu/47/sandybridge/lib
> -L/opt/cray/alps/5.0.1-2.0500.7663.1.1.ari/lib64
> -L/opt/gcc/4.7.2/snos/lib/gcc/x86_64-suse-linux/4.7.2
> -L/opt/gcc/4.7.2/snos/lib/gcc/x86_64-suse-linux/4.7.2/../../../../lib64
> -L/lib/../lib64 -L/usr/lib/../lib64
> -L/opt/gcc/4.7.2/snos/lib/gcc/x86_64-suse-linux/4.7.2/../../..
> /scratch1/scratchdirs/hargrove/ccQ1f0sx.o -lrca -L/opt/cray/atp/1.6.0/lib/
> --undefined=_ATP_Data_Globals --undefined=__atpHandlerInstall
> -lAtpSigHCommData -lAtpSigHandler --start-group -lgfortran -lscicpp_gnu
> -lsci_gnu_mp -lstdc++ -lgfortran -lmpich_gnu_47 -lmpl -lrt -lsma -lxpmem
> -ldmapp -lugni -lpmi -lalpslli -lalpsutil -lalps -ludreg -lpthread -lm
> --end-group -lgomp -lpthread --start-group -lgcc -lgcc_eh -lc --end-group
> /opt/gcc/4.7.2/snos/lib/gcc/x86_64-suse-linux/4.7.2/crtend.o
> /usr/lib/../lib64/crtn.o
 
 
 -Paul
 
 
 On Fri, Jan 25, 2013 at 2:46 PM, Nathan Hjelm  wrote:
 
> Something is wrong with the wrappers. A number of libraries (-lxpmem,
> -lugni, etc) are missing from libs_static. Might be a similar issue with 
> eh
> missing -llustreapi. Going to create a critical bug to track this issue.
> 
> Works in 1.7 :-/ ... If you add -lnuma to libs_static in
> mpicc-wrapper-data.txt.
> 
> -Nathan
> HPC-3, LANL
> 
> On Fri, Jan 25, 2013 at 02:13:41PM -0800, Paul Hargrove wrote:
>> Still having problems on the Cray XC30, but now they are when linking an
>> MPI app:
>> 
>> $ ./INSTALL/bin/mpicc -o ring_c

Re: [OMPI devel] Open MPI (not quite) on Cray XC30

2013-01-28 Thread Nathan Hjelm

Try building static. Lots of errors due to missing libraries in libs_static.

-Nathan

On Fri, Jan 25, 2013 at 04:09:16PM -0800, Ralph Castain wrote:
> FWIW: I can build it fine without setting any of the CC... flags on LANL's 
> Cray XE6, and mpicc worked just fine for me once built that way.
> 
> So I'm not quite sure I understand the "mpicc is completely borked in the 
> trunk". Can you elaborate?
> 
> On Jan 25, 2013, at 3:59 PM, Paul Hargrove  wrote:
> 
> > Nathan,
> > 
> > The 2nd and 3rd non-blank lines of my original post:
> > Given that it is INTENDED to be API-compatible with the XE series, I began 
> > configuring with
> > CC=cc CXX=CC FC=ftn --with-platform=lanl/cray_xe6/optimized-nopanasas
> > 
> > So, I am surprised that nobody objected before now to my use of the 
> > Cray-provided wrapper compilers.
> > I mistakenly believed that if I don't use them then I wouldn't get through 
> > configure w/ ugni and alps support.
> > However, I've just now completed configure w/o setting CC, CXX, FC and see 
> > the expected components.
> > I'll report more from this build later ("make all" is running now).
> > 
> > I would appreciate (perhaps off-list) receiving any module or platform file 
> > or additional instructions that maybe appropriate to building on a Cray XE, 
> > XK or XC system.
> > 
> > Getting OMPI running on our XC30 is of exactly ZERO importance beyond my 
> > own edification.
> > So, I am likely to stop fighting this battle soon.
> > 
> > -Paul
> > 
> > 
> > On Fri, Jan 25, 2013 at 3:21 PM, Nathan Hjelm  wrote:
> > Hmm, I see mpicc in there. It will use the compiler directly instead of 
> > Cray's wrappers. We didn't want Open MPI's wrapper linking in MPT afterall 
> > ;). mpicc is completely borked in the trunk.
> > 
> > If you want to use the Cray wrappers with Open MPI I can give you a module 
> > file that sets up the environment correctly (link against -lmpi not 
> > -lmpich, etc).
> > 
> > -Nathan
> > 
> > On Fri, Jan 25, 2013 at 03:10:37PM -0800, Paul Hargrove wrote:
> > > Nathan,
> > >
> > > Cray's "cc" wrapper is adding xpmem, ugni, pmi, alps and others already:
> > >
> > > $ cc -v hello.c 2>&1 | grep collect
> > > >  /opt/gcc/4.7.2/snos/libexec/gcc/x86_64-suse-linux/4.7.2/collect2
> > > > --sysroot= -m elf_x86_64 -static -u pthread_mutex_trylock -u
> > > > pthread_mutex_destroy -u pthread_create /usr/lib/../lib64/crt1.o
> > > > /usr/lib/../lib64/crti.o
> > > > /opt/gcc/4.7.2/snos/lib/gcc/x86_64-suse-linux/4.7.2/crtbeginT.o
> > > > -L/opt/cray/udreg/2.3.2-1.0500.5931.3.1.ari/lib64
> > > > -L/opt/cray/ugni/4.0-1.0500.5836.7.58.ari/lib64
> > > > -L/opt/cray/pmi/4.0.0-1..9282.69.4.ari/lib64
> > > > -L/opt/cray/dmapp/4.0.1-1.0500.5932.6.5.ari/lib64
> > > > -L/opt/cray/xpmem/0.1-2.0500.36799.3.6.ari/lib64
> > > > -L/opt/cray/alps/5.0.1-2.0500.7663.1.1.ari/lib64
> > > > -L/opt/cray/rca/1.0.0-2.0500.37705.3.12.ari/lib64
> > > > -L/opt/cray/mpt/5.6.0/gni/mpich2-gnu/47/lib
> > > > -L/opt/cray/mpt/5.6.0/gni/sma/lib64
> > > > -L/opt/cray/libsci/12.0.00/gnu/47/sandybridge/lib
> > > > -L/opt/cray/alps/5.0.1-2.0500.7663.1.1.ari/lib64
> > > > -L/opt/gcc/4.7.2/snos/lib/gcc/x86_64-suse-linux/4.7.2
> > > > -L/opt/gcc/4.7.2/snos/lib/gcc/x86_64-suse-linux/4.7.2/../../../../lib64
> > > > -L/lib/../lib64 -L/usr/lib/../lib64
> > > > -L/opt/gcc/4.7.2/snos/lib/gcc/x86_64-suse-linux/4.7.2/../../..
> > > > /scratch1/scratchdirs/hargrove/ccQ1f0sx.o -lrca 
> > > > -L/opt/cray/atp/1.6.0/lib/
> > > > --undefined=_ATP_Data_Globals --undefined=__atpHandlerInstall
> > > > -lAtpSigHCommData -lAtpSigHandler --start-group -lgfortran -lscicpp_gnu
> > > > -lsci_gnu_mp -lstdc++ -lgfortran -lmpich_gnu_47 -lmpl -lrt -lsma -lxpmem
> > > > -ldmapp -lugni -lpmi -lalpslli -lalpsutil -lalps -ludreg -lpthread -lm
> > > > --end-group -lgomp -lpthread --start-group -lgcc -lgcc_eh -lc 
> > > > --end-group
> > > > /opt/gcc/4.7.2/snos/lib/gcc/x86_64-suse-linux/4.7.2/crtend.o
> > > > /usr/lib/../lib64/crtn.o
> > >
> > >
> > > -Paul
> > >
> > >
> > > On Fri, Jan 25, 2013 at 2:46 PM, Nathan Hjelm  wrote:
> > >
> > > > Something is wrong with the wrappers. A number of libraries (-lxpmem,
> > > > -lugni, etc) are missing from libs_static. Might be a similar issue 
> > > > with eh
> > > > missing -llustreapi. Going to create a critical bug to track this issue.
> > > >
> > > > Works in 1.7 :-/ ... If you add -lnuma to libs_static in
> > > > mpicc-wrapper-data.txt.
> > > >
> > > > -Nathan
> > > > HPC-3, LANL
> > > >
> > > > On Fri, Jan 25, 2013 at 02:13:41PM -0800, Paul Hargrove wrote:
> > > > > Still having problems on the Cray XC30, but now they are when linking 
> > > > > an
> > > > > MPI app:
> > > > >
> > > > > $ ./INSTALL/bin/mpicc -o ring_c examples/ring_c.c
> > > > > > fs_lustre_file_open.c:(.text+0x130): undefined reference to
> > > > > > `llapi_file_create'
> > > > > > fs_lustre_file_open.c:(.text+0x17e): undefined reference to
> > > > > > `llapi_file_get_stripe'
>

Re: [OMPI devel] New ARM patch

2013-01-28 Thread Leif Lindholm


On 26/01/13 00:05, Jeff Squyres (jsquyres) wrote:

Here's what I have done:

1. Committed your patch to v1.6.  George's patch was not committed to
v1.6.


Many thanks.


2. I opened https://svn.open-mpi.org/trac/ompi/ticket/3481 to track
your proposal of re-implementing/revamping the ARM ASM code.

Do you have a timeline for when that can be done?


As I have mentioned to you off list, I have (very) recently been 
seconded into the Linaro Enterprise Group, focusing on improving the

ARM server software ecosystem.
As such, I am potentially in a bit of legal limbo, until Linaro signs
a contribution agreement. This is however in the works.
But giving some flexibility for roadblocks, can we say "this quarter"?


3. Since no one is currently MTT testing Open MPI on ARM, I added the
following statement in the v1.6 README file under "Other systems have
been lightly (but not fully tested):"

- ARM4, ARM5, ARM6, ARM7 (when using non-inline assembly; only ARM7
is fully supported when -DOMPI_DISABLE_INLINE_ASM is used).

--> Is this correct?


Apart from our *cough* convoluted architecture vs. processor naming 
scheme... It should be ARMv4, ARMv5, ARMv6 and ARMv7.
(since ARM4 and ARM5 were skipped and ARM6 and ARM7 were processors 
implementing the ARMv3 and ARMv4 architectures :)



--> Do you think you'll be able to setup some MTT on ARM platforms?


I hope so.


4. I also added the following to v1.6 NEWS:

- Automatically provide compiler flags that compile properly on some
types of ARM systems.


Sounds good.

/
Leif

37 matches

Mail list logo