Re: [OMPI users] Bad Infiniband latency with subounce

2010-02-15 Thread Ralph Castain

On Feb 15, 2010, at 8:44 PM, Terry Frankcombe wrote:

> On Mon, 2010-02-15 at 20:18 -0700, Ralph Castain wrote:
>> Did you run it with -mca mpi_paffinity_alone 1? Given this is 1.4.1, you can 
>> set the bindings to -bind-to-socket or -bind-to-core. Either will give you 
>> improved performance.
>> 
>> IIRC, MVAPICH defaults to -bind-to-socket. OMPI defaults to no binding.
> 
> 
> Is this sensible?  Won't most users want processes bound?  OMPI's
> supposed to "to the right thing" out of the box, right?

Well, that depends on how you look at it. Been the subject of a lot of debate 
within the devel community. If you bind by default and it is a shared node 
cluster, then you can really mess people up. On the other hand, if you don't 
bind by default, then people that run benchmarks without looking at the options 
can get bad numbers. Unfortunately, there is no automated way to tell if the 
cluster is configured for shared use or dedicated nodes.

I honestly don't know that "most users want processes bound". One installation 
I was at set binding by default using the system mca param file, and got yelled 
at by a group of users that had threaded apps - and most definitely did -not- 
want their processes bound. After a while, it became clear that nothing we 
could do would make everyone happy :-/

I doubt there is a right/wrong answer - at least, we sure can't find one. So we 
don't bind by default so we "do no harm", and put out FAQs, man pages, mpirun 
option help messages, etc. that explain the situation and tell you when/how to 
bind.

> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Bad Infiniband latency with subounce

2010-02-15 Thread Terry Frankcombe
On Mon, 2010-02-15 at 20:18 -0700, Ralph Castain wrote:
> Did you run it with -mca mpi_paffinity_alone 1? Given this is 1.4.1, you can 
> set the bindings to -bind-to-socket or -bind-to-core. Either will give you 
> improved performance.
> 
> IIRC, MVAPICH defaults to -bind-to-socket. OMPI defaults to no binding.


Is this sensible?  Won't most users want processes bound?  OMPI's
supposed to "to the right thing" out of the box, right?





Re: [OMPI users] Bad Infiniband latency with subounce

2010-02-15 Thread Ralph Castain
Did you run it with -mca mpi_paffinity_alone 1? Given this is 1.4.1, you can 
set the bindings to -bind-to-socket or -bind-to-core. Either will give you 
improved performance.

IIRC, MVAPICH defaults to -bind-to-socket. OMPI defaults to no binding.


On Feb 15, 2010, at 6:51 PM, Repsher, Stephen J wrote:

> Hello again,
> 
> Hopefully this is an easier question
> 
> My cluster uses Infiniband interconnects (Mellanox Infinihost III and some 
> ConnectX).  I'm seeing terrible and sporadic latency (order ~1000 
> microseconds)  as measured by the subounce code 
> (http://sourceforge.net/projects/subounce/), but the bandwidth is as 
> expected.  I'm used to seeing only 1-2 microseconds with MVAPICH and 
> wondering why OpenMPI either isn't performing as well or doesn't play well 
> with how bounce is measuring latency (by timing 0 byte messages).  I've tried 
> to play with a few parameters with no success.  Here's how the build is 
> configured:
> 
> myflags="-O3 -xSSE2"
> ./configure --prefix=/part0/apps/MPI/intel/openmpi-1.4.1 \
>--disable-dlopen --with-wrapper-ldflags="-shared-intel" \
>--enable-orterun-prefix-by-default \
>--with-openib --enable-openib-connectx-xrc --enable-openib-rdmacm \
>CC=icc CXX=icpc F77=ifort FC=ifort \
>CFLAGS="$myflags" FFLAGS="$myflags" CXXFLAGS="$myflags" 
> FCFLAGS="$myflags" \
>OBJC=gcc OBJCFLAGS="-O3"
> Any ideas?
> 
> Thanks,
> Steve
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] Bad Infiniband latency with subounce

2010-02-15 Thread Repsher, Stephen J
Hello again,

Hopefully this is an easier question

My cluster uses Infiniband interconnects (Mellanox Infinihost III and some 
ConnectX).  I'm seeing terrible and sporadic latency (order ~1000 microseconds) 
 as measured by the subounce code (http://sourceforge.net/projects/subounce/), 
but the bandwidth is as expected.  I'm used to seeing only 1-2 microseconds 
with MVAPICH and wondering why OpenMPI either isn't performing as well or 
doesn't play well with how bounce is measuring latency (by timing 0 byte 
messages).  I've tried to play with a few parameters with no success.  Here's 
how the build is configured:

myflags="-O3 -xSSE2"
./configure --prefix=/part0/apps/MPI/intel/openmpi-1.4.1 \
--disable-dlopen --with-wrapper-ldflags="-shared-intel" \
--enable-orterun-prefix-by-default \
--with-openib --enable-openib-connectx-xrc --enable-openib-rdmacm \
CC=icc CXX=icpc F77=ifort FC=ifort \
CFLAGS="$myflags" FFLAGS="$myflags" CXXFLAGS="$myflags" 
FCFLAGS="$myflags" \
OBJC=gcc OBJCFLAGS="-O3"
Any ideas?

Thanks,
Steve




Re: [OMPI users] Seg fault with PBS Pro 10.2

2010-02-15 Thread Ralph Castain
Could you please ask them about this:

OMPI makes the following call to connect to the mother superior:

struct tm_roots tm_root;
ret = tm_init(NULL, _root);

Could they tell us why this segfaults in PBS Pro? It works correctly with all 
releases of Torque.

Thanks
Ralph

On Feb 15, 2010, at 12:06 PM, Joshua Bernstein wrote:

> Well,
> 
>   We all wish the Altair guys would at least try to maintain backwards 
> compatibility with the community, but they have a big habit of breaking 
> things. This isn't the first time they've broken a more customer facing 
> function like tm_spawn. (The also like breaking pbs_statjob too!).
> 
>   I have access to PBS Pro and I can raise the issue with Altair if it 
> would help. Just let me know how I can be helpful.
> 
> -Joshua Bernstein
> Senior Software Engineer
> Penguin Computing
> 
> On Feb 15, 2010, at 8:23 AM, Jeff Squyres wrote:
> 
>> Bummer!
>> 
>> If it helps, could you put us in touch with the PBS Pro people?  We usually 
>> only have access to Torque when developing the TM-launching stuff (PBS Pro 
>> and Torque supposedly share the same TM interface, but we don't have access 
>> to PBS Pro, so we don't know if it has diverged over time).
>> 
>> 
>> On Feb 15, 2010, at 8:13 AM, Repsher, Stephen J wrote:
>> 
>>> Ralph,
>>> 
>>> This is my first build of OpenMPI so I haven't had this working before.  
>>> I'm pretty confident that PATH and LD_LIBRARY_PATH issues are not the 
>>> cause, otherwise launches outside of PBS would fail too.  Also, I tried 
>>> compiling everything statically with the same result.
>>> 
>>> Some additional info...  (1) I did a diff on tm.h for PBS 10.2 and from 
>>> version 8.0 that we had - they are identical, and (2) I've tried this with 
>>> both the Intel 11.1 and GCC compilers and gotten the exact same run-time 
>>> errors.
>>> 
>>> For now, I've got a a work-around setup that launches over ssh and still 
>>> attaches the processes to PBS.
>>> 
>>> Thanks for your help.
>>> 
>>> Steve
>>> 
>>> 
>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
>>> Behalf Of Ralph Castain
>>> Sent: Friday, February 12, 2010 8:29 PM
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] Seg fault with PBS Pro 10.2
>>> 
>>> Afraid compilers don't help when the param is a void*...
>>> 
>>> It looks like this is consistent, but I've never tried it under that 
>>> particular environment. Did prior versions of OMPI work, or are you trying 
>>> this for the first time?
>>> 
>>> One thing you might check is that you have the correct PATH and 
>>> LD_LIBRARY_PATH set to point to this version of OMPI and the corresponding 
>>> PBS Pro libs you used to build it. Most Linux distros come with OMPI 
>>> installed, and that can cause surprises.
>>> 
>>> We run under Torque at major installations every day, so it -should- 
>>> work...unless PBS Pro has done something unusual.
>>> 
>>> 
>>> On Feb 12, 2010, at 1:41 PM, Repsher, Stephen J wrote:
>>> 
 Yes, the failure seems to be in mpirun, it never even gets to my 
 application.
 
 The proto for tm_init looks like this:
 int tm_init(void *info, struct tm_roots *roots);
 
 where the struct has 6 elements: 2 x tm_task_id + 3 x int + 1 x tm_task_id
 
 If the API was different, wouldn't the compiler most likely generate an 
 error at compile-time?
 
 Thanks!
 
 Steve
 
 
 From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
 Behalf Of Ralph Castain
 Sent: Friday, February 12, 2010 3:21 PM
 To: Open MPI Users
 Subject: Re: [OMPI users] Seg fault with PBS Pro 10.2
 
 I'm a tad confused - this trace would appear to indicate that mpirun is 
 failing, yes? Not your application?
 
 The reason it works for local procs is that tm_init isn't called for that 
 case - mpirun just fork/exec's the procs directly. When remote nodes are 
 required, mpirun must connect to Torque. This is done with a call to:
 
   ret = tm_init(NULL, _root);
 
 My guess is that something changed in PBS Pro 10.2 to that API. Can you 
 check the tm header file and see? I have no access to PBSany more, so 
 I'll have to rely on your eyes to see a diff.
 
 Thanks
 Ralph
 
 On Feb 12, 2010, at 8:50 AM, Repsher, Stephen J wrote:
 
> Hello,
> 
> I'm having problems running Open MPI jobs under PBS Pro 10.2.  I've 
> configured and built OpenMPI 1.4.1 with the Intel 11.1 compiler on Linux 
> and with --with-tm support and the build runs fine.  I've also built with 
> static libraries per the FAQ suggestion since libpbs is static.  However, 
> my test application keep failing with a segmentation fault, but ONLY when 
> trying to select more than 1 node.  Running on a single node withing PBS 
> works fine.  Also, running outside of PBS vis ssh runs fine as well, even 
> across multiple nodes.  OpenIB 

Re: [OMPI users] Seg fault with PBS Pro 10.2

2010-02-15 Thread Joshua Bernstein

Well,

	We all wish the Altair guys would at least try to maintain backwards  
compatibility with the community, but they have a big habit of  
breaking things. This isn't the first time they've broken a more  
customer facing function like tm_spawn. (The also like breaking  
pbs_statjob too!).


	I have access to PBS Pro and I can raise the issue with Altair if it  
would help. Just let me know how I can be helpful.


-Joshua Bernstein
Senior Software Engineer
Penguin Computing

On Feb 15, 2010, at 8:23 AM, Jeff Squyres wrote:


Bummer!

If it helps, could you put us in touch with the PBS Pro people?  We  
usually only have access to Torque when developing the TM-launching  
stuff (PBS Pro and Torque supposedly share the same TM interface,  
but we don't have access to PBS Pro, so we don't know if it has  
diverged over time).



On Feb 15, 2010, at 8:13 AM, Repsher, Stephen J wrote:


Ralph,

This is my first build of OpenMPI so I haven't had this working  
before.  I'm pretty confident that PATH and LD_LIBRARY_PATH issues  
are not the cause, otherwise launches outside of PBS would fail  
too.  Also, I tried compiling everything statically with the same  
result.


Some additional info...  (1) I did a diff on tm.h for PBS 10.2 and  
from version 8.0 that we had - they are identical, and (2) I've  
tried this with both the Intel 11.1 and GCC compilers and gotten  
the exact same run-time errors.


For now, I've got a a work-around setup that launches over ssh and  
still attaches the processes to PBS.


Thanks for your help.

Steve


From: users-boun...@open-mpi.org [mailto:users-bounces@open- 
mpi.org] On Behalf Of Ralph Castain

Sent: Friday, February 12, 2010 8:29 PM
To: Open MPI Users
Subject: Re: [OMPI users] Seg fault with PBS Pro 10.2

Afraid compilers don't help when the param is a void*...

It looks like this is consistent, but I've never tried it under  
that particular environment. Did prior versions of OMPI work, or  
are you trying this for the first time?


One thing you might check is that you have the correct PATH and  
LD_LIBRARY_PATH set to point to this version of OMPI and the  
corresponding PBS Pro libs you used to build it. Most Linux distros  
come with OMPI installed, and that can cause surprises.


We run under Torque at major installations every day, so it - 
should- work...unless PBS Pro has done something unusual.



On Feb 12, 2010, at 1:41 PM, Repsher, Stephen J wrote:

Yes, the failure seems to be in mpirun, it never even gets to my  
application.


The proto for tm_init looks like this:
int tm_init(void *info, struct tm_roots *roots);

where the struct has 6 elements: 2 x tm_task_id + 3 x int + 1 x  
tm_task_id


If the API was different, wouldn't the compiler most likely  
generate an error at compile-time?


Thanks!

Steve


From: users-boun...@open-mpi.org [mailto:users-bounces@open- 
mpi.org] On Behalf Of Ralph Castain

Sent: Friday, February 12, 2010 3:21 PM
To: Open MPI Users
Subject: Re: [OMPI users] Seg fault with PBS Pro 10.2

I'm a tad confused - this trace would appear to indicate that  
mpirun is failing, yes? Not your application?


The reason it works for local procs is that tm_init isn't called  
for that case - mpirun just fork/exec's the procs directly. When  
remote nodes are required, mpirun must connect to Torque. This is  
done with a call to:


   ret = tm_init(NULL, _root);

My guess is that something changed in PBS Pro 10.2 to that API.  
Can you check the tm header file and see? I have no access to  
PBSany more, so I'll have to rely on your eyes to see a diff.


Thanks
Ralph

On Feb 12, 2010, at 8:50 AM, Repsher, Stephen J wrote:


Hello,

I'm having problems running Open MPI jobs under PBS Pro 10.2.   
I've configured and built OpenMPI 1.4.1 with the Intel 11.1  
compiler on Linux and with --with-tm support and the build runs  
fine.  I've also built with static libraries per the FAQ  
suggestion since libpbs is static.  However, my test application  
keep failing with a segmentation fault, but ONLY when trying to  
select more than 1 node.  Running on a single node withing PBS  
works fine.  Also, running outside of PBS vis ssh runs fine as  
well, even across multiple nodes.  OpenIB support is also  
enabled, but that doesn't seem to affect the error because I've  
also tried running with the --mca btl tcp,self flag and it still  
doesn't work.  Here is the error I'm getting:


[n34:26892] *** Process received signal ***
[n34:26892] Signal: Segmentation fault (11)
[n34:26892] Signal code: Address not mapped (1)
[n34:26892] Failing at address: 0x3f
[n34:26892] [ 0] /lib64/libpthread.so.0 [0x7fc0309d6a90]
[n34:26892] [ 1] /part0/apps/MPI/intel/openmpi-1.4.1/bin/ 
pbs_mpirun(discui_+0x84) [0x476a50]
[n34:26892] [ 2] /part0/apps/MPI/intel/openmpi-1.4.1/bin/ 
pbs_mpirun(diswsi+0xc3) [0x474063]
[n34:26892] [ 3] /part0/apps/MPI/intel/openmpi-1.4.1/bin/ 
pbs_mpirun [0x471d0c]
[n34:26892] [ 4] /part0/apps/MPI/intel/openmpi-1.4.1/bin/ 

Re: [OMPI users] Seg fault with PBS Pro 10.2

2010-02-15 Thread Jeff Squyres
Bummer!

If it helps, could you put us in touch with the PBS Pro people?  We usually 
only have access to Torque when developing the TM-launching stuff (PBS Pro and 
Torque supposedly share the same TM interface, but we don't have access to PBS 
Pro, so we don't know if it has diverged over time).


On Feb 15, 2010, at 8:13 AM, Repsher, Stephen J wrote:

> Ralph,
>  
> This is my first build of OpenMPI so I haven't had this working before.  I'm 
> pretty confident that PATH and LD_LIBRARY_PATH issues are not the cause, 
> otherwise launches outside of PBS would fail too.  Also, I tried compiling 
> everything statically with the same result.
>  
> Some additional info...  (1) I did a diff on tm.h for PBS 10.2 and from 
> version 8.0 that we had - they are identical, and (2) I've tried this with 
> both the Intel 11.1 and GCC compilers and gotten the exact same run-time 
> errors.
>  
> For now, I've got a a work-around setup that launches over ssh and still 
> attaches the processes to PBS.
>  
> Thanks for your help.
>  
> Steve
>  
> 
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> Behalf Of Ralph Castain
> Sent: Friday, February 12, 2010 8:29 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Seg fault with PBS Pro 10.2
> 
> Afraid compilers don't help when the param is a void*...
> 
> It looks like this is consistent, but I've never tried it under that 
> particular environment. Did prior versions of OMPI work, or are you trying 
> this for the first time?
> 
> One thing you might check is that you have the correct PATH and 
> LD_LIBRARY_PATH set to point to this version of OMPI and the corresponding 
> PBS Pro libs you used to build it. Most Linux distros come with OMPI 
> installed, and that can cause surprises.
> 
> We run under Torque at major installations every day, so it -should- 
> work...unless PBS Pro has done something unusual.
> 
> 
> On Feb 12, 2010, at 1:41 PM, Repsher, Stephen J wrote:
> 
>> Yes, the failure seems to be in mpirun, it never even gets to my application.
>>  
>> The proto for tm_init looks like this:
>> int tm_init(void *info, struct tm_roots *roots);
>>  
>> where the struct has 6 elements: 2 x tm_task_id + 3 x int + 1 x tm_task_id
>>  
>> If the API was different, wouldn't the compiler most likely generate an 
>> error at compile-time?
>>  
>> Thanks!
>>  
>> Steve
>>  
>> 
>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
>> Behalf Of Ralph Castain
>> Sent: Friday, February 12, 2010 3:21 PM
>> To: Open MPI Users
>> Subject: Re: [OMPI users] Seg fault with PBS Pro 10.2
>> 
>> I'm a tad confused - this trace would appear to indicate that mpirun is 
>> failing, yes? Not your application?
>> 
>> The reason it works for local procs is that tm_init isn't called for that 
>> case - mpirun just fork/exec's the procs directly. When remote nodes are 
>> required, mpirun must connect to Torque. This is done with a call to:
>> 
>> ret = tm_init(NULL, _root);
>> 
>> My guess is that something changed in PBS Pro 10.2 to that API. Can you 
>> check the tm header file and see? I have no access to PBSany more, so 
>> I'll have to rely on your eyes to see a diff.
>> 
>> Thanks
>> Ralph
>> 
>> On Feb 12, 2010, at 8:50 AM, Repsher, Stephen J wrote:
>> 
>>> Hello,
>>> 
>>> I'm having problems running Open MPI jobs under PBS Pro 10.2.  I've 
>>> configured and built OpenMPI 1.4.1 with the Intel 11.1 compiler on Linux 
>>> and with --with-tm support and the build runs fine.  I've also built with 
>>> static libraries per the FAQ suggestion since libpbs is static.  However, 
>>> my test application keep failing with a segmentation fault, but ONLY when 
>>> trying to select more than 1 node.  Running on a single node withing PBS 
>>> works fine.  Also, running outside of PBS vis ssh runs fine as well, even 
>>> across multiple nodes.  OpenIB support is also enabled, but that doesn't 
>>> seem to affect the error because I've also tried running with the --mca btl 
>>> tcp,self flag and it still doesn't work.  Here is the error I'm getting:
>>> 
>>> [n34:26892] *** Process received signal ***
>>> [n34:26892] Signal: Segmentation fault (11)
>>> [n34:26892] Signal code: Address not mapped (1)
>>> [n34:26892] Failing at address: 0x3f
>>> [n34:26892] [ 0] /lib64/libpthread.so.0 [0x7fc0309d6a90]
>>> [n34:26892] [ 1] 
>>> /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun(discui_+0x84) [0x476a50]
>>> [n34:26892] [ 2] 
>>> /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun(diswsi+0xc3) [0x474063]
>>> [n34:26892] [ 3] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun 
>>> [0x471d0c]
>>> [n34:26892] [ 4] 
>>> /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun(tm_init+0x1fe) [0x471ff8]
>>> [n34:26892] [ 5] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun 
>>> [0x43f580]
>>> [n34:26892] [ 6] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun 
>>> [0x413921]
>>> [n34:26892] [ 7] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun 
>>> 

Re: [OMPI users] Seg fault with PBS Pro 10.2

2010-02-15 Thread Repsher, Stephen J
Ralph,

This is my first build of OpenMPI so I haven't had this working before.  I'm 
pretty confident that PATH and LD_LIBRARY_PATH issues are not the cause, 
otherwise launches outside of PBS would fail too.  Also, I tried compiling 
everything statically with the same result.

Some additional info...  (1) I did a diff on tm.h for PBS 10.2 and from version 
8.0 that we had - they are identical, and (2) I've tried this with both the 
Intel 11.1 and GCC compilers and gotten the exact same run-time errors.

For now, I've got a a work-around setup that launches over ssh and still 
attaches the processes to PBS.

Thanks for your help.

Steve



From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Ralph Castain
Sent: Friday, February 12, 2010 8:29 PM
To: Open MPI Users
Subject: Re: [OMPI users] Seg fault with PBS Pro 10.2

Afraid compilers don't help when the param is a void*...

It looks like this is consistent, but I've never tried it under that particular 
environment. Did prior versions of OMPI work, or are you trying this for the 
first time?

One thing you might check is that you have the correct PATH and LD_LIBRARY_PATH 
set to point to this version of OMPI and the corresponding PBS Pro libs you 
used to build it. Most Linux distros come with OMPI installed, and that can 
cause surprises.

We run under Torque at major installations every day, so it -should- 
work...unless PBS Pro has done something unusual.


On Feb 12, 2010, at 1:41 PM, Repsher, Stephen J wrote:

Yes, the failure seems to be in mpirun, it never even gets to my application.

The proto for tm_init looks like this:
int tm_init(void *info, struct tm_roots *roots);

where the struct has 6 elements: 2 x tm_task_id + 3 x int + 1 x tm_task_id

If the API was different, wouldn't the compiler most likely generate an error 
at compile-time?

Thanks!

Steve



From: users-boun...@open-mpi.org 
[mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Friday, February 12, 2010 3:21 PM
To: Open MPI Users
Subject: Re: [OMPI users] Seg fault with PBS Pro 10.2

I'm a tad confused - this trace would appear to indicate that mpirun is 
failing, yes? Not your application?

The reason it works for local procs is that tm_init isn't called for that case 
- mpirun just fork/exec's the procs directly. When remote nodes are required, 
mpirun must connect to Torque. This is done with a call to:

ret = tm_init(NULL, _root);

My guess is that something changed in PBS Pro 10.2 to that API. Can you check 
the tm header file and see? I have no access to PBS any more, so I'll have to 
rely on your eyes to see a diff.

Thanks
Ralph

On Feb 12, 2010, at 8:50 AM, Repsher, Stephen J wrote:

Hello,

I'm having problems running Open MPI jobs under PBS Pro 10.2.  I've configured 
and built OpenMPI 1.4.1 with the Intel 11.1 compiler on Linux and with 
--with-tm support and the build runs fine.  I've also built with static 
libraries per the FAQ suggestion since libpbs is static.  However, my test 
application keep failing with a segmentation fault, but ONLY when trying to 
select more than 1 node.  Running on a single node withing PBS works fine.  
Also, running outside of PBS vis ssh runs fine as well, even across multiple 
nodes.  OpenIB support is also enabled, but that doesn't seem to affect the 
error because I've also tried running with the --mca btl tcp,self flag and it 
still doesn't work.  Here is the error I'm getting:

[n34:26892] *** Process received signal ***
[n34:26892] Signal: Segmentation fault (11)
[n34:26892] Signal code: Address not mapped (1)
[n34:26892] Failing at address: 0x3f
[n34:26892] [ 0] /lib64/libpthread.so.0 [0x7fc0309d6a90]
[n34:26892] [ 1] 
/part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun(discui_+0x84) [0x476a50]
[n34:26892] [ 2] 
/part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun(diswsi+0xc3) [0x474063]
[n34:26892] [ 3] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun [0x471d0c]
[n34:26892] [ 4] 
/part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun(tm_init+0x1fe) [0x471ff8]
[n34:26892] [ 5] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun [0x43f580]
[n34:26892] [ 6] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun [0x413921]
[n34:26892] [ 7] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun [0x412b78]
[n34:26892] [ 8] /lib64/libc.so.6(__libc_start_main+0xe6) [0x7fc03068d586]
[n34:26892] [ 9] /part0/apps/MPI/intel/openmpi-1.4.1/bin/pbs_mpirun [0x412ac9]
[n34:26892] *** End of error message ***
Segmentation fault

(NOTE: pbs_mpirun = orterun, mpirun, etc.)

Has anyone else seen errors like this within PBS?


Steve Repsher
Boeing Defense, Space, & Security - Rotorcraft
Aerodynamics/CFD
Phone: (610) 591-1510
Fax: (610) 591-6263
stephen.j.reps...@boeing.com



___
users