from:"Justin"

[OMPI users] Segfault when using valgrind

2009-07-07 Thread Justin

atherv.c:121)
==22736==by 0x646CCF7: Uintah::Level::setBCTypes() (Level.cc:728)
==22736==by 0x646D823: Uintah::Level::finalizeLevel() (Level.cc:537)
==22736==by 0x6465457: 
Uintah::Grid::problemSetup(Uintah::Handle const&, 
Uintah::ProcessorGroup const*, bool) (Grid.cc:866)
==22736==by 0x8345759: Uintah::SimulationController::gridSetup() 
(SimulationController.cc:243)
==22736==by 0x834F418: Uintah::AMRSimulationController::run() 
(AMRSimulationController.cc:117)

==22736==by 0x4089AE: main (sus.cc:629)


Are these problems with openmpi and is there any known work arounds?

Thanks,
Justin

[OMPI users] Deadlock on large numbers of processors

2008-12-05 Thread Justin


Hi,

We are currently using OpenMPI 1.3 on Ranger for large processor jobs 
(8K+).  Our code appears to be occasionally deadlocking at random within 
point to point communication (see stacktrace below).  This code has been 
tested on many different MPI versions and as far as we know it does not 
contain a deadlock.  However, in the past we have ran into problems with 
shared memory optimizations within MPI causing deadlocks.  We can 
usually avoid these by setting a few environment variables to either 
increase the size of shared memory buffers or disable shared memory 
optimizations all together.   Does OpenMPI have any known deadlocks that 
might be causing our deadlocks?  If are there any work arounds?  Also 
how do we disable shared memory within OpenMPI?


Here is an example of where processors are hanging:

#0  0x2b2df3522683 in mca_btl_sm_component_progress () from 
/opt/apps/intel10_1/openmpi/1.3/lib/openmpi/mca_btl_sm.so
#1  0x2b2df2cb46bf in mca_bml_r2_progress () from 
/opt/apps/intel10_1/openmpi/1.3/lib/openmpi/mca_bml_r2.so
#2  0x2b2df0032ea4 in opal_progress () from 
/opt/apps/intel10_1/openmpi/1.3/lib/libopen-pal.so.0
#3  0x2b2ded0d7622 in ompi_request_default_wait_some () from 
/opt/apps/intel10_1/openmpi/1.3//lib/libmpi.so.0
#4  0x2b2ded109e34 in PMPI_Waitsome () from 
/opt/apps/intel10_1/openmpi/1.3//lib/libmpi.so.0



Thanks,
Justin

Re: [OMPI users] Deadlock on large numbers of processors

2008-12-05 Thread Justin

Thank you for this info.  I should add that our code tends to post a lot 
of sends prior to the other side posting receives.  This causes a lot of 
unexpected messages to exist.  Our code explicitly matches up all tags 
and processors (that is we do not use MPI wild cards).  If we had a dead 
lock I would think we would see it regardless of weather or not we cross 
the roundevous threshold.  I guess one way to test this would be to to 
set this threshold to 0.  If it then dead locks we would likely be able 
to track down the deadlock.  Are there any other parameters we can send 
mpi that will turn off buffering?


Thanks,
Justin

Brock Palen wrote:
When ever this happens we found the code to have a deadlock.  users 
never saw it until they cross the eager->roundevous threshold.


Yes you can disable shared memory with:

mpirun --mca btl ^sm

Or you can try increasing the eager limit.

ompi_info --param btl sm

MCA btl: parameter "btl_sm_eager_limit" (current value:
  "4096")

You can modify this limit at run time,  I think (can't test it right 
now) it is just:


mpirun --mca btl_sm_eager_limit 40960

I think you can also in tweaking these values use env Vars in place of 
putting it all in the mpirun line:


export OMPI_MCA_btl_sm_eager_limit=40960

See:
http://www.open-mpi.org/faq/?category=tuning


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



On Dec 5, 2008, at 12:22 PM, Justin wrote:


Hi,

We are currently using OpenMPI 1.3 on Ranger for large processor jobs 
(8K+).  Our code appears to be occasionally deadlocking at random 
within point to point communication (see stacktrace below).  This 
code has been tested on many different MPI versions and as far as we 
know it does not contain a deadlock.  However, in the past we have 
ran into problems with shared memory optimizations within MPI causing 
deadlocks.  We can usually avoid these by setting a few environment 
variables to either increase the size of shared memory buffers or 
disable shared memory optimizations all together.   Does OpenMPI have 
any known deadlocks that might be causing our deadlocks?  If are 
there any work arounds?  Also how do we disable shared memory within 
OpenMPI?


Here is an example of where processors are hanging:

#0  0x2b2df3522683 in mca_btl_sm_component_progress () from 
/opt/apps/intel10_1/openmpi/1.3/lib/openmpi/mca_btl_sm.so
#1  0x2b2df2cb46bf in mca_bml_r2_progress () from 
/opt/apps/intel10_1/openmpi/1.3/lib/openmpi/mca_bml_r2.so
#2  0x2b2df0032ea4 in opal_progress () from 
/opt/apps/intel10_1/openmpi/1.3/lib/libopen-pal.so.0
#3  0x2b2ded0d7622 in ompi_request_default_wait_some () from 
/opt/apps/intel10_1/openmpi/1.3//lib/libmpi.so.0
#4  0x2b2ded109e34 in PMPI_Waitsome () from 
/opt/apps/intel10_1/openmpi/1.3//lib/libmpi.so.0



Thanks,
Justin
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Deadlock on large numbers of processors

2008-12-05 Thread Justin

The reason i'd like to disable these eager buffers is to help detect the 
deadlock better.  I would not run with this for a normal run but it 
would be useful for debugging.  If the deadlock is indeed due to our 
code then disabling any shared buffers or eager sends would make that 
deadlock reproduceable.In addition we might be able to lower the 
number of processors down.  Right now determining which processor is 
deadlocks when we are using 8K cores and each processor has hundreds of 
messages sent out would be quite difficult.


Thanks for your suggestions,
Justin
Brock Palen wrote:
OpenMPI has differnt eager limits for all the network types, on your 
system run:


ompi_info --param btl all

and look for the eager_limits
You can set these values to 0 using the syntax I showed you before. 
That would disable eager messages.

There might be a better way to disable eager messages.
Not sure why you would want to disable them, they are there for 
performance.


Maybe you would still see a deadlock if every message was below the 
threshold. I think there is a limit of the number of eager messages a 
receving cpus will accept. Not sure about that though.  I still kind 
of doubt it though.


Try tweaking your buffer sizes,  make the openib  btl eager limit the 
same as shared memory. and see if you get locks up between hosts and 
not just shared memory.


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



On Dec 5, 2008, at 2:10 PM, Justin wrote:

Thank you for this info.  I should add that our code tends to post a 
lot of sends prior to the other side posting receives.  This causes a 
lot of unexpected messages to exist.  Our code explicitly matches up 
all tags and processors (that is we do not use MPI wild cards).  If 
we had a dead lock I would think we would see it regardless of 
weather or not we cross the roundevous threshold.  I guess one way to 
test this would be to to set this threshold to 0.  If it then dead 
locks we would likely be able to track down the deadlock.  Are there 
any other parameters we can send mpi that will turn off buffering?


Thanks,
Justin

Brock Palen wrote:
When ever this happens we found the code to have a deadlock.  users 
never saw it until they cross the eager->roundevous threshold.


Yes you can disable shared memory with:

mpirun --mca btl ^sm

Or you can try increasing the eager limit.

ompi_info --param btl sm

MCA btl: parameter "btl_sm_eager_limit" (current value:
  "4096")

You can modify this limit at run time,  I think (can't test it right 
now) it is just:


mpirun --mca btl_sm_eager_limit 40960

I think you can also in tweaking these values use env Vars in place 
of putting it all in the mpirun line:


export OMPI_MCA_btl_sm_eager_limit=40960

See:
http://www.open-mpi.org/faq/?category=tuning


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



On Dec 5, 2008, at 12:22 PM, Justin wrote:


Hi,

We are currently using OpenMPI 1.3 on Ranger for large processor 
jobs (8K+).  Our code appears to be occasionally deadlocking at 
random within point to point communication (see stacktrace below).  
This code has been tested on many different MPI versions and as far 
as we know it does not contain a deadlock.  However, in the past we 
have ran into problems with shared memory optimizations within MPI 
causing deadlocks.  We can usually avoid these by setting a few 
environment variables to either increase the size of shared memory 
buffers or disable shared memory optimizations all together.   Does 
OpenMPI have any known deadlocks that might be causing our 
deadlocks?  If are there any work arounds?  Also how do we disable 
shared memory within OpenMPI?


Here is an example of where processors are hanging:

#0  0x2b2df3522683 in mca_btl_sm_component_progress () from 
/opt/apps/intel10_1/openmpi/1.3/lib/openmpi/mca_btl_sm.so
#1  0x2b2df2cb46bf in mca_bml_r2_progress () from 
/opt/apps/intel10_1/openmpi/1.3/lib/openmpi/mca_bml_r2.so
#2  0x2b2df0032ea4 in opal_progress () from 
/opt/apps/intel10_1/openmpi/1.3/lib/libopen-pal.so.0
#3  0x2b2ded0d7622 in ompi_request_default_wait_some () from 
/opt/apps/intel10_1/openmpi/1.3//lib/libmpi.so.0
#4  0x2b2ded109e34 in PMPI_Waitsome () from 
/opt/apps/intel10_1/openmpi/1.3//lib/libmpi.so.0



Thanks,
Justin
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Deadlock on large numbers of processors

2008-12-09 Thread Justin

I have tried disabling the shared memory by running with the following 
parameters to mpirun


--mca btl openib,self --mca btl_openib_ib_timeout 23 --mca 
btl_openib_use_srq 1 --mca btl_openib_use_rd_max 2048


Unfortunately this did not get rid of any hangs and has seemed to make 
them more common.  I have now been able to reproduce the deadlock at 32 
processors.  I am now working with an mpi deadlock detection research 
code which will hopefully be able to tell me if there are any deadlocks 
in our code.  At the same time if any of you have any suggestions of 
parameters to openmpi that might alleviate these deadlocks I would be 
grateful.



Thanks,
Justin




Rolf Vandevaart wrote:


The current version of Open MPI installed on ranger is 1.3a1r19685 
which is from early October.  This version has a fix for ticket 
#1378.  Ticket #1449 is not an issue is this case because each node 
has 16 processors and #1449 is for larger SMPs.


However, I am wondering if this is because of ticket 
https://svn.open-mpi.org/trac/ompi/ticket/1468 which was not yet fixed 
in the version running on ranger.


As was suggested earlier, running without the sm btl would be a clue 
if this is the problem.


mpirun --mca btl ^sm a.out

Another way to potentially work around the issue is to increase the 
size of the shared memory backing file.


mpirun --mca 1073741824 -mca mpool_sm_max_size 1073741824 a.out

We will also work with TACC to get an upgraded version of Open MPI 1.3 
on there.


Let us know what you find.

Rolf


On 12/09/08 08:05, Lenny Verkhovsky wrote:

also see https://svn.open-mpi.org/trac/ompi/ticket/1449



On 12/9/08, *Lenny Verkhovsky* <lenny.verkhov...@gmail.com 
<mailto:lenny.verkhov...@gmail.com>> wrote:


maybe it's related to 
https://svn.open-mpi.org/trac/ompi/ticket/1378  ??



On 12/5/08, *Justin* <luitj...@cs.utah.edu
<mailto:luitj...@cs.utah.edu>> wrote:

The reason i'd like to disable these eager buffers is to help
detect the deadlock better.  I would not run with this for a
normal run but it would be useful for debugging.  If the
deadlock is indeed due to our code then disabling any shared
buffers or eager sends would make that deadlock 
reproduceable.   In addition we might be able to lower the 
number of processors

down.  Right now determining which processor is deadlocks when
we are using 8K cores and each processor has hundreds of
messages sent out would be quite difficult.

Thanks for your suggestions,
Justin

Brock Palen wrote:

OpenMPI has differnt eager limits for all the network types,
on your system run:

ompi_info --param btl all

and look for the eager_limits
You can set these values to 0 using the syntax I showed you
before. That would disable eager messages.
There might be a better way to disable eager messages.
Not sure why you would want to disable them, they are there
for performance.

Maybe you would still see a deadlock if every message was
below the threshold. I think there is a limit of the number
of eager messages a receving cpus will accept. Not sure
about that though.  I still kind of doubt it though.

Try tweaking your buffer sizes,  make the openib  btl eager
limit the same as shared memory. and see if you get locks up
between hosts and not just shared memory.

Brock Palen
www.umich.edu/~brockp <http://www.umich.edu/~brockp>
Center for Advanced Computing
bro...@umich.edu <mailto:bro...@umich.edu>
(734)936-1985



    On Dec 5, 2008, at 2:10 PM, Justin wrote:

Thank you for this info.  I should add that our code
tends to post a lot of sends prior to the other side
posting receives.  This causes a lot of unexpected
messages to exist.  Our code explicitly matches up all
tags and processors (that is we do not use MPI wild
cards).  If we had a dead lock I would think we would
see it regardless of weather or not we cross the
roundevous threshold.  I guess one way to test this
would be to to set this threshold to 0.  If it then dead
locks we would likely be able to track down the
deadlock.  Are there any other parameters we can send
mpi that will turn off buffering?

    Thanks,
Justin

Brock Palen wrote:

When ever this happens we found the code to have a
deadlock.  users never saw it until they cross the
eager->roundevous threshold.

Yes you can disable shared m

Re: [OMPI users] Deadlock on large numbers of processors

2008-12-11 Thread Justin

The more I look at this bug the more I'm convinced it is with openMPI 
and not our code.  Here is why:  Our code generates a 
communication/execution schedule.  At each timestep this schedule is 
executed and all communication and execution is performed.  Our problem 
is AMR which means the communication schedule may change from time to 
time.  In this case the schedule has not changed in many timesteps 
meaning the same communication schedule is being used as the last X (x 
being around 20 in this case) timesteps. 

Our code does have a very large communication problem.  I have been able 
to reduce the hang down to 16 processors and it seems to me the hang 
occurs when he have lots of work per processor.  Meaning if I add more 
processors it may not hang but reducing processors makes it more likely 
to hang. 


What is the status on the fix for this particular freelist deadlock?

Thanks,
Justin

Jeff Squyres wrote:

George --

Is this the same issue that you're working on?

(we have a "blocker" bug for v1.3 about deadlock at heavy messaging 
volume -- on Tuesday, it looked like a bug in our freelist...)



On Dec 9, 2008, at 10:28 AM, Justin wrote:

I have tried disabling the shared memory by running with the 
following parameters to mpirun


--mca btl openib,self --mca btl_openib_ib_timeout 23 --mca 
btl_openib_use_srq 1 --mca btl_openib_use_rd_max 2048


Unfortunately this did not get rid of any hangs and has seemed to 
make them more common.  I have now been able to reproduce the 
deadlock at 32 processors.  I am now working with an mpi deadlock 
detection research code which will hopefully be able to tell me if 
there are any deadlocks in our code.  At the same time if any of you 
have any suggestions of parameters to openmpi that might alleviate 
these deadlocks I would be grateful.



Thanks,
Justin




Rolf Vandevaart wrote:


The current version of Open MPI installed on ranger is 1.3a1r19685 
which is from early October.  This version has a fix for ticket 
#1378.  Ticket #1449 is not an issue is this case because each node 
has 16 processors and #1449 is for larger SMPs.


However, I am wondering if this is because of ticket 
https://svn.open-mpi.org/trac/ompi/ticket/1468 which was not yet 
fixed in the version running on ranger.


As was suggested earlier, running without the sm btl would be a clue 
if this is the problem.


mpirun --mca btl ^sm a.out

Another way to potentially work around the issue is to increase the 
size of the shared memory backing file.


mpirun --mca 1073741824 -mca mpool_sm_max_size 1073741824 a.out

We will also work with TACC to get an upgraded version of Open MPI 
1.3 on there.


Let us know what you find.

Rolf


On 12/09/08 08:05, Lenny Verkhovsky wrote:

also see https://svn.open-mpi.org/trac/ompi/ticket/1449



On 12/9/08, *Lenny Verkhovsky* <lenny.verkhov...@gmail.com 
<mailto:lenny.verkhov...@gmail.com>> wrote:


   maybe it's related to 
https://svn.open-mpi.org/trac/ompi/ticket/1378  ??



   On 12/5/08, *Justin* <luitj...@cs.utah.edu
   <mailto:luitj...@cs.utah.edu>> wrote:

   The reason i'd like to disable these eager buffers is to help
   detect the deadlock better.  I would not run with this for a
   normal run but it would be useful for debugging.  If the
   deadlock is indeed due to our code then disabling any shared
   buffers or eager sends would make that deadlock 
reproduceable.   In addition we might be able to lower the 
number of processors

   down.  Right now determining which processor is deadlocks when
   we are using 8K cores and each processor has hundreds of
   messages sent out would be quite difficult.

   Thanks for your suggestions,
   Justin

   Brock Palen wrote:

   OpenMPI has differnt eager limits for all the network 
types,

   on your system run:

   ompi_info --param btl all

   and look for the eager_limits
   You can set these values to 0 using the syntax I showed you
   before. That would disable eager messages.
   There might be a better way to disable eager messages.
   Not sure why you would want to disable them, they are there
   for performance.

   Maybe you would still see a deadlock if every message was
   below the threshold. I think there is a limit of the number
   of eager messages a receving cpus will accept. Not sure
   about that though.  I still kind of doubt it though.

   Try tweaking your buffer sizes,  make the openib  btl eager
   limit the same as shared memory. and see if you get 
locks up

   between hosts and not just shared memory.

   Brock Palen
   www.umich.edu/~brockp <http://www.umich.edu/~brockp>
   Center for Advanced Computing
   bro...@umich.edu <mailto:bro...@umich.edu>
   (734)936-1985



   On Dec 5, 2008, at 2:

Re: [OMPI users] Deadlock on large numbers of processors

2009-01-12 Thread Justin


Hi,  has this deadlock been fixed in the 1.3 source yet?

Thanks,

Justin


Jeff Squyres wrote:

On Dec 11, 2008, at 5:30 PM, Justin wrote:

The more I look at this bug the more I'm convinced it is with openMPI 
and not our code.  Here is why:  Our code generates a 
communication/execution schedule.  At each timestep this schedule is 
executed and all communication and execution is performed.  Our 
problem is AMR which means the communication schedule may change from 
time to time.  In this case the schedule has not changed in many 
timesteps meaning the same communication schedule is being used as 
the last X (x being around 20 in this case) timesteps.
Our code does have a very large communication problem.  I have been 
able to reduce the hang down to 16 processors and it seems to me the 
hang occurs when he have lots of work per processor.  Meaning if I 
add more processors it may not hang but reducing processors makes it 
more likely to hang.

What is the status on the fix for this particular freelist deadlock?



George is actively working on it because it is the "last" issue 
blocking us from releasing v1.3.  I fear that if he doesn't get it 
fixed by tonight, we'll have to push v1.3 to next year (see 
http://www.open-mpi.org/community/lists/devel/2008/12/5029.php and 
http://www.open-mpi.org/community/lists/users/2008/12/7499.php).

Re: [OMPI users] Deadlock on large numbers of processors

2009-01-12 Thread Justin

In order for me to test this out I need to wait for TACC to install this 
version on Ranger.  Right now they have version  1.3a1r19685 installed.  
I'm guessing this is probably an older version.  I'm not sure when TACC 
will get around to updating there OpenMPI version.  I could request them 
to update it but it would be a lot easier to request an actual release.  
What is the current schedule for the 1.3 release?


Justin

Jeff Squyres wrote:

Justin --

Could you actually give your code a whirl with 1.3rc3 to ensure that 
it fixes the problem for you?


http://www.open-mpi.org/software/ompi/v1.3/


On Jan 12, 2009, at 1:30 PM, Tim Mattox wrote:


Hi Justin,
I applied the fixes for this particular deadlock to the 1.3 code base
late last week, see ticket #1725:
https://svn.open-mpi.org/trac/ompi/ticket/1725

This should fix the described problem, but I personally have not tested
to see if the deadlock in question is now gone.  Everyone should give
thanks to George for his efforts in tracking down the problem
and finding a solution.
 -- Tim Mattox, the v1.3 gatekeeper

On Mon, Jan 12, 2009 at 12:46 PM, Justin <luitj...@cs.utah.edu> wrote:

Hi,  has this deadlock been fixed in the 1.3 source yet?

Thanks,

Justin


Jeff Squyres wrote:


On Dec 11, 2008, at 5:30 PM, Justin wrote:

The more I look at this bug the more I'm convinced it is with 
openMPI and
not our code.  Here is why:  Our code generates a 
communication/execution
schedule.  At each timestep this schedule is executed and all 
communication

and execution is performed.  Our problem is AMR which means the
communication schedule may change from time to time.  In this case 
the
schedule has not changed in many timesteps meaning the same 
communication

schedule is being used as the last X (x being around 20 in this case)
timesteps.
Our code does have a very large communication problem.  I have 
been able
to reduce the hang down to 16 processors and it seems to me the 
hang occurs
when he have lots of work per processor.  Meaning if I add more 
processors

it may not hang but reducing processors makes it more likely to hang.
What is the status on the fix for this particular freelist deadlock?



George is actively working on it because it is the "last" issue 
blocking
us from releasing v1.3.  I fear that if he doesn't get it fixed by 
tonight,

we'll have to push v1.3 to next year (see
http://www.open-mpi.org/community/lists/devel/2008/12/5029.php and
http://www.open-mpi.org/community/lists/users/2008/12/7499.php).



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
tmat...@gmail.com || timat...@open-mpi.org
   I'm a bright... http://www.the-brights.net/
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] MPI_Send over 2 GB

2009-02-18 Thread Justin

My guess would be that your count argument is overflowing.  Is the count 
a signed 32 bit integer?  If so it will overflow around 2GB.  Try 
outputting the size that you are sending and see if you get large 
negative number.


Justin

Vittorio wrote:
Hi! I'm doing a test to measure the transfer rates and latency of ompi 
over infiniband


starting from 1 kB everything was doing fine until i wanted to 
transfer 2 GB and i received this error


[tatami:02271] *** An error occurred in MPI_Recv
[tatami:02271] *** on communicator MPI_COMM_WORLD
[tatami:02271] *** MPI_ERR_COUNT: invalid count argument
[tatami:02271] *** MPI_ERRORS_ARE_FATAL (goodbye)
[randori:12166] *** An error occurred in MPI_Send
[randori:12166] *** on communicator MPI_COMM_WORLD
[randori:12166] *** MPI_ERR_COUNT: invalid count argument
[randori:12166] *** MPI_ERRORS_ARE_FATAL (goodbye)


this error appears if i run the program either on the same node or both
is 2 GB the intrinsic limit of MPI_Send/MPI_Recv?

thanks a lot
Vittorio


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] valgrind problems

2009-02-26 Thread Justin

Is there any tricks to getting it to work?  When we run with valgrind we 
get segfaults, valgrind reports errors in different MPI functions for 
example:


==3629== Invalid read of size 4
==3629==at 0x1CF7AEEC: (within 
/usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so)
==3629==by 0x1D9C23F4: mca_btl_sm_component_progress (in 
/usr/lib/openmpi/lib/openmpi/mca_btl_sm.so)
==3629==by 0x1D17F14A: mca_bml_r2_progress (in 
/usr/lib/openmpi/lib/openmpi/mca_bml_r2.so)
==3629==by 0x151FCCD9: opal_progress (in 
/usr/lib/openmpi/lib/libopen-pal.so.0.0.0)
==3629==by 0xD09FA94: ompi_request_wait_all (in 
/usr/lib/openmpi/lib/libmpi.so.0.0.0)
==3629==by 0x1E3E47C1: ompi_coll_tuned_sendrecv_actual (in 
/usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so)
==3629==by 0x1E3E9105: 
ompi_coll_tuned_barrier_intra_recursivedoubling (in 
/usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so)
==3629==by 0xD0B42FF: PMPI_Barrier (in 
/usr/lib/openmpi/lib/libmpi.so.0.0.0)
==3629==by 0x7EA025E: 
Uintah::DataArchiver::initializeOutput(Uintah::Handle 
const&) (DataArchiver.cc:400)
==3629==by 0x899DDDF: 
Uintah::SimulationController::postGridSetup(Uintah::Handle&, 
double&) (SimulationController.cc:352)
==3629==by 0x89A8568: Uintah::AMRSimulationController::run() 
(AMRSimulationController.cc:126)

==3629==by 0x408B9F: main (sus.cc:622)

This is then followed by a segfault.

Justin

Jeff Squyres wrote:

On Feb 26, 2009, at 7:03 PM, Justin wrote:

I'm trying to use valgrind to check if we have any memory problems in 
our code when running with parallel processors.  However,  when I run 
using mpi and valgrind I crashes in various places.  For example some 
of the times it will crash with a segfault within MPI_Allgatherv 
despite the fact that all the arguments to the all gather on all 
processors is completely valid.   If we don't use valgrind the 
program runs just fine.
This is on a Debian(lenny) 64 bit machine using the stock mpi 
package.  The command used to launch the job is: mpirun -np 8 
valgrind -v --log-file=valgrind.%p executable.  Are valgrind and 
openmpi compatible?  Is there any special tricks to getting them to 
work together?



We use valgrind internally to track down leaks and other debugging 
kinds of things.  So yes, it should work.


I do try to keep up with the latest latest latest valgrind, though.

Re: [OMPI users] valgrind problems

2009-02-26 Thread Justin

Also the stable version of openmpi on Debian is 1.2.7rc2.  Are there any 
known issues with this version and valgrid?


Thanks,
Justin

Justin wrote:
Is there any tricks to getting it to work?  When we run with valgrind 
we get segfaults, valgrind reports errors in different MPI functions 
for example:


==3629== Invalid read of size 4
==3629==at 0x1CF7AEEC: (within 
/usr/lib/openmpi/lib/openmpi/mca_pml_ob1.so)
==3629==by 0x1D9C23F4: mca_btl_sm_component_progress (in 
/usr/lib/openmpi/lib/openmpi/mca_btl_sm.so)
==3629==by 0x1D17F14A: mca_bml_r2_progress (in 
/usr/lib/openmpi/lib/openmpi/mca_bml_r2.so)
==3629==by 0x151FCCD9: opal_progress (in 
/usr/lib/openmpi/lib/libopen-pal.so.0.0.0)
==3629==by 0xD09FA94: ompi_request_wait_all (in 
/usr/lib/openmpi/lib/libmpi.so.0.0.0)
==3629==by 0x1E3E47C1: ompi_coll_tuned_sendrecv_actual (in 
/usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so)
==3629==by 0x1E3E9105: 
ompi_coll_tuned_barrier_intra_recursivedoubling (in 
/usr/lib/openmpi/lib/openmpi/mca_coll_tuned.so)
==3629==by 0xD0B42FF: PMPI_Barrier (in 
/usr/lib/openmpi/lib/libmpi.so.0.0.0)
==3629==by 0x7EA025E: 
Uintah::DataArchiver::initializeOutput(Uintah::Handle 
const&) (DataArchiver.cc:400)
==3629==by 0x899DDDF: 
Uintah::SimulationController::postGridSetup(Uintah::Handle&, 
double&) (SimulationController.cc:352)
==3629==by 0x89A8568: Uintah::AMRSimulationController::run() 
(AMRSimulationController.cc:126)

==3629==by 0x408B9F: main (sus.cc:622)

This is then followed by a segfault.

Justin

Jeff Squyres wrote:

On Feb 26, 2009, at 7:03 PM, Justin wrote:

I'm trying to use valgrind to check if we have any memory problems 
in our code when running with parallel processors.  However,  when I 
run using mpi and valgrind I crashes in various places.  For example 
some of the times it will crash with a segfault within 
MPI_Allgatherv despite the fact that all the arguments to the all 
gather on all processors is completely valid.   If we don't use 
valgrind the program runs just fine.
This is on a Debian(lenny) 64 bit machine using the stock mpi 
package.  The command used to launch the job is: mpirun -np 8 
valgrind -v --log-file=valgrind.%p executable.  Are valgrind and 
openmpi compatible?  Is there any special tricks to getting them to 
work together?



We use valgrind internally to track down leaks and other debugging 
kinds of things.  So yes, it should work.


I do try to keep up with the latest latest latest valgrind, though.



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] MPI_Test without deallocation

2009-03-25 Thread Justin


Have you tried MPI_Probe?

Justin

Shaun Jackman wrote:
Is there a function similar to MPI_Test that doesn't deallocate the 
MPI_Request object? I would like to test if a message has been 
received (MPI_Irecv), check its tag, and dispatch the MPI_Request to 
another function based on that tag.


Cheers,
Shaun
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] MPI_Test without deallocation

2009-03-25 Thread Justin

There are two version of probe (MPI_Probe and MPI_IProbe) but I can't 
tell you off hand their details. I know when looking at them in the past 
the basic understanding that I took away was the MPI_Probe is like 
MPI_Test but it doesn't actually receive or deallocate the message.


From 

http://www.mcs.anl.gov/research/projects/mpi/mpi-standard/mpi-report-1.1/node50.htm

/The MPI_PROBE and MPI_IPROBE operations allow incoming messages to be 
checked for, without actually receiving them. The user can then decide 
how to receive them, based on the information returned by the probe 
(basically, the information returned by status). In particular, the user 
may allocate memory for the receive buffer, according to the length of 
the probed message./


Shaun Jackman wrote:
If an MPI_Irecv has already been posted, and a single message is sent 
to the receiver, then will an MPI_Probe return that there is no 
message waiting to be received? The message has already been received 
by the MPI_Irecv. It's the MPI_Request object of the MPI_Irecv call 
that needs to be probed, but MPI_Test has the side effect of also 
deallocating the MPI_Request object.


Cheers,
Shaun

Justin wrote:

Have you tried MPI_Probe?

Justin

Shaun Jackman wrote:
Is there a function similar to MPI_Test that doesn't deallocate the 
MPI_Request object? I would like to test if a message has been 
received (MPI_Irecv), check its tag, and dispatch the MPI_Request to 
another function based on that tag.


Cheers,
Shaun

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

[OMPI users] [PATCH] hooks: disable malloc override inside of Gentoo sandbox

2013-07-02 Thread Justin Bronder

As described in the comments in the source, Gentoo's own version of
fakeroot, sandbox, also runs into hangs when malloc is overridden.
Sandbox environments can easily be detected by looking for SANDBOX_PID
in the environment.  When detected, employ the same fix used for
fakeroot.

See https://bugs.gentoo.org/show_bug.cgi?id=462602
---
 opal/mca/memory/linux/hooks.c | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/opal/mca/memory/linux/hooks.c b/opal/mca/memory/linux/hooks.c
index 6a1646f..ce91e76 100644
--- a/opal/mca/memory/linux/hooks.c
+++ b/opal/mca/memory/linux/hooks.c
@@ -747,9 +747,16 @@ static void opal_memory_linux_malloc_init_hook(void)
"fakeroot" build environment that allocates memory during
stat() (see http://bugs.debian.org/531522).  It may not be
necessary any more since we're using access(), not stat().  But
-   we'll leave the check, anyway. */
+   we'll leave the check, anyway.
+
+   This is also an issue when using Gentoo's version of 'fakeroot',
+   sandbox v2.5.  Sandbox environments can also be detected fairly
+   easily by looking for SANDBOX_PID.
+*/
+
 if (getenv("FAKEROOTKEY") != NULL ||
-getenv("FAKED_MODE") != NULL) {
+getenv("FAKED_MODE") != NULL ||
+getenv("SANDBOX_PID") != NULL ) {
 return;
 }
 
-- 
1.8.1.5


-- 
Justin Bronder


signature.asc
Description: Digital signature

Re: [OMPI users] Seg fault with PBS Pro 10.4

2011-07-27 Thread Justin Wood

I heard back from my Altair contact this morning.  He told me that they 
did in fact make a change in some version of 10.x that broke this.  They 
don't have a workaround for v10, but he said it was fixed in v11.x.


I built OpenMPI 1.5.3 this morning with PBSPro v11.0, and it works fine. 
 I don't get any segfaults.


-Justin.

On 07/26/2011 05:49 PM, Ralph Castain wrote:

I don't believe we ever got anywhere with this due to lack of response. If you 
get some info on what happened to tm_init, please pass it along.

Best guess: something changed in a recent PBS Pro release. Since none of us 
have access to it, we don't know what's going on. :-(


On Jul 26, 2011, at 10:10 AM, Wood, Justin Contractor, SAIC wrote:


I'm having a problem using OpenMPI under PBS Pro 10.4.  I tried both 1.4.3 and 
1.5.3, both behave the same.  I'm able to run just fine if I don't use PBS and 
go direct to the nodes.  Also, if I run under PBS and use only 1 node, it works 
fine, but as soon as I span nodes, I get the following:

[a4ou-n501:07366] *** Process received signal ***
[a4ou-n501:07366] Signal: Segmentation fault (11)
[a4ou-n501:07366] Signal code: Address not mapped (1)
[a4ou-n501:07366] Failing at address: 0x3f
[a4ou-n501:07366] [ 0] /lib64/libpthread.so.0 [0x3f2b20eb10]
[a4ou-n501:07366] [ 1] /opt/ompi/1.4.3/intel/lib/libopen-rte.so.0(discui_+0x84) 
[0x2affa453765c]
[a4ou-n501:07366] [ 2] /opt/ompi/1.4.3/intel/lib/libopen-rte.so.0(diswsi+0xc3) 
[0x2affa4534c6f]
[a4ou-n501:07366] [ 3] /opt/ompi/1.4.3/intel/lib/libopen-rte.so.0 
[0x2affa453290c]
[a4ou-n501:07366] [ 4] 
/opt/ompi/1.4.3/intel/lib/libopen-rte.so.0(tm_init+0x1fe) [0x2affa4532bf8]
[a4ou-n501:07366] [ 5] /opt/ompi/1.4.3/intel/lib/libopen-rte.so.0 
[0x2affa452691c]
[a4ou-n501:07366] [ 6] mpirun [0x404c17]
[a4ou-n501:07366] [ 7] mpirun [0x403e28]
[a4ou-n501:07366] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3f2a61d994]
[a4ou-n501:07366] [ 9] mpirun [0x403d59]
[a4ou-n501:07366] *** End of error message ***
Segmentation fault

I searched the archives and found a similar issue from last year:

http://www.open-mpi.org/community/lists/users/2010/02/12084.php

The last update I saw was that someone was going to contact Altair and have 
them look at why it was failing to do the tm_init.  Does anyone have an update 
to this, and has anyone been able to run successfully using recent versions of 
PBSPro?  I've also contacted our rep at Altair, but he hasn't responded yet.

Thanks, Justin.

Justin Wood
Systems Engineer
FNMOC | SAIC
7 Grace Hopper, Stop 1
Monterey, CA
justin.g.wood@navy.mil
justin.g.w...@saic.com
office: 831.656.4671
mobile: 831.869.1576


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Justin Wood
Systems Engineer
FNMOC | SAIC
7 Grace Hopper, Stop 1
Monterey, CA
justin.g.wood@navy.mil
justin.g.w...@saic.com
office: 831.656.4671
mobile: 831.869.1576

[OMPI users] Cluster hangs/shows error while executing simple MPI program in C

2013-03-05 Thread Justin Joseph

Cluster hangs/shows error while executing simple MPI program in C
I am trying to run a simple MPI program(multiple array addition), it 
runs perfectly in my PC but simply hangs or shows the following error in the 
cluster.
I am using open mpi and the following command to execute .
mpirun -machinefile machine -np 4 ./array_sum 


error code: 
[[22877,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] 
connect() to 192.168.122.1 failed: Connection refused (111)

#include#include#include#include#definegroup
   MPI_COMM_WORLD #defineroot 0#definesize 
100intmain(intargc,char*argv[]){intno_tasks,task_id,i;MPI_Init(,);MPI_Comm_size(group,_tasks);MPI_Comm_rank(group,_id);intarr1[size],arr2[size],local1[size],local2[size];if(task_id==root){for(i=0;i

[OMPI users] Segfault when using valgrind

2009-07-06 Thread Justin Luitjens

, bool) (Grid.cc:866)
==22736==by 0x8345759: Uintah::SimulationController::gridSetup()
(SimulationController.cc:243)
==22736==by 0x834F418: Uintah::AMRSimulationController::run()
(AMRSimulationController.cc:117)
==22736==by 0x4089AE: main (sus.cc:629)


Are these problems with openmpi and is there any known work arounds?

Thanks,
Justin

Re: [OMPI users] MPI-Send for entire entire matrix when allocating memory dynamically

2009-10-29 Thread Justin Luitjens

Why not do something like this:

double **A=new double*[N];
double *A_data new double [N*N];

for(int i=0;i wrote:

> Hi
>thanks for the quick response. Yes, that is what I meant. I thought
> there was no other way around what I am doing but It is always good to ask a
> expert rather than assume!
>
> Cheers,
>
> C.S.N
>
>
> On Thu, Oct 29, 2009 at 11:25 AM, Eugene Loh  wrote:
>
>> Natarajan CS wrote:
>>
>>  Hello all,
>>>Firstly, My apologies for a duplicate post in LAM/MPI list I have
>>> the following simple MPI code. I was wondering if there was a workaround for
>>> sending a dynamically allocated 2-D matrix? Currently I can send the matrix
>>> row-by-row, however, since rows are not contiguous I cannot send the entire
>>> matrix at once. I realize one option is to change the malloc to act as one
>>> contiguous block but can I keep the matrix definition as below and still
>>> send the entire matrix in one go?
>>>
>>
>> You mean with one standard MPI call?  I don't think so.
>>
>> In MPI, there is a notion of derived datatypes, but I'm not convinced this
>> is what you want.  A derived datatype is basically a static template of data
>> and holes in memory.  E.g., 3 bytes, then skip 7 bytes, then another 2
>> bytes, then skip 500 bytes, then 1 last byte.  Something like that.  Your 2d
>> matrices differ in two respects.  One is that the pattern in memory is
>> different for each matrix you allocate.  The other is that your matrix
>> definition includes pointer information that won't be the same in every
>> process's address space.  I guess you could overcome the first problem by
>> changing alloc_matrix() to some fixed pattern in memory for some r and c,
>> but you'd still have pointer information in there that you couldn't blindly
>> copy from one process address space to another.
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] MPI-Send for entire entire matrix when allocating memory dynamically

2009-10-31 Thread Justin Luitjens

Here is how you can do this without having to redescribe the data type all
the time.  This will also keep your data layout together and improve cache
coherency.


#include 
#include 
#include 
using namespace std;
int main()
{
  int N=2, M=3;
  //Allocate the matrix
  double **A=(double**)malloc(sizeof(double*)*N);
  double *A_data=(double*)malloc(sizeof(double)*N*M);

  //assign some values to the matrix
  for(int i=0;i<N;i++)
A[i]=_data[i*M];

  int j=0;
  for(int n=0;n<N;n++)
for(int m=0;m<M;m++)
  A[n][m]=j++;

  //print the matrix
  cout << "Matrix:\n";
  for(int n=0;n<N;n++)
  {
for(int m=0;m<M;m++)
{
  cout << A[n][m] << " ";
}
cout << endl;
  }

  //to send over mpi
  //MPI_Send(A_data,M*N,MPI_DOUBLE,dest,tag,MPI_COMM_WORLD);

  //delete the matrix
  free(A);
  free(A_data);

  return 0;
}


On Sat, Oct 31, 2009 at 11:32 AM, George Bosilca <bosi...@eecs.utk.edu>wrote:

> Eugene is right, every time you create a new matrix you will have to
> describe it with a new datatype (even when using MPI_BOTTOM).
>
> george.
>
>
> On Oct 30, 2009, at 18:11 , Natarajan CS wrote:
>
>  Thanks for the replies guys! Definitely two suggestions worth trying.
>> Definitely didn't consider a derived datatype. I wasn't really sure that the
>> MPI_Send call overhead was significant enough that increasing the buffer
>> size and decreasing the number of calls would cause any speed up. Will
>> change the code over the weekend and see what happens! Also, maybe if one
>> passes the absolute address maybe there is no need for creating multiple
>> definitions of the datatype? Haven't gone through the man pages yet, so
>> apologies for ignorance!
>>
>> On Fri, Oct 30, 2009 at 2:44 PM, Eugene Loh <eugene@sun.com> wrote:
>> Wouldn't you need to create a different datatype for each matrix instance?
>>  E.g., let's say you create twelve 5x5 matrices.  Wouldn't you need twelve
>> different derived datatypes?  I would think so because each time you create
>> a matrix, the footprint of that matrix in memory will depend on the whims of
>> malloc().
>>
>> George Bosilca wrote:
>>
>> Even with the original way to create the matrices, one can use
>>  MPI_Create_type_struct to create an MPI datatype (
>> http://web.mit.edu/course/13/13.715/OldFiles/build/mpich2-1.0.6p1/www/www3/MPI_Type_create_struct.html
>>  )
>> using MPI_BOTTOM as the original displacement.
>>
>> On Oct 29, 2009, at 15:31 , Justin Luitjens wrote:
>>
>> Why not do something like this:
>>
>> double **A=new double*[N];
>> double *A_data new double [N*N];
>>
>> for(int i=0;i<N;i++)
>> A[i]=_data[i*N];
>>
>> This way you have contiguous data (in A_data) but can access it as a  2D
>> array using A[i][j].
>>
>> (I haven't compiled this but I know we represent our matrices this  way).
>>
>> On Thu, Oct 29, 2009 at 12:30 PM, Natarajan CS <csnata...@gmail.com>
>>  wrote:
>> Hi
>> thanks for the quick response. Yes, that is what I meant. I  thought there
>> was no other way around what I am doing but It is  always good to ask a
>> expert rather than assume!
>>
>> Cheers,
>>
>> C.S.N
>>
>>
>> On Thu, Oct 29, 2009 at 11:25 AM, Eugene Loh <eugene@sun.com>  wrote:
>> Natarajan CS wrote:
>>
>> Hello all,
>>Firstly, My apologies for a duplicate post in LAM/MPI list I  have the
>> following simple MPI code. I was wondering if there was a  workaround for
>> sending a dynamically allocated 2-D matrix? Currently  I can send the matrix
>> row-by-row, however, since rows are not  contiguous I cannot send the entire
>> matrix at once. I realize one  option is to change the malloc to act as one
>> contiguous block but  can I keep the matrix definition as below and still
>> send the entire  matrix in one go?
>>
>> You mean with one standard MPI call?  I don't think so.
>>
>> In MPI, there is a notion of derived datatypes, but I'm not  convinced
>> this is what you want.  A derived datatype is basically a  static template
>> of data and holes in memory.  E.g., 3 bytes, then  skip 7 bytes, then
>> another 2 bytes, then skip 500 bytes, then 1 last  byte.  Something like
>> that.  Your 2d matrices differ in two  respects.  One is that the pattern in
>> memory is different for each  matrix you allocate.  The other is that your
>> matrix definition  includes pointer information that won't be the same in
>> every  process's address space.  I guess you could overcome the first
>>  problem by changing alloc_matrix(

Re: [OMPI users] Wrappers should put include path after user args

2010-01-19 Thread Justin Bronder

On 04/12/09 16:20 -0500, Jeff Squyres wrote:
> Oy -- more specifically, we should not be putting -I/usr/include on the 
> command line *at all* (because it's special and already included by the 
> compiler search paths; similar for /usr/lib and /usr/lib64).  We should have 
> some special case code that looks for /usr/include and simply drops it.  Let 
> me check and see what's going on...
> 

I believe this was initially added here: 
https://svn.open-mpi.org/trac/ompi/ticket/870

> Can you send the contents of your 
> $prefix/share/openmpi/mpif90-wrapper-data.txt?  (it is *likely* in that 
> directory, but it could be somewhere else under prefix as well -- the 
> mpif90-wrapper-data.txt file is the important one)
> 
> 
> 
> On Dec 4, 2009, at 1:08 PM, Jed Brown wrote:
> 
> > Open MPI is installed by the distro with headers in /usr/include
> > 
> >   $ mpif90 -showme:compile -I/some/special/path
> >   -I/usr/include -pthread -I/usr/lib/openmpi -I/some/special/path
> > 
> > Here's why it's a problem:
> > 
> > HDF5 is also installed in /usr with modules at /usr/include/h5*.mod.  A
> > new HDF5 cannot be compiled using the wrappers because it will always
> > resolve the USE statements to /usr/include which is binary-incompatible
> > with the the new version (at a minimum, they "fixed" the size of an
> > argument to H5Lget_info_f between 1.8.3 and 1.8.4).
> > 
> > To build the library, the current choices are
> > 
> >   (a) get rid of the system copy before building
> >   (b) not use mpif90 wrapper
> > 
> > 
> > I just checked that MPICH2 wrappers consistently put command-line args
> > before the wrapper args.
> > 
> > Jed

Any news on this?  It doesn't look like it made it into the 1.4.1 release.
Also, it's not just /usr/include that is a problem, but the fact that the
wrappers are passing their paths before the user specified ones.  Here's an
example using mpich2 and openmpi with non-standard install paths.

Mpich2 (Some output stripped as mpicc -compile_info prints everything):
jbronder@mejis ~ $ which mpicc
/usr/lib64/mpi/mpi-mpich2/usr/bin/mpicc
jbronder@mejis ~ $ mpicc -compile_info -I/bleh
x86_64-pc-linux-gnu-gcc -I/bleh -I/usr/lib64/mpi/mpi-mpich2/usr/include 

OpenMPI:
jbronder@mejis ~ $ which mpicc
/usr/lib64/mpi/mpi-openmpi/usr/bin/mpicc
jbronder@mejis ~ $ mpicc -showme:compile -I/bleh
-I/usr/lib64/mpi/mpi-openmpi/usr/include/openmpi -pthread -I/bleh


Thanks,

-- 
Justin Bronder


pgpUpu5h4BdhJ.pgp
Description: PGP signature

[OMPI users] building OpenMPI on Windows XP 64 using Visual Studio 6 and Compaq Visual Fortran

2010-01-28 Thread Justin Watson

Hello all,

I am trying to build a 32 bit version of OpenMP on Window XP 64 
using Visual Studio 6 and Compaq Visual Fortran 6.6b.  I  am using CMake to 
configure the build I specify Visual Studio 6 as my generator for this project. 
 I specify where my c (cl.exe) and fortran (f90.exe) compilers are.  After I 
run configure for the first time, I select that I want f77 and f90 bindings.  
The second time I run configure I get the following error:

Define it as 'long long'.
Define it as 'unsigned long long'.
Check alignment of long long in c...
Check alignment of long long in c...
Check C:/Program Files (x86)/Microsoft Visual Studio/DF98/BIN/F90.EXE external 
symbol convention...
CMake Error at 
contrib/platform/win32/CMakeModules/f77_find_ext_symbol_convention.cmake:96 
(MESSAGE):
unknow Fortran naming convertion.
Call Stack (most recent call first):
contrib/platform/win32/CMakeModules/setup_f77.cmake:26 
(OMPI_F77_FIND_EXT_SYMBOL_CONVENTION)
contrib/platform/win32/CMakeModules/ompi_configure.cmake:1113 (INCLUDE)
CMakeLists.txt:87 (INCLUDE)
Configuring incomplete, errors occurred!

Has anyone had success in building with a similar configuration?


Justin K. Watson
   Email: jkw...@arl.psu.edu
Research Assistant  
Phone: (814) 863-6754
Computational Methods Development Department   Fax: (814) 
865-3287


Applied Research Laboratory
The Pennsylvania State University
P.O. Box 30
State College, Pa 16804-0030

[OMPI users] Problem with private variables in modules

2010-03-10 Thread Justin Watson

Hello everyone,

I have  come across a situation where I am trying to make 
private variables that passed to subroutines using modules.  Here is the 
situation, The main program calls two different routines.  These routines are 
functionally different but utilize the same variable names for some global data 
which are contained in a module (this was done to make the passing of the data 
easy to various levels of subroutines it is not needed outside the subroutine 
chain).  I am using workshare constructs to run each of these routines on its 
own thread.  I would  like to make the data in the module private to that 
thread.  When I set the variable to private it still behaves as if it were 
shared.  If I pass the variable to the routines via an argument list everything 
is fine (this will cause me to re-write a bunch of code).  The question is ... 
shouldn't this work within the context of a module as well?

I have been getting different result using different compilers. 
 I have tried Lahey and Intel and they both show signs of not handling this 
properly.  I have attach a small test problem that mimics what I am doing in 
the large code.

Justin K. Watson
   Email: jkw...@arl.psu.edu
Research Assistant  
Phone: (814) 863-6754
Computational Methods Development Department   Fax: (814) 
865-3287


Applied Research Laboratory
The Pennsylvania State University
P.O. Box 30
State College, Pa 16804-0030



Hello.f90
Description: Hello.f90


PrintThreadM.f90
Description: PrintThreadM.f90


DataM.f90
Description: DataM.f90

Re: [OMPI users] Run-time problem

2009-03-06 Thread justin oppenheim

Please let me go over it again, and maybe it helps clarifying things a bit 
better. All the OS involved are Suse 10.3.

I have a place for the the installed programs, say /programs.

In /programs I have installed openmpi and my mpi program, say my_mpi_program.  
When  I am in the working directory, my LD_LIBRARY_PATH does include both 

/programs/my_mpi_program/lib
/programs/openmpi/lib

And my PATH includes
/programs/my_mpi_program/bin
/programs/openmpi/bin

So, then I do

mpirun -machinefile machinefile  -np 20 my_mpi_program  

and I get

/programs/my_mpi_program: symbol lookup error: 
/programs/openmpi/lib/libmpi_cxx.so.0: undefined symbol: 
ompi_registered_datareps

When I configured openmpi, I did

./configure --prefix=/programs/openmpi
 
and then compiled it. Subsequently, I compiled my_mpi_program with the options:

MPI_CXX=/programs/openmpi/bin/mpicxx 
MPI_CC=/programs/openmpi/bin/mpicc 
MPI_INCLUDE=/programs/openmpi/include/
MPI_LIB=mpi 
MPI_LIBDIR=/programs/openmpi/lib/ 
MPI_LINKERFORPROGRAMS=/programs/openmpi/bin/mpicxx

Any clue? The directory /programs is NSF mounted on the nodes.

Many thanks again,

JO










--- On Thu, 3/5/09, justin oppenheim <jl09...@yahoo.com> wrote:
From: justin oppenheim <jl09...@yahoo.com>
Subject: Re: [OMPI users] Run-time problem
To: "Ralph Castain" <r...@lanl.gov>
List-Post: users@lists.open-mpi.org
Date: Thursday, March 5, 2009, 5:28 PM

Hi Ralph:

Sorry for my ignorance, but in you option 2: what command should I add the 
option 
--prefix=path-to-install? when I configure openmpi? I already did that when I 
configured  and compiled openmpi.  Also, in response to your option 1, I did 
add the paths to libraries of openmpi in the LD_LIBRARY_PATH  in the .cshrc of 
the nodes. 

Thank you,
JO

--- On Thu, 3/5/09, Ralph Castain <r...@lanl.gov> wrote:
From: Ralph Castain <r...@lanl.gov>
Subject: Re: [OMPI users] Run-time problem
To: jl09...@yahoo.com
Cc: "Open MPI Users <us...@open-mpi.org>" <us...@open-mpi.org>
List-Post: users@lists.open-mpi.org
Date: Thursday, March 5, 2009, 12:46 PM

First, you can add --launch-agent rsh to
 the command line and that will have OMPI use rsh.
It sounds like your remote nodes may not be seeing your OMPI install directory. 
Several ways you can resolve that - here are a couple:
1. add the install directory to your LD_LIBRARY_PATH in your .cshrc (or 
whatever shell rc you are using) - be sure this is being executed on the remote 
nodes
2. add --prefix=path-to-install on your cmd line - this will direct your remote 
procs to the proper libraries
Ralph

On Mar 5, 2009, at 10:18 AM, justin oppenheim wrote:
Maybe I should also add that the program
my_mpi_executable is locally installed under the same root directory as that 
under which  openmpi-1.3 is installed. This root directory is NSF mounted on 
the working nodes.

Thanks,
JO

--- On Thu, 3/5/09, justin oppenheim <jl09...@yahoo.com> wrote:
From: justin oppenheim <jl09...@yahoo.com>
Subject: Re: [OMPI users] Run-time problem
To: "Ralph Castain" <r...@lanl.gov>
List-Post: users@lists.open-mpi.org
Date: Thursday, March 5, 2009, 12:04 PM

Hi Ralph:

Thanks for your prompt response. I am using openmpi-1.3, Suse 10.3. I installed 
openmpi-1.3 with the option

./configure --prefix=/where/to/install

and then just 

make all install

I thought the default connection mode is rsh, but I had to invoke ssh-agent, in 
order not have to enter password one by one. How to change to rsh?

Thanks,
JO

--- On Thu, 3/5/09, Ralph Castain <r...@lanl.gov> wrote:
From: Ralph Castain <r...@lanl.gov>
Subject: Re: [OMPI users] Run-time
 problem
To: jl09...@yahoo.com, "Open MPI Users" <us...@open-mpi.org>
List-Post: users@lists.open-mpi.org
Date: Thursday, March 5, 2009, 11:40 AM

Could you tell us what version of Open MPI you are using, a little about your 
system (I would assume you are using ssh?), and how this was configured?
ThanksRalph

On Mar 5, 2009, at 9:31 AM, justin oppenheim wrote:
Hi:

When I execute something like

mpirun
 -machinefile machinefile my_mpi_executable 

I get something like this 

my_mpi_executable symbol lookup error: remote_openmpi/lib/libmpi_cxx.so.0: 
undefined symbol: ompi_registered_datareps

where both my_mpi_executable and remote_openmpi are installed on NSF mounted 
locations.

Any clue?

thanks

JO
   ___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Run-time problem

2009-03-16 Thread justin oppenheim

Hi Jeff:

I managed to run it just recently... It turns out that some libraries libib* 
were missing, as well as some others. I learned this by trying to install an 
old version of openmpi that was in the repository of my Suse Linux. The 
"software manager" of Suse told me the missing libraries for the old openmpi. 
After installing these libraries, the already installed new openmpi (downloaded 
from open-mpi.org) works. Maybe it is a good idea to spell this out on open-mpi 
web site. People would just install the openmpi without knowing that there 
might be some missing libraries... 

Thanks!
JO 

--- On Sat, 3/14/09, Jeff Squyres <jsquy...@cisco.com> wrote:
From: Jeff Squyres <jsquy...@cisco.com>
Subject: Re: [OMPI users] Run-time problem
To: jl09...@yahoo.com, "Open MPI Users" <us...@open-mpi.org>
Cc: "Ralph Castain" <r...@lanl.gov>
List-Post: users@lists.open-mpi.org
Date: Saturday, March 14, 2009, 9:15 AM

Sorry for the delay in replying; this week unexpectedly turned exceptionally
hectic for several us...

On Mar 9, 2009, at 2:53 PM, justin oppenheim wrote:

> Yes. As I indicated earlier, I did use these options to compile my program
> 
> MPI_CXX=/programs/openmpi/bin/mpicxx
> MPI_CC=/programs/openmpi/bin/mpicc
> MPI_INCLUDE=/programs/openmpi/include/
> MPI_LIB=mpi /programs/openmpi/
> MPI_LIBDIR=/programs/openmpi/lib/
MPI_LINKERFORPROGRAMS=/programs/openmpi/bin/mpicxx

Ah; I think Ralph was asking because we don't know exactly how these
?environment variables? are being used to build your application.

> where /programs/openmpi/ is the chosen location for installing the openmpi
package (specifically, openmpi-1.3.tar.gz)  that I downloaded from 
www.open-mpi.org.

Can you ensure that you have exactly the same version of Open MPI installed on
all nodes in exactly the same location in the filesystem (it doesn't *have*
to be the same location on the filesystem on all the nodes, but it sure is
easier if it is).  Also be sure that when you mpirun across multiple nodes that
the same version of Open MPI (both executables and libraries) are being found on
all nodes.

> 
> Any clue? Again, my system is Suse 10.3 64-bit, which should be pretty
standard. Would another package openmpi-1.3-1.src.rpm work better for my system?
> 
> Thanks,
> 
> JO
> 
> 
> 
> 
> 
> --- On Mon, 3/9/09, Ralph Castain <r...@lanl.gov> wrote:
> From: Ralph Castain <r...@lanl.gov>
> Subject: Re: [OMPI users] Run-time problem
> To: jl09...@yahoo.com
> Cc: us...@open-mpi.org
> Date: Monday, March 9, 2009, 7:59 AM
> 
> Did you try compiling your program with the provided mpicc (or mpiCC,
mpif90, etc. - as appropriate) wrapper compiler? The wrapper compilers contain
all the required library definitions to make the application work.
> 
> Compiling without the wrapper compilers is a very bad idea...
> 
> Ralph
> 
> 
> On Mar 6, 2009, at 11:02 AM, justin oppenheim wrote:
> 
>> Please let me go over it again, and maybe it helps clarifying things a
bit better. All the OS involved are Suse 10.3.
>> 
>> I have a place for the the installed programs, say /programs.
>> 
>> In /programs I have installed openmpi and my mpi program, say
my_mpi_program.  When I am in the working directory, my LD_LIBRARY_PATH does
include both
>> 
>> /programs/my_mpi_program/lib
>> /programs/openmpi/lib
>> 
>> And my PATH includes
>> /programs/my_mpi_program/bin
>> /programs/openmpi/bin
>> 
>> So, then I do
>> 
>> mpirun -machinefile machinefile  -np 20 my_mpi_program

>> 
>> and I get
>> 
>> /programs/my_mpi_program: symbol lookup error:
/programs/openmpi/lib/libmpi_cxx.so.0: undefined symbol:
ompi_registered_datareps
>> 
>> When I configured openmpi, I did
>> 
>> ./configure --prefix=/programs/openmpi
>> 
>> and then compiled it. Subsequently, I compiled my_mpi_program with the
options:
>> 
>> MPI_CXX=/programs/openmpi/bin/mpicxx
>> MPI_CC=/programs/openmpi/bin/mpicc
>> MPI_INCLUDE=/programs/openmpi/include/
>> MPI_LIB=mpi
>> MPI_LIBDIR=/programs/openmpi/lib/
MPI_LINKERFORPROGRAMS=/programs/openmpi/bin/mpicxx
>> 
>> Any clue? The directory /programs is NSF mounted on the nodes.
>> 
>> Many thanks again,
>> 
>> JO
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> --- On Thu, 3/5/09, justin oppenheim <jl09...@yahoo.com> wrote:
>> From: justin oppenheim <jl09...@yahoo.com>
>> Subject: Re: [OMPI users] Run-time problem
>> To: "Ralph Castain" <r...@lanl.gov>
>> Date: Thursday, March 5, 2009, 5:28 PM
>> 
>> Hi Ralph:
>> 
>> Sorry for my ignoranc

[OMPI users] CUDA IPC/RDMA Not Working

2016-03-30 Thread Justin Luitjens

 MCA topo: basic (MCA v2.0.0, API v2.1.0, Component v1.10.2)
   MCA vprotocol: pessimist (MCA v2.0.0, API v2.0.0, Component
  v1.10.2)


Thanks,
Justin


---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI users] CUDA IPC/RDMA Not Working

2016-03-30 Thread Justin Luitjens

We have figured this out.  It turns out that the first call to each 
MPI_Isend/Irecv is staged through the host but subsequent calls are not.

Thanks,
Justin

From: Justin Luitjens
Sent: Wednesday, March 30, 2016 9:37 AM
To: us...@open-mpi.org
Subject: CUDA IPC/RDMA Not Working

Hello,

I have installed OpenMPI 1.10.2 with cuda support:

[jluitjens@dt03 repro]$ ompi_info --parsable --all | grep 
mpi_built_with_cuda_support:value
mca:mpi:base:param:mpi_built_with_cuda_support:value:true


I'm trying to verify that GPU direct is working and that messages aren't 
traversing through the host.  On a K80 GPU I'm starting 2 MPI processes where 
each takes one of the GPUs of the K80.  They then do a send receive of a 
certain size.

In addition,  I'm recording a timeline with nvprof to visualize what is 
happening.  What I'm excepting to happens is there will be one MemCpy D2D on 
each device corresponding to the send and the recive.  However,  What I'm 
seeing is each device  D2H followed by a H2D copy suggesting the data is 
staging through the host.

Here is how I'm currently running the application:

mpirun --mca btl_smcuda_cuda_ipc_verbose 100 --mca btl_smcuda_use_cuda_ipc 1 
--mca btl smcuda,self --mca btl_openib_want_cuda_gdr 1 -np 2 nvprof -o 
profile.%p ./a.out



I'm getting the following diagnostic output:

[dt03:21732] Sending CUDA IPC REQ (try=1): myrank=1, mydev=1, peerrank=0
[dt03:21731] Sending CUDA IPC REQ (try=1): myrank=0, mydev=0, peerrank=1
[dt03:21731] Not sending CUDA IPC ACK because request already initiated
[dt03:21732] Analyzed CUDA IPC request: myrank=1, mydev=1, peerrank=0, 
peerdev=0 --> ACCESS=1
[dt03:21732] BTL smcuda: rank=1 enabling CUDA IPC to rank=0 on node=dt03
[dt03:21732] Sending CUDA IPC ACK:  myrank=1, mydev=1, peerrank=0, peerdev=0
[dt03:21731] Received CUDA IPC ACK, notifying PML: myrank=0, peerrank=1
[dt03:21731] BTL smcuda: rank=0 enabling CUDA IPC to rank=1 on node=dt03

Here it seems like IPC is correctly being enabled between ranks 0 and 1.

I have tried both very large and very small messages and they all seem to stage 
through the host.

What am I doing wrong?

For reference here is my ompi_info output:

[jluitjens@dt03 repro]$ ompi_info
 Package: Open MPI jluitjens@dt04 Distribution
Open MPI: 1.10.2
  Open MPI repo revision: v1.10.1-145-g799148f
   Open MPI release date: Jan 21, 2016
Open RTE: 1.10.2
  Open RTE repo revision: v1.10.1-145-g799148f
   Open RTE release date: Jan 21, 2016
OPAL: 1.10.2
  OPAL repo revision: v1.10.1-145-g799148f
   OPAL release date: Jan 21, 2016
 MPI API: 3.0.0
Ident string: 1.10.2
  Prefix: 
/shared/devtechapps/mpi/gnu-4.7.3/openmpi-1.10.2/cuda-7.5
Configured architecture: x86_64-pc-linux-gnu
  Configure host: dt04
   Configured by: jluitjens
   Configured on: Tue Feb  9 10:56:22 PST 2016
  Configure host: dt04
Built by: jluitjens
Built on: Tue Feb  9 11:21:51 PST 2016
  Built host: dt04
  C bindings: yes
C++ bindings: yes
 Fort mpif.h: yes (all)
Fort use mpi: yes (limited: overloading)
   Fort use mpi size: deprecated-ompi-info-value
Fort use mpi_f08: no
Fort mpi_f08 compliance: The mpi_f08 module was not built
  Fort mpi_f08 subarrays: no
   Java bindings: no
  Wrapper compiler rpath: runpath
  C compiler: /shared/apps/rhel-6.2/tools/gcc-4.7.3/bin/gcc
 C compiler absolute:
  C compiler family name: GNU
  C compiler version: 4.7.3
C++ compiler: /shared/apps/rhel-6.2/tools/gcc-4.7.3/bin/g++
  C++ compiler absolute: none
   Fort compiler: /shared/apps/rhel-6.2/tools/gcc-4.7.3/bin/gfortran
   Fort compiler abs:
 Fort ignore TKR: no
   Fort 08 assumed shape: no
  Fort optional args: no
  Fort INTERFACE: yes
Fort ISO_FORTRAN_ENV: yes
   Fort STORAGE_SIZE: no
  Fort BIND(C) (all): no
  Fort ISO_C_BINDING: yes
Fort SUBROUTINE BIND(C): no
   Fort TYPE,BIND(C): no
Fort T,BIND(C,name="a"): no
Fort PRIVATE: no
  Fort PROTECTED: no
   Fort ABSTRACT: no
   Fort ASYNCHRONOUS: no
  Fort PROCEDURE: no
 Fort USE...ONLY: no
   Fort C_FUNLOC: no
Fort f08 using wrappers: no
 Fort MPI_SIZEOF: no
 C profiling: yes
   C++ profiling: yes
   Fort mpif.h profiling: yes
  Fort use mpi profiling: yes
   Fort use mpi_f08 prof: no
  C++ exceptions: no
  Thread support: posix (MPI_THREAD_MULTIPLE: no, OPAL support: yes,
  OMPI progress: no, ORTE progress: yes, Event lib:
  yes)
   Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: yes
 MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no

Re: [OMPI users] Ssh launch code

2016-07-14 Thread Justin Cinkelj


Fork call location:
https://github.com/open-mpi/ompi-release/blob/v2.x/orte/mca/plm/rsh/plm_rsh_module.c#L911-921

BR Justin

On 07/14/2016 03:12 PM, larkym wrote:

Where in the code does the tree based launch via ssh occur in open-mpi?

I have read a few articles, but would like to understand it more, 
specifically the code that does it.


Thanks



Sent from my Verizon, Samsung Galaxy smartphone


___
users mailing list
us...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2016/07/29661.php

[OMPI users] [PMX:VIRUS] Re: OpenMPI 1.0.3a1r10002 Fails to build with IBM XL Compilers.

2006-05-28 Thread Justin Bronder

Brian Barrett wrote:
> On May 27, 2006, at 10:01 AM, Justin Bronder wrote:
>
>   
>> I've attached the required logs.  Essentially the problem seems to
>> be that the XL Compilers fail to recognize "__asm__ __volatile__" in
>> opal/include/sys/powerpc/atomic.h when building 64-bit.
>>
>> I've tried using various xlc wrappers such as gxlc and xlc_r to
>> no avail.  The current log uses xlc_r_64 which is just a one line
>> shell script forcing the -q64 option.
>>
>> The same works flawlessly with gcc-4.1.0.  I'm using the nightly
>> build in order to link with Torque's new shared libraries.
>>
>> Any help would be greatly appreciated.  For reference here are
>> a few other things that may provide more information.
>> 
>
> Can you send the config.log file generated by configure?  What else  
> is in the xlc_r_64 shell script, other than the -q64 option?
>
>
>   
I've attached the config.log, and here's what all of the *_64 scripts
look like.


node42 openmpi-1.0.3a1r10002 # cat /opt/ibmcmp/vac/8.0/bin/xlc_r_64
#!/bin/sh
xlc_r -q64 "$@"


Thanks,

-- 
Justin Bronder
University of Maine, Orono

Advanced Computing Research Lab
20 Godfrey Dr
Orono, ME 04473
www.clusters.umaine.edu

Mathematics Department
425 Neville Hall
Orono, ME 04469


WARNING: The virus scanner was unable to scan the next
attachment.  This attachment could possibly contain viruses
or other malicious programs.  The attachment could not be
scanned for the following reasons:

The file was corrupt

You are advised NOT to open this attachment unless you are
completely sure of its contents.  If in doubt, please
contact your system administrator.  

The identifier for this message is 'k4SEmG3W018957'.

The Management
PureMessage Admin <sys...@cs.indiana.edu>


config.log.tar.gz
Description: application/gzip

Re: [OMPI users] [PMX:VIRUS] Re: OpenMPI 1.0.3a1r10002 Fails to build with IBM XL Compilers.

2006-05-31 Thread Justin Bronder

On 5/30/06, Brian Barrett <brbar...@open-mpi.org> wrote:

On May 28, 2006, at 8:48 AM, Justin Bronder wrote:

> Brian Barrett wrote:
>> On May 27, 2006, at 10:01 AM, Justin Bronder wrote:
>>
>>
>>> I've attached the required logs.  Essentially the problem seems to
>>> be that the XL Compilers fail to recognize "__asm__ __volatile__" in
>>> opal/include/sys/powerpc/atomic.h when building 64-bit.
>>>
>>> I've tried using various xlc wrappers such as gxlc and xlc_r to
>>> no avail.  The current log uses xlc_r_64 which is just a one line
>>> shell script forcing the -q64 option.
>>>
>>> The same works flawlessly with gcc-4.1.0.  I'm using the nightly
>>> build in order to link with Torque's new shared libraries.
>>>
>>> Any help would be greatly appreciated.  For reference here are
>>> a few other things that may provide more information.
>>>
>>
>> Can you send the config.log file generated by configure?  What else
>> is in the xlc_r_64 shell script, other than the -q64 option?

> I've attached the config.log, and here's what all of the *_64 scripts
> look like.

Can you try compiling without the -qkeyword=__volatile__?  It looks
like XLC now has some support for GCC-style inline assembly, but it
doesn't seem to be working in this case.  If that doesn't work, try
setting CFLAGS and CXXFLAGS to include -qnokeyword=asm, which should
disable GCC inline assembly entirely.  I don't have access to a linux
cluster with the XL compilers, so I can't verify this.  But it should
work.

Brian

No good sadly.  The same error continues to appear.  I had actually
initially
attempted to  compile without -qkeyword=__volatile__, but had hoped to
force xlc to recognize it.  This is obviously more of an XL issue,
especially
as I've since found that everything works flawlessly in 32-bit mode.  If
anyone
has more suggestions, I'd love the help as I'm lost at this point.

Thanks for the help thus far,

Justin.

Re: [OMPI users] [PMX:VIRUS] Re: OpenMPI 1.0.3a1r10002 Fails to build with IBM XL Compilers.

2006-05-31 Thread Justin Bronder


On 5/31/06, Brian W. Barrett <brbar...@open-mpi.org> wrote:


A quick workaround is to edit opal/include/opal_config.h and change the
#defines for OMPI_CXX_GCC_INLINE_ASSEMBLY and OMPI_CC_GCC_INLINE_ASSEMBLY
from 1 to 0.  That should allow you to build Open MPI with those XL
compilers.  Hopefully IBM will fix this in a future version ;).



Well I actually edited include/ompi_config.h and set both
OMPI_C_GCC_INLINE_ASSEMBLY
and OMPI_CXX_GCC_INLINE_ASSEMBLY to 0.  This worked until libtool tried to
create
a shared library:

/bin/sh ../libtool --tag=CC --mode=link gxlc_64  -O -DNDEBUG
-qnokeyword=asm   -export-dynamic   -o libopal.la -rpath
/usr/local/ompi-xl/lib   libltdl/libltdlc.la asm/libasm.la class/libclass.la
event/libevent.la mca/base/libmca_base.la memoryhooks/libopalmemory.la
runtime/libruntime.la threads/libthreads.la util/libopalutil.la
mca/maffinity/base/libmca_maffinity_base.la
mca/memory/base/libmca_memory_base.la
mca/memory/malloc_hooks/libmca_memory_malloc_hooks.la
mca/paffinity/base/libmca_paffinity_base.la
mca/timer/base/libmca_timer_base.la mca/timer/linux/libmca_timer_linux.la
-lm  -lutil -lnsl
mkdir .libs
gxlc_64 -shared  --whole-archive libltdl/.libs/libltdlc.a asm/.libs/libasm.a
class/.libs/libclass.a event/.libs/libevent.a mca/base/.libs/libmca_base.a
memoryhooks/.libs/libopalmemory.a runtime/.libs/libruntime.a
threads/.libs/libthreads.a util/.libs/libopalutil.a
mca/maffinity/base/.libs/libmca_maffinity_base.a
mca/memory/base/.libs/libmca_memory_base.a
mca/memory/malloc_hooks/.libs/libmca_memory_malloc_hooks.a
mca/paffinity/base/.libs/libmca_paffinity_base.a
mca/timer/base/.libs/libmca_timer_base.a
mca/timer/linux/.libs/libmca_timer_linux.a --no-whole-archive  -ldl -lm
-lutil -lnsl -lc  -qnokeyword=asm -soname libopal.so.0 -o
.libs/libopal.so.0.0.0
gxlc: 1501-257 Option --whole-archive is not recognized.  Option will be
ignored.
gxlc: 1501-257 Option --no-whole-archive is not recognized.  Option will be
ignored.
gxlc: 1501-257 Option -qnokeyword=asm is not recognized.  Option will be
ignored.
gxlc: 1501-257 Option -soname is not recognized.  Option will be ignored.
xlc: 1501-218 file libopal.so.0 contains an incorrect file suffix
xlc: 1501-228 input file libopal.so.0 not found
xlc -q64 -qthreaded -D_REENTRANT -lpthread -qmkshrobj
libltdl/.libs/libltdlc.a asm/.libs/libasm.a class/.libs/libclass.a
event/.libs/libevent.a mca/base/.libs/libmca_base.a
memoryhooks/.libs/libopalmemory.a runtime/.libs/libruntime.a
threads/.libs/libthreads.a util/.libs/libopalutil.a
mca/maffinity/base/.libs/libmca_maffinity_base.

I was able to fix this by editing libtool and replacing $CC with $LD in the
following:

# Commands used to build and install a shared archive.
archive_cmds="\$LD -shared \$libobjs \$deplibs \$compiler_flags
\${wl}-soname \$wl\$soname -o \$lib"
archive_expsym_cmds="\$echo \\\"{ global:\\\" >
\$output_objdir/\$libname.ver~
 cat \$export_symbols | sed -e \\\"s/(.*)/1;/\\\" >>
\$output_objdir/\$libname.ver~
 \$echo \\\"local: *; };\\\" >> \$output_objdir/\$libname.ver~
   \$LD -shared \$libobjs \$deplibs \$compiler_flags \${wl}-soname
\$wl\$soname \${wl}-version-script \${wl}\$output_objdir/\$libname.ver -o
\$lib"

We then fail later on at:

make[3]: Entering directory `/usr/src/openmpi-1.0.3a1r10133
/orte/tools/orted'
/bin/sh ../../../libtool --tag=CC --mode=link gxlc_64  -O -DNDEBUG
-export-dynamic   -o orted   orted.o ../../../orte/liborte.la
../../../opal/libopal.la  -lm  -lutil -lnsl
gxlc_64 -O -DNDEBUG -o .libs/orted orted.o --export-dynamic
../../../orte/.libs/liborte.so
/usr/src/openmpi-1.0.3a1r10133/opal/.libs/libopal.so
../../../opal/.libs/libopal.so -ldl -lm -lutil -lnsl --rpath
/usr/local/ompi-xl/lib
gxlc: 1501-257 Option --export-dynamic is not recognized.  Option will be
ignored.
gxlc: 1501-257 Option --rpath is not recognized.  Option will be ignored.
xlc: 1501-274 An incompatible level of gcc has been specified.
xlc: 1501-228 input file /usr/local/ompi-xl/lib not found
xlc -q64 -qthreaded -D_REENTRANT -lpthread -O -DNDEBUG -o .libs/orted
orted.o ../../../orte/.libs/liborte.so
/usr/src/openmpi-1.0.3a1r10133/opal/.libs/libopal.so
../../../opal/.libs/libopal.so -ldl -lm -lutil -lnsl /usr/local/ompi-xl/lib

Simply replacing ld for gxlc_64 here obviously won't work.
node42 orted # ld  -O -DNDEBUG -o .libs/orted orted.o --export-dynamic
../../../orte/.libs/liborte.so
/usr/src/openmpi-1.0.3a1r10133/opal/.libs/libopal.so
../../../opal/.libs/libopal.so -ldl -lm -lutil -lnsl --rpath
/usr/local/ompi-xl/lib -lpthread
ld: warning: cannot find entry symbol _start; defaulting to 10013ed8

Of course, I've been told that directly linking with ld isn't such a great
idea in the first
place.  Ideas?

Thanks,

Justin.

[OMPI users] OpenMpi 1.1 and Torque 2.1.1

2006-06-29 Thread Justin Bronder


I'm having trouble getting OpenMPI to execute jobs when submitting through
Torque.
Everything works fine if I am to use the included mpirun scripts, but this
is obviously
not a good solution for the general users on the cluster.

I'm running under OS X 10.4, Darwin 8.6.0.  I configured OpenMpi with:
export CC=/opt/ibmcmp/vac/6.0/bin/xlc
export CXX=/opt/ibmcmp/vacpp/6.0/bin/xlc++
export FC=/opt/ibmcmp/xlf/8.1/bin/xlf90_r
export F77=/opt/ibmcmp/xlf/8.1/bin/xlf_r
export LDFLAGS=-lSystemStubs
export LIBTOOL=glibtool

PREFIX=/usr/local/ompi-xl

./configure \
   --prefix=$PREFIX \
   --with-tm=/usr/local/pbs/ \
   --with-gm=/opt/gm \
   --enable-static \
   --disable-cxx

I also had to employ the fix listed in:
http://www.open-mpi.org/community/lists/users/2006/04/1007.php


I've attached the output of ompi_info while in an interactive job.  Looking
through the list,
I can at least save a bit of trouble by listing what does work.  Anything
outside of Torque
seems fine.  From within an interactive job, pbsdsh works fine, hence the
earlier problems
with poll are fixed.

Here is the error that is reported when I attemt to run hostname on one
processor:
node96:/usr/src/openmpi-1.1 jbronder$ /usr/local/ompi-xl/bin/mpirun -np 1
-mca pls_tm_debug 1 /bin/hostname
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: final top-level argv:
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: orted --no-daemonize
--bootproxy 1 --name  --num_procs 2 --vpid_start 0 --nodename  --universe
jbron...@node96.meldrew.clusters.umaine.edu:default-universe --nsreplica "
0.0.0;tcp://10.0.1.96:49395" --gprreplica "0.0.0;tcp://10.0.1.96:49395"
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: Set
prefix:/usr/local/ompi-xl
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: launching on node
localhost
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: resetting PATH:
/usr/local/ompi-xl/bin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/pbs/bin:/usr/local/mpiexec/bin:/opt/ibmcmp/xlf/8.1/bin:/opt/ibmcmp/vac/6.0/bin:/opt/ibmcmp/vacpp/6.0/bin:/opt/gm/bin:/opt/fms/bin
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: found
/usr/local/ompi-xl/bin/orted
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: not oversubscribed --
setting mpi_yield_when_idle to 0
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: executing: orted
--no-daemonize --bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0
--nodename localhost --universe
jbron...@node96.meldrew.clusters.umaine.edu:default-universe
--nsreplica "0.0.0;tcp://10.0.1.96:49395" --gprreplica "0.0.0
;tcp://10.0.1.96:49395"
[node96.meldrew.clusters.umaine.edu:00850] pls:tm: start_procs returned
error -13
[node96.meldrew.clusters.umaine.edu:00850] [0,0,0] ORTE_ERROR_LOG: Not found
in file rmgr_urm.c at line 184
[node96.meldrew.clusters.umaine.edu:00850] [0,0,0] ORTE_ERROR_LOG: Not found
in file rmgr_urm.c at line 432
[node96.meldrew.clusters.umaine.edu:00850] mpirun: spawn failed with
errno=-13
node96:/usr/src/openmpi-1.1 jbronder$


My thanks for any help in advance,

Justin Bronder.


ompi_info.log.gz
Description: GNU Zip compressed data

Re: [OMPI users] Problem with Openmpi 1.1

2006-07-06 Thread Justin Bronder


As far as the nightly builds go, I'm still seeing what I believe to be
this problem in both r10670 and r10652.  This is happening with
both Linux and OS X.  Below are the systems and ompi_info for the
newest revision 10670.

As an example of the error, when running HPL with Myrinet I get the
following error.  Using tcp everything is fine and I see the results I'd
expect.

||Ax-b||_oo / ( eps * ||A||_1  * N) =
42820214496954887558164928727596662784.000 .. FAILED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) = 156556068835.2711182 ..
FAILED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 1156439380.5172558 ..
FAILED
||Ax-b||_oo  . . . . . . . . . . . . . . . . . =
272683853978565028754868928512.00
||A||_oo . . . . . . . . . . . . . . . . . . . =3822.884181
||A||_1  . . . . . . . . . . . . . . . . . . . =3823.922627
||x||_oo . . . . . . . . . . . . . . . . . . . =
37037692483529688659798261760.00
||x||_1  . . . . . . . . . . . . . . . . . . . =
4102704048669982798475494948864.00
===

Finished  1 tests with the following results:
 0 tests completed and passed residual checks,
 1 tests completed and failed residual checks,
 0 tests skipped because of illegal input values.


Linux node41 2.6.16.19 #1 SMP Wed Jun 21 17:22:01 EDT 2006 ppc64 PPC970FX,
altivec supported GNU/Linux
jbronder@node41 ~ $ /usr/local/ompi-gnu-1.1.1a/bin/ompi_info
   Open MPI: 1.1.1a1r10670
  Open MPI SVN revision: r10670
   Open RTE: 1.1.1a1r10670
  Open RTE SVN revision: r10670
   OPAL: 1.1.1a1r10670
  OPAL SVN revision: r10670
 Prefix: /usr/local/ompi-gnu-1.1.1a
Configured architecture: powerpc64-unknown-linux-gnu
  Configured by: root
  Configured on: Thu Jul  6 10:15:37 EDT 2006
 Configure host: node41
   Built by: root
   Built on: Thu Jul  6 10:28:14 EDT 2006
 Built host: node41
 C bindings: yes
   C++ bindings: yes
 Fortran77 bindings: yes (all)
 Fortran90 bindings: yes
Fortran90 bindings size: small
 C compiler: gcc
C compiler absolute: /usr/bin/gcc
   C++ compiler: g++
  C++ compiler absolute: /usr/bin/g++
 Fortran77 compiler: gfortran
 Fortran77 compiler abs:
/usr/powerpc64-unknown-linux-gnu/gcc-bin/4.1.0/gfortran
 Fortran90 compiler: gfortran
 Fortran90 compiler abs:
/usr/powerpc64-unknown-linux-gnu/gcc-bin/4.1.0/gfortran
C profiling: yes
  C++ profiling: yes
Fortran77 profiling: yes
Fortran90 profiling: yes
 C++ exceptions: no
 Thread support: posix (mpi: no, progress: no)
 Internal debug support: no
MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
libltdl support: yes
 MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component v1.1.1)
  MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.1.1)
  MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.1.1)
  MCA timer: linux (MCA v1.0, API v1.0, Component v1.1.1)
  MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
  MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
   MCA coll: basic (MCA v1.0, API v1.0, Component v1.1.1)
  MCA coll: hierarch (MCA v1.0, API v1.0, Component v1.1.1)
   MCA coll: self (MCA v1.0, API v1.0, Component v1.1.1)
   MCA coll: sm (MCA v1.0, API v1.0, Component v1.1.1)
   MCA coll: tuned (MCA v1.0, API v1.0, Component v1.1.1)
 MCA io: romio (MCA v1.0, API v1.0, Component v1.1.1)
  MCA mpool: gm (MCA v1.0, API v1.0, Component v1.1.1)
  MCA mpool: sm (MCA v1.0, API v1.0, Component v1.1.1)
MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.1.1)
MCA bml: r2 (MCA v1.0, API v1.0, Component v1.1.1)
 MCA rcache: rb (MCA v1.0, API v1.0, Component v1.1.1)
MCA btl: gm (MCA v1.0, API v1.0, Component v1.1.1)
MCA btl: self (MCA v1.0, API v1.0, Component v1.1.1)
MCA btl: sm (MCA v1.0, API v1.0, Component v1.1.1)
MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0)
   MCA topo: unity (MCA v1.0, API v1.0, Component v1.1.1)
MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.0)
MCA gpr: null (MCA v1.0, API v1.0, Component v1.1.1)
MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.1.1)
MCA gpr: replica (MCA v1.0, API v1.0, Component v1.1.1)
MCA iof: proxy (MCA v1.0, API v1.0, Component v1.1.1)
MCA iof: svc (MCA v1.0, API v1.0, Component v1.1.1)

Re: [OMPI users] Problem with Openmpi 1.1

2006-07-06 Thread Justin Bronder


Disregard the failure on Linux, a rebuild from scratch of HPL and OpenMPI
seems to have resolved the issue.  At least I'm not getting the errors
during
the residual checks.

However, this is persisting under OS X.

Thanks,
Justin.

On 7/6/06, Justin Bronder <jsbron...@gmail.com> wrote:


For OS X:
/usr/local/ompi-xl/bin/mpirun -mca btl gm -np 4 ./xhpl

For Linux:
ARCH=ompi-gnu-1.1.1a
/usr/local/$ARCH/bin/mpiexec -mca btl gm -np 2 -path /usr/local/$ARCH/bin
./xhpl

Thanks for the speedy response,
Justin.

On 7/6/06, Galen M. Shipman <gship...@lanl.gov> wrote:

> Hey Justin,
> Please provide us your mca parameters (if any), these could be in a
> config file, environment variables or on the command line.
>
> Thanks,
>
> Galen
>
> On Jul 6, 2006, at 9:22 AM, Justin Bronder wrote:
>
> As far as the nightly builds go, I'm still seeing what I believe to be
> this problem in both r10670 and r10652.  This is happening with
> both Linux and OS X.  Below are the systems and ompi_info for the
> newest revision 10670.
>
> As an example of the error, when running HPL with Myrinet I get the
> following error.  Using tcp everything is fine and I see the results I'd
>
> expect.
>
> 
> ||Ax-b||_oo / ( eps * ||A||_1  * N) =
> 42820214496954887558164928727596662784.000 .. FAILED
> ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) = 156556068835.2711182.. 
FAILED
> ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 1156439380.5172558 ..
> FAILED
> ||Ax-b||_oo  . . . . . . . . . . . . . . . . . =
> 272683853978565028754868928512.00
> ||A||_oo . . . . . . . . . . . . . . . . . . . =3822.884181
> ||A||_1  . . . . . . . . . . . . . . . . . . . =3823.922627
> ||x||_oo . . . . . . . . . . . . . . . . . . . =
> 37037692483529688659798261760.00
> ||x||_1  . . . . . . . . . . . . . . . . . . . =
> 4102704048669982798475494948864.00
> ===
>
> Finished  1 tests with the following results:
>   0 tests completed and passed residual checks,
>   1 tests completed and failed residual checks,
>   0 tests skipped because of illegal input values.
>
> 
>
> Linux node41 2.6.16.19 #1 SMP Wed Jun 21 17:22:01 EDT 2006 ppc64
> PPC970FX, altivec supported GNU/Linux
> jbronder@node41 ~ $ /usr/local/ompi- gnu-1.1.1a/bin/ompi_info
> Open MPI: 1.1.1a1r10670
>Open MPI SVN revision: r10670
> Open RTE: 1.1.1a1r10670
>Open RTE SVN revision: r10670
> OPAL: 1.1.1a1r10670
>OPAL SVN revision: r10670
>   Prefix: /usr/local/ompi-gnu-1.1.1a
>  Configured architecture: powerpc64-unknown-linux-gnu
>Configured by: root
>Configured on: Thu Jul  6 10:15:37 EDT 2006
>   Configure host: node41
> Built by: root
> Built on: Thu Jul  6 10:28:14 EDT 2006
>   Built host: node41
>   C bindings: yes
> C++ bindings: yes
>   Fortran77 bindings: yes (all)
>   Fortran90 bindings: yes
>  Fortran90 bindings size: small
>   C compiler: gcc
>  C compiler absolute: /usr/bin/gcc
> C++ compiler: g++
>C++ compiler absolute: /usr/bin/g++
>   Fortran77 compiler: gfortran
>   Fortran77 compiler abs:
> /usr/powerpc64-unknown-linux-gnu/gcc-bin/4.1.0/gfortran
>   Fortran90 compiler: gfortran
>   Fortran90 compiler abs:
> /usr/powerpc64-unknown-linux-gnu/gcc-bin/4.1.0/gfortran
>  C profiling: yes
>C++ profiling: yes
>  Fortran77 profiling: yes
>  Fortran90 profiling: yes
>   C++ exceptions: no
>   Thread support: posix (mpi: no, progress: no)
>   Internal debug support: no
>  MPI parameter check: runtime
> Memory profiling support: no
> Memory debugging support: no
>  libltdl support: yes
>   MCA memory: ptmalloc2 (MCA v1.0, API v1.0, Component
> v1.1.1)
>MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.1.1)
>MCA maffinity: first_use (MCA v1.0, API v1.0, Component
> v1.1.1)
>MCA timer: linux (MCA v1.0, API v1.0, Component v1.1.1)
>MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
>MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
> MCA coll: basic (MCA v1.0, API v1.0, Componentv1.1.1)
>
>MCA coll: hierarch (MCA v1.0, API v1.0, Component v1.1.1)
> MCA coll: self (MCA v1.0, API v1

Re: [OMPI users] Problem with Openmpi 1.1

2006-07-06 Thread Justin Bronder


With 1.0.3a1r10670 the same problem is occuring.  Again the same configure
arguments
as before.  For clarity, the Myrinet drive we are using is 2.0.21

node90:~/src/hpl/bin/ompi-xl-1.0.3 jbronder$ gm_board_info
GM build ID is "2.0.21_MacOSX_rc20050429075134PDT
r...@node96.meldrew.clusters.umaine.edu:/usr/src/gm-2.0.21_MacOSX Fri Jun 16
14:39:45 EDT 2006."

node90:~/src/hpl/bin/ompi-xl-1.0.3 jbronder$
/usr/local/ompi-xl-1.0.3/bin/mpirun
-np 2 xhpl
This succeeds.
||Ax-b||_oo / ( eps * ||A||_1  * N) =0.1196787 .. PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =0.0283195 .. PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =0.0063300 .. PASSED

node90:~/src/hpl/bin/ompi-xl-1.0.3 jbronder$
/usr/local/ompi-xl-1.0.3/bin/mpirun
-mca btl gm -np 2 xhpl
This fails.
||Ax-b||_oo / ( eps * ||A||_1  * N) =
717370209518881444284334080.000 .. FAILED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) = 226686309135.4274597 ..
FAILED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 2386641249.6518722 ..
FAILED
||Ax-b||_oo  . . . . . . . . . . . . . . . . . = 2037398812542965504.00
||A||_oo . . . . . . . . . . . . . . . . . . . =2561.554752
||A||_1  . . . . . . . . . . . . . . . . . . . =2558.129237
||x||_oo . . . . . . . . . . . . . . . . . . . = 300175355203841216.00
||x||_1  . . . . . . . . . . . . . . . . . . . = 31645943341479366656.00

Does anyone have a working system with OS X and Myrinet (GM)?  If so, I'd
love to hear
the configure arguments and various versions you are using.  Bonus points if
you are
using the IBM XL compilers.

Thanks,
Justin.


On 7/6/06, Justin Bronder <jsbron...@gmail.com> wrote:


Yes, that output was actually cut and pasted from an OS X run.  I'm about
to test
against 1.0.3a1r10670.

Justin.

On 7/6/06, Galen M. Shipman <gship...@lanl.gov> wrote:

> Justin,
> Is the OS X run showing the same residual failure?
>
> - Galen
>
> On Jul 6, 2006, at 10:49 AM, Justin Bronder wrote:
>
> Disregard the failure on Linux, a rebuild from scratch of HPL and
> OpenMPI
> seems to have resolved the issue.  At least I'm not getting the errors
> during
> the residual checks.
>
> However, this is persisting under OS X.
>
> Thanks,
> Justin.
>
> On 7/6/06, Justin Bronder < jsbron...@gmail.com> wrote:
>
> > For OS X:
> > /usr/local/ompi-xl/bin/mpirun -mca btl gm -np 4 ./xhpl
> >
> > For Linux:
> > ARCH=ompi-gnu-1.1.1a
> > /usr/local/$ARCH/bin/mpiexec -mca btl gm -np 2 -path
> > /usr/local/$ARCH/bin ./xhpl
> >
> > Thanks for the speedy response,
> > Justin.
> >
> > On 7/6/06, Galen M. Shipman < gship...@lanl.gov> wrote:
> >
> > > Hey Justin,
> > Please provide us your mca parameters (if any), these could be in a
> > config file, environment variables or on the command line.
> >
> > Thanks,
> >
> > Galen
> >
> >  On Jul 6, 2006, at 9:22 AM, Justin Bronder wrote:
> >
> > As far as the nightly builds go, I'm still seeing what I believe to be
> >
> > this problem in both r10670 and r10652.  This is happening with
> > both Linux and OS X.  Below are the systems and ompi_info for the
> > newest revision 10670.
> >
> > As an example of the error, when running HPL with Myrinet I get the
> > following error.  Using tcp everything is fine and I see the results
> > I'd
> > expect.
> >
> > 
> > ||Ax-b||_oo / ( eps * ||A||_1  * N) =
> > 42820214496954887558164928727596662784.000 .. FAILED
> > ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) = 156556068835.2711182.. 
FAILED
> > ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 1156439380.5172558.. 
FAILED
> > ||Ax-b||_oo  . . . . . . . . . . . . . . . . . =
> > 272683853978565028754868928512.00
> > ||A||_oo . . . . . . . . . . . . . . . . . . . =3822.884181
> > ||A||_1  . . . . . . . . . . . . . . . . . . . =3823.922627
> > ||x||_oo . . . . . . . . . . . . . . . . . . . =
> > 37037692483529688659798261760.00
> > ||x||_1  . . . . . . . . . . . . . . . . . . . =
> > 4102704048669982798475494948864.00
> > ===
> >
> > Finished  1 tests with the following results:
> >   0 tests completed and passed residual checks,
> >   1 tests completed and failed residual checks,
> >   0 tests skipped because of illegal input values.
> >
> > 
> >
> > Linux node41 2.6.16.19 #1 SMP Wed Jun 21 17:22:01

Re: [OMPI users] Problem with Openmpi 1.1

2006-07-08 Thread Justin Bronder

1.)  Compiling without XL will take a little while, but I have the setup
for the
other questions ready now.  I figured I'd answer them right away.

2.)  TCP works fine, and is quite quick compared to mpich-1.2.7p1 by the
way.
I just reverified this.
WR11C2R45000   160 1 2  10.10  8.253e+00
||Ax-b||_oo / ( eps * ||A||_1  * N) =0.0412956 .. PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =0.0272613 .. PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =0.0053214 .. PASSED


3.)  Exactly same setup, using mpichgm-1.2.6..14b
WR11C2R45000   160 1 2  10.43  7.994e+00

||Ax-b||_oo / ( eps * ||A||_1  * N) =0.0353693 .. PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =0.0233491 .. PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =0.0045577 .. PASSED

It also worked with mpichgm-1.2.6..15  (I believe this is the version, I
don't have
a node up with it at the moment).

Obviously mpich-1.2.7p1 works as well over ethernet.


Anyways, I'll begin the build with the standard gcc compilers that are
included
with OS X.  This is powerpc-apple-darwin8-gcc-4.0.1.

Thanks,

Justin.

Jeff Squyres (jsquyres) wrote:
> Justin --
>  
> Can we eliminate some variables so that we can figure out where the
> error is originating?
>  
> - Can you try compiling without the XL compilers?
> - Can you try running with just TCP (and not Myrinet)?
> - With the same support library installation (such as BLAS, etc.,
> assumedly also compiled with XL), can you try another MPI (e.g., LAM,
> MPICH-gm, whatever)?
>
> Let us know what you find.  Thanks!
>  
>
> 
> *From:* users-boun...@open-mpi.org
> [mailto:users-boun...@open-mpi.org] *On Behalf Of *Justin Bronder
> *Sent:* Thursday, July 06, 2006 3:16 PM
> *To:* Open MPI Users
> *Subject:* Re: [OMPI users] Problem with Openmpi 1.1
>
> With 1.0.3a1r10670 the same problem is occuring.  Again the same
> configure arguments
> as before.  For clarity, the Myrinet drive we are using is 2.0.21
>
> node90:~/src/hpl/bin/ompi-xl-1.0.3 jbronder$ gm_board_info
> GM build ID is "2.0.21_MacOSX_rc20050429075134PDT
> r...@node96.meldrew.clusters.umaine.edu:/usr/src/gm-2.0.21_MacOSX
> Fri Jun 16 14:39:45 EDT 2006."
>
> node90:~/src/hpl/bin/ompi-xl-1.0.3 jbronder$
> /usr/local/ompi-xl-1.0.3/bin/mpirun -np 2 xhpl
> This succeeds.
> ||Ax-b||_oo / ( eps * ||A||_1  * N) =0.1196787
> .. PASSED
> ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =0.0283195
> .. PASSED
> ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =0.0063300
> .. PASSED
>
> node90:~/src/hpl/bin/ompi-xl-1.0.3 jbronder$
> /usr/local/ompi-xl-1.0.3/bin/mpirun -mca btl gm -np 2 xhpl
> This fails.
> ||Ax-b||_oo / ( eps * ||A||_1  * N) =
> 717370209518881444284334080.000 .. FAILED
> ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) = 226686309135.4274597
> .. FAILED
> ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 2386641249.6518722
> .. FAILED
> ||Ax-b||_oo  . . . . . . . . . . . . . . . . . =
> 2037398812542965504.00
> ||A||_oo . . . . . . . . . . . . . . . . . . . =2561.554752
> ||A||_1  . . . . . . . . . . . . . . . . . . . =2558.129237
> ||x||_oo . . . . . . . . . . . . . . . . . . . =
> 300175355203841216.00
> ||x||_1  . . . . . . . . . . . . . . . . . . . =
> 31645943341479366656.00
>
> Does anyone have a working system with OS X and Myrinet (GM)?  If
> so, I'd love to hear
> the configure arguments and various versions you are using.  Bonus
> points if you are
> using the IBM XL compilers.
>
> Thanks,
> Justin.
>
>
> On 7/6/06, *Justin Bronder* <jsbron...@gmail.com
> <mailto:jsbron...@gmail.com>> wrote:
>
> Yes, that output was actually cut and pasted from an OS X
> run.  I'm about to test
> against 1.0.3a1r10670.
>
> Justin.
>
> On 7/6/06, *Galen M. Shipman* < gship...@lanl.gov
> <mailto:gship...@lanl.gov>> wrote:
>
> Justin, 
>
> Is the OS X run showing the same residual failure?
>
> - Galen 
>
> On Jul 6, 2006, at 10:49 AM, Justin Bronder wrote:
>
> Disregard the failure on Linux, a rebuild from scratch of
>     HPL and OpenMPI
>

Re: [OMPI users] problem abut openmpi running

2006-10-19 Thread Justin Bronder

On a number of my Linux machines, /usr/local/lib is not searched by
ldconfig, and hence, is
not going to be found by gcc.  You can fix this by adding /usr/local/lib to
/etc/ld.so.conf and
running ldconfig ( add the -v flag if you want to see the output ).

-Justin.

On 10/19/06, Durga Choudhury <dpcho...@gmail.com> wrote:

George

I knew that was the answer to Calin's question, but I still would like to
understand the issue:

by default, the openMPI installer installs the libraries in
/usr/local/lib, which is a standard location for the C compiler to look for
libraries. So *why* do I need to explicitly specify this with
LD_LIBRARY_PATH? For example, when I am compiling with pthread calls and
pass -lpthread to gcc, I need not specify the location of libpthread.sowith 
LD_LIBRARY_PATH. I had the same problem as Calin so I am curious. This
is assuming he has not redirected the installation path to some non-standard
location.

Thanks
Durga

On 10/19/06, George Bosilca <bosi...@cs.utk.edu> wrote:
>
> Calin,
>
> Look like you're missing a proper value for the LD_LIBRARY_PATH.
> Please read the Open MPI FAW at http://www.open-mpi.org/faq/?
> category=running.
>
>   Thanks,
> george.
>
> On Oct 19, 2006, at 6:41 AM, calin pal wrote:
>
> >
> >   hi,
> >  i m calin from indiai m working on openmpii
> > have installed openmpi 1.1.1-tar.gz in four machines in our college
> > labin one system the openmpi is properly working.i have written
> > "hello world" program in all machines .but in one machine its
> > working properly.in other machine gives
> > ((
> > (hello:error while loading shared libraries:libmpi.so..o;cannot
> > open shared object file:no such file or directory.)
> >
> >
> > what is the problem plz tel me..and how to solve it..please
> > tell me
> >
> > calin pal
> > india
> > fergusson college
> > msc.tech(maths and computer sc.)
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

--
Devil wanted omnipresence;
He therefore created communists.
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

[OMPI users] Open-MPI 1.2 and GM

2007-03-27 Thread Justin Bronder


Having a user who requires some of the features of gfortran in 4.1.2, I
recently began building a new image.  The issue is that "-mca btl gm" fails
while "-mca mtl gm" works.  I have not yet done any benchmarking, as I was
wondering if the move to mtl is part of the upgrade.  Below are the packages
I rebuilt.

Kernel 2.6.16.27 -> 2.6.20.1
Gcc 4.1.1 -> 4.1.2
GM Drivers 2.0.26 -> 2.0.26 (with patches for newer kernels)
OpenMPI 1.1.4 -> 1.2


The following works as expected:
/usr/local/ompi-gnu/bin/mpirun -np 4 -mca mtl gm --host node84,node83 ./xhpl

The following fails:
/usr/local/ompi-gnu/bin/mpirun -np 4 -mca btl gm --host node84,node83 ./xhpl

I've attached gziped files as suggested on the "Getting Help" section of the
website and the output from the failed mpirun.  Both nodes are known good
Myrinet nodes, using FMA to map.


Thanks in advance,

-- 
Justin Bronder

Advanced Computing Research Lab
University of Maine, Orono
20 Godfrey Dr
Orono, ME 04473
www.clusters.umaine.edu


config.log.gz
Description: Binary data


ompi_info.gz
Description: Binary data
--
Process 0.1.2 is unable to reach 0.1.2 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of 
usable components.
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
--
Process 0.1.1 is unable to reach 0.1.1 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of 
usable components.
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
--
Process 0.1.0 is unable to reach 0.1.0 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of 
usable components.
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (goodbye)
--
Process 0.1.3 is unable to reach 0.1.3 for MPI communication.
If you specified the use of a BTL component, you may have
forgotten a component (such as "self") in the list of 
usable components.
--
--
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
prob

[OMPI users] Strange errors when running mpirun

2016-09-22 Thread Justin Chang

Dear all,

So I upgraded/updated my Homebrew on my Macbook and installed Open MPI
2.0.1 using "brew install openmpi". However, when I open up a terminal
and type "mpirun -n 1" I get the following messages:

~ mpirun -n 1
[Justins-MacBook-Pro-2.local:20793] [[13318,0],0] bind() failed on
error Address already in use (48)
[Justins-MacBook-Pro-2.local:20793] [[13318,0],0] ORTE_ERROR_LOG:
Error in file oob_usock_component.c at line 228
--
No executable was specified on the mpirun command line.

Aborting.
--


I have never seen anything like the first two lines. I also installed
python and mpi4py via pip, and when I still get the same messages:

~ python -c "from mpi4py import MPI"
[Justins-MacBook-Pro-2.local:20871] [[13496,0],0] bind() failed on
error Address already in use (48)
[Justins-MacBook-Pro-2.local:20871] [[13496,0],0] ORTE_ERROR_LOG:
Error in file oob_usock_component.c at line 228

But now if I add "mpirun -n 1" I get the following:

~ mpirun -n 1 python -c "from mpi4py import MPI"
[Justins-MacBook-Pro-2.local:20935] [[13560,0],0] bind() failed on
error Address already in use (48)
[Justins-MacBook-Pro-2.local:20935] [[13560,0],0] ORTE_ERROR_LOG:
Error in file oob_usock_component.c at line 228
[Justins-MacBook-Pro-2.local:20936] [[13560,1],0]
usock_peer_send_blocking: send() to socket 17 failed: Socket is not
connected (57)
[Justins-MacBook-Pro-2.local:20936] [[13560,1],0] ORTE_ERROR_LOG:
Unreachable in file oob_usock_connection.c at line 315
[Justins-MacBook-Pro-2.local:20936] [[13560,1],0]
orte_usock_peer_try_connect: usock_peer_send_connect_ack to proc
[[13560,0],0] failed: Unreachable (-12)
[Justins-MacBook-Pro-2:20936] *** Process received signal ***
[Justins-MacBook-Pro-2:20936] Signal: Segmentation fault: 11 (11)
[Justins-MacBook-Pro-2:20936] Signal code:  (0)
[Justins-MacBook-Pro-2:20936] Failing at address: 0x0
---
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
---
--
mpirun detected that one or more processes exited with non-zero
status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[13560,1],0]
  Exit code:1
--

Clearly something is wrong here. I already tried things like "rm -rf
$TMPDIR/openmpi-sessions-*" but said directory keeps reappearing and
the error persists. Why does this happen and how do I fix it? For what
it's worth, here's some other information that may help:

~ mpicc --version
Apple LLVM version 8.0.0 (clang-800.0.38)
Target: x86_64-apple-darwin15.6.0
Thread model: posix
InstalledDir: 
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

I tested Hello World with both mpicc and mpif90, and they still work
despite showing those two error/warning messages.

Thanks,
Justin
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Strange errors when running mpirun

2016-09-22 Thread Justin Chang

"mpirun -n 1" was just to demonstrate that I get those error messages.
I ran a simple helloworld.c and it still gives those two messages.

I did delete openmpi-sessions-* from my $TMPDIR but it doesn't solve
the problem. Here's my $TMPDIR:

~ cd $TMPDIR
~ pwd
/var/folders/jd/qh5zn6jn5kz_byz9gxz5kl2mgn/T
~ ls
MediaCache
TemporaryItems
com.apple.AddressBook.ContactsAccountsService
com.apple.AddressBook.InternetAccountsBridge
com.apple.AirPlayUIAgent
com.apple.BKAgentService
com.apple.CalendarAgent
com.apple.CalendarAgent.CalNCService
com.apple.CloudPhotosConfiguration
com.apple.DataDetectorsDynamicData
com.apple.ICPPhotoStreamLibraryService
com.apple.InputMethodKit.TextReplacementService
com.apple.PhotoIngestService
com.apple.Preview
com.apple.Safari
com.apple.SocialPushAgent
com.apple.WeatherKitService
com.apple.cloudphotosd
com.apple.dt.XCDocumenter.XCDocumenterExtension
com.apple.dt.XcodeBuiltInExtensions
com.apple.geod
com.apple.iCal.CalendarNC
com.apple.lateragent
com.apple.ncplugin.stocks
com.apple.ncplugin.weather
com.apple.notificationcenterui.WeatherSummary
com.apple.photolibraryd
com.apple.photomoments
com.apple.quicklook.ui.helper
com.apple.soagent
com.getdropbox.dropbox.garcon
icdd501
ics21406
openmpi-sessions-501@Justins-MacBook-Pro-2_0
pmix-12195
pmix-12271
pmix-12289
pmix-12295
pmix-12304
pmix-12313
pmix-12367
pmix-12397
pmix-12775
pmix-12858
pmix-17118
pmix-1754
pmix-20632
pmix-20793
pmix-20849
pmix-21019
pmix-22316
pmix-8129
pmix-8494
xcrun_db
~ rm -rf openmpi-sessions-501@Justins-MacBook-Pro-2_0
~ mpirun -n 1
[Justins-MacBook-Pro-2.local:22527] [[12992,0],0] bind() failed on
error Address already in use (48)
[Justins-MacBook-Pro-2.local:22527] [[12992,0],0] ORTE_ERROR_LOG:
Error in file oob_usock_component.c at line 228
--
No executable was specified on the mpirun command line.

Aborting.
--

and when I type "ls" the directory
"openmpi-sessions-501@Justins-MacBook-Pro-2_0" reappeared. Unless
there's a different directory I need to look for?

On Thu, Sep 22, 2016 at 4:08 AM, r...@open-mpi.org <r...@open-mpi.org> wrote:
> Maybe I’m missing something, but “mpirun -n 1” doesn’t include the name of an 
> application to execute.
>
> The error message prior to that error indicates that you have some cruft 
> sitting in your tmpdir. You just need to clean it out - look for something 
> that starts with “openmpi”
>
>
>> On Sep 22, 2016, at 1:45 AM, Justin Chang <jychan...@gmail.com> wrote:
>>
>> Dear all,
>>
>> So I upgraded/updated my Homebrew on my Macbook and installed Open MPI
>> 2.0.1 using "brew install openmpi". However, when I open up a terminal
>> and type "mpirun -n 1" I get the following messages:
>>
>> ~ mpirun -n 1
>> [Justins-MacBook-Pro-2.local:20793] [[13318,0],0] bind() failed on
>> error Address already in use (48)
>> [Justins-MacBook-Pro-2.local:20793] [[13318,0],0] ORTE_ERROR_LOG:
>> Error in file oob_usock_component.c at line 228
>> --
>> No executable was specified on the mpirun command line.
>>
>> Aborting.
>> --
>>
>>
>> I have never seen anything like the first two lines. I also installed
>> python and mpi4py via pip, and when I still get the same messages:
>>
>> ~ python -c "from mpi4py import MPI"
>> [Justins-MacBook-Pro-2.local:20871] [[13496,0],0] bind() failed on
>> error Address already in use (48)
>> [Justins-MacBook-Pro-2.local:20871] [[13496,0],0] ORTE_ERROR_LOG:
>> Error in file oob_usock_component.c at line 228
>>
>> But now if I add "mpirun -n 1" I get the following:
>>
>> ~ mpirun -n 1 python -c "from mpi4py import MPI"
>> [Justins-MacBook-Pro-2.local:20935] [[13560,0],0] bind() failed on
>> error Address already in use (48)
>> [Justins-MacBook-Pro-2.local:20935] [[13560,0],0] ORTE_ERROR_LOG:
>> Error in file oob_usock_component.c at line 228
>> [Justins-MacBook-Pro-2.local:20936] [[13560,1],0]
>> usock_peer_send_blocking: send() to socket 17 failed: Socket is not
>> connected (57)
>> [Justins-MacBook-Pro-2.local:20936] [[13560,1],0] ORTE_ERROR_LOG:
>> Unreachable in file oob_usock_connection.c at line 315
>> [Justins-MacBook-Pro-2.local:20936] [[13560,1],0]
>> orte_usock_peer_try_connect: usock_peer_send_connect_ack to proc
>> [[13560,0],0] failed: Unreachable (-12)
>> [Justins-MacBook-Pro-2:20936] *** Process

Re: [OMPI users] Strange errors when running mpirun

2016-09-22 Thread Justin Chang

Oh, so setting this in my ~/.profile

export TMPDIR=/tmp

in fact solves my problem completely! Not sure why this is the case, but thanks!

Justin

On Thu, Sep 22, 2016 at 7:33 AM, Gilles Gouaillardet
<gilles.gouaillar...@gmail.com> wrote:
> Justin,
>
> i do not see this error on my laptop
>
> which version of OS X are you running ?
>
> can you try to
> TMPDIR=/tmp mpirun -n 1
>
> Cheers,
>
> Gilles
>
> On Thu, Sep 22, 2016 at 7:21 PM, Nathan Hjelm <hje...@me.com> wrote:
>> FWIW it works fine for me on my MacBook Pro running 10.12 with Open MPI 
>> 2.0.1 installed through homebrew:
>>
>> ✗ brew -v
>> Homebrew 1.0.0 (git revision c3105; last commit 2016-09-22)
>> Homebrew/homebrew-core (git revision 227e; last commit 2016-09-22)
>>
>> ✗ brew info openmpi
>>
>> open-mpi: stable 2.0.1 (bottled), HEAD
>> High performance message passing library
>> https://www.open-mpi.org/
>> Conflicts with: lcdf-typetools, mpich
>> /usr/local/Cellar/open-mpi/2.0.1 (688 files, 8.3M) *
>>   Poured from bottle on 2016-09-22 at 03:53:35
>> From: 
>> https://github.com/Homebrew/homebrew-core/blob/master/Formula/open-mpi.rb
>> ==> Dependencies
>> Required: libevent ✔
>> ==> Options
>> --c++11
>> Build using C++11 mode
>> --with-cxx-bindings
>> Enable C++ MPI bindings (deprecated as of MPI-3.0)
>> --with-java
>> Build with java support
>> --with-mpi-thread-multiple
>> Enable MPI_THREAD_MULTIPLE
>> --without-fortran
>> Build without fortran support
>> --HEAD
>> Install HEAD version
>>
>> ✗ type -p mpicc
>> mpicc is /usr/local/bin/mpicc
>>
>> ✗ mpirun --version
>> mpirun (Open MPI) 2.0.1
>>
>> Report bugs to http://www.open-mpi.org/community/help/
>>
>>
>> ✗ mpirun ./ring_c
>> Process 0 sending 10 to 1, tag 201 (4 processes in ring)
>> Process 0 sent to 1
>> Process 0 decremented value: 9
>> Process 0 decremented value: 8
>> Process 0 decremented value: 7
>> Process 0 decremented value: 6
>> Process 0 decremented value: 5
>> Process 0 decremented value: 4
>> Process 0 decremented value: 3
>> Process 0 decremented value: 2
>> Process 0 decremented value: 1
>> Process 0 decremented value: 0
>> Process 0 exiting
>> Process 1 exiting
>> Process 2 exiting
>> Process 3 exiting
>>
>>
>> -Nathan
>>
>>> On Sep 22, 2016, at 3:31 AM, Justin Chang <jychan...@gmail.com> wrote:
>>>
>>> I tried that and also deleted everything inside $TMPDIR. The error
>>> still persists
>>>
>>> On Thu, Sep 22, 2016 at 4:21 AM, r...@open-mpi.org <r...@open-mpi.org> 
>>> wrote:
>>>> Try removing the “pmix” entries as well
>>>>
>>>>> On Sep 22, 2016, at 2:19 AM, Justin Chang <jychan...@gmail.com> wrote:
>>>>>
>>>>> "mpirun -n 1" was just to demonstrate that I get those error messages.
>>>>> I ran a simple helloworld.c and it still gives those two messages.
>>>>>
>>>>> I did delete openmpi-sessions-* from my $TMPDIR but it doesn't solve
>>>>> the problem. Here's my $TMPDIR:
>>>>>
>>>>> ~ cd $TMPDIR
>>>>> ~ pwd
>>>>> /var/folders/jd/qh5zn6jn5kz_byz9gxz5kl2mgn/T
>>>>> ~ ls
>>>>> MediaCache
>>>>> TemporaryItems
>>>>> com.apple.AddressBook.ContactsAccountsService
>>>>> com.apple.AddressBook.InternetAccountsBridge
>>>>> com.apple.AirPlayUIAgent
>>>>> com.apple.BKAgentService
>>>>> com.apple.CalendarAgent
>>>>> com.apple.CalendarAgent.CalNCService
>>>>> com.apple.CloudPhotosConfiguration
>>>>> com.apple.DataDetectorsDynamicData
>>>>> com.apple.ICPPhotoStreamLibraryService
>>>>> com.apple.InputMethodKit.TextReplacementService
>>>>> com.apple.PhotoIngestService
>>>>> com.apple.Preview
>>>>> com.apple.Safari
>>>>> com.apple.SocialPushAgent
>>>>> com.apple.WeatherKitService
>>>>> com.apple.cloudphotosd
>>>>> com.apple.dt.XCDocumenter.XCDocumenterExtension
>>>>> com.apple.dt.XcodeBuiltInExtensions
>>>>> com.apple.geod
>>>>> com.apple.iCal.CalendarNC
>>>>> com.apple.lateragent
>>>>> com.apple.ncplugin.stocks
>>>>> com.apple.ncplugin.weat

Re: [OMPI users] Strange errors when running mpirun

2016-09-30 Thread Justin Chang

Thank you, using the default $TMPDIR works now.

On Fri, Sep 30, 2016 at 7:32 AM, Gilles Gouaillardet <
gilles.gouaillar...@gmail.com> wrote:

> Justin and all,
>
> the root cause is indeed a bug i fixed in
> https://github.com/open-mpi/ompi/pull/2135
> i also had this patch applied to home-brew, so if you re-install
> open-mpi, you should be fine.
>
> Cheers,
>
> Gilles
>
> for those who want to know more
> - Open MPI uses two Unix sockets, one by oob/usock and one by mix
> - to keep things simple, oob/usock Unix socket is based on $TMPDIR,
> hostname and quite a few more characters.
>   OSX default $TMPDIR is not short, so when we append the FQDN (that
> might not be short too) and other paths, the size may
>   excess the max allowed path for a Unix socket (104 bytes on
> Yosemite). this path is currently silently truncated, so
> bad/non-understandable things can happen. the patch disqualifies
> oob/usock instead of silently truncating the path.
> a simple workaround is to
> export TMPDIR=/tmp
> a better workaround is to
> mpirun --mca oob ^usock ...
> or you can add to your environment
> export OMPI_MCA_oob=^sock
> and then use mpirun as usual
> - pmix Unix socket path is only based on $TMPDIR plus a few extra
> characters
> bottom line, and unless your $TMPDIR is insanely long, you should be
> fine with one of these workarounds, or the patch available at
> https://github.com/open-mpi/ompi/pull/2135.patch, or by using the
> latest open-mpi from homebrew.
>
> On Fri, Sep 23, 2016 at 11:15 AM, Gilles Gouaillardet <gil...@rist.or.jp>
> wrote:
> > Justin,
> >
> >
> > the root cause could be the length of $TMPDIR that might cause some path
> > being truncated.
> >
> > you can check that by simply using a custom $TMPDIR that has the same
> size
> > than the original one
> >
> >
> > which version of OSX are you running ?
> >
> > this might explain why Nathan nor i were able to reproduce the issue, and
> > i'd like to understand why this
> >
> > issue went undetected by Open MPI
> >
> >
> > Cheers,
> >
> >
> > Gilles
> >
> >
> >
> > On 9/23/2016 3:12 AM, Justin Chang wrote:
> >>
> >> Oh, so setting this in my ~/.profile
> >>
> >> export TMPDIR=/tmp
> >>
> >> in fact solves my problem completely! Not sure why this is the case, but
> >> thanks!
> >>
> >> Justin
> >>
> >> On Thu, Sep 22, 2016 at 7:33 AM, Gilles Gouaillardet
> >> <gilles.gouaillar...@gmail.com> wrote:
> >>>
> >>> Justin,
> >>>
> >>> i do not see this error on my laptop
> >>>
> >>> which version of OS X are you running ?
> >>>
> >>> can you try to
> >>> TMPDIR=/tmp mpirun -n 1
> >>>
> >>> Cheers,
> >>>
> >>> Gilles
> >>>
> >>> On Thu, Sep 22, 2016 at 7:21 PM, Nathan Hjelm <hje...@me.com> wrote:
> >>>>
> >>>> FWIW it works fine for me on my MacBook Pro running 10.12 with Open
> MPI
> >>>> 2.0.1 installed through homebrew:
> >>>>
> >>>> ✗ brew -v
> >>>> Homebrew 1.0.0 (git revision c3105; last commit 2016-09-22)
> >>>> Homebrew/homebrew-core (git revision 227e; last commit 2016-09-22)
> >>>>
> >>>> ✗ brew info openmpi
> >>>>
> >>>> open-mpi: stable 2.0.1 (bottled), HEAD
> >>>> High performance message passing library
> >>>> https://www.open-mpi.org/
> >>>> Conflicts with: lcdf-typetools, mpich
> >>>> /usr/local/Cellar/open-mpi/2.0.1 (688 files, 8.3M) *
> >>>>Poured from bottle on 2016-09-22 at 03:53:35
> >>>> From:
> >>>> https://github.com/Homebrew/homebrew-core/blob/master/
> Formula/open-mpi.rb
> >>>> ==> Dependencies
> >>>> Required: libevent ✔
> >>>> ==> Options
> >>>> --c++11
> >>>>  Build using C++11 mode
> >>>> --with-cxx-bindings
> >>>>  Enable C++ MPI bindings (deprecated as of MPI-3.0)
> >>>> --with-java
> >>>>  Build with java support
> >>>> --with-mpi-thread-multiple
> >>>>  Enable MPI_THREAD_MULTIPLE
> >>>> --without-fortran
> >>>>  Build without fortran support
> >>>> --HEAD
> >>>>  Install HEAD version
> >>

[OMPI users] Problem building OpenMPI with CUDA 8.0

2016-10-18 Thread Justin Luitjens

I have the release version of CUDA 8.0 installed and am trying to build OpenMPI.

Here is my configure and build line:

./configure --prefix=$PREFIXPATH --with-cuda=$CUDA_HOME --with-tm= 
--with-openib= && make && sudo make install

Where CUDA_HOME points to the cuda install path.

When I run the above command it builds for quite a while but eventually errors 
out wit this:

make[2]: Entering directory 
`/home/jluitjens/Perforce/jluitjens_dtlogin_p4sw/sw/devrel/DevtechCompute/Internal/Tools/dtlogin/scripts/mpi/openmpi-1.10.1-gcc5.0_2014_11-cuda8.0/opal/tools/wrappers'
  CCLD opal_wrapper
../../../opal/.libs/libopen-pal.so: undefined reference to `nvmlInit_v2'
../../../opal/.libs/libopen-pal.so: undefined reference to 
`nvmlDeviceGetHandleByIndex_v2'
../../../opal/.libs/libopen-pal.so: undefined reference to 
`nvmlDeviceGetCount_v2'


Any idea what I might need to change to get around this error?

Thanks,
Justin

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Problem building OpenMPI with CUDA 8.0

2016-10-18 Thread Justin Luitjens

After looking into this a bit more it appears that the issue is I am building 
on a head node which does not have the driver installed.  Building on back node 
resolves this issue.  In CUDA 8.0 the NVML stubs can be found in the toolkit at 
the following path:  ${CUDA_HOME}/lib64/stubs

For 8.0 I'd suggest updating the configure/make scripts to look for nvml there 
and link in the stubs.  This way the build is not dependent on the driver being 
installed and only the toolkit.

Thanks,
Justin

From: users [mailto:users-boun...@lists.open-mpi.org] On Behalf Of Justin 
Luitjens
Sent: Tuesday, October 18, 2016 9:53 AM
To: users@lists.open-mpi.org
Subject: [OMPI users] Problem building OpenMPI with CUDA 8.0

I have the release version of CUDA 8.0 installed and am trying to build OpenMPI.

Here is my configure and build line:

./configure --prefix=$PREFIXPATH --with-cuda=$CUDA_HOME --with-tm= 
--with-openib= && make && sudo make install

Where CUDA_HOME points to the cuda install path.

When I run the above command it builds for quite a while but eventually errors 
out wit this:

make[2]: Entering directory 
`/home/jluitjens/Perforce/jluitjens_dtlogin_p4sw/sw/devrel/DevtechCompute/Internal/Tools/dtlogin/scripts/mpi/openmpi-1.10.1-gcc5.0_2014_11-cuda8.0/opal/tools/wrappers'
  CCLD opal_wrapper
../../../opal/.libs/libopen-pal.so: undefined reference to `nvmlInit_v2'
../../../opal/.libs/libopen-pal.so: undefined reference to 
`nvmlDeviceGetHandleByIndex_v2'
../../../opal/.libs/libopen-pal.so: undefined reference to 
`nvmlDeviceGetCount_v2'


Any idea what I might need to change to get around this error?

Thanks,
Justin

This email message is for the sole use of the intended recipient(s) and may 
contain confidential information.  Any unauthorized review, use, disclosure or 
distribution is prohibited.  If you are not the intended recipient, please 
contact the sender by reply email and destroy all copies of the original 
message.

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Crash in libopen-pal.so

2017-06-19 Thread Justin Luitjens

I have an application that works on other systems but on the current system I'm 
running I'm seeing the following crash:

[dt04:22457] *** Process received signal ***
[dt04:22457] Signal: Segmentation fault (11)
[dt04:22457] Signal code: Address not mapped (1)
[dt04:22457] Failing at address: 0x6a1da250
[dt04:22457] [ 0] /lib64/libpthread.so.0(+0xf370)[0x2b353370]
[dt04:22457] [ 1] 
/home/jluitjens/libs/openmpi/lib/libopen-pal.so.13(opal_memory_ptmalloc2_int_free+0x50)[0x2cbcf810]
[dt04:22457] [ 2] 
/home/jluitjens/libs/openmpi/lib/libopen-pal.so.13(opal_memory_ptmalloc2_free+0x9b)[0x2cbcff3b]
[dt04:22457] [ 3] ./hacc_tpm[0x42f068]
[dt04:22457] [ 4] ./hacc_tpm[0x42f231]
[dt04:22457] [ 5] ./hacc_tpm[0x40f64d]
[dt04:22457] [ 6] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2c30db35]
[dt04:22457] [ 7] ./hacc_tpm[0x4115cf]
[dt04:22457] *** End of error message ***


This app is a CUDA app but doesn't use GPU direct so that should be irrelevant.

I'm building with ggc/5.3.0  cuda/8.0.44  openmpi/1.10.7

I'm using this on centos 7 and am using a vanilla MPI configure line:  
./configure --prefix=/home/jluitjens/libs/openmpi/

Currently I'm trying to do this with just a single MPI process but multiple MPI 
processes fail in the same way:

mpirun  --oversubscribe -np 1 ./command

What is odd is the crash occurs around the same spot in the code but not 
consistently at the same spot.  The spot in the code where the single thread is 
at the time of the crash is nowhere near MPI code.  The code where it is 
crashing is just using malloc to allocate some memory. This makes me think the 
crash is due to a thread outside of the application I'm working on (perhaps in 
OpenMPI itself) or perhaps due to openmpi hijacking malloc/free.

Does anyone have any ideas of what I could try to work around this issue?

Thanks,
Justin












---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] OpenMPI 3.0.0 Failing To Compile

2018-02-28 Thread Justin Luitjens


I'm trying to build OpenMPI on Ubuntu 16.04.3 and I'm getting an error.


Here is how I configure and build:
./configure --with-cuda=$CUDA_HOME --prefix=$MPI_HOME && make clean &&  make -j 
&& make install


Here is the error I see:

make[2]: Entering directory 
'/tmpnfs/jluitjens/libs/src/openmpi-3.0.0/opal/mca/crs'
  CC   base/crs_base_open.lo
  GENERATE opal_crs.7
  CC   base/crs_base_select.lo
  CC   base/crs_base_close.lo
  CC   base/crs_base_fns.lo
Option package-version requires an argument
Usage: ../../../ompi/mpi/man/make_manpage.pl --package-name= 
--package-version= --ompi-date= --opal-date= --orte-date= --input= --output= 
[--nocxx] [ --nofortran] [--nof08]
Makefile:2199: recipe for target 'opal_crs.7' failed
make[2]: *** [opal_crs.7] Error 1
make[2]: *** Waiting for unfinished jobs
make[2]: Leaving directory 
'/tmpnfs/jluitjens/libs/src/openmpi-3.0.0/opal/mca/crs'
Makefile:2364: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory '/tmpnfs/jluitjens/libs/src/openmpi-3.0.0/opal'
Makefile:1885: recipe for target 'all-recursive' failed
make: *** [all-recursive] Error 1


Any suggestions on what might be going on?

---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

[OMPI users] Seg fault with PBS Pro 10.4

2011-07-26 Thread Wood, Justin Contractor, SAIC

I'm having a problem using OpenMPI under PBS Pro 10.4.  I tried both 1.4.3 and 
1.5.3, both behave the same.  I'm able to run just fine if I don't use PBS and 
go direct to the nodes.  Also, if I run under PBS and use only 1 node, it works 
fine, but as soon as I span nodes, I get the following:

[a4ou-n501:07366] *** Process received signal ***
[a4ou-n501:07366] Signal: Segmentation fault (11)
[a4ou-n501:07366] Signal code: Address not mapped (1)
[a4ou-n501:07366] Failing at address: 0x3f
[a4ou-n501:07366] [ 0] /lib64/libpthread.so.0 [0x3f2b20eb10]
[a4ou-n501:07366] [ 1] /opt/ompi/1.4.3/intel/lib/libopen-rte.so.0(discui_+0x84) 
[0x2affa453765c]
[a4ou-n501:07366] [ 2] /opt/ompi/1.4.3/intel/lib/libopen-rte.so.0(diswsi+0xc3) 
[0x2affa4534c6f]
[a4ou-n501:07366] [ 3] /opt/ompi/1.4.3/intel/lib/libopen-rte.so.0 
[0x2affa453290c]
[a4ou-n501:07366] [ 4] 
/opt/ompi/1.4.3/intel/lib/libopen-rte.so.0(tm_init+0x1fe) [0x2affa4532bf8]
[a4ou-n501:07366] [ 5] /opt/ompi/1.4.3/intel/lib/libopen-rte.so.0 
[0x2affa452691c]
[a4ou-n501:07366] [ 6] mpirun [0x404c17]
[a4ou-n501:07366] [ 7] mpirun [0x403e28]
[a4ou-n501:07366] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3f2a61d994]
[a4ou-n501:07366] [ 9] mpirun [0x403d59]
[a4ou-n501:07366] *** End of error message ***
Segmentation fault

I searched the archives and found a similar issue from last year:

http://www.open-mpi.org/community/lists/users/2010/02/12084.php

The last update I saw was that someone was going to contact Altair and have 
them look at why it was failing to do the tm_init.  Does anyone have an update 
to this, and has anyone been able to run successfully using recent versions of 
PBSPro?  I've also contacted our rep at Altair, but he hasn't responded yet.

Thanks, Justin.

Justin Wood
Systems Engineer
FNMOC | SAIC
7 Grace Hopper, Stop 1
Monterey, CA
justin.g.wood@navy.mil
justin.g.w...@saic.com
office: 831.656.4671
mobile: 831.869.1576

Re: [OMPI users] CUDA mpi question

2019-11-28 Thread Justin Luitjens via users

That is not guaranteed to work.  There is no streaming concept in the MPI 
standard.  The fundamental issue here is MPI is only asynchronous on the 
completion and not the initiation of the send/recv.

It would be nice if the next version of mpi would look to add something like a 
triggered send or receive that only initiates when it receives a signal saying 
the memory is ready.  This would be vender neutral and enable things like 
streaming.

For example at the end of a kernel which creates data the gpu could poke a 
memory location to signal the send is ready.  Then the IB device could initiate.

Sent from my iPhone

On Nov 28, 2019, at 8:21 AM, George Bosilca via users 
 wrote:


Wonderful maybe but extremely unportable. Thanks but no thanks!

  George.

On Wed, Nov 27, 2019 at 11:07 PM Zhang, Junchao 
mailto:jczh...@mcs.anl.gov>> wrote:
Interesting idea. But doing MPI_THREAD_MULTIPLE has other side-effects. If MPI 
nonblocking calls could take an extra stream argument and work like a kernel 
launch, it would be wonderful.
--Junchao Zhang


On Wed, Nov 27, 2019 at 6:12 PM Joshua Ladd 
mailto:josh...@mellanox.com>> wrote:
Why not spawn num_threads, where num_threads is the number of Kernels to launch 
, and compile with the “--default-stream per-thread” option?

Then you could use MPI in thread multiple mode to achieve your objective.

Something like:



void *launch_kernel(void *dummy)
{
float *data;
cudaMalloc(, N * sizeof(float));

kernel<<>>(data, N);

cudaStreamSynchronize(0);

MPI_Isend(data,..);
return NULL;
}

int main()
{
MPI_init_thread(,,MPI_THREAD_MULTIPLE,);
const int num_threads = 8;

pthread_t threads[num_threads];

for (int i = 0; i < num_threads; i++) {
if (pthread_create([i], NULL, launch_kernel, 0)) {
fprintf(stderr, "Error creating threadn");
return 1;
}
}

for (int i = 0; i < num_threads; i++) {
if(pthread_join(threads[i], NULL)) {
fprintf(stderr, "Error joining threadn");
return 2;
}
}
cudaDeviceReset();

MPI_Finalize();
}




From: users 
mailto:users-boun...@lists.open-mpi.org>> On 
Behalf Of Zhang, Junchao via users
Sent: Wednesday, November 27, 2019 5:43 PM
To: George Bosilca mailto:bosi...@icl.utk.edu>>
Cc: Zhang, Junchao mailto:jczh...@mcs.anl.gov>>; Open MPI 
Users mailto:users@lists.open-mpi.org>>
Subject: Re: [OMPI users] CUDA mpi question

I was pointed to "2.7. Synchronization and Memory Ordering" of  
https://docs.nvidia.com/pdf/GPUDirect_RDMA.pdf.
 It is on topic. But unfortunately it is too short and I could not understand 
it.
I also checked cudaStreamAddCallback/cudaLaunchHostFunc, which say the host 
function "must not make any CUDA API calls". I am not sure if MPI_Isend 
qualifies as such functions.
--Junchao Zhang


On Wed, Nov 27, 2019 at 4:18 PM George Bosilca 
mailto:bosi...@icl.utk.edu>> wrote:
On Wed, Nov 27, 2019 at 5:02 PM Zhang, Junchao 
mailto:jczh...@mcs.anl.gov>> wrote:
On Wed, Nov 27, 2019 at 3:16 PM George Bosilca 
mailto:bosi...@icl.utk.edu>> wrote:
Short and portable answer: you need to sync before the Isend or you will send 
garbage data.
Ideally, I want to formulate my code into a series of asynchronous "kernel 
launch, kernel launch, ..." without synchronization, so that I can hide kernel 
launch overhead. It now seems I have to sync before MPI calls (even nonblocking 
calls)

Then you need a means to ensure sequential execution, and this is what the 
streams provide. Unfortunately, I looked into the code and I'm afraid there is 
currently no realistic way to do what you need. My previous comment was based 
on an older code, that seems to be 1) unmaintained currently, and 2) only 
applicable to the OB1 PML + OpenIB BTL combo. As recent versions of OMPI have 
moved away from the OpenIB BTL, relying more heavily on UCX for Infiniband 
support, the old code is now deprecated. Sorry for giving you hope on this.

Maybe you can delegate the MPI call into a CUDA event callback ?

  George.




Assuming you are willing to go for a less portable solution you can get the 
OMPI streams and add your kernels inside, so that the sequential order will 
guarantee correctness of your isend. We have 2 hidden CUDA streams in OMPI, one 
for device-to-host and one for host-to-device, that can be queried with the 
non-MPI standard compliant functions (mca_common_cuda_get_dtoh_stream and 
mca_common_cuda_get_htod_stream).

Which streams (dtoh or htod) should I use to insert kernels producing send data 
and kernels using received data? I imagine MPI uses GPUDirect RDMA to move data 
directly from GPU to NIC. Why do we need to bother dtoh or

47 matches

Mail list logo