Re: [OMPI devel] mtt IBM SPAWN error

2008-09-04 Thread Lenny Verkhovsky
isn't it related to https://svn.open-mpi.org/trac/ompi/ticket/1469 ?

On 6/30/08, Lenny Verkhovsky  wrote:
>
> I am not familiar with spawn test of IBM, but maybe this is right
> behavior,
> if spawn test allocates 3 ranks on the node, and then allocates another 3
> then this test suppose to fail due to max_slots=4.
>
> But it fails with the fallowing hostfile as well BUT WITH A DIFFERENT
> ERROR.
>
> #cat hostfile2
> witch2 slots=4 max_slots=4
> witch3 slots=4 max_slots=4
> witch1:/home/BENCHMARKS/IBM # /home/USERS/lenny/OMPI_ORTE_18772/bin/mpirun
> -np 3 -hostfile hostfile2 dynamic/spawn
> bash: orted: command not found
> [witch1:22789]
> --
> A daemon (pid 22791) died unexpectedly with status 127 while attempting
> to launch so we are aborting.
> There may be more information reported by the environment (see above).
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> [witch1:22789]
> --
> mpirun was unable to cleanly terminate the daemons on the nodes shown
> below. Additional manual cleanup may be required - please refer to
> the "orte-clean" tool for assistance.
> --
> witch3 - daemon did not report back when launched
>
> On Mon, Jun 30, 2008 at 9:38 AM, Lenny Verkhovsky <
> lenny.verkhov...@gmail.com> wrote:
>
>> Hi,
>> trying to run mtt I failed to run IBM spawn test. It fails only when using
>> hostfile, and not when using host list.
>> ( OMPI from TRUNK )
>>
>> This is working :
>> #mpirun -np 3 -H witch2 dynamic/spawn
>>
>> This Fails:
>> # cat hostfile
>> witch2 slots=4 max_slots=4
>>
>> #mpirun -np 3 -hostfile hostfile dynamic/spawn
>> [witch1:12392]
>> --
>> There are not enough slots available in the system to satisfy the 3 slots
>> that were requested by the application:
>>   dynamic/spawn
>>
>> Either request fewer slots for your application, or make more slots
>> available
>> for use.
>> --
>> [witch1:12392]
>> --
>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
>> launch so we are aborting.
>>
>> There may be more information reported by the environment (see above).
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --
>> mpirun: clean termination accomplished
>>
>>
>> Using hostfile1 also works
>> #cat hostfile1
>> witch2
>> witch2
>> witch2
>>
>>
>> Best Regards
>> Lenny.
>>
>
>


Re: [OMPI devel] mtt IBM SPAWN error

2008-06-30 Thread Lenny Verkhovsky
I saw it. But I think it something else, since it works if I run it with
hostlist

#mpirun -np 3 -H witch2,witch3  dynamic/spawn
#


On Mon, Jun 30, 2008 at 4:03 PM, Ralph H Castain  wrote:

> Well, that error indicates that it was unable to launch the daemon on
> witch3
> for some reason. If you look at the error reported by bash, you will see
> that the ³orted² binary wasn¹t found!
>
> Sounds like a path error ­ you might check to see if witch3 has the
> binaries
> installed, and if they are where you told the system to look...
>
> Ralph
>
>
>
> On 6/30/08 5:21 AM, "Lenny Verkhovsky"  wrote:
>
> > I am not familiar with spawn test of IBM, but maybe this is right
> behavior,
> > if spawn test allocates 3 ranks on the node, and then allocates another 3
> > then this test suppose to fail due to max_slots=4.
> >
> > But it fails with the fallowing hostfile as well BUT WITH A DIFFERENT
> ERROR.
> >
> > #cat hostfile2
> > witch2 slots=4 max_slots=4
> > witch3 slots=4 max_slots=4
> > witch1:/home/BENCHMARKS/IBM #
> /home/USERS/lenny/OMPI_ORTE_18772/bin/mpirun -np
> > 3 -hostfile hostfile2 dynamic/spawn
> > bash: orted: command not found
> > [witch1:22789]
> >
> --
> > A daemon (pid 22791) died unexpectedly with status 127 while attempting
> > to launch so we are aborting.
> > There may be more information reported by the environment (see above).
> > This may be because the daemon was unable to find all the needed shared
> > libraries on the remote node. You may set your LD_LIBRARY_PATH to have
> the
> > location of the shared libraries on the remote nodes and this will
> > automatically be forwarded to the remote nodes.
> >
> --
> > [witch1:22789]
> >
> --
> > mpirun was unable to cleanly terminate the daemons on the nodes shown
> > below. Additional manual cleanup may be required - please refer to
> > the "orte-clean" tool for assistance.
> >
> --
> > witch3 - daemon did not report back when launched
> >
> > On Mon, Jun 30, 2008 at 9:38 AM, Lenny Verkhovsky <
> lenny.verkhov...@gmail.com>
> > wrote:
> >> Hi,
> >> trying to run mtt I failed to run IBM spawn test. It fails only when
> using
> >> hostfile, and not when using host list.
> >> ( OMPI from TRUNK )
> >>
> >> This is working :
> >> #mpirun -np 3 -H witch2 dynamic/spawn
> >>
> >> This Fails:
> >> # cat hostfile
> >> witch2 slots=4 max_slots=4
> >> #mpirun -np 3 -hostfile hostfile dynamic/spawn
> >> [witch1:12392]
> >>
> --
> >> There are not enough slots available in the system to satisfy the 3
> slots
> >> that were requested by the application:
> >>   dynamic/spawn
> >>
> >> Either request fewer slots for your application, or make more slots
> available
> >> for use.
> >>
> --
> >> [witch1:12392]
> >>
> --
> >> A daemon (pid unknown) died unexpectedly on signal 1  while attempting
> to
> >> launch so we are aborting.
> >>
> >> There may be more information reported by the environment (see above).
> >>
> >> This may be because the daemon was unable to find all the needed shared
> >> libraries on the remote node. You may set your LD_LIBRARY_PATH to have
> the
> >> location of the shared libraries on the remote nodes and this will
> >> automatically be forwarded to the remote nodes.
> >>
> --
> >> mpirun: clean termination accomplished
> >>
> >>
> >> Using hostfile1 also works
> >> #cat hostfile1
> >> witch2
> >> witch2
> >> witch2
> >>
> >>
> >> Best Regards
> >> Lenny.
> >>
> >
>
>
>
>


Re: [OMPI devel] mtt IBM SPAWN error

2008-06-30 Thread Ralph H Castain
Well, that error indicates that it was unable to launch the daemon on witch3
for some reason. If you look at the error reported by bash, you will see
that the ³orted² binary wasn¹t found!

Sounds like a path error ­ you might check to see if witch3 has the binaries
installed, and if they are where you told the system to look...

Ralph



On 6/30/08 5:21 AM, "Lenny Verkhovsky"  wrote:

> I am not familiar with spawn test of IBM, but maybe this is right behavior,
> if spawn test allocates 3 ranks on the node, and then allocates another 3
> then this test suppose to fail due to max_slots=4.
>  
> But it fails with the fallowing hostfile as well BUT WITH A DIFFERENT ERROR.
>  
> #cat hostfile2 
> witch2 slots=4 max_slots=4
> witch3 slots=4 max_slots=4
> witch1:/home/BENCHMARKS/IBM # /home/USERS/lenny/OMPI_ORTE_18772/bin/mpirun -np
> 3 -hostfile hostfile2 dynamic/spawn
> bash: orted: command not found
> [witch1:22789] 
> --
> A daemon (pid 22791) died unexpectedly with status 127 while attempting
> to launch so we are aborting.
> There may be more information reported by the environment (see above).
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> [witch1:22789] 
> --
> mpirun was unable to cleanly terminate the daemons on the nodes shown
> below. Additional manual cleanup may be required - please refer to
> the "orte-clean" tool for assistance.
> --
> witch3 - daemon did not report back when launched
>  
> On Mon, Jun 30, 2008 at 9:38 AM, Lenny Verkhovsky 
> wrote:
>> Hi, 
>> trying to run mtt I failed to run IBM spawn test. It fails only when using
>> hostfile, and not when using host list.
>> ( OMPI from TRUNK )
>>  
>> This is working :
>> #mpirun -np 3 -H witch2 dynamic/spawn
>>  
>> This Fails:
>> # cat hostfile
>> witch2 slots=4 max_slots=4
>> #mpirun -np 3 -hostfile hostfile dynamic/spawn
>> [witch1:12392] 
>> --
>> There are not enough slots available in the system to satisfy the 3 slots
>> that were requested by the application:
>>   dynamic/spawn
>> 
>> Either request fewer slots for your application, or make more slots available
>> for use.
>> --
>> [witch1:12392] 
>> --
>> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
>> launch so we are aborting.
>> 
>> There may be more information reported by the environment (see above).
>> 
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --
>> mpirun: clean termination accomplished
>>  
>>  
>> Using hostfile1 also works
>> #cat hostfile1
>> witch2
>> witch2
>> witch2
>>  
>>  
>> Best Regards
>> Lenny.
>> 
> 






Re: [OMPI devel] mtt IBM SPAWN error

2008-06-30 Thread Ralph H Castain
That¹s correct ­ and is precisely the behavior it should exhibit. The
reasons:

1. when you specify ­host, we assume max_slots is infinite since you cannot
provide any info to the contrary. We therefore allow you to oversubscribe
the node to your heart¹s desire. However, note one problem: if your original
launch is only one proc, we will set it to be aggressive in terms of
yielding the processor. Your subsequent comm_spawn¹d procs will therefore
suffer degraded performance if they oversubscribe the node.

Can¹t be helped - there is no way to pass enough info with -host for us to
do better.


2. when you run with -hostfile, your hostfile is telling us to allow no more
than 4 procs on the node. You used three in your original launch, leaving
only one slot available. Since each of the procs in the IBM test attempts to
spawn another, your job will fail.

We can always do more to improve the error messaging...
Ralph


On 6/30/08 12:38 AM, "Lenny Verkhovsky"  wrote:

> Hi, 
> trying to run mtt I failed to run IBM spawn test. It fails only when using
> hostfile, and not when using host list.
> ( OMPI from TRUNK )
>  
> This is working :
> #mpirun -np 3 -H witch2 dynamic/spawn
>  
> This Fails:
> # cat hostfile
> witch2 slots=4 max_slots=4
> #mpirun -np 3 -hostfile hostfile dynamic/spawn
> [witch1:12392] 
> --
> There are not enough slots available in the system to satisfy the 3 slots
> that were requested by the application:
>   dynamic/spawn
> 
> Either request fewer slots for your application, or make more slots available
> for use.
> --
> [witch1:12392] 
> --
> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
> launch so we are aborting.
> 
> There may be more information reported by the environment (see above).
> 
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> mpirun: clean termination accomplished
>  
>  
> Using hostfile1 also works
> #cat hostfile1
> witch2
> witch2
> witch2
>  
>  
> Best Regards
> Lenny.
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel






Re: [OMPI devel] mtt IBM SPAWN error

2008-06-30 Thread Lenny Verkhovsky
I am not familiar with spawn test of IBM, but maybe this is right behavior,
if spawn test allocates 3 ranks on the node, and then allocates another 3
then this test suppose to fail due to max_slots=4.

But it fails with the fallowing hostfile as well BUT WITH A DIFFERENT ERROR.

#cat hostfile2
witch2 slots=4 max_slots=4
witch3 slots=4 max_slots=4
witch1:/home/BENCHMARKS/IBM # /home/USERS/lenny/OMPI_ORTE_18772/bin/mpirun
-np 3 -hostfile hostfile2 dynamic/spawn
bash: orted: command not found
[witch1:22789]
--
A daemon (pid 22791) died unexpectedly with status 127 while attempting
to launch so we are aborting.
There may be more information reported by the environment (see above).
This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
[witch1:22789]
--
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--
witch3 - daemon did not report back when launched

On Mon, Jun 30, 2008 at 9:38 AM, Lenny Verkhovsky <
lenny.verkhov...@gmail.com> wrote:

> Hi,
> trying to run mtt I failed to run IBM spawn test. It fails only when using
> hostfile, and not when using host list.
> ( OMPI from TRUNK )
>
> This is working :
> #mpirun -np 3 -H witch2 dynamic/spawn
>
> This Fails:
> # cat hostfile
> witch2 slots=4 max_slots=4
>
> #mpirun -np 3 -hostfile hostfile dynamic/spawn
> [witch1:12392]
> --
> There are not enough slots available in the system to satisfy the 3 slots
> that were requested by the application:
>   dynamic/spawn
>
> Either request fewer slots for your application, or make more slots
> available
> for use.
> --
> [witch1:12392]
> --
> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
> launch so we are aborting.
>
> There may be more information reported by the environment (see above).
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> mpirun: clean termination accomplished
>
>
> Using hostfile1 also works
> #cat hostfile1
> witch2
> witch2
> witch2
>
>
> Best Regards
> Lenny.
>