Re: [OMPI users] mpirun hanging followup

2007-07-18 Thread Ralph H Castain
Hooray! Glad we could help track this down - sorry it was so hard to do so.

To answer your questions:

1. Yes - ORTE should bail out gracefully. It definitely should not hang. I
will log the problem and investigate. I believe I know where the problem
lies, and it may already be fixed on our trunk, but the fix may not get into
the 1.2 family (have to see what it would entail).

2. I will definitely commit that debug code and ensure it is in future
releases.

3. I'll see if we can add something about --debug-daemons to the FAQ -
thanks for pointing out that oversight.

Thanks
Ralph



On 7/18/07 12:19 PM, "Bill Johnstone"  wrote:

> 
> --- Ralph Castain  wrote:
> 
>> Unfortunately, we don't have more debug statements internal to that
>> function. I'll have to create a patch for you that will add some so
>> we can
>> better understand why it is failing - will try to send it to you on
>> Wed.
> 
> Thank you for the patch you sent.
> 
> I solved the problem.  It was a head-slapper of an error.  Turned out
> that I had forgotten -- the permissions on the filesystem override the
> permissions of the mount point.  As I mentioned, these machines have an
> NFS root filesystem.  In that filesystem, tmp has permissions 1777.
> However, when each node mounts its local temp partition to /tmp, the
> permissions on that filesystem are the permissions the mount point
> takes on.
> 
> In this case, I had forgotten to apply permissions 1777 to /tmp after
> mounting on each machine.  As a result, /tmp really did not have the
> appropriate permissions for mpirun to write to it as necessary.
> 
> Your patch helped me figure this out.  Technically, I should have been
> able to figure it out from the messages you'd already sent to the
> mailing list, but it wasn't until I saw the line in session_dir.c where
> the error was occurring that I realized it had to be some kind of
> permissions error.
> 
> I've attached the new debug output below:
> 
> [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
> util/session_dir.c at line 108
> [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
> util/session_dir.c at line 391
> [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
> runtime/orte_init_stage1.c at line 626
> --
> It looks like orte_init failed for some reason; your parallel process
> is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>   orte_session_dir failed
>   --> Returned value -1 instead of ORTE_SUCCESS
> 
> --
> [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
> runtime/orte_system_init.c at line 42
> [node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
> runtime/orte_init.c at line 52
> Open RTE was unable to initialize properly.  The error occured while
> attempting to orte_init().  Returned value -1 instead of ORTE_SUCCESS.
> 
> Starting at line 108 of session_dir.c, is:
> 
> if (ORTE_SUCCESS != (ret = opal_os_dirpath_create(directory, my_mode)))
> {
> ORTE_ERROR_LOG(ret);
> }
> 
> Three further points:
> 
> -Is there some reason ORTE can't bail out gracefully upon this error,
> instead of hanging like it was doing for me?
> 
> -I think leaving in the extra debug logging code you sent me in the
> patch for future Open MPI versions would be a good idea to help
> troubleshoot problems like this.
> 
> -It would be nice to see "--debug-daemons" added to the Troubleshooting
> section of the FAQ on the web site.
> 
> Thank you very very much for your help Ralph and everyone else that replied.
> 
> 
>
> __
> __
> Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail,
> news, photos & more.
> http://mobile.yahoo.com/go?refer=1GNXIC
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] mpirun hanging followup

2007-07-18 Thread Bill Johnstone

--- Ralph Castain  wrote:

> Unfortunately, we don't have more debug statements internal to that
> function. I'll have to create a patch for you that will add some so
> we can
> better understand why it is failing - will try to send it to you on
> Wed.

Thank you for the patch you sent.

I solved the problem.  It was a head-slapper of an error.  Turned out
that I had forgotten -- the permissions on the filesystem override the
permissions of the mount point.  As I mentioned, these machines have an
NFS root filesystem.  In that filesystem, tmp has permissions 1777. 
However, when each node mounts its local temp partition to /tmp, the
permissions on that filesystem are the permissions the mount point
takes on.

In this case, I had forgotten to apply permissions 1777 to /tmp after
mounting on each machine.  As a result, /tmp really did not have the
appropriate permissions for mpirun to write to it as necessary.

Your patch helped me figure this out.  Technically, I should have been
able to figure it out from the messages you'd already sent to the
mailing list, but it wasn't until I saw the line in session_dir.c where
the error was occurring that I realized it had to be some kind of
permissions error.

I've attached the new debug output below:

[node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
util/session_dir.c at line 108
[node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
util/session_dir.c at line 391
[node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
runtime/orte_init_stage1.c at line 626
--
It looks like orte_init failed for some reason; your parallel process
is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_session_dir failed
  --> Returned value -1 instead of ORTE_SUCCESS

--
[node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
runtime/orte_system_init.c at line 42
[node5.x86-64:11511] [0,0,1] ORTE_ERROR_LOG: Error in file
runtime/orte_init.c at line 52
Open RTE was unable to initialize properly.  The error occured while
attempting to orte_init().  Returned value -1 instead of ORTE_SUCCESS.

Starting at line 108 of session_dir.c, is:

if (ORTE_SUCCESS != (ret = opal_os_dirpath_create(directory, my_mode)))
{
ORTE_ERROR_LOG(ret);
}

Three further points:

-Is there some reason ORTE can't bail out gracefully upon this error,
instead of hanging like it was doing for me?

-I think leaving in the extra debug logging code you sent me in the
patch for future Open MPI versions would be a good idea to help
troubleshoot problems like this.

-It would be nice to see "--debug-daemons" added to the Troubleshooting
section of the FAQ on the web site.

Thank you very very much for your help Ralph and everyone else that replied.




Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, 
photos & more. 
http://mobile.yahoo.com/go?refer=1GNXIC


Re: [OMPI users] mpirun hanging followup

2007-07-18 Thread Ralph H Castain



On 7/18/07 11:46 AM, "Bill Johnstone"  wrote:

> --- Ralph Castain  wrote:
> 
>> No, the session directory is created in the tmpdir - we don't create
>> anything anywhere else, nor do we write any executables anywhere.
> 
> In the case where the TMPDIR env variable isn't specified, what is the
> default assumed by Open MPI/orte?

It rattles through a logic chain:

1. ompi mca param value

2. TMPDIR in environ

3. TMP in environ

4. default to /tmp just so we have something to work with...

> 
>> Just out of curiosity: although I know you have different arch's on
>> your
>> nodes, the tests you are running are all executing on the same arch,
>> correct???
> 
> Yes, tests all execute on the same arch, although I am led to another
> question.  Can I use a headnode of a particular arch, but in my mpirun
> hostfile, specify only nodes of another arch, and launch from the
> headnode?  In other words, no computation is done on the headnode of
> arch A, all computation is done on nodes of arch B, but the job is
> launched from the headnode -- would that be acceptable?

As long as the prefix is set such that the correct binary executables can be
found, then you should be fine.

> 
> I should be clear that for the problem you are helping me with, *all*
> the nodes involved are running the same arch, OS, compiler, system
> libraries, etc.  The multiple arch question is for edification for the
> future.

No problem - I just wanted to eliminate one possible complication for now.

Thanks
Ralph

> 
> 
> 
>
> __
> __
> Got a little couch potato?
> Check out fun summer activities for kids.
> http://search.yahoo.com/search?fr=oni_on_mail=summer+activities+for+kids=
> bz 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] mpirun hanging followup

2007-07-18 Thread Bill Johnstone
--- Ralph Castain  wrote:

> No, the session directory is created in the tmpdir - we don't create
> anything anywhere else, nor do we write any executables anywhere.

In the case where the TMPDIR env variable isn't specified, what is the
default assumed by Open MPI/orte?

> Just out of curiosity: although I know you have different arch's on
> your
> nodes, the tests you are running are all executing on the same arch,
> correct???

Yes, tests all execute on the same arch, although I am led to another
question.  Can I use a headnode of a particular arch, but in my mpirun
hostfile, specify only nodes of another arch, and launch from the
headnode?  In other words, no computation is done on the headnode of
arch A, all computation is done on nodes of arch B, but the job is
launched from the headnode -- would that be acceptable?

I should be clear that for the problem you are helping me with, *all*
the nodes involved are running the same arch, OS, compiler, system
libraries, etc.  The multiple arch question is for edification for the
future.





Got a little couch potato? 
Check out fun summer activities for kids.
http://search.yahoo.com/search?fr=oni_on_mail=summer+activities+for+kids=bz
 


Re: [OMPI users] mpirun hanging followup

2007-07-17 Thread Ralph Castain
No, the session directory is created in the tmpdir - we don't create
anything anywhere else, nor do we write any executables anywhere.

Unfortunately, we don't have more debug statements internal to that
function. I'll have to create a patch for you that will add some so we can
better understand why it is failing - will try to send it to you on Wed.

Just out of curiosity: although I know you have different arch's on your
nodes, the tests you are running are all executing on the same arch,
correct???

Ralph


On 7/17/07 4:06 PM, "Bill Johnstone"  wrote:

> I made sure the TMPDIR environment variable was set to /tmp for
> non-interactive logins, and got the same result as before.
> 
> Also specifying the "-mca tmpdir_base /tmp" command-line options gave
> the same result as well.
> 
> I made a mistake in my previous e-mail however -- the user home
> directories are also writable by each node (again, via NFS).  /var and
> /tmp are the only unique-per-node writable directories.  I'm assuming
> that by default, the session directory structure is created in the run
> directory, or the user's home directory, or something similar?
> 
> /tmp and the home directories are both mounted nosuid, but are mounted
> exec.  Does mpirun write/run a suid executable in any of these
> directories?
> 
> Thank you.
> 
> --- Ralph Castain  wrote:
> 
>> Open MPI needs to create a temporary directory structure that we call
>> the
>> "session directory". This error is telling you that Open MPI was
>> unable to
>> create that directory, probably due to a permission issue.
>> 
>> We decide on the root directory for the session directory using a
>> progression. You can direct where you want it to go by setting the
>> TMPDIR
>> environment variable, or (to set it just for us) using -mca
>> tmpdir_base foo
>> on the mpirun command (or you can set OMPI_MCA_tmpidir_base=foo in
>> your
>> environment), where "foo" is the root of your tmp directory you want
>> us to
>> use (e.g., /tmp).
>> 
>> Hope that helps
>> Ralph
>> 
>> 
>> 
>> On 7/17/07 3:09 PM, "Bill Johnstone"  wrote:
>> 
>>> When I run with --debug-daemons, I get:
>>> 
>>> 
>>> 
>>> [node5.x86-64:09920] [0,0,1] ORTE_ERROR_LOG: Error in file
>>> runtime/orte_init_stage1.c at line 626
>>> 
>> 
> --
>>> It looks like orte_init failed for some reason; your parallel
>> process
>>> is
>>> likely to abort.  There are many reasons that a parallel process
>> can
>>> fail during orte_init; some of which are due to configuration or
>>> environment problems.  This failure appears to be an internal
>> failure;
>>> here's some additional information (which may only be relevant to
>> an
>>> Open MPI developer):
>>> 
>>>   orte_session_dir failed
>>>   --> Returned value -1 instead of ORTE_SUCCESS
>>> 
>>> 
>> 
> --
>>> [node5.x86-64:09920] [0,0,1] ORTE_ERROR_LOG: Error in file
>>> runtime/orte_system_init.c at line 42
>>> [node5.x86-64:09920] [0,0,1] ORTE_ERROR_LOG: Error in file
>>> runtime/orte_init.c at line 52
>>> Open RTE was unable to initialize properly.  The error occured
>> while
>>> attempting to orte_init().  Returned value -1 instead of
>> ORTE_SUCCESS.
>>> 
>>> 
>>> 
>>> Where would you suggest I look next?
>>> 
>>> Also, if it makes any difference, /usr/local is on a read-only
>> NFSROOT.
>>>  Only /tmp and /var are writeable per-node.
>>> 
>>> Thank you very much for your help so far.
>>> 
>>> --- George Bosilca  wrote:
>>> 
 Sorry. The --debug was supposed to be --debug-devel. But I suspect
 that if you have a normal build then there will be not much
>> output.
 However, --debug-daemons should give enough output so we can at
>> least
  
 have a starting point.
 
george.
 
 On Jul 17, 2007, at 2:46 PM, Bill Johnstone wrote:
 
> George Bosilca wrote:
> 
>> You can start by adding --debug-daemons and --debug to your
>> mpirun
>> command line. This will generate a lot of output related to the
>> operations done internally by the launcher. If you send this
 output
>> to the list we might be able to help you a little bit more.
> 
> OK, I added those, but got a message about needing to supply a
> suitable
> debugger.  If I supply the "--debugger gdb" argument, I just get
> dumped
> into gdb.  I'm not sure what I need to do next to get the
>> launcher
> output you mentioned.  My knowledge of gdb is pretty rudimentary.
 
> Do I
> need to set mpirun as the executable, and the use the gdb "run"
> command
> with the mpirun arguments?
> 
> Do I need to rebuild openmpi with --enable-debug?
>>> 
>>> 
>>> 
>>>   
>>> 
>> 
> __
>>> __
>>> Luggage? GPS? Comic books?
>>> 

Re: [OMPI users] mpirun hanging followup

2007-07-17 Thread Bill Johnstone
I made sure the TMPDIR environment variable was set to /tmp for 
non-interactive logins, and got the same result as before.

Also specifying the "-mca tmpdir_base /tmp" command-line options gave
the same result as well.

I made a mistake in my previous e-mail however -- the user home
directories are also writable by each node (again, via NFS).  /var and
/tmp are the only unique-per-node writable directories.  I'm assuming
that by default, the session directory structure is created in the run
directory, or the user's home directory, or something similar?

/tmp and the home directories are both mounted nosuid, but are mounted
exec.  Does mpirun write/run a suid executable in any of these
directories?

Thank you.

--- Ralph Castain  wrote:

> Open MPI needs to create a temporary directory structure that we call
> the
> "session directory". This error is telling you that Open MPI was
> unable to
> create that directory, probably due to a permission issue.
> 
> We decide on the root directory for the session directory using a
> progression. You can direct where you want it to go by setting the
> TMPDIR
> environment variable, or (to set it just for us) using -mca
> tmpdir_base foo
> on the mpirun command (or you can set OMPI_MCA_tmpidir_base=foo in
> your
> environment), where "foo" is the root of your tmp directory you want
> us to
> use (e.g., /tmp).
> 
> Hope that helps
> Ralph
> 
> 
> 
> On 7/17/07 3:09 PM, "Bill Johnstone"  wrote:
> 
> > When I run with --debug-daemons, I get:
> > 
> > 
> > 
> > [node5.x86-64:09920] [0,0,1] ORTE_ERROR_LOG: Error in file
> > runtime/orte_init_stage1.c at line 626
> >
>
--
> > It looks like orte_init failed for some reason; your parallel
> process
> > is
> > likely to abort.  There are many reasons that a parallel process
> can
> > fail during orte_init; some of which are due to configuration or
> > environment problems.  This failure appears to be an internal
> failure;
> > here's some additional information (which may only be relevant to
> an
> > Open MPI developer):
> > 
> >   orte_session_dir failed
> >   --> Returned value -1 instead of ORTE_SUCCESS
> > 
> >
>
--
> > [node5.x86-64:09920] [0,0,1] ORTE_ERROR_LOG: Error in file
> > runtime/orte_system_init.c at line 42
> > [node5.x86-64:09920] [0,0,1] ORTE_ERROR_LOG: Error in file
> > runtime/orte_init.c at line 52
> > Open RTE was unable to initialize properly.  The error occured
> while
> > attempting to orte_init().  Returned value -1 instead of
> ORTE_SUCCESS.
> > 
> > 
> > 
> > Where would you suggest I look next?
> > 
> > Also, if it makes any difference, /usr/local is on a read-only
> NFSROOT.
> >  Only /tmp and /var are writeable per-node.
> > 
> > Thank you very much for your help so far.
> > 
> > --- George Bosilca  wrote:
> > 
> >> Sorry. The --debug was supposed to be --debug-devel. But I suspect
> >> that if you have a normal build then there will be not much
> output.
> >> However, --debug-daemons should give enough output so we can at
> least
> >>  
> >> have a starting point.
> >> 
> >>george.
> >> 
> >> On Jul 17, 2007, at 2:46 PM, Bill Johnstone wrote:
> >> 
> >>> George Bosilca wrote:
> >>> 
>  You can start by adding --debug-daemons and --debug to your
> mpirun
>  command line. This will generate a lot of output related to the
>  operations done internally by the launcher. If you send this
> >> output
>  to the list we might be able to help you a little bit more.
> >>> 
> >>> OK, I added those, but got a message about needing to supply a
> >>> suitable
> >>> debugger.  If I supply the "--debugger gdb" argument, I just get
> >>> dumped
> >>> into gdb.  I'm not sure what I need to do next to get the
> launcher
> >>> output you mentioned.  My knowledge of gdb is pretty rudimentary.
> >> 
> >>> Do I
> >>> need to set mpirun as the executable, and the use the gdb "run"
> >>> command
> >>> with the mpirun arguments?
> >>> 
> >>> Do I need to rebuild openmpi with --enable-debug?
> > 
> > 
> > 
> >   
> >
>
__
> > __
> > Luggage? GPS? Comic books?
> > Check out fitting gifts for grads at Yahoo! Search
> >
>
http://search.yahoo.com/search?fr=oni_on_mail=graduation+gifts=bz
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 





Moody friends. Drama queens. Your life? Nope! - their life, your story. Play 
Sims Stories at Yahoo! Games.
http://sims.yahoo.com/  


Re: [OMPI users] mpirun hanging followup

2007-07-17 Thread Ralph Castain
Open MPI needs to create a temporary directory structure that we call the
"session directory". This error is telling you that Open MPI was unable to
create that directory, probably due to a permission issue.

We decide on the root directory for the session directory using a
progression. You can direct where you want it to go by setting the TMPDIR
environment variable, or (to set it just for us) using -mca tmpdir_base foo
on the mpirun command (or you can set OMPI_MCA_tmpidir_base=foo in your
environment), where "foo" is the root of your tmp directory you want us to
use (e.g., /tmp).

Hope that helps
Ralph



On 7/17/07 3:09 PM, "Bill Johnstone"  wrote:

> When I run with --debug-daemons, I get:
> 
> 
> 
> [node5.x86-64:09920] [0,0,1] ORTE_ERROR_LOG: Error in file
> runtime/orte_init_stage1.c at line 626
> --
> It looks like orte_init failed for some reason; your parallel process
> is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>   orte_session_dir failed
>   --> Returned value -1 instead of ORTE_SUCCESS
> 
> --
> [node5.x86-64:09920] [0,0,1] ORTE_ERROR_LOG: Error in file
> runtime/orte_system_init.c at line 42
> [node5.x86-64:09920] [0,0,1] ORTE_ERROR_LOG: Error in file
> runtime/orte_init.c at line 52
> Open RTE was unable to initialize properly.  The error occured while
> attempting to orte_init().  Returned value -1 instead of ORTE_SUCCESS.
> 
> 
> 
> Where would you suggest I look next?
> 
> Also, if it makes any difference, /usr/local is on a read-only NFSROOT.
>  Only /tmp and /var are writeable per-node.
> 
> Thank you very much for your help so far.
> 
> --- George Bosilca  wrote:
> 
>> Sorry. The --debug was supposed to be --debug-devel. But I suspect
>> that if you have a normal build then there will be not much output.
>> However, --debug-daemons should give enough output so we can at least
>>  
>> have a starting point.
>> 
>>george.
>> 
>> On Jul 17, 2007, at 2:46 PM, Bill Johnstone wrote:
>> 
>>> George Bosilca wrote:
>>> 
 You can start by adding --debug-daemons and --debug to your mpirun
 command line. This will generate a lot of output related to the
 operations done internally by the launcher. If you send this
>> output
 to the list we might be able to help you a little bit more.
>>> 
>>> OK, I added those, but got a message about needing to supply a
>>> suitable
>>> debugger.  If I supply the "--debugger gdb" argument, I just get
>>> dumped
>>> into gdb.  I'm not sure what I need to do next to get the launcher
>>> output you mentioned.  My knowledge of gdb is pretty rudimentary.
>> 
>>> Do I
>>> need to set mpirun as the executable, and the use the gdb "run"
>>> command
>>> with the mpirun arguments?
>>> 
>>> Do I need to rebuild openmpi with --enable-debug?
> 
> 
> 
>   
> __
> __
> Luggage? GPS? Comic books?
> Check out fitting gifts for grads at Yahoo! Search
> http://search.yahoo.com/search?fr=oni_on_mail=graduation+gifts=bz
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] mpirun hanging followup

2007-07-17 Thread Bill Johnstone
When I run with --debug-daemons, I get:



[node5.x86-64:09920] [0,0,1] ORTE_ERROR_LOG: Error in file
runtime/orte_init_stage1.c at line 626
--
It looks like orte_init failed for some reason; your parallel process
is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_session_dir failed
  --> Returned value -1 instead of ORTE_SUCCESS

--
[node5.x86-64:09920] [0,0,1] ORTE_ERROR_LOG: Error in file
runtime/orte_system_init.c at line 42
[node5.x86-64:09920] [0,0,1] ORTE_ERROR_LOG: Error in file
runtime/orte_init.c at line 52
Open RTE was unable to initialize properly.  The error occured while
attempting to orte_init().  Returned value -1 instead of ORTE_SUCCESS.



Where would you suggest I look next?

Also, if it makes any difference, /usr/local is on a read-only NFSROOT.
 Only /tmp and /var are writeable per-node.

Thank you very much for your help so far.

--- George Bosilca  wrote:

> Sorry. The --debug was supposed to be --debug-devel. But I suspect  
> that if you have a normal build then there will be not much output.  
> However, --debug-daemons should give enough output so we can at least
>  
> have a starting point.
> 
>george.
> 
> On Jul 17, 2007, at 2:46 PM, Bill Johnstone wrote:
> 
> > George Bosilca wrote:
> >
> >> You can start by adding --debug-daemons and --debug to your mpirun
> >> command line. This will generate a lot of output related to the
> >> operations done internally by the launcher. If you send this
> output
> >> to the list we might be able to help you a little bit more.
> >
> > OK, I added those, but got a message about needing to supply a  
> > suitable
> > debugger.  If I supply the "--debugger gdb" argument, I just get  
> > dumped
> > into gdb.  I'm not sure what I need to do next to get the launcher
> > output you mentioned.  My knowledge of gdb is pretty rudimentary.  
> 
> > Do I
> > need to set mpirun as the executable, and the use the gdb "run"  
> > command
> > with the mpirun arguments?
> >
> > Do I need to rebuild openmpi with --enable-debug?



  

Luggage? GPS? Comic books? 
Check out fitting gifts for grads at Yahoo! Search
http://search.yahoo.com/search?fr=oni_on_mail=graduation+gifts=bz


Re: [OMPI users] mpirun hanging followup

2007-07-17 Thread George Bosilca
Sorry. The --debug was supposed to be --debug-devel. But I suspect  
that if you have a normal build then there will be not much output.  
However, --debug-daemons should give enough output so we can at least  
have a starting point.


  george.

On Jul 17, 2007, at 2:46 PM, Bill Johnstone wrote:


George Bosilca wrote:


You can start by adding --debug-daemons and --debug to your mpirun
command line. This will generate a lot of output related to the
operations done internally by the launcher. If you send this output
to the list we might be able to help you a little bit more.


OK, I added those, but got a message about needing to supply a  
suitable
debugger.  If I supply the "--debugger gdb" argument, I just get  
dumped

into gdb.  I'm not sure what I need to do next to get the launcher
output you mentioned.  My knowledge of gdb is pretty rudimentary.   
Do I
need to set mpirun as the executable, and the use the gdb "run"  
command

with the mpirun arguments?

Do I need to rebuild openmpi with --enable-debug?



__ 
__
Building a website is a piece of cake. Yahoo! Small Business gives  
you all the tools to get online.

http://smallbusiness.yahoo.com/webhosting
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] mpirun hanging followup

2007-07-17 Thread Bill Johnstone
George Bosilca wrote:

> You can start by adding --debug-daemons and --debug to your mpirun
> command line. This will generate a lot of output related to the
> operations done internally by the launcher. If you send this output
> to the list we might be able to help you a little bit more.

OK, I added those, but got a message about needing to supply a suitable
debugger.  If I supply the "--debugger gdb" argument, I just get dumped
into gdb.  I'm not sure what I need to do next to get the launcher
output you mentioned.  My knowledge of gdb is pretty rudimentary.  Do I
need to set mpirun as the executable, and the use the gdb "run" command
with the mpirun arguments?

Do I need to rebuild openmpi with --enable-debug?




Building a website is a piece of cake. Yahoo! Small Business gives you all the 
tools to get online.
http://smallbusiness.yahoo.com/webhosting 


Re: [OMPI users] mpirun hanging followup

2007-07-17 Thread Michael Edwards

On 7/17/07, Bill Johnstone  wrote:

Thanks for the help.  I've replied below.

--- "G.O."  wrote:

> 1- Check to make sure that there are no firewalls blocking
> traffic between the nodes.

There is no firewall in-between the nodes.  If I run jobs directly via
ssh, e.g. "ssh node4 env" they work.



Are you using host based authentication of some kind?  ie, are you
being prompted for a password when you ssh between nodes?


Re: [OMPI users] mpirun hanging followup

2007-07-17 Thread Bill Johnstone
Thanks for the help.  I've replied below.

--- "G.O."  wrote:

> 1- Check to make sure that there are no firewalls blocking
> traffic between the nodes.

There is no firewall in-between the nodes.  If I run jobs directly via
ssh, e.g. "ssh node4 env" they work.

> 2 - Check to make sure that all nodes have the openmpi installed
> and have the very same executable you are trying to run on the same
> path, have all permissions correctly.

Yes, they are all installed to /usr/local , the permissions are the
same, and if I just invoke mpirun on an individual node by logging into
it, it works.  In fact, even commands like "ssh node4 mpirun" (just to
get the mpirun help banner) work.

> 3- Check to make sure that all nodes have the same interface,
> i.e. eth0 .

They all do have the same interfaces.  In my configureation, eth1 is
the interface that corresponds to the cluster IP network.  I have tried
using "--mca btl_tcp_if_include eth1" but it seems to make no
difference.

>That's all i can think of for very quick checks for now. Hope it's
> one of this.

Thank you very much, but unfortunately it isn't any of these, as far as
I can tell.



  

Fussy? Opinionated? Impossible to please? Perfect.  Join Yahoo!'s user panel 
and lay it on us. http://surveylink.yahoo.com/gmrs/yahoo_panel_invite.asp?a=7 



Re: [OMPI users] mpirun hanging followup

2007-07-17 Thread Michael Edwards

If you are having difficulty getting openmpi set up yourself, you
might look into OSCAR or Rocks, they make setting up your cluster much
easier and include various mpi packages as well as other utilities for
reducing your management overhead.

I can help you (off list) get set up with OSCAR if you like, and there
are very helpful mailing lists for both projects.

On 7/17/07, Bill Johnstone  wrote:

Hello all.

I could really use help trying to figure out why mpirun is hanging as
detailed in my previous message yesterday, 16 July.  Since there's been
no response, please allow me to give a short summary.

-Open MPI 1.2.3 on GNU/Linux, 2.6.21 kernel, gcc 4.1.2, bash 3.2.15 is
default shell
-Open MPI installed to /usr/local, which is in non-interactive session
path
-Systems are AMD64, using ethernet as interconnect, on private IP
network

mpirun hangs whenever I invoke any process running on a remote node.
It runs a job fine if I invoke it so that it only runs on the local
node.  Ctrl+C never successfully cancels an mpirun job -- I have to use
kill -9.

I'm asking for help trying to figure what steps have been taken by
mpirun, and how I can figure out where things are getting stuck /
crashing.  What could be happening on the remote nodes?  What debugging
steps can I take?

Without MPI running, the cluster is of no use, so I would really
appreciate some help here.





Need Mail bonding?
Go to the Yahoo! Mail Q for great tips from Yahoo! Answers users.
http://answers.yahoo.com/dir/?link=list=396546091
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] mpirun hanging followup

2007-07-17 Thread G.O.

On 7/17/07, Bill Johnstone  wrote:

Hello all.

I could really use help trying to figure out why mpirun is hanging as
detailed in my previous message yesterday, 16 July.  Since there's been
no response, please allow me to give a short summary.

-Open MPI 1.2.3 on GNU/Linux, 2.6.21 kernel, gcc 4.1.2, bash 3.2.15 is
default shell
-Open MPI installed to /usr/local, which is in non-interactive session
path
-Systems are AMD64, using ethernet as interconnect, on private IP
network

mpirun hangs whenever I invoke any process running on a remote node.
It runs a job fine if I invoke it so that it only runs on the local
node.  Ctrl+C never successfully cancels an mpirun job -- I have to use
kill -9.

I'm asking for help trying to figure what steps have been taken by
mpirun, and how I can figure out where things are getting stuck /
crashing.  What could be happening on the remote nodes?  What debugging
steps can I take?

Without MPI running, the cluster is of no use, so I would really
appreciate some help here.



   1- Check to make sure that there are no firewalls blocking traffic
between the nodes.
   2 - Check to make sure that all nodes have the openmpi installed
and have the very same executable you are trying to run on the same
path, have all permissions correctly.
   3- Check to make sure that all nodes have the same interface, i.e. eth0 .

  That's all i can think of for very quick checks for now. Hope it's
one of this.

  Thanks,
 gurhan





Need Mail bonding?
Go to the Yahoo! Mail Q for great tips from Yahoo! Answers users.
http://answers.yahoo.com/dir/?link=list=396546091
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] mpirun hanging followup

2007-07-17 Thread George Bosilca
You can start by adding --debug-daemons and --debug to your mpirun  
command line. This will generate a lot of output related to the  
operations done internally by the launcher. If you send this output  
to the list we might be able to help you a little bit more.


  george.

On Jul 17, 2007, at 1:12 PM, Bill Johnstone wrote:


Hello all.

I could really use help trying to figure out why mpirun is hanging as
detailed in my previous message yesterday, 16 July.  Since there's  
been

no response, please allow me to give a short summary.

-Open MPI 1.2.3 on GNU/Linux, 2.6.21 kernel, gcc 4.1.2, bash 3.2.15 is
default shell
-Open MPI installed to /usr/local, which is in non-interactive session
path
-Systems are AMD64, using ethernet as interconnect, on private IP
network

mpirun hangs whenever I invoke any process running on a remote node.
It runs a job fine if I invoke it so that it only runs on the local
node.  Ctrl+C never successfully cancels an mpirun job -- I have to  
use

kill -9.

I'm asking for help trying to figure what steps have been taken by
mpirun, and how I can figure out where things are getting stuck /
crashing.  What could be happening on the remote nodes?  What  
debugging

steps can I take?

Without MPI running, the cluster is of no use, so I would really
appreciate some help here.




__ 
__

Need Mail bonding?
Go to the Yahoo! Mail Q for great tips from Yahoo! Answers users.
http://answers.yahoo.com/dir/?link=list=396546091
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] mpirun hanging followup

2007-07-17 Thread Bill Johnstone
Hello all.

I could really use help trying to figure out why mpirun is hanging as
detailed in my previous message yesterday, 16 July.  Since there's been
no response, please allow me to give a short summary.

-Open MPI 1.2.3 on GNU/Linux, 2.6.21 kernel, gcc 4.1.2, bash 3.2.15 is
default shell
-Open MPI installed to /usr/local, which is in non-interactive session
path
-Systems are AMD64, using ethernet as interconnect, on private IP
network

mpirun hangs whenever I invoke any process running on a remote node. 
It runs a job fine if I invoke it so that it only runs on the local
node.  Ctrl+C never successfully cancels an mpirun job -- I have to use
kill -9.

I'm asking for help trying to figure what steps have been taken by
mpirun, and how I can figure out where things are getting stuck /
crashing.  What could be happening on the remote nodes?  What debugging
steps can I take?

Without MPI running, the cluster is of no use, so I would really
appreciate some help here.





Need Mail bonding?
Go to the Yahoo! Mail Q for great tips from Yahoo! Answers users.
http://answers.yahoo.com/dir/?link=list=396546091