Re: [OMPI users] question about checkpoint on cluster, mpirun doesn't work on cluster

2010-03-29 Thread fengguang tian
hi
I solve this problem, some previous versions of directories in the cluster
are not removed, after I remove them, it works fine. thank you

cheers
fengguang

On Mon, Mar 29, 2010 at 11:47 AM, Josh Hursey  wrote:

> Does this happen when you run without '-am ft-enable-cr' (so a no-C/R run)?
>
> This will help us determine if your problem is with the C/R work or with
> the ORTE runtime. I suspect that there is something odd with your system
> that is confusing the runtime (so not a C/R problem).
>
> Have you made sure to remove the previous versions of Open MPI from all
> machines on your cluster, before installing the new version? Sometimes
> problems like this come up because of mismatches in Open MPI versions on a
> machine.
>
> -- Josh
>
>
> On Mar 23, 2010, at 5:42 PM, fengguang tian wrote:
>
>  I met the same problem with this link:
>> http://www.open-mpi.org/community/lists/users/2009/12/11374.php
>>
>> in the link, they give a solution that use v1.4 open mpi instead of v1.3
>> open mpi. but, I am using v1.7a1r22794 open mpi, and met the same problem.
>> here is what I have done:
>> my cluster composed of two machines:nimbus(master) and nimbus1(slave),
>> when I run mpirun -np 40 -am ft-enable-cr --hostfile .mpihostfile
>> myapplication
>> on the nimbus, and it doesn't work, it shows:
>>
>> [nimbus1:21387] opal_os_dirpath_create: Error: Unable to create the
>> sub-directory (/tmp/openmpi-sessions-mpiu@nimbus1_0/59759) of
>> (/tmp/openmpi-sessions-mpiu@nimbus1_0/59759/0/1), mkdir failed [1]
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file
>> util/session_dir.c at line 106
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file
>> util/session_dir.c at line 399
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file
>> base/ess_base_std_orted.c at line 301
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to
>> be sent to a process whose contact information is unknown in file
>> rml_oob_send.c at line 104
>> [nimbus1:21387] [[59759,0],1] could not get route to [[INVALID],INVALID]
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to
>> be sent to a process whose contact information is unknown in file
>> util/show_help.c at line 602
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file
>> ess_env_module.c at line 143
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to
>> be sent to a process whose contact information is unknown in file
>> rml_oob_send.c at line 104
>> [nimbus1:21387] [[59759,0],1] could not get route to [[INVALID],INVALID]
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to
>> be sent to a process whose contact information is unknown in file
>> util/show_help.c at line 602
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file
>> runtime/orte_init.c at line 129
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to
>> be sent to a process whose contact information is unknown in file
>> rml_oob_send.c at line 104
>> [nimbus1:21387] [[59759,0],1] could not get route to [[INVALID],INVALID]
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to
>> be sent to a process whose contact information is unknown in file
>> util/show_help.c at line 602
>> [nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file
>> orted/orted_main.c at line 355
>> --
>> A daemon (pid 10737) died unexpectedly with status 255 while attempting
>> to launch so we are aborting.
>>
>> There may be more information reported by the environment (see above).
>>
>> This may be because the daemon was unable to find all the needed shared
>> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
>> location of the shared libraries on the remote nodes and this will
>> automatically be forwarded to the remote nodes.
>> --
>> --
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> --
>>
>>
>> cheers
>> fengguang
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] question about checkpoint on cluster, mpirun doesn't work on cluster

2010-03-29 Thread Josh Hursey
Does this happen when you run without '-am ft-enable-cr' (so a no-C/R  
run)?


This will help us determine if your problem is with the C/R work or  
with the ORTE runtime. I suspect that there is something odd with your  
system that is confusing the runtime (so not a C/R problem).


Have you made sure to remove the previous versions of Open MPI from  
all machines on your cluster, before installing the new version?  
Sometimes problems like this come up because of mismatches in Open MPI  
versions on a machine.


-- Josh

On Mar 23, 2010, at 5:42 PM, fengguang tian wrote:


I met the same problem with this 
link:http://www.open-mpi.org/community/lists/users/2009/12/11374.php

in the link, they give a solution that use v1.4 open mpi instead of  
v1.3 open mpi. but, I am using v1.7a1r22794 open mpi, and met the  
same problem.

here is what I have done:
my cluster composed of two machines:nimbus(master) and  
nimbus1(slave), when I run mpirun -np 40 -am ft-enable-cr -- 
hostfile .mpihostfile myapplication

on the nimbus, and it doesn't work, it shows:

[nimbus1:21387] opal_os_dirpath_create: Error: Unable to create the  
sub-directory (/tmp/openmpi-sessions-mpiu@nimbus1_0/59759) of (/tmp/ 
openmpi-sessions-mpiu@nimbus1_0/59759/0/1), mkdir failed [1]
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file util/ 
session_dir.c at line 106
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file util/ 
session_dir.c at line 399
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file base/ 
ess_base_std_orted.c at line 301
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is  
attempting to be sent to a process whose contact information is  
unknown in file rml_oob_send.c at line 104
[nimbus1:21387] [[59759,0],1] could not get route to  
[[INVALID],INVALID]
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is  
attempting to be sent to a process whose contact information is  
unknown in file util/show_help.c at line 602
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file  
ess_env_module.c at line 143
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is  
attempting to be sent to a process whose contact information is  
unknown in file rml_oob_send.c at line 104
[nimbus1:21387] [[59759,0],1] could not get route to  
[[INVALID],INVALID]
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is  
attempting to be sent to a process whose contact information is  
unknown in file util/show_help.c at line 602
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file runtime/ 
orte_init.c at line 129
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is  
attempting to be sent to a process whose contact information is  
unknown in file rml_oob_send.c at line 104
[nimbus1:21387] [[59759,0],1] could not get route to  
[[INVALID],INVALID]
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is  
attempting to be sent to a process whose contact information is  
unknown in file util/show_help.c at line 602
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file orted/ 
orted_main.c at line 355

--
A daemon (pid 10737) died unexpectedly with status 255 while  
attempting

to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed  
shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to  
have the

location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--


cheers
fengguang
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] question about checkpoint on cluster, mpirun doesn't work on cluster

2010-03-23 Thread fengguang tian
I met the same problem with this link:
http://www.open-mpi.org/community/lists/users/2009/12/11374.php

in the link, they give a solution that use v1.4 open mpi instead of v1.3
open mpi. but, I am using v1.7a1r22794 open mpi, and met the same problem.
here is what I have done:
my cluster composed of two machines:nimbus(master) and nimbus1(slave), when
I run mpirun -np 40 -am ft-enable-cr --hostfile .mpihostfile myapplication
on the nimbus, and it doesn't work, it shows:

[nimbus1:21387] opal_os_dirpath_create: Error: Unable to create the
sub-directory (/tmp/openmpi-sessions-mpiu@nimbus1_0/59759) of
(/tmp/openmpi-sessions-mpiu@nimbus1_0/59759/0/1), mkdir failed [1]
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file
util/session_dir.c at line 106
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file
util/session_dir.c at line 399
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file
base/ess_base_std_orted.c at line 301
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to be
sent to a process whose contact information is unknown in file
rml_oob_send.c at line 104
[nimbus1:21387] [[59759,0],1] could not get route to [[INVALID],INVALID]
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to be
sent to a process whose contact information is unknown in file
util/show_help.c at line 602
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file ess_env_module.c
at line 143
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to be
sent to a process whose contact information is unknown in file
rml_oob_send.c at line 104
[nimbus1:21387] [[59759,0],1] could not get route to [[INVALID],INVALID]
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to be
sent to a process whose contact information is unknown in file
util/show_help.c at line 602
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file
runtime/orte_init.c at line 129
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to be
sent to a process whose contact information is unknown in file
rml_oob_send.c at line 104
[nimbus1:21387] [[59759,0],1] could not get route to [[INVALID],INVALID]
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: A message is attempting to be
sent to a process whose contact information is unknown in file
util/show_help.c at line 602
[nimbus1:21387] [[59759,0],1] ORTE_ERROR_LOG: Error in file
orted/orted_main.c at line 355
--
A daemon (pid 10737) died unexpectedly with status 255 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--


cheers
fengguang