[OMPI users] Segmentation fault / Address not mapped (1) with 2-node job on Rocks 5.2

2010-06-21 Thread Riccardo Murri
Hello,

I'm using OpenMPI 1.4.2 on a Rocks 5.2 cluster.  I compiled it on my
own to have a thread-enabled MPI (the OMPI coming with Rocks 5.2
apparently only supports MPI_THREAD_SINGLE), and installed into ~/sw.

To test the newly installed library I compiled a simple "hello world"
that comes with Rocks::

  [murri@idgc3grid01 hello_mpi.d]$ cat hello_mpi.c
  #include 
  #include 

  #include 

  int main(int argc, char **argv) {
int myrank;
struct utsname unam;

MPI_Init(, );

uname();
MPI_Comm_rank(MPI_COMM_WORLD, );
printf("Hello from rank %d on host %s\n", myrank, unam.nodename);

MPI_Finalize();
  }

The program runs fine as long as it only uses ranks on localhost::

  [murri@idgc3grid01 hello_mpi.d]$ mpirun --host localhost -np 2 hello_mpi
  Hello from rank 1 on host idgc3grid01.uzh.ch
  Hello from rank 0 on host idgc3grid01.uzh.ch

However, as soon as I try to run on more than one host, I get a
segfault::

  [murri@idgc3grid01 hello_mpi.d]$ mpirun --host
idgc3grid01,compute-0-11 --pernode hello_mpi
  [idgc3grid01:13006] *** Process received signal ***
  [idgc3grid01:13006] Signal: Segmentation fault (11)
  [idgc3grid01:13006] Signal code: Address not mapped (1)
  [idgc3grid01:13006] Failing at address: 0x50
  [idgc3grid01:13006] [ 0] /lib64/libpthread.so.0 [0x359420e4c0]
  [idgc3grid01:13006] [ 1]
/home/oci/murri/sw/lib/libopen-rte.so.0(orte_util_encode_pidmap+0xdb)
[0x2b352d00265b]
  [idgc3grid01:13006] [ 2]
/home/oci/murri/sw/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x676)
[0x2b352d00e0e6]
  [idgc3grid01:13006] [ 3]
/home/oci/murri/sw/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0xb8)
[0x2b352d015358]
  [idgc3grid01:13006] [ 4]
/home/oci/murri/sw/lib/openmpi/mca_plm_rsh.so [0x2b352dcb9a80]
  [idgc3grid01:13006] [ 5] mpirun [0x40345a]
  [idgc3grid01:13006] [ 6] mpirun [0x402af3]
  [idgc3grid01:13006] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4)
[0x359361d974]
  [idgc3grid01:13006] [ 8] mpirun [0x402a29]
  [idgc3grid01:13006] *** End of error message ***
  Segmentation fault

I've already tried the suggestions posted to similar messages on the
list: "ldd" reports that the executable is linked with the libraries
in my home, not the system-wide OMPI::

  [murri@idgc3grid01 hello_mpi.d]$ ldd hello_mpi
  libmpi.so.0 => /home/oci/murri/sw/lib/libmpi.so.0 (0x2ad2bd6f2000)
  libopen-rte.so.0 => /home/oci/murri/sw/lib/libopen-rte.so.0
(0x2ad2bd997000)
  libopen-pal.so.0 => /home/oci/murri/sw/lib/libopen-pal.so.0
(0x2ad2bdbe3000)
  libdl.so.2 => /lib64/libdl.so.2 (0x003593e0)
  libnsl.so.1 => /lib64/libnsl.so.1 (0x003596a0)
  libutil.so.1 => /lib64/libutil.so.1 (0x0035a100)
  libm.so.6 => /lib64/libm.so.6 (0x003593a0)
  libpthread.so.0 => /lib64/libpthread.so.0 (0x00359420)
  libc.so.6 => /lib64/libc.so.6 (0x00359360)
  /lib64/ld-linux-x86-64.so.2 (0x00359320)

I've also checked with "strace" that the "mpi.h" file used during
compile is the one in ~/sw/include and that all ".so" files being
loaded from OMPI are the ones in ~/sw/lib.  I can ssh without password
to the target compute node. The "mpirun" and "mpicc" are the correct ones:

  [murri@idgc3grid01 hello_mpi.d]$ which mpirun
  ~/sw/bin/mpirun

  [murri@idgc3grid01 hello_mpi.d]$ which mpicc
  ~/sw/bin/mpicc


I'm pretty stuck now; can anybody give me a hint?

Thanks a lot for any help!

Best regards,
Riccardo


Re: [OMPI users] ompi-ps failure on WinXP

2010-06-21 Thread Shiqing Fan

Hi Stephan,

I haven't tried to generate NMake Makefile to build Open MPI on Window, 
the original purpose of using CMake was to generate Visual Studio 
solution files. But if you can provide more information, e.g. error 
messages, maybe I can figure out the problem for NMake.


Thanks,
Shiqing

On 2010-6-21 7:43 PM, Stephan Hackstedt wrote:
Update: NMake Makefile creation works with VC8, but using nmake to 
install openmpi creates an error...

i will try to find a way to build it.

Stephan

Stephan

2010/6/21 Stephan Hackstedt >


Hi Shiqing,

i just checked out the code, but i am unable to create the nmake
makefile with cmake.
cmake tells me, it is unable to define 8-bit types.
i also noticed, that for windows, the 1.4.2 release works with
cmake, version above this making cmake to fail.
i am using the VC10 compiler, but as an alternative i can use VC8.
maybe it's worth a try.
if i can make it, i will report :)

Stephan

2010/6/21 Shiqing Fan >


Hi Stephan,

For ompi-server test, you could probably refer to this Open
MPI doc: http://www.open-mpi.org/doc/v1.4/man1/ompi-server.1.php .

Possible tests would be "ompi-server -r -", "ompi-server -r
+", "ompi-server -r file", or you can also write a MPI program
using MPI_Lookup_name/MPI_Publish_name functions.


Regards,
Shiqing





On 2010-6-20 11:14 AM, Stephan Hackstedt wrote:

Hello,

i found no solution for this until yet.
Is there anyone who has a running ompi-server.exe on Windows XP?
If so, it would be great to tell me what i can do, to make
ompi-server-exe running properly on WinXP.

Stephan


2010/6/16 Stephan Hackstedt >

Hello,

i am using Open-MPI on a WinXP Professional VirtualBox
machine.
Open-MPI is build with cmake and nmake.
When i'm trying to use the ompi-ps tool i got the
following failure (the same with ompi-server, ompi-clean
and orted):




###

D:\project\cluster_ompi>ompi-ps.exe
[vbox:03552] [[INVALID],INVALID] ORTE_ERROR_LOG: Not
found in file D:\project\op
enmpi_1_4_2_src\orte\runtime\orte_init.c at line 125

--
It looks like orte_init failed for some reason; your
parallel process is
likely to abort.  There are many reasons that a parallel
process can
fail during orte_init; some of which are due to
configuration or
environment problems.  This failure appears to be an
internal failure;
here's some additional information (which may only be
relevant to an
Open MPI developer):

  orte_ess_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS

--






on the other way, when use mpirun to start the tools like
"mpirun ompi-ps.exe" there is no error.
It this normal, or maybe is there an fix to solve my problem?
I'm would be nice if somebody has a solution for this.


Stephan



___
users mailing list
us...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/users



-- 
--

Shiqing Fanhttp://www.hlrs.de/people/fan
High Performance Computing   Tel.: +49 711 685 87234
   Center Stuttgart (HLRS)Fax.: +49 711 685 65832
Address:Allmandring 30   email:f...@hlrs.de  

70569 Stuttgart
   







--
--
Shiqing Fan  http://www.hlrs.de/people/fan
High Performance Computing   Tel.: +49 711 685 87234
  Center Stuttgart (HLRS)Fax.: +49 711 685 65832
Address:Allmandring 30   email: f...@hlrs.de
70569 Stuttgart



Re: [OMPI users] Xgrid an openmpi 1.2 and 1.5rc1

2010-06-21 Thread charlie strauss

To be more specific.

I have a working xgrid with the envirnment variables set.  In  
particular I can run xgrid commands from the shell prompt like this:


xgrid -job submit /bin/hostname

and it runs because the enviroment variables are set.

my understanding is that openMPI will look for those ENV vars and if  
present try to run on xgrid.  my understanding is that there are no  
configuration files for this needed.  It should work out of the box.


thus I could be able to type at the same command line:
mpirun -np 3 /bin/hostname
 or
mpirun -np 3 examples/hello_c( the mpi example)

and have them run on xgrid.(for example see http://www.macresearch.org/getting_started_with_openmpi_and_xgrid 
 )


But that's not what happens instead they always run on the localhost

 I know I'm not the only one who has this issue since i can reproduce  
it on 6 different computers around me and I see questions like mine  
posted on the web.


Is there any other configuration one needs to use the built-in openmpi  
and have it use an available xgrid?


(separate question: if so, does it always uses the default logical  
grid or is there a way to configure which grid id (a given  
controller_host can partition the grid into logical subsets of nodes.  
in xgrid-speak these are calles logical grids and one of these is  
assigned to be the default grid if the grif-id is not specified).








On Jun 21, 2010, at 1:40 PM, Barrett, Brian W wrote:

You have to set two environment variables (XGRID_CONTROLLER_HOSTNAME  
and XGRID_CONTROLLER_PASSWORD) with the correct information in order  
for the XGrid starter to work.  Due to the way XGrid works, the  
nolocal option will not work properly when launching with XGrid.


Brian

On Jun 21, 2010, at 1:28 PM, charlie strauss wrote:

Perhaps I was mistaken about 1.5rc1.As for  the installed  
openMPI on mac osx, my 10.5 OSX has v1.2.3  when I try to run it,  
it works fine locally but it never finds the xgrid.


any mpi job I run, will run on the localhost not the xgrid agents.   
If try to force the issue by specifying -nolocal then it just  
complains there are no nodes.


SO how do I use openMPI so that it uses the nodes of an xgrid  
cluster?


mpirun -nolocal -n 32 /bin/hostname
--
There are no available nodes allocated to this job. This could be  
because

no nodes were found or all the available nodes were already used.

Note that since the -nolocal option was given no processes can be
launched on the local node.
--
[ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of  
resource in file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/ 
rmaps/base/rmaps_base_support_fns.c at line 168
[ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of  
resource in file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/ 
rmaps/round_robin/rmaps_rr.c at line 402
[ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of  
resource in file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/ 
rmaps/base/rmaps_base_map_job.c at line 210
[ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of  
resource in file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/ 
rmgr/urm/rmgr_urm.c at line 372










On Jun 16, 2010, at 1:36 PM, Ralph Castain wrote:

Where did you see that 1.5 works with xgrid? That support has been  
broken since the 1.2 series, unfortunately, so it would help to  
ensure we don't have stale docs out there to the contrary.


As for the 1.2 results, you are aware (I imagine) that OSX ships  
with the last 1.2 release already installed? You don't have to do  
anything to use it but run.


If you are getting peer timeouts, that is almost always a firewall  
issue. But I would try the factory-installed version first to be  
sure.


On Jun 16, 2010, at 1:14 PM, Charlie E. Strauss wrote:

I'm new to openMPI.  I'm trying to set it up for using xgrid.  I  
have read
that v1.3 and v1.4 are broken on OSX 10.5 and 10.6 although I  
have seen
some discussions in the archives of this mail list saying some  
people have

v1.4 running on 10.6.

I have now compiled both openMPI 1.2 and openMPI1.5rc  and  
neither of
these is working for me with xgrid.   Both of these say they work  
with

xgrid.

The failuremodes are different.

Anyone know how to get a working install?  I am building this on  
a OSX 10.5.8
machine.  THe xgrid controller is on a OSX 10.6 server machine.   
I have tried

configuring with and without the --with-xgrid option.

Behaviour of openMPI1.2
$ /usr/local/openmpi/bin/mpirun -nolocal -n 2 /bin/hostname

THe job appears in the xgrid queue, and the logs show it is  
running on a
remote machine.  However nothing ever happens and peeking in the  
xgrid

results I see:

$ xgrid -job results -id 8703
[brio.llnl.gov:38789] [0,0,1]-[0,0,0]  
mca_oob_tcp_peer_complete_connect:

connection failed: 

Re: [OMPI users] Xgrid an openmpi 1.2 and 1.5rc1

2010-06-21 Thread Ralph Castain
If you want, you can upgrade to the last release in the 1.2 series from the 
www.open-mpi.org web site. Anything in 1.2 will work - just not beyond.


On Jun 21, 2010, at 1:40 PM, Barrett, Brian W wrote:

> You have to set two environment variables (XGRID_CONTROLLER_HOSTNAME and 
> XGRID_CONTROLLER_PASSWORD) with the correct information in order for the 
> XGrid starter to work.  Due to the way XGrid works, the nolocal option will 
> not work properly when launching with XGrid.
> 
> Brian
> 
> On Jun 21, 2010, at 1:28 PM, charlie strauss wrote:
> 
>> Perhaps I was mistaken about 1.5rc1.As for  the installed openMPI on mac 
>> osx, my 10.5 OSX has v1.2.3  when I try to run it, it works fine locally but 
>> it never finds the xgrid.
>> 
>> any mpi job I run, will run on the localhost not the xgrid agents.  If try 
>> to force the issue by specifying -nolocal then it just complains there are 
>> no nodes.
>> 
>> SO how do I use openMPI so that it uses the nodes of an xgrid cluster?
>> 
>> mpirun -nolocal -n 32 /bin/hostname
>> --
>> There are no available nodes allocated to this job. This could be because
>> no nodes were found or all the available nodes were already used.
>> 
>> Note that since the -nolocal option was given no processes can be 
>> launched on the local node.
>> --
>> [ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of resource in 
>> file 
>> /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmaps/base/rmaps_base_support_fns.c
>>  at line 168
>> [ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of resource in 
>> file 
>> /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmaps/round_robin/rmaps_rr.c 
>> at line 402
>> [ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of resource in 
>> file 
>> /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmaps/base/rmaps_base_map_job.c
>>  at line 210
>> [ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of resource in 
>> file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmgr/urm/rmgr_urm.c at 
>> line 372
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Jun 16, 2010, at 1:36 PM, Ralph Castain wrote:
>> 
>>> Where did you see that 1.5 works with xgrid? That support has been broken 
>>> since the 1.2 series, unfortunately, so it would help to ensure we don't 
>>> have stale docs out there to the contrary.
>>> 
>>> As for the 1.2 results, you are aware (I imagine) that OSX ships with the 
>>> last 1.2 release already installed? You don't have to do anything to use it 
>>> but run.
>>> 
>>> If you are getting peer timeouts, that is almost always a firewall issue. 
>>> But I would try the factory-installed version first to be sure.
>>> 
>>> On Jun 16, 2010, at 1:14 PM, Charlie E. Strauss wrote:
>>> 
 I'm new to openMPI.  I'm trying to set it up for using xgrid.  I have read
 that v1.3 and v1.4 are broken on OSX 10.5 and 10.6 although I have seen
 some discussions in the archives of this mail list saying some people have
 v1.4 running on 10.6.
 
 I have now compiled both openMPI 1.2 and openMPI1.5rc  and neither of
 these is working for me with xgrid.   Both of these say they work with
 xgrid.
 
 The failuremodes are different.
 
 Anyone know how to get a working install?  I am building this on a OSX 
 10.5.8
 machine.  THe xgrid controller is on a OSX 10.6 server machine.  I have 
 tried
 configuring with and without the --with-xgrid option.
 
 Behaviour of openMPI1.2
 $ /usr/local/openmpi/bin/mpirun -nolocal -n 2 /bin/hostname
 
 THe job appears in the xgrid queue, and the logs show it is running on a
 remote machine.  However nothing ever happens and peeking in the xgrid
 results I see:
 
 $ xgrid -job results -id 8703
 [brio.llnl.gov:38789] [0,0,1]-[0,0,0] mca_oob_tcp_peer_complete_connect:
 connection failed: Operation timed out (60) - retrying
 [brio.llnl.gov:38792] [0,0,2]-[0,0,0] mca_oob_tcp_peer_complete_connect:
 connection failed: Operation timed out (60) - retrying
 
 Perhaps a firewall issue?
 
 Of course I'm more interested in getting the new openMPI1.5 working.
 When I run this, again I get an entry in the queue, and the job runs on a
 remote machine but  I get a job failed message
 
 $ /usr/local/openmpi5/bin/mpirun -n 2 /bin/hostname
 $ xgrid -job results -id 8702
 [brio.llnl.gov:38776] Error: unknown option "-mca"
 
 
 
 Note I have NOT installed openMPI on any of the other computers in the
 grid.  So perhaps that is the problem?  If I did install it on other
 computers how would I tell mpirun where to find the path to the install
 point?
 
 
 
 
 Finally in both cases, I don't see any way to pass xgrid specific argument
 in on the mpi command 

Re: [OMPI users] Xgrid an openmpi 1.2 and 1.5rc1

2010-06-21 Thread Barrett, Brian W
You have to set two environment variables (XGRID_CONTROLLER_HOSTNAME and 
XGRID_CONTROLLER_PASSWORD) with the correct information in order for the XGrid 
starter to work.  Due to the way XGrid works, the nolocal option will not work 
properly when launching with XGrid.

Brian

On Jun 21, 2010, at 1:28 PM, charlie strauss wrote:

Perhaps I was mistaken about 1.5rc1.As for  the installed openMPI on mac 
osx, my 10.5 OSX has v1.2.3  when I try to run it, it works fine locally but it 
never finds the xgrid.

any mpi job I run, will run on the localhost not the xgrid agents.  If try to 
force the issue by specifying -nolocal then it just complains there are no 
nodes.

SO how do I use openMPI so that it uses the nodes of an xgrid cluster?

mpirun -nolocal -n 32 /bin/hostname
--
There are no available nodes allocated to this job. This could be because
no nodes were found or all the available nodes were already used.

Note that since the -nolocal option was given no processes can be
launched on the local node.
--
[ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of resource in 
file 
/SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmaps/base/rmaps_base_support_fns.c
 at line 168
[ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of resource in 
file 
/SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmaps/round_robin/rmaps_rr.c at 
line 402
[ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of resource in 
file 
/SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmaps/base/rmaps_base_map_job.c 
at line 210
[ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of resource in 
file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmgr/urm/rmgr_urm.c at 
line 372









On Jun 16, 2010, at 1:36 PM, Ralph Castain wrote:

Where did you see that 1.5 works with xgrid? That support has been broken since 
the 1.2 series, unfortunately, so it would help to ensure we don't have stale 
docs out there to the contrary.

As for the 1.2 results, you are aware (I imagine) that OSX ships with the last 
1.2 release already installed? You don't have to do anything to use it but run.

If you are getting peer timeouts, that is almost always a firewall issue. But I 
would try the factory-installed version first to be sure.

On Jun 16, 2010, at 1:14 PM, Charlie E. Strauss wrote:

I'm new to openMPI.  I'm trying to set it up for using xgrid.  I have read
that v1.3 and v1.4 are broken on OSX 10.5 and 10.6 although I have seen
some discussions in the archives of this mail list saying some people have
v1.4 running on 10.6.

I have now compiled both openMPI 1.2 and openMPI1.5rc  and neither of
these is working for me with xgrid.   Both of these say they work with
xgrid.

The failuremodes are different.

Anyone know how to get a working install?  I am building this on a OSX 10.5.8
machine.  THe xgrid controller is on a OSX 10.6 server machine.  I have tried
configuring with and without the --with-xgrid option.

Behaviour of openMPI1.2
$ /usr/local/openmpi/bin/mpirun -nolocal -n 2 /bin/hostname

THe job appears in the xgrid queue, and the logs show it is running on a
remote machine.  However nothing ever happens and peeking in the xgrid
results I see:

$ xgrid -job results -id 8703
[brio.llnl.gov:38789] [0,0,1]-[0,0,0] mca_oob_tcp_peer_complete_connect:
connection failed: Operation timed out (60) - retrying
[brio.llnl.gov:38792] [0,0,2]-[0,0,0] mca_oob_tcp_peer_complete_connect:
connection failed: Operation timed out (60) - retrying

Perhaps a firewall issue?

Of course I'm more interested in getting the new openMPI1.5 working.
When I run this, again I get an entry in the queue, and the job runs on a
remote machine but  I get a job failed message

$ /usr/local/openmpi5/bin/mpirun -n 2 /bin/hostname
$ xgrid -job results -id 8702
[brio.llnl.gov:38776] Error: unknown option "-mca"



Note I have NOT installed openMPI on any of the other computers in the
grid.  So perhaps that is the problem?  If I did install it on other
computers how would I tell mpirun where to find the path to the install
point?




Finally in both cases, I don't see any way to pass xgrid specific argument
in on the mpi command line.  An xgrid controller divides the agents into
sets of logical grids and you need to specify which logical grid to submit
the job to.In xgrid cli syntax one write "xgrid -gid 2"  for grid 2.
When I use openMPI all the jobs get sent to just the default grid which is
the grid that xgrid uses if no gid is specified.

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Charlie Strauss
Bioscience Division
c...@lanl.gov
505 665 4838
Quidquid latine dictum sit, altum sonatur.

___

Re: [OMPI users] Xgrid an openmpi 1.2 and 1.5rc1

2010-06-21 Thread charlie strauss
Perhaps I was mistaken about 1.5rc1.As for  the installed openMPI  
on mac osx, my 10.5 OSX has v1.2.3  when I try to run it, it works  
fine locally but it never finds the xgrid.


any mpi job I run, will run on the localhost not the xgrid agents.  If  
try to force the issue by specifying -nolocal then it just complains  
there are no nodes.


SO how do I use openMPI so that it uses the nodes of an xgrid cluster?

mpirun -nolocal -n 32 /bin/hostname
--
There are no available nodes allocated to this job. This could be  
because

no nodes were found or all the available nodes were already used.

Note that since the -nolocal option was given no processes can be
launched on the local node.
--
[ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of  
resource in file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmaps/ 
base/rmaps_base_support_fns.c at line 168
[ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of  
resource in file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmaps/ 
round_robin/rmaps_rr.c at line 402
[ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of  
resource in file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmaps/ 
base/rmaps_base_map_job.c at line 210
[ocho.lanl.gov:35438] [0,0,0] ORTE_ERROR_LOG: Temporarily out of  
resource in file /SourceCache/openmpi/openmpi-5/openmpi/orte/mca/rmgr/ 
urm/rmgr_urm.c at line 372










On Jun 16, 2010, at 1:36 PM, Ralph Castain wrote:

Where did you see that 1.5 works with xgrid? That support has been  
broken since the 1.2 series, unfortunately, so it would help to  
ensure we don't have stale docs out there to the contrary.


As for the 1.2 results, you are aware (I imagine) that OSX ships  
with the last 1.2 release already installed? You don't have to do  
anything to use it but run.


If you are getting peer timeouts, that is almost always a firewall  
issue. But I would try the factory-installed version first to be sure.


On Jun 16, 2010, at 1:14 PM, Charlie E. Strauss wrote:

I'm new to openMPI.  I'm trying to set it up for using xgrid.  I  
have read
that v1.3 and v1.4 are broken on OSX 10.5 and 10.6 although I have  
seen
some discussions in the archives of this mail list saying some  
people have

v1.4 running on 10.6.

I have now compiled both openMPI 1.2 and openMPI1.5rc  and neither of
these is working for me with xgrid.   Both of these say they work  
with

xgrid.

The failuremodes are different.

Anyone know how to get a working install?  I am building this on a  
OSX 10.5.8
machine.  THe xgrid controller is on a OSX 10.6 server machine.  I  
have tried

configuring with and without the --with-xgrid option.

Behaviour of openMPI1.2
$ /usr/local/openmpi/bin/mpirun -nolocal -n 2 /bin/hostname

THe job appears in the xgrid queue, and the logs show it is running  
on a
remote machine.  However nothing ever happens and peeking in the  
xgrid

results I see:

$ xgrid -job results -id 8703
[brio.llnl.gov:38789] [0,0,1]-[0,0,0]  
mca_oob_tcp_peer_complete_connect:

connection failed: Operation timed out (60) - retrying
[brio.llnl.gov:38792] [0,0,2]-[0,0,0]  
mca_oob_tcp_peer_complete_connect:

connection failed: Operation timed out (60) - retrying

Perhaps a firewall issue?

Of course I'm more interested in getting the new openMPI1.5 working.
When I run this, again I get an entry in the queue, and the job  
runs on a

remote machine but  I get a job failed message

$ /usr/local/openmpi5/bin/mpirun -n 2 /bin/hostname
$ xgrid -job results -id 8702
[brio.llnl.gov:38776] Error: unknown option "-mca"



Note I have NOT installed openMPI on any of the other computers in  
the

grid.  So perhaps that is the problem?  If I did install it on other
computers how would I tell mpirun where to find the path to the  
install

point?




Finally in both cases, I don't see any way to pass xgrid specific  
argument
in on the mpi command line.  An xgrid controller divides the agents  
into
sets of logical grids and you need to specify which logical grid to  
submit
the job to.In xgrid cli syntax one write "xgrid -gid 2"  for  
grid 2.
When I use openMPI all the jobs get sent to just the default grid  
which is

the grid that xgrid uses if no gid is specified.

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Charlie Strauss
Bioscience Division
c...@lanl.gov
505 665 4838
Quidquid latine dictum sit, altum sonatur.



Re: [OMPI users] ompi-ps failure on WinXP

2010-06-21 Thread Shiqing Fan


Hi Stephan,

For ompi-server test, you could probably refer to this Open MPI doc: 
http://www.open-mpi.org/doc/v1.4/man1/ompi-server.1.php .


Possible tests would be "ompi-server -r -", "ompi-server -r +", 
"ompi-server -r file", or you can also write a MPI program using 
MPI_Lookup_name/MPI_Publish_name functions.



Regards,
Shiqing




On 2010-6-20 11:14 AM, Stephan Hackstedt wrote:

Hello,

i found no solution for this until yet.
Is there anyone who has a running ompi-server.exe on Windows XP?
If so, it would be great to tell me what i can do, to make 
ompi-server-exe running properly on WinXP.


Stephan


2010/6/16 Stephan Hackstedt >


Hello,

i am using Open-MPI on a WinXP Professional VirtualBox machine.
Open-MPI is build with cmake and nmake.
When i'm trying to use the ompi-ps tool i got the following
failure (the same with ompi-server, ompi-clean and orted):



###

D:\project\cluster_ompi>ompi-ps.exe
[vbox:03552] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file
D:\project\op
enmpi_1_4_2_src\orte\runtime\orte_init.c at line 125
--
It looks like orte_init failed for some reason; your parallel
process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--





on the other way, when use mpirun to start the tools like "mpirun
ompi-ps.exe" there is no error.
It this normal, or maybe is there an fix to solve my problem?
I'm would be nice if somebody has a solution for this.


Stephan



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
--
Shiqing Fan  http://www.hlrs.de/people/fan
High Performance Computing   Tel.: +49 711 685 87234
  Center Stuttgart (HLRS)Fax.: +49 711 685 65832
Address:Allmandring 30   email: f...@hlrs.de
70569 Stuttgart



Re: [OMPI users] ompi-ps failure on WinXP

2010-06-21 Thread Shiqing Fan


Hi Stephan,

ompi-ps is now fixed in trunk (r23286), it should work again with this 
fix, could you please update and try it again?



Thanks,
Shiqing

On 2010-6-16 6:55 PM, Stephan Hackstedt wrote:

Hello,

i am using Open-MPI on a WinXP Professional VirtualBox machine.
Open-MPI is build with cmake and nmake.
When i'm trying to use the ompi-ps tool i got the following failure 
(the same with ompi-server, ompi-clean and orted):




###

D:\project\cluster_ompi>ompi-ps.exe
[vbox:03552] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
D:\project\op

enmpi_1_4_2_src\orte\runtime\orte_init.c at line 125
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--





on the other way, when use mpirun to start the tools like "mpirun 
ompi-ps.exe" there is no error.

It this normal, or maybe is there an fix to solve my problem?
I'm would be nice if somebody has a solution for this.


Stephan


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
--
Shiqing Fan  http://www.hlrs.de/people/fan
High Performance Computing   Tel.: +49 711 685 87234
  Center Stuttgart (HLRS)Fax.: +49 711 685 65832
Address:Allmandring 30   email: f...@hlrs.de
70569 Stuttgart



Re: [OMPI users] Open MPI task scheduler

2010-06-21 Thread Matthieu Brucher
2010/6/21 Jack Bryan :
> Hi,
> thank you very much for your help.
> What is the meaning of " must find a system so that every task can be
> serialized in the same form." What is the meaning of "serize " ?

Serialize is the process of converting an object instance into a
text/binary stream, and to create a new object instance from this
stream. It allows communication of data between processes. With MPI,
you send one data type after another, with serialization, you send
everything in one big chunk.

> I have no experience of programming with python and XML.

Python is not mandatory at all. I use it to automate the wrappers/SOAP
files generation, and to talk to the daemon. You can do this is any
language you are comfortable with.

> I have studied your blog.
> Where can I find a simple example to use the techniques you have said ?

If you are looking for RPC samples, you can ask google with just SOAP
as key, it will find several tutorials on how this works. As Jody
said, you may want something simplier if you can have all tasks in one
MPI process, but once you go on a big cluster with variable resources,
you will be stuck.

> For exmple, I have 5 task (print "hello world !").
> I want to use 6 processors to do it in parallel.
> One processr is the manager node who distributes tasks and other 5
> processors
> do the printing jobs and when they are done, they tell this to the manager
> noitde.

In this case, you have your daemon working in parallel from the batch
scheduler, and then each process asks the daemon for a new ticket. You
may add tickets by talking to the dameon directly, without having to
launch a new job.


> Boost.Asio is a cross-platform C++ library for network and low-level I/O
> programming. I have no experiences of using it. Will it take a long time to
> learn
> how to use it ?

The longest time will not be to master Boost, but more to understand
how to create your TCP server and to serialize your parameters.

> If the messages are transferred by SOAP+TCP, how the manager node calls it
> and push task into it ?

You have to think of SOAP + TCP as just a simple function call that
hides everything. From the client node point of view, it's a simple
function call server.get_ticket(). The manager node will be talked to
by different kind of programs: task programs or by a program that will
push tickets. The latter one will just be another function call like
this in C++:

std::vector tickets;
daemon.connect(somewhere, port);
daemon.add_tickets(tickets);

> Do I need to install SOAP+TCP on my cluster so that I can use it ?

As Jody said, you can do things with MPI directly. I would not
recommand it, but this will help you with a fast solution. You will
have to use some MPI2 calls to create a socket on the daemon to talk
to it, and in fact, you will have to create exactly what I proposed,
but less portable.

Matthieu
-- 
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher


[OMPI users] question about reconstructPar

2010-06-21 Thread asmae . elbahlouli
helloi would like to know a think:i have done a "mpirun" on 30 processeur then i have done a "reconstructPar", now i have modified my controlDict file with a startime from the latestTime and the endTime 1000 intervalls plus than the last , i want to know if i need to redone a decomposePar before running the calc?thanks

Re: [OMPI users] Open MPI task scheduler

2010-06-21 Thread jody
Hi

I think your problem can be solved easily on  the MPI level.
Just hav you manager execute a loop in which it waits for any message.
Define different message types by their MPI-tags. Once a message
has been received, decide what to do by looking at the tag.

Here i assume that a worker with no job sends a message with the tag
TAG_TASK_REQUEST and then waits to receive a message from the master
with either a new task or the command to exit.
Once a worker has finished a tsk it sends a message with the tag TAG_RESULT,
and then sends a message containing the result.
Here i assume that new tasks can be sent from a different node by using
the tag TAG_NEW_TASK.

The main loop in the Master would be:

while (more_tasks) {
 MPI_Recv(, MPI_INT, 1, MPI_ANY_SOURCE, MPI_ANY_TAG, );
 switch (st.MPI_TAG) {
   case TAG_TASK_REQUEST:
 sendNextTask(st.MPI_SOURCE);
 break;
  case TAG_RESULT:
 collectResult(st.MPI_SOURCE);
 break;
  case TAG_NEW_TASK:
 putNewTaskOnQueue(st.MPI_SOURCE);
 break;
   }
}


In a worker:

  while (go_on) {
 MPI_Send(a, MPI_INT, 1, idMaster, TAG_TASK_REQUEST);
 MPI_Recv(, TaskType, 1, idMaster, MPI_ANY_TAG, );
 if (st.MPI_TAG == TAG_STOP) {
   go_on=false;
 } else {
   result=workOnTask(TaskDef, TaskLen);
   MPI_Send(a, MPI_INT, 1, idMaster, TAG_RESULT);
   MPI_Send(result, resultType, 1, idMaster, TAG_RESULT_CONTENT);
  }
}

I hope this helps
  Jody

On Mon, Jun 21, 2010 at 12:17 AM, Jack Bryan  wrote:
> Hi,
> thank you very much for your help.
> What is the meaning of " must find a system so that every task can be
> serialized in the same form." What is the meaning of "serize " ?
> I have no experience of programming with python and XML.
> I have studied your blog.
> Where can I find a simple example to use the techniques you have said ?
> For exmple, I have 5 task (print "hello world !").
> I want to use 6 processors to do it in parallel.
> One processr is the manager node who distributes tasks and other 5
> processors
> do the printing jobs and when they are done, they tell this to the manager
> noitde.
>
> Boost.Asio is a cross-platform C++ library for network and low-level I/O
> programming. I have no experiences of using it. Will it take a long time to
> learn
> how to use it ?
> If the messages are transferred by SOAP+TCP, how the manager node calls it
> and push task into it ?
> Do I need to install SOAP+TCP on my cluster so that I can use it ?
>
> Any help is appreciated.
> Jack
> June 20  2010
>> Date: Sun, 20 Jun 2010 21:00:06 +0200
>> From: matthieu.bruc...@gmail.com
>> To: us...@open-mpi.org
>> Subject: Re: [OMPI users] Open MPI task scheduler
>>
>> 2010/6/20 Jack Bryan :
>> > Hi, Matthieu:
>> > Thanks for your help.
>> > Most of your ideas show that what I want to do.
>> > My scheduler should be able to be called from any C++ program, which can
>> > put
>> > a list of tasks to the scheduler and then the scheduler distributes the
>> > tasks to other client nodes.
>> > It may work like in this way:
>> > while(still tasks available) {
>> > myScheduler.push(tasks);
>> > myScheduler.get(tasks results from client nodes);
>> > }
>>
>> Exactly. In your case, you want only one server, so you must find a
>> system so that every task can be serialized in the same form. The
>> easiest way to do so is to serialize your parameter set as an XML
>> fragment and add the type of task as another field.
>>
>> > My cluster has 400 nodes with Open MPI. The tasks should be transferred
>> > b y
>> > MPI protocol.
>>
>> No, they should not ;) MPI can be used, but it is not the easiest way
>> to do so. You still have to serialize your ticket, and you have to use
>> some functions that are from MPI2 (so perhaps not as portable as MPI1
>> functions). Besides, it cannot be used from programs that do not know
>> of using MPI protocols.
>>
>> > I am not familiar with  RPC Protocol.
>>
>> RPC is not a protocol per se. SOAP is. RPC stands for Remote Procedure
>> Call. It is basically your scheduler that has several functions
>> clients can call:
>> - add tickets
>> - retrieve ticket
>> - ticket is done
>>
>> > If I use Boost.ASIO and some Python/GCCXML script to generate the code,
>> > it
>> > can be
>> > called from C++ program on Open MPI cluster ?
>>
>> Yes, SOAP is just an XML way of representing the fact that you call a
>> function on the server. You can use it with C++, Java, ... I use it
>> with Python to monitor how many tasks are remaining, for instance.
>>
>> > I cannot find the skeletton on your blog.
>> > Would you please tell me where to find it ?
>>
>> It's not complete as some of the work is property of my employer. This
>> is how I use GCCXML to generate the calling code:
>>
>> http://matt.eifelle.com/2009/07/21/using-gccxml-to-automate-c-wrappers-creation/
>> You have some additional code to write, but this is the main idea.
>>
>> > I really