[OMPI users] running open mpi on ubuntu 9.04

2009-09-17 Thread Hodgess, Erin
Dear Open MPI people:

I'm trying to run a simple "hello world" program on Ubuntu 9.04

It's on a dual core laptop; no other machines.

Here is the output:
erin@erin-laptop:~$ mpirun -np 2 a.out
ssh: connect to host erin-laptop port 22: Connection refused
--
A daemon (pid 11854) died unexpectedly with status 255 while attempting
to launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--
mpirun: clean termination accomplished

erin@erin-laptop:~$ 

Any help would be much appreciated.

Sincerely,
Erin


Erin M. Hodgess, PhD
Associate Professor
Department of Computer and Mathematical Sciences
University of Houston - Downtown
mailto: hodge...@uhd.edu



Re: [OMPI users] Multi-threading with OpenMPI ?

2009-09-17 Thread Ashika Umanga Umagiliya

HI Jeff, Ralph,

Yes, I call MPI_COMM_SPAWN in multiple threads simultaneously.
Because I need to expose my parallel algorithm as a web service, I need 
multiple clients connect and execute my logic as same time(ie mutiple 
threads).
For each client , a new thread is created (by Web service framework) and 
inside the thread,MPI_Init_Thread() is called if the MPI hasnt been 
initialized.

The the thread calls MPI_COMM__SPAWN and create new processes.

So ,if this is the case isn't there any workarounds ?

Thanks in advance,
umanga


Jeff Squyres wrote:

On Sep 16, 2009, at 9:53 PM, Ralph Castain wrote:


Only the obvious, and not very helpful one: comm_spawn isn't thread
safe at this time. You'll need to serialize your requests to that
function.




This is likely the cause of your issues if you are calling 
MPI_COMM_SPAWN in multiple threads simultaneously.  Can you verify?


If not, we'll need to dig a little deeper to figure out what's going 
on.  But Ralph is right -- read up on the THREAD_MULTIPLE constraints 
(check the OMPI README file) to see if that's what's biting you.






Re: [OMPI users] Application hangs when checkpointing application (update)

2009-09-17 Thread Josh Hursey

Interesting. I'll try to take a look and see if I can reproduce today.

-- Josh

On Sep 14, 2009, at 4:54 PM, Jean Potsam wrote:


Hi Josh,
   Thanks for the response. I am actually testing it on a  
single node (though in the near future i will run it on a set of  
nodes). Therefore, my application is running on the same machine as  
mpirun.
When I run the application and triggers the checkpointing mechanism  
from a seperate terminal, it checkpoints fine.


However, when I try to checkpoint it from within the main program as  
show below, it hangs.


kind regards,

Jean


--- On Mon, 14/9/09, Josh Hursey  wrote:

From: Josh Hursey 
Subject: Re: [OMPI users] Application hangs when checkpointing  
application (update)

To: "Open MPI Users" 
Date: Monday, 14 September, 2009, 1:27 PM

Is your application running on the same machine as mpirun?

How did you configure Open MPI? Note that is program will not work  
without the FT thread enabled, which would be one reason why it  
would seem to hang (since it is waiting for the application to enter  
the MPI library):

  --enable-ft-thread --enable-mpi-threads

I do not think the message that you saw is related. Often  
orte_checkpoint cannot figure out the jobid on first contact with  
the HNP/mpirun process, so this is displayed as an INVALID handle.


-- Josh

On Sep 11, 2009, at 9:50 AM, Jean Potsam wrote:

>
> Hi Everyone,
>   I noticed that it hangs just before displaying the  
following while trying to checkpoint the application.

>
> 
> [sun06:15252] orte_checkpoint: notify_hnp: Requested a checkpoint  
of jobid [INVALID]

> ###
>
> Can it be related to the above?
>
> Thanks
>
>
>  
--

> Hi Everyone,
> I wrote a small program with a function to  
trigger the checkpointing mechanism as follows:

>
> 
>
> #include 
> #include 
> #include 
> #include 
> #include 
> void trigger_checkpoint();
> int main(int argc, char **argv)
> {
> int rank,size;
> MPI_Init(, );
> MPI_Comm_rank(MPI_COMM_WORLD, );
> MPI_Comm_size(MPI_COMM_WORLD, );
> printf("I am processor no %d of a total of %d procs \n", rank,  
size);

> system("sleep 10");
> trigger_checkpoint();
> printf("I am processor no %d of a total of %d procs \n", rank,  
size);

> system("sleep 10");
> printf("I am processor no %d of a total of %d procs \n", rank,  
size);

> system("sleep 10");
> printf("bye \n");
> MPI_Finalize();
> return 0;
> }
>
> void trigger_checkpoint()
> {
>   printf("hi\n");
>   system("ompi-checkpoint -v `pidof mpirun` ");
> }
> #
>
>
> The application works fine on my laptop with ubuntu as the OS.  
However, when I tried running it on one of the machines at my uni,  
with suse linux installed, the application hangs as soon as the ompi- 
checkpoint is triggered. This is what I get:

>
>
>
> ##
> I am processor no 0 of a total of 1 procs
> hi
> I am processor no 0 of a total of 1 procs
> [sun06:15426] orte_checkpoint: Checkpointing...
> [sun06:15426]PID 15411
> [sun06:15426]Connected to Mpirun [[12727,0],0]
> [sun06:15426] orte_checkpoint: notify_hnp: Contact Head Node  
Process PID 15411

> ###
>
> does anyone has some ideas about this?
>
> Thanks a lot
>
> Jean.
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] OpenMPI much slower than Mpich2

2009-09-17 Thread Jeff Squyres
Sorry for the delay in replying; my INBOX has become a disaster  
recently.  More below.



On Sep 14, 2009, at 5:08 AM, Sam Verboven wrote:


Dear All,

I'm having the following problem. If I execute the exact same
application using both openmpi and mpich2, the former takes more than
2 times as long. When we compared the ganglia output we could see that
openmpi generates more than 60 percent System CPU whereas mpich2 only
has about 5, the remaining cycles all going to User CPU. This about
explains the slowdown: when using openmpi we lose more than half the
cycles to operating system overhead. We would very much like to know
why our openmpi implementation incurs such a dramatic overhead.

The only reason I could think of myself is the fact that we use
bridged network interfaces on the cluster. Openmpi would not run
properly until we specified the mca command to use the br0 interface
instead of the physical eth0. Mpich2 does not require any extra
parameters.



What did Open MPI did when you did not specify the use br0?

I assume that br0 is a combination of some other devices, like eth0  
and eth1?  If so, what happens if you "btl_tcp_if_include eth0,eth1"  
instead of br0?



The calculations themselves are done using fortran. The operating
system is ubuntu 9.04, we have 14 dual quad core nodes and both
openmpi and mpich2 are compiled from source without any configure
options.

Full command OpenMPI:
mpirun.openmpi --mca btl_tcp_if_include br0 --prefix
/usr/shares/mpi/openmpi -hostfile hostfile -np 224
/home/arickx/bin/Linux/F_me_Kl1l2_3cl_mpi_2

Full command Mpich2:
mpiexec.mpich2 -machinefile machinefile -np 113
/home/arickx/bin/Linux/F_me_Kl1l2_3cl_mpi_2



I notice that you're running almost 2x the number of processes for  
Open MPI as MPICH2 -- does increasing the number of processes increase  
the problem size, or have some other effect on overall run-time?


--
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI users] How does OpenMPI decided to use which algorithm inMPI_Bcast????????????????

2009-09-17 Thread Jeff Squyres
Search through the mailing list archives -- this question has been  
discussed a few times.


On Sep 3, 2009, at 2:03 AM, shan axida wrote:


Hi,
I had a glance at OpenMPI source codes and there are several  
algorithms for MPI_Bcast function.
My question is how is the algorithm decided to use in a given  
MPI_Bcast call? message size?

Anyone give me little detailed information for this question?

Thanks a lot.

Axida

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI users] Multi-threading with OpenMPI ?

2009-09-17 Thread Jeff Squyres

On Sep 16, 2009, at 9:53 PM, Ralph Castain wrote:


Only the obvious, and not very helpful one: comm_spawn isn't thread
safe at this time. You'll need to serialize your requests to that
function.




This is likely the cause of your issues if you are calling  
MPI_COMM_SPAWN in multiple threads simultaneously.  Can you verify?


If not, we'll need to dig a little deeper to figure out what's going  
on.  But Ralph is right -- read up on the THREAD_MULTIPLE constraints  
(check the OMPI README file) to see if that's what's biting you.


--
Jeff Squyres
jsquy...@cisco.com



Re: [OMPI users] infiniband question

2009-09-17 Thread Jeff Squyres
Correct, you don't need DAPL.  Can you send all the information listed  
here:


http://www.open-mpi.org/community/help/


On Sep 17, 2009, at 9:17 AM, Yann JOBIC wrote:


Hi,

I'm new to infiniband.
I installed the rdma_cm, rdma_ucm and ib_uverbs kernel modules.

When i'm running a ring test openmpi code, i've got :
[Lidia][0,1,1][btl_openib_endpoint.c: 
992:mca_btl_openib_endpoint_qp_init_query]

Set MTU to IBV value 4 (2048 bytes)
[Lidia][0,1,1][btl_openib_endpoint.c: 
992:mca_btl_openib_endpoint_qp_init_query]

Set MTU to IBV value 4 (2048 bytes)
[Lilou][0,1,0][btl_openib_endpoint.c: 
992:mca_btl_openib_endpoint_qp_init_query]

Set MTU to IBV value 4 (2048 bytes)
[Lilou][0,1,0][btl_openib_endpoint.c: 
992:mca_btl_openib_endpoint_qp_init_query]

Set MTU to IBV value 4 (2048 bytes)

And then, the program hangs.

I thought i only need rdma communications, and don't need the DALP lib
(with the iboip module).

I am wrong ?

Thanks,

Yann



--
___

Yann JOBIC
HPC engineer
Polytech Marseille DME
IUSTI-CNRS UMR 6595
Technopôle de Château Gombert
5 rue Enrico Fermi
13453 Marseille cedex 13
Tel : (33) 4 91 10 69 39
  ou  (33) 4 91 10 69 43
Fax : (33) 4 91 10 69 69

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




--
Jeff Squyres
jsquy...@cisco.com




[OMPI users] infiniband question

2009-09-17 Thread Yann JOBIC

Hi,

I'm new to infiniband.
I installed the rdma_cm, rdma_ucm and ib_uverbs kernel modules.

When i'm running a ring test openmpi code, i've got :
[Lidia][0,1,1][btl_openib_endpoint.c:992:mca_btl_openib_endpoint_qp_init_query] 
Set MTU to IBV value 4 (2048 bytes)
[Lidia][0,1,1][btl_openib_endpoint.c:992:mca_btl_openib_endpoint_qp_init_query] 
Set MTU to IBV value 4 (2048 bytes)
[Lilou][0,1,0][btl_openib_endpoint.c:992:mca_btl_openib_endpoint_qp_init_query] 
Set MTU to IBV value 4 (2048 bytes)
[Lilou][0,1,0][btl_openib_endpoint.c:992:mca_btl_openib_endpoint_qp_init_query] 
Set MTU to IBV value 4 (2048 bytes)


And then, the program hangs.

I thought i only need rdma communications, and don't need the DALP lib 
(with the iboip module).


I am wrong ?

Thanks,

Yann



--
___

Yann JOBIC
HPC engineer
Polytech Marseille DME
IUSTI-CNRS UMR 6595
Technopôle de Château Gombert
5 rue Enrico Fermi
13453 Marseille cedex 13
Tel : (33) 4 91 10 69 39
 ou  (33) 4 91 10 69 43
Fax : (33) 4 91 10 69 69 



Re: [OMPI users] How to build OMPI with Checkpoint/restart.

2009-09-17 Thread Joshua Hursey


On Sep 16, 2009, at 8:30 AM, Marcin Stolarek wrote:


Hi,

It seems I solved my problem. Root of the error was, that I haven't  
loaded blcr module. So I couldn't checkpoint even one therad  
application.


I am glad to hear that you have things working now.


However I stil can't find MCA:blcr in ompi_all -info, It's working.


This may have been a red-herring, sorry. I think ompi_info will only  
show the 'none' component due to the way it searches for components in  
the system. This is a bug how in the CRS selection logic plays with  
ompi_info. I will take a note/file a bug to look into fixing it.  
Unfortunately I do not have a work around other than looking in the  
install directory for the mca_crs_blcr.so file.


-- Josh



marcin

2009/9/15 Marcin Stolarek 
Hi,

I've done everythink from the beginig.:

rm  -r $ompi_install
make clean
make
make install

In $ompi_install, I've got files you mentioned:
mstol@halo2:/home/guests/mstol/openmpi/lib/openmp# ls mca_crs_bl*
mca_crs_blcr.la  mca_crs_blcr.so

but, when I try:
# ompi_info -all | grep "crs:"
mstol@halo2:/home/guests/mstol/openmpi/openmpi-1.3.3# ompi_info -- 
all | grep "crs:"

MCA crs: none (MCA v2.0, API v2.0, Component v1.3.3)
MCA crs: parameter "crs_base_verbose" (current  
value: "0", data source: default value)
MCA crs: parameter "crs" (current value: "none",  
data source: default value)
MCA crs: parameter  
"crs_none_select_warning" (current value: "0", data source: default  
value)
MCA crs: parameter "crs_none_priority" (current  
value: "0", data source: default value)


I don't have crs: blcr component.

marcin

2009/9/14 Josh Hursey 

The config.log looked fine, so I think you have fixed the configure  
problem that you previously posted about.


Though the config.log indicates that the BLCR component is scheduled  
for compile, ompi_info does not indicate that it is available. I  
suspect that the error below is because the CRS could not find any  
CRS components to select (though there should have been an error  
displayed indicating as such).


I would check your Open MPI installation to make sure that it is the  
one that you configured with. Specifically I would check to make  
sure that in the installation location there are the following files:

$install_dir/lib/openmpi/mca_crs_blcr.so
$install_dir/lib/openmpi/mca_crs_blcr.la

If that checks out, then I would remove the old installation  
directory and try reinstalling fresh.


Let me know how it goes.

-- Josh



On Sep 13, 2009, at 5:49 AM, Marcin Stolarek wrote:

I've tryed another time.  Here is what I get when trying to run  
using-1.4a1r21964 :


(terminus:~) mstol% mpirun --am ft-enable-cr ./a.out
--
It looks like opal_init failed for some reason; your parallel  
process is

likely to abort.  There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

opal_cr_init() failed failed
--> Returned value -1 instead of OPAL_SUCCESS
--
[terminus:06120] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file  
runtime/orte_

init.c at line 79
--
It looks like MPI_INIT failed for some reason; your parallel process  
is

likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or  
environment

problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

ompi_mpi_init: orte_init failed
--> Returned "Error" (-1) instead of "Success" (0)
--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[terminus:6120] Abort before MPI_INIT completed successfully; not  
able to guaran

tee that all other processes were killed!
--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--

I've included config.log and ompi_info --all output in attacment
LD_LIBRARY_PATH is set correctly.
Any idea?

marcin





2009/9/12 Marcin Stolarek 
Hi,
I'm trying  to compile OpenMPI with  checkpoint restart via BLCR.  
I'm not sure which path shoul I set as a value of --with-blcr option.

I'm using 1.3.3 release, which version of BLCR should I use?

I've compiled the