Re: [OMPI users] How to paralellize the algorithm...

2016-10-01 Thread Constantinos Makassikis
Dear,

It seems to me you are trying to parallelize following a Master/Slave
paradigm.

Perhaps you may want to take a look and test some existing MPI-based
library for achieving that:
- DMSM: https://www.hpcvl.org/faqs/dmsm-library
- TOP-C: http://www.ccs.neu.edu/home/gene/topc.html
- ADLB: https://www.cs.mtsu.edu/~rbutler/adlb/


Kind Regards,
--
Constantinos


On Fri, Sep 30, 2016 at 5:36 PM, Bar�� Ke�eci via users <
users@lists.open-mpi.org> wrote:

> Hello everyone,
> I'm trying to investigate the paralellization of an algorithm with OpenMPI
> on a distributed computers' network. In the network there are one Master
> PC and 24 Computing Node PCs. I'm quite a newbie in this field. However i
> achieved installing the OpenMPI and compiling and running my first parallel
> codes on this parallel platform.
>
> Now here is my question.
> The algorithm in my concern is a simple one. Such that: in the outer "for
> loop" the algorithm repeats until a stopping conditon is met. The Master PC
> should do this outer loop. And in the "inner loop" a local search procedure
> is performed in paralel by the 24 Computing Nodes. That means i actually
> want to paralellize the inner loop since it is the most time cosuming part
> of my algorithm. I have already managed to code this part since i know the
> total number of steps of the "inner loop" and hence i was able to
> paralellize the inner "for loop" over the distributed pcs. Now here is the
> problem. I want the Master PC repeats the main loop until a stopping
> criterion is met, but at each step it should distribute the inner loop over
> 24 compute nodes. And i dont have any idea how should i do this. It appears
> to me i should build a code something like, i have to make each compute
> node wait a signal from the master code and reapeat the inner loop over and
> over...
>
> I hope i could make it clear with my poor English. I would appreciate if
> anyone can help me or at least give the broad methodology.
>
> Best regars to all.
>
> Doctor Keceee
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Checkpointing an MPI application with OMPI

2013-01-30 Thread Constantinos Makassikis
On Wed, Jan 30, 2013 at 3:02 AM, Ralph Castain  wrote:

>
> If your node hardware is the problem, or you decide you do want/need to
> pursue an FT solution, then you might look at the OMPI-based solutions from
> parties such as http://fault-tolerance.org or the MPICH2 folks.
>

Just as Ralph said, you may look into alternatives. From what I have seen,
MPICH2 provides fault tolerance using BLCR.
The same goes for Intel's MPI (
http://software.intel.com/en-us/forums/topic/296300). Though not free, you
may try it during
a 30-day evaluation period (
http://software.intel.com/en-us/intel-mpi-library/).
It can be interesting to see how the two MPI fair wrt to BLCR-based FT.

Another alternative which may be worth considering is DMTCP (
http://dmtcp.sourceforge.net/) from Northeastern University
for which there has been an interesting podcast recently (
http://www.rce-cast.com/Podcast/rce-76-distributed-multithreaded-checkpointing.html)
:-)

Finally, depending on the application, you may be interested in adding
checkpoint-based fault tolerance at the application level with the help of
libraries such as SCR (http://sourceforge.net/projects/scalablecr/). Though
you'll need to spend some time modifying the application source code,
it may be better than system-level based alternatives in the long run.

--
Constantinos


Re: [OMPI users] mpirun command gives ERROR

2012-07-19 Thread Constantinos Makassikis
Hi Paul,

can you give us some additional information 
- on your compilation command line ?
- the application you're trying to run ?

Just in case, did you try checking points 1. and 2. of the error message 
in the application you're trying to run ? 

--
Constantinos

On Thu, Jul 19, 2012 at 07:34:50PM +0800, Abhra Paul wrote:
>Respected developers and users
>I am trying to run a parallel program CPMD with the command "
>/usr/local/bin/mpirun -np 4 ./cpmd.x 1-h2-wave.inp > 1-h2-wave.out &" , it
>is giving the following error:
>
> ==
>[testcpmd@slater CPMD_3_15_3]$ /usr/local/bin/mpirun -np 4 ./cpmd.x
>1-h2-wave.inp > 1-h2-wave.out &
>[1] 1769
>[testcpmd@slater CPMD_3_15_3]$
>--
>MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>with errorcode 999.
> 
>NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>You may or may not see output from other processes, depending on
>exactly when Open MPI kills them.
>--
>--
>mpirun has exited due to process rank 0 with PID 1770 on
>node slater.rcamos.iacs exiting improperly. There are two reasons this
>could occur:
> 
>1. this process did not call "init" before exiting, but others in
>the job did. This can cause a job to hang indefinitely while it waits
>for all processes to call "init". By rule, if one process calls "init",
>then ALL processes must call "init" prior to termination.
> 
>2. this process called "init", but exited without calling "finalize".
>By rule, all processes that call "init" MUST call "finalize" prior to
>exiting or it will be considered an "abnormal termination"
> 
>This may have caused other processes in the application to be
>terminated by signals sent by mpirun (as reported here).
>--
> 
>[1]+  Exit 231/usr/local/bin/mpirun -np 4 ./cpmd.x
>1-h2-wave.inp > 1-h2-wave.out
>
> ==
>I am unable to find out the reason of that error. Please help. My Open-MPI
>version is 1.6.
>With regards
>Abhra Paul

> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Beginners Tutorial Link broken

2012-07-15 Thread Constantinos Makassikis

Hi Alexander,

Indeed the link is inaccessible. If you try only
http://webct.ncsa.uiuc.edu
you get redirected to http://www.citutor.org/login.php
You need registering to access the tutorial materials.

A preview of available materials appears in
http://www.citutor.org/browse.php

Probably the FAQ entry should be updated ...


HTH,

--
Constantinos

On Sat, Jul 14, 2012 at 07:34:08PM +0200, Alexander Brock wrote:
> Hi,
> 
> The FAQ[0] state, that [1] ist a great place for beginners,
> unfortunately the site is down[2], is there a mirror?
> 
> Alexander
> 
> [0] http://www.open-mpi.org/faq/?category=all#tutorials
> [1] http://webct.ncsa.uiuc.edu:8900/public/MPI/
> [2]
> http://www.downforeveryoneorjustme.com/webct.ncsa.uiuc.edu:8900/public/MPI/
> 



> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] checkpointing/restart of hpl

2012-06-04 Thread Constantinos Makassikis
Hi,

you may start by looking at http://www.open-mpi.org/faq/?category=ft
which leads you to
https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR
and
http://osl.iu.edu/research/ft/ompi-cr/

The latter is the most up-to-date link and probably what you are looking
for.


HTH,

--
Constantinos

On Mon, Jun 4, 2012 at 3:24 AM, Ifeanyi  wrote:

> Dear,
>
> I am a new user of open mpi. I have already installed openmpi and build
> hpl. I want to checkpoint/restart hpl and compare its performance
>
> Please can you point me to a useful link that will guide me through this
> checkpointing of hpl.
>
> thanks in advance.
>
> Regards,
> Ifeanyi
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] MPI books and resources

2012-05-12 Thread Constantinos Makassikis
You may be interested by :

- the MPI Standard (http://www.mpi-forum.org/docs/docs.html)

- the book chapter written by J. Squyres in (see
http://www.open-mpi.org/community/lists/devel/2012/05/10981.php)

and

- a Scuba Diving Kit ... to dive into ... open source code :-D


--
Constantinos




On Sat, May 12, 2012 at 12:18 PM, Rohan Deshpande wrote:

> Hi,
>
> Can anyone point me to good resources and books to understand the detailed
> architecture and working of MPI.
>
> I would like to know all the minute details.
>
> Thanks,
>
> --
>
>
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Regarding MPI programming

2012-04-23 Thread Constantinos Makassikis
Assuming the type of the elements in the array is known, you'll probably
have to do it in two steps:
1) Broadcast the number of elements in the array
2) Broadcast the array itself


HTH,

--
Constantinos

On Mon, Apr 23, 2012 at 12:41 PM, seshendra seshu wrote:

> Hi,
> I am using stacks , where i stored my sub arrays in stacks and i need send
> the sub arrays to all the nodes but i have know idea what is the size of
> array present in stack so how can i receive the data using MPI_recv with
> out knowing the size of a array. can any please help me in solving this.
>
>
> --
>  WITH REGARDS
> M.L.N.Seshendra
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] mpicc command not found - Fedora

2012-03-30 Thread Constantinos Makassikis
On Fri, Mar 30, 2012 at 2:39 PM, Rohan Deshpande  wrote:

> Hi,
>
> I do not know how to use *ortecc*.
>
The same way as mpicc. Actually on my machine they both are symlinks to
"opal_wrapper".
Your second screenshot suggests orte* commands have been installed.


> After looking at the details i found that* yum install did not install
> openmpi-devel package. *
>
> yum cannot find it either - *yum search openmpi-devel says not match
> found.*
>
> I am using* Red Hat 6.2 and i686 processors. *
>

> which mpicc shows -
>
> *which: no mpicc in
> (/usr/lib/qt-3.3/bin:/usr/local/ns-allinone/bin:/usr/local/ns-allinone/tcl8.4.18/unix:/usr/local/ns-allinone/tk8.4.18/unix:/usr/local/cuda/cuda/bin:/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/lib/openmpi/bin)
> *
>
> rpmquery -l openmpi-devel   *says package not installed*
>
> What could be the possible solution?
>

1) If ortecc is indeed present you can test it.
If it works you may create manually some symlinks of your own:
  ln -s /path1/ortecc /path2/mpicc
  ln -s /path1/orterun /path2/mpirun
where path2 is in your PATH
Maybe the fastest ... but not the cleanest :-)

2) Fix the red hat package ...
May take some time ...

3) As Amit suggested earlier you can also download OpenMPI's source,
compile and install it !

--
Constantinos




>
> On Fri, Mar 30, 2012 at 2:05 AM, Amit Ghadge  wrote:
>
>> You can try source packaged. Extract and run command ./configure
>> --prefix=usr/local , make all , make install after to compile any mpi
>> program by using mpicc
>>  On 29-Mar-2012 7:26 PM, "Jeffrey Squyres"  wrote:
>>
>>> I don't know exactly how Fedora packages Open MPI, but I've seen some
>>> distributions separate Open MPI into a base package and a "devel" package.
>>>  And mpicc (and some friends) are split off into that "devel" package.
>>>
>>> The rationale is that you don't need mpicc (and friends) to *run* Open
>>> MPI applications -- you only need mpicc (etc.) to *develop* Open MPI
>>> applications.
>>>
>>> Poke around and see if you can find a devel-like Open MPI package in
>>> Fedora.
>>>
>>>
>>> On Mar 29, 2012, at 7:45 AM, Rohan Deshpande wrote:
>>>
>>> > Hi,
>>> >
>>> > I have installed mpi successfully on fedora using yum install openmpi
>>> openmpi-devel openmpi-libs
>>> >
>>> > I have also added /usr/lib/openmpi/bin to PATH and LD_LIBRARY_PATH
>>> variable.
>>> >
>>> > But when I try to complie my program using mpicc hello.c or
>>> /usr/lib/openmpi/bin/mpicc hello.c I get error saying mpicc: command not
>>> found
>>> >
>>> > I checked the contents of /user/lib/openmpi/bin and there is no
>>> mpicc... here is the screenshot
>>> >
>>> > 
>>> >
>>> >
>>> > The add/remove  programs show the installation details
>>> >
>>> >  
>>> >
>>> > I have tried re installing but same thing happened.
>>> >
>>> > Can someone help me to solve this issue?
>>> >
>>> > Thanks
>>> > --
>>> >
>>> > Best Regards,
>>> >
>>> > ROHAN
>>> >
>>> >
>>> >
>>> > ___
>>> > users mailing list
>>> > us...@open-mpi.org
>>> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>>
>>> --
>>> Jeff Squyres
>>> jsquy...@cisco.com
>>> For corporate legal information go to:
>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>
>
> --
>
> Best Regards,
>
> ROHAN DESHPANDE
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] mpicc command not found - Fedora

2012-03-29 Thread Constantinos Makassikis
Hi,

Did you try ortecc command directly ?
Perhaps the RPM package you're installing does not create symlinks from
ortecc to mpicc ...

--
Constantinos


On Thu, Mar 29, 2012 at 1:45 PM, Rohan Deshpande  wrote:

> Hi,
>
> I have installed mpi successfully on fedora using *yum install openmpi
> openmpi-devel openmpi-libs*
> *
> *
> I have also added */usr/lib/openmpi/bin* to *PATH *and* 
> LD_LIBRARY_PATH*variable.
>
> But when I try to complie my program using *mpicc hello.c* 
> or*/usr/lib/openmpi/bin/mpicc hello.c
> * I get error saying *mpicc: command not found*
> *
> *
> I checked the contents of /user/lib/openmpi/bin and there is no mpicc...
> here is the screenshot
>
> [image: Inline image 2]
>
>
> The add/remove  programs show the installation details
>
>  [image: Inline image 3]
>
> I have tried re installing but same thing happened.
>
> Can someone help me to solve this issue?
>
> Thanks
> --
>
> Best Regards,
>
> ROHAN
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] mpi & mac

2011-07-06 Thread Constantinos Makassikis
On Tue, Jul 5, 2011 at 9:48 PM, Robert Sacker  wrote:

> Hi all,
>
Hello !

> I need some help. I'm trying to run C++ code in Xcode on a Mac Pro Desktop
> (OS 10.6) and utilize all 8 cores . My ultimate goal is to be able to run
> the code on the cluster here on campus. I'm in the process of
> converting into C++ the number crunching part of the stuff I previously
> wrote in Matlab.
> Is there some documentation that explains how to get started? Thanks. Bob
>

I am not sure whether this is the relevant mailing list for
general parallelization questions ...

In any case, before converting your Matlab code to C++ try using
parallelization features that come with Matlab.

Otherwise, after translating your Matlab code to C++, you should
consider in the first place getting acquainted with OpenMP and
use it to speed up your code on your 8-core machine.
OpenMP can be rather straightforward to apply.

Afterwards, if necessary, you may look into parallelizing over multiple
machines with OpenMPI.

Someone advised an interesting link to online courses on parallelism
(OpenMP and OpenMPI) in the following post
http://www.open-mpi.org/community/lists/users/2010/01/11922.php


HTH,

--
Constantinos



> "Do everything in moderation--including moderation."
>
> Robert J. Sacker, PhD
> Professor
> Department of Mathematics
> University of Southern California
> 3620 S. Vermont Ave., KAP 108
> Los Angeles, CA 90089-2532(office KAP 120B)
> (213) 740-3793,  (213) 740-2424 FAX
> http://www-bcf.usc.edu/~rsacker
> http://www.scholarpedia.org/article/Neimark-Sacker_bifurcation
> Editor, Short Notes, Journal of Difference Equations and Applications
> http://www.tandf.co.uk/journals/titles/10236198.asp
>
>
>
>
>
>
>
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


[OMPI users] Meaning of ./configure --with-ft=LAM option

2011-06-20 Thread Constantinos Makassikis
Hi everyone !

I have started playing a little bit with Open MPI 1.5.3 and
came across the  " --with-ft=LAM " option in the ./configure script.

However, I did not find its meaning anywhere in the doc ...


Best regards,

--
Constantinos


Re: [OMPI users] Deadlock with barrier und RMA

2011-06-11 Thread Constantinos Makassikis
On Sat, Jun 11, 2011 at 5:17 PM, Ole Kliemann <
ole-ompi-2...@mail.plastictree.net> wrote:

> On Sat, Jun 11, 2011 at 07:24:24AM -0600, Ralph Castain wrote:
> > Oh my - that is such an old version! Any reason for using it instead of
> something more recent?
>
> I'm using the cluster of the university where I work und I'm not the
> admin. So I'm going with what is installed there.
>

Provided your account is available over all the nodes of the cluster
(commonly through a shared filesystem (e.g. NFS)),
you can easily install and use a more recent version of OpenMPI.

mkdir -p ${HOME}/ompi-1.5.3 && ./configure --prefix=${HOME}/ompi-1.5.3
make
make install

You should not forget to modify your "PATH" and "LD_LIBRARY_PATH"
environment variables in your ".bash_profile".


>
> It's the first time I'm using MPI. Before I complain to the admins about
> old versions or anything else, I'd like to check whether my code
> actually should be okay in regard to MPI specifications.
>
> > On Jun 11, 2011, at 8:43 AM, Ole Kliemann wrote:
> >
> > > Hi everyone!
> > >
> > > I'm trying to use MPI on a cluster running OpenMPI 1.2.4 and starting
> > > processes through PBSPro_11.0.2.110766. I've been running into a couple
> > > of performance and deadlock problems and like to check whether I'm
> > > making a mistake.
> > >
> > > One of the deadlocks I managed to boil down to the attached example. I
> > > run it on 8 cores. It usually deadlocks with all except one process
> > > showing
> > >
> > > start barrier
> > >
> > > as last output.
> > >
> > > The one process out of order shows:
> > >
> > > start getting local
> > >
> > > My question at this point is simply whether this is expected behaviour
> > > of OpenMPI.
> > >
> > > Thanks in advance!
> > > Ole
> > > ___
> > > users mailing list
> > > us...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/users
> >
> >
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] Test OpenMPI on a cluster

2010-01-31 Thread Constantinos Makassikis

Tim wrote:
Hi,  
  
I am learning MPI on a cluster. Here is one simple example. I expect the output would show response from different nodes, but they all respond from the same node node062. I just wonder why and how I can actually get report from different nodes to show MPI actually distributes processes to different nodes? Thanks and regards!
  
ex1.c  
  
/* test of MPI */  
#include "mpi.h"  
#include   
#include   
  
int main(int argc, char **argv)  
{  
char idstr[2232]; char buff[22128];  
char processor_name[MPI_MAX_PROCESSOR_NAME];  
int numprocs; int myid; int i; int namelen;  
MPI_Status stat;  
  
MPI_Init(,);  
MPI_Comm_size(MPI_COMM_WORLD,);  
MPI_Comm_rank(MPI_COMM_WORLD,);  
MPI_Get_processor_name(processor_name, );  
  
if(myid == 0)  
{  
  printf("WE have %d processors\n", numprocs);  
  for(i=1;i<numprocs;i++)  
  {  
sprintf(buff, "Hello %d", i);  
MPI_Send(buff, 128, MPI_CHAR, i, 0, MPI_COMM_WORLD); }  
for(i=1;i<numprocs;i++)  
{  
  MPI_Recv(buff, 128, MPI_CHAR, i, 0, MPI_COMM_WORLD, );  
  printf("%s\n", buff);  
}  
}  
else  
{   
  MPI_Recv(buff, 128, MPI_CHAR, 0, 0, MPI_COMM_WORLD, );  
  sprintf(idstr, " Processor %d at node %s ", myid, processor_name);  
  strcat(buff, idstr);  
  strcat(buff, "reporting for duty\n");  
  MPI_Send(buff, 128, MPI_CHAR, 0, 0, MPI_COMM_WORLD);  
}  
MPI_Finalize();  
  
}  
  
ex1.pbs  
  
#!/bin/sh  
#  
#This is an example script example.sh  
#  
#These commands set up the Grid Environment for your job:  
#PBS -N ex1  
#PBS -l nodes=10:ppn=1,walltime=1:10:00  
#PBS -q dque  
  
# export OMP_NUM_THREADS=4  
  
 mpirun -np 10 /home/tim/courses/MPI/examples/ex1  
  

Try running your program with the following:

mpirun -np 10 -machinefile machines /home/tim/courses/MPI/examples/ex1 


where 'machines' is a file containing the names of your nodes (one per line)

node063
node064
...
node177


HTH,

--
Constantinos Makassikis

  
compile and run:


[tim@user1 examples]$ mpicc ./ex1.c -o ex1   
[tim@user1 examples]$ qsub ex1.pbs  
35540.mgt  
[tim@user1 examples]$ nano ex1.o35540  
  
Begin PBS Prologue Sat Jan 30 21:28:03 EST 2010 1264904883  
Job ID: 35540.mgt  
Username:   tim  
Group:  Brown  
Nodes:  node062 node063 node169 node170 node171 node172 node174 node175  
node176 node177  
End PBS Prologue Sat Jan 30 21:28:03 EST 2010 1264904883  
  
WE have 10 processors  
Hello 1 Processor 1 at node node062 reporting for duty  
  
Hello 2 Processor 2 at node node062 reporting for duty  
  
Hello 3 Processor 3 at node node062 reporting for duty  
  
Hello 4 Processor 4 at node node062 reporting for duty  
  
Hello 5 Processor 5 at node node062 reporting for duty  
  
Hello 6 Processor 6 at node node062 reporting for duty  
  
Hello 7 Processor 7 at node node062 reporting for duty  
  
Hello 8 Processor 8 at node node062 reporting for duty  
  
Hello 9 Processor 9 at node node062 reporting for duty  
  
  
Begin PBS Epilogue Sat Jan 30 21:28:11 EST 2010 1264904891  
Job ID: 35540.mgt  
Username:   tim  
Group:  Brown  
Job Name:   ex1  
Session:15533  
Limits: neednodes=10:ppn=1,nodes=10:ppn=1,walltime=01:10:00  
Resources:  cput=00:00:00,mem=420kb,vmem=8216kb,walltime=00:00:03  
Queue:  dque  
Account:  
Nodes:  node062 node063 node169 node170 node171 node172 node174 node175 node176  
node177  
Killing leftovers...  
  
End PBS Epilogue Sat Jan 30 21:28:11 EST 2010 1264904891  





  
___

users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

  




Re: [OMPI users] Hanging vs Stopping behaviour in communication failures

2009-12-14 Thread Constantinos Makassikis

Jeff Squyres wrote:

On Dec 9, 2009, at 3:47 AM, Constantinos Makassikis wrote:

  

sometimes when running Open MPI jobs, the application hangs. By looking the
output I get the following error message:

[ic17][[34562,1],74][../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv   
] mca_btl_tcp_frag_recv: readv failed: No route to host (113)


I would expect Open MPI to eventually quit with an error at such situations.
Is the observed behaviour (i.e.: hanging) the intended one ?



That does seem weird.  I would think that we should abort rather than hang.  
But then again, the code is fairly hairy there -- there are many corner cases.

  

If so, what would be the reason(s) behind choosing the hanging over the
stopping ?



It *looks* like the code is supposed to retry the connection here, but perhaps 
something is not operating correctly (or perhaps it *is* trying to reconnect 
and the network is failing to reconnect for some reason...?).
  

I don't really know whether it is trying to reconnect. What is sure, is
that last time it happened, the destination node could indeed not be
reached (i.e.: couldn't ssh it nor did it repond to ping).
How often does this happen?  Is it in the middle of the application run, or at the very beginning?  

Did not happen very often: only after long and intensive usage of the
nodes. As for the time in the application execution, I couldn't figure
it out. Maybe it would be a good idea I modify the source code so that
I keep track of the progress.

Do you have any other network issues where connections get dropped, etc.?  Do 
you have any firewalls running on your cluster machines

To my knowledge there hasn't been any other network issues.
There are no firewalls.

I don't know if the current information is sufficient to answer with
certainty. I am going to try and look for more info whenever it occurs
again. About that, are there some options I could use in Open MPI to
gather more info ? Are there any other things I should pay attention
to ?

Thanks for your help,

--
Constantinos


[OMPI users] Hanging vs Stopping behaviour in communication failures

2009-12-09 Thread Constantinos Makassikis

Dear all,

sometimes when running Open MPI jobs, the application hangs. By looking the
output I get the following error message:

[ic17][[34562,1],74][../../../../../ompi/mca/btl/tcp/btl_tcp_frag.c:216:mca_btl_tcp_frag_recv
] mca_btl_tcp_frag_recv: readv failed: No route to host (113)


I would expect Open MPI to eventually quit with an error at such situations.
Is the observed behaviour (i.e.: hanging) the intended one ?

If so, what would be the reason(s) behind choosing the hanging over the 
stopping ?



Best Regards,

--
Constantinos


Re: [OMPI users] Changing location where checkpoints are saved

2009-11-18 Thread Constantinos Makassikis

Josh Hursey wrote:

(Sorry for the excessive delay in replying)

On Sep 30, 2009, at 11:02 AM, Constantinos Makassikis wrote:


Thanks for the reply!

Concerning the mca options for checkpointing:
- are verbosity options (e.g.: crs_base_verbose) limited to 0 and 1 
values ?
- in priority options (e.g.: crs_blcr_priority) do lower numbers 
indicate higher priority ?


By searching in the archives of the mailing list I found two 
interesting/useful posts:
- [1] http://www.open-mpi.org/community/lists/users/2008/09/6534.php 
(for different checkpointing schemes)
- [2] http://www.open-mpi.org/community/lists/users/2009/05/9385.php 
(for restarting)


Following indications given in [1], I tried to make each process
checkpoint itself in it local /tmp and centralize the resulting
checkpoints in /tmp or $HOME:

Excerpt from mca-params.conf:
-
snapc_base_store_in_place=0
snapc_base_global_snapshot_dir=/tmp or $HOME
crs_base_snapshot_dir=/tmp

COMMANDS used:
--
mpirun -n 2 -machinefile machines -am ft-enable-cr a.out
ompi-checkpoint mpirun_pid



OUTPUT of ompi-checkpoint -v 16753
--
[ic85:17044] orte_checkpoint: Checkpointing...
[ic85:17044] PID 17036
[ic85:17044] Connected to Mpirun [[42098,0],0]
[ic85:17044] orte_checkpoint: notify_hnp: Contact Head Node Process 
PID 17036
[ic85:17044] orte_checkpoint: notify_hnp: Requested a checkpoint of 
jobid [INVALID]

[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Requested - Global Snapshot Reference: 
(null)

[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044]   Pending - Global Snapshot Reference: 
(null)

[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044]   Running - Global Snapshot Reference: 
(null)

[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] File Transfer - Global Snapshot Reference: 
(null)

[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Error - Global Snapshot Reference: 
ompi_global_snapshot_17036.ckpt




OUTPUT of MPIRUN


[ic85:17038] crs:blcr: blcr_checkpoint_peer: Thread finished with 
status 3
[ic86:20567] crs:blcr: blcr_checkpoint_peer: Thread finished with 
status 3
-- 


WARNING: Could not preload specified file: File already exists.

Fileset: /tmp/ompi_global_snapshot_17036.ckpt/0
Host: ic85

Will continue attempting to launch the process.

-- 


[ic85:17036] filem:rsh: wait_all(): Wait failed (-1)
[ic85:17036] [[42098,0],0] ORTE_ERROR_LOG: Error in file 
../../../../../orte/mca/snapc/full/snapc_full_global.c at line 1054


This is a warning about creating the global snapshot directory 
(ompi_global_snapshot_17036.ckpt) for the first checkpoint (seq 0). It 
seems to indicate that the directory existed when the file gather 
started.


A couple things to check:
 - Did you clean out the /tmp on all of the nodes with any files 
starting with "opal" or "ompi"?
 - Does the error go away when you set 
(snapc_base_global_snapshot_dir=$HOME)?
 - Could you try running against a v1.3 release? (I wonder if this 
feature has been broken on the trunk)


Let me know what you find. In the next couple days, I'll try to test 
the trunk again with this feature to make sure that it is still 
working on my test machines.


-- Josh

Hello Josh,

I have switched to v1.3 and re-run with 
snapc_base_global_snapshot_dir=/tmp or $HOME

with a clean /tmp.

In both cases I get the same error as before :-(

I don't know if the following can be of any help but after ompi-checkpoint
returns there is only a copy of the checkpoint of process of rank 0 in
the global snapshot directory:

$(snapc_base_global_snapshot_dir)/ompi_global_snapshot_.ckpt/0

So I guess the error occurs during the remote copy phase.

--
Constantinos







Does anyone has an idea about what is wrong?


Best regards,

--
Constantinos



Josh Hursey wrote:
This is described in the C/R User's Guide attached to the webpage 
below:

 https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR

Additionally this has been addressed on the users mailing list in 
the past, so searching around will likely turn up some examples.


-- Josh

On Sep 18, 2009, at 11:58 AM, Constantinos Makassikis wrote:


Dear all,

I have installed blcr 0.8.2 and Open MPI (r21973) on my NFS 
account. 

Re: [OMPI users] error in checkpointing an mpi application

2009-10-01 Thread Constantinos Makassikis

Hi,

from what you describe below, seems as if you did not configure well 
OpenMPI.


You issued

./configure --with-ft=cr --enable-mpi-threads --with-blcr=/usr/local/bin 
--with-blcr-libdir=/usr/local/lib

while according to the installation paths you gave it should have been 
more like


./configure --with-ft=cr --enable-mpi-threads --with-blcr=/root/MS 
--with-blcr-libdir=/root/MS/lib



Apart from that, if you wish to have BLCR modules loaded at start up of 
your machine, a simple way is to add the following lines in rc.local 
This file is somewhere in /etc: the exact location can vary from one linux 
distribution to another (e.g.: /etc/rc.d/rc.local or /etc/rc.local)


/sbin/insmod /usr/local/lib/blcr/2.6.23.1-42.fc8/blcr_imports.ko 
/sbin/insmod /usr/local/lib/blcr/2.6.23.1-42.fc8/blcr.ko


Just in case, if you have multiple MPIs installed, you can check which 
you are using with the following command:


which mpirun


HTH,

--
Constantinos

Mallikarjuna Shastry wrote:

 dear sir


i am sending the details as follows


1. i am using openmpi-1.3.3 and blcr 0.8.2 
2. i have installed blcr 0.8.2 first under /root/MS

3. then i installed openmpi 1.3.3 under /root/MS
4 i have configured and installed open mpi as follows

#./configure --with-ft=cr --enable-mpi-threads --with-blcr=/usr/local/bin 
--with-blcr-libdir=/usr/local/lib
# make 
# make install


then i added the following to the .bash_profile under home directory( i went to 
home directory by doing cd ~)

/sbin/insmod /usr/local/lib/blcr/2.6.23.1-42.fc8/blcr_imports.ko 
/sbin/insmod /usr/local/lib/blcr/2.6.23.1-42.fc8/blcr.ko 
PATH=$PATH:/usr/local/bin

MANPATH=$MANPATH:/usr/local/man
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib

then i compiled and run the file arr_add.c as follows

[root@localhost examples]# mpicc -o res arr_add.c
[root@localhost examples]# mpirun -np 2 -am ft-enable-cr ./res

2   2   2   2   2   2   2   2   2   2
2   2   2   2   2   2   2   2   2   2
2   2   2   2   2   2   2   2   2   2
--
Error: The process with PID 5790 is not checkpointable.
   This could be due to one of the following:
- An application with this PID doesn't currently exist
- The application with this PID isn't checkpointable
- The application with this PID isn't an OPAL application.
   We were looking for the named files:
 /tmp/opal_cr_prog_write.5790
 /tmp/opal_cr_prog_read.5790
--
[localhost.localdomain:05788] local) Error: Unable to initiate the handshake 
with peer [[7788,1],1]. -1
[localhost.localdomain:05788] [[7788,0],0] ORTE_ERROR_LOG: Error in file 
snapc_full_global.c at line 567
[localhost.localdomain:05788] [[7788,0],0] ORTE_ERROR_LOG: Error in file 
snapc_full_global.c at line 1054
2   2   2   2   2   2   2   2   2   2
2   2   2   2   2   2   2   2   2   2
2   2   2   2   2   2   2   2   2   2
2   2   2   2   2   2   2   2   2   2
2   2   2   2   2   2   2   2   2   2
2   2   2   2   2   2   2   2   2   2


NOTE: the PID of mpirun is 5788

i geve the following command for taking the checkpoint

[root@localhost examples]#ompi-checkpoint -s 5788

i got the following output , but it was hanging like this

[localhost.localdomain:05796] Requested - Global Snapshot 
Reference: (null)
[localhost.localdomain:05796]   Pending - Global Snapshot 
Reference: (null)
[localhost.localdomain:05796]   Running - Global Snapshot 
Reference: (null)


can anybody resolve this problem
kindly rectify it.


with regards

mallikarjuna shastry



  
___

users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

  




Re: [OMPI users] Changing location where checkpoints are saved

2009-09-30 Thread Constantinos Makassikis

Thanks for the reply!

Concerning the mca options for checkpointing:
- are verbosity options (e.g.: crs_base_verbose) limited to 0 and 1 values ?
- in priority options (e.g.: crs_blcr_priority) do lower numbers 
indicate higher priority ?


By searching in the archives of the mailing list I found two 
interesting/useful posts:
- [1] http://www.open-mpi.org/community/lists/users/2008/09/6534.php 
(for different checkpointing schemes)
- [2] http://www.open-mpi.org/community/lists/users/2009/05/9385.php 
(for restarting)


Following indications given in [1], I tried to make each process
checkpoint itself in it local /tmp and centralize the resulting
checkpoints in /tmp or $HOME:

Excerpt from mca-params.conf:
-
snapc_base_store_in_place=0
snapc_base_global_snapshot_dir=/tmp or $HOME
crs_base_snapshot_dir=/tmp

COMMANDS used:
--
mpirun -n 2 -machinefile machines -am ft-enable-cr a.out
ompi-checkpoint mpirun_pid



OUTPUT of ompi-checkpoint -v 16753
--
[ic85:17044] orte_checkpoint: Checkpointing...
[ic85:17044] PID 17036
[ic85:17044] Connected to Mpirun [[42098,0],0]
[ic85:17044] orte_checkpoint: notify_hnp: Contact Head Node Process PID 
17036
[ic85:17044] orte_checkpoint: notify_hnp: Requested a checkpoint of 
jobid [INVALID]

[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Requested - Global Snapshot Reference: (null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044]   Pending - Global Snapshot Reference: (null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044]   Running - Global Snapshot Reference: (null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] File Transfer - Global Snapshot Reference: (null)
[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Error - Global Snapshot Reference: 
ompi_global_snapshot_17036.ckpt




OUTPUT of MPIRUN


[ic85:17038] crs:blcr: blcr_checkpoint_peer: Thread finished with status 3
[ic86:20567] crs:blcr: blcr_checkpoint_peer: Thread finished with status 3
--
WARNING: Could not preload specified file: File already exists.

Fileset: /tmp/ompi_global_snapshot_17036.ckpt/0
Host: ic85

Will continue attempting to launch the process.

--
[ic85:17036] filem:rsh: wait_all(): Wait failed (-1)
[ic85:17036] [[42098,0],0] ORTE_ERROR_LOG: Error in file 
../../../../../orte/mca/snapc/full/snapc_full_global.c at line 1054




Does anyone has an idea about what is wrong?


Best regards,

--
Constantinos



Josh Hursey wrote:

This is described in the C/R User's Guide attached to the webpage below:
  https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR

Additionally this has been addressed on the users mailing list in the 
past, so searching around will likely turn up some examples.


-- Josh

On Sep 18, 2009, at 11:58 AM, Constantinos Makassikis wrote:


Dear all,

I have installed blcr 0.8.2 and Open MPI (r21973) on my NFS account. 
By default,
it seems that checkpoints are saved in $HOME. However, I would prefer 
them

to be saved on a local disk (e.g.: /tmp).

Does anyone know how I can change the location where Open MPI saves 
checkpoints?



Best regards,

--
Constantinos
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users





[OMPI users] Changing location where checkpoints are saved

2009-09-18 Thread Constantinos Makassikis

Dear all,

I have installed blcr 0.8.2 and Open MPI (r21973) on my NFS account. By 
default,

it seems that checkpoints are saved in $HOME. However, I would prefer them
to be saved on a local disk (e.g.: /tmp).

Does anyone know how I can change the location where Open MPI saves 
checkpoints?



Best regards,

--
Constantinos