[OMPI users] Error while building OpenMPI on Itanium cluster with Myrinet

2006-04-14 Thread Aniruddha Shet
Hi,

I am encountering an error while building OpenMPI on a cluster with Itanium
processors and Myrinet. Please find attached a tar with configure and make
traces.

Thanks,
Aniruddha

--
Aniruddha Shet | Project webpage: http://forge-fre.ornl.gov/molar/index.html
Graduate Research Associate | Project webpage: www.cs.unm.edu/~fastos
Dept. of Comp. Sci. & Engg | Personal webpage: www.cse.ohio-state.edu/~shet
The Ohio State University | Office: DL 474
2015 Neil Avenue | Phone: +1 (614) 292 7036
Columbus OH 43210-1277 | Cell: +1 (614) 446 1630

--


ompi_output.tar.gz
Description: Binary data


Re: [OMPI users] Open MPI error

2006-04-14 Thread Prakash Velayutham

OK. Figured that it was wrong number of arguments to the code.

Thanks,
Prakash

Jeff Squyres (jsquyres) wrote:

I'm assuming that this is during the startup shortly after mpirun,
right?  (i.e., during MPI_INIT)

It looks like MPI processes were unable to connect back to the
rendezvous point (mpirun) during startup.  Do you have any firewalls or
port blocking running in your cluster?
 

  

-Original Message-
From: users-boun...@open-mpi.org 
[mailto:users-boun...@open-mpi.org] On Behalf Of Prakash Velayutham

Sent: Friday, April 14, 2006 11:00 AM
To: us...@open-mpi.org
Cc: Prakash Velayutham
Subject: [OMPI users] Open MPI error

Hi All,

What does this error mean?

**

socket 10: [wins02:19102] [0,0,3]-[0,0,0] mca_oob_tcp_msg_recv: readv
failed with errno=104
socket 12: [wins01:19281] [0,0,4]-[0,0,0] mca_oob_tcp_msg_recv: readv
failed with errno=104
socket 6: [wins05:00939] [0,0,1]-[0,0,0] mca_oob_tcp_msg_send_handler:
writev failed with errno=104
socket 6: [wins05:00939] [0,0,1] ORTE_ERROR_LOG: Communication failure
in file gpr_proxy_put_get.c at line 143
socket 6: [wins05:00939] [0,0,1]-[0,0,0]
mca_oob_tcp_peer_complete_connect: connection failed (errno=111) -
retrying (pid=939)
socket 6: [wins05:00939] mca_oob_tcp_peer_timer_handler
socket 6: [wins05:00939] [0,0,1]-[0,0,0]
mca_oob_tcp_peer_complete_connect: connection failed (errno=111) -
retrying (pid=939)
socket 6: [wins05:00939] mca_oob_tcp_peer_timer_handler
socket 6: [wins05:00939] [0,0,1]-[0,0,0]
mca_oob_tcp_peer_complete_connect: connection failed (errno=111) -
retrying (pid=939)
**
*

I am still debugging the code I am working on, but just wanted to get
some insight into where I should be looking at.

I am running openmpi-1.0.1.

Thanks,
Prakash


[OMPI users] OMPI and Xgrid

2006-04-14 Thread Warner Yuen
I did get MrBayes to run with Xgrid compiled with OpenMPI. However it  
was setup as more of a "traditional" cluster. The agents all have a  
shared NFS directory to the controller. Basically I'm only using  
Xgrid as a job scheduler. It doesn't seem as if MrBayes is a "grid"  
application but more of an application for a traidional cluster.


You will need to have the following enabled:

1) NFS shared directory across all the machines on the grid.

2) Open-MPI installed locally on all the machines or via NFS. (You'll  
need to compile Open MPI)


3) Here's the part that may make Xgrid not desirable to use for MPI  
applications:


a) Compile with MPI support:

MPI = yes
CC= $(MPIPATH)/bin/mpicc
CFLAGS = -fast

	b) Make sure that Xgrid is set to properly use password-based  
authentication.


	c) Set the environment variables for Open-MPI to use Xgrid as the  
laucher/scheduler. Assuming bash:


$ export XGRID_CONTROLLER_HOSTNAME=mycomputer.apple.com
$ export XGRID_CONTROLLER_PASSWORD=passwd

You could also add the above to a .bashrc file and have  
your .bash_profile source it.


d) Run the MPI application:

$ mpirun -np X ./myapp

There are a couple of issues:

It turns out that the directory and files that MrBayes creates must  
be readable and writable by all the agents. MrBayes requires more  
than just reading standard input/output but also the creation and  
writing of other intermediate files. For an application like HP  
Linpack that just reads and writes one file, things work fine.  
However, the MrBayes application writes out and reads back two  
additional files for each MPI process that is spawned.


All the files that MrBayes are trying to read/write must have  
permissions for user 'nobody'.  This is a  bit of a problem, since  
you probably (in general) don't want to allow user nobody to write  
all over your home directory.  One solution (if possible) would be to  
have the application write into /tmp and then collect the files after  
the job completes. But I don't know if you can set MrBayes to use a  
temporary directory. Perhaps your MrBayes customer can let us know  
how to specify a tmpdir.


I don't know how or if MrBayes has the option of specifying a temp  
working directory. I have tested the basics of this by executing an  
MPI command to copy the *.nex file to /tmp of all the agents. This  
seems allows everything to work, but I can't seem to easily clean the  
intermediate files off of the agents after this runs since the  
MrBayes application created them and the user doesn't own them.


I'm hoping the OMPI developers can come to the rescue on some of  
these issues, perhaps working in conjunction with some of the Apple  
Xgrid engineers.


Lastly, this is from one of the MrBayes folks:

"Getting help with Xgrid among the phylo community will probably be  
difficult.

Fredrik can't help and probably not anyone with CIPRES either.  Fredrik
recommends mpi since it is unix based and more people use it.

He also does not recommend setting up a cluster in your lab to run  
MrBayes.
This is because of a fault with MrBayes. The way it is currently set  
up is that
the runs are only as fast as the slowest machine, in that if someone  
sits down

to use a machine in the cluster, everything is processed at that speed.
Here we use mpi for in parallel and condor to distribute for non- 
parallel.


And frankly, MrBayes can be somewhat unstable with mpi and seems to  
get hung up

on occasion.

Unfortunately for you, I think running large jobs will be a lot  
easier in a

couple of years."

-Warner

Warner Yuen
Apple Computer
email: wy...@apple.com
Tel: 408.718.2859
Fax: 408.715.0133


On Apr 14, 2006, at 8:52 AM, users-requ...@open-mpi.org wrote:


Message: 2
Date: Thu, 13 Apr 2006 14:33:29 -0400 (EDT)
From: liuli...@stat.ohio-state.edu
Subject: Re: [OMPI users] running a job problem
To: "Open MPI Users" 
Message-ID:
<1122.164.107.248.223.1144953209.squir...@www.stat.ohio-state.edu>
Content-Type: text/plain;charset=iso-8859-1

Brian,
It worked when I used the latest version of Mrbayes. Thanks. By the  
way,

do  you have any idea to submit an ompi job on xgrid? Thanks again.
Liang


On Apr 12, 2006, at 9:09 AM, liuli...@stat.ohio-state.edu wrote:

We have a Mac network running xgrid and we have successfully  
installed

mpi. We want to run a parallell version of mrbayes. It did not have
any
problem when we compiled mrbayes using mpicc. But when we tried to
run the
compiled mrbayes, we got lots errror message

mpiexec -np 4 ./mb -i  yeast_noclock_imp.txt
  Parallel version of

  Parallel version of

  Parallel version of

  Parallel version of

[ea285fltprinter.scc.ohio-state.edu:03327] *** An error occurred in
MPI_comm_size
[ea285fltprinter.scc.ohio-state.edu:03327] *** on communicator
MPI_COMM_WORLD

Re: [OMPI users] Incorrect behavior for attributes attached toMPI_COMM_SELF.

2006-04-14 Thread Jeff Squyres (jsquyres)
Martin --
 
We finally figured out the Right solution and have committed it to all
three SVN branches:
 
- trunk (head of development)
- v1.1 (recently-created branch for the upcoming v1.1)
- v1.0 (stable release)
 
The fix is in the nightly tarballs for the trunk and v1.1; due to a
different problem, it didn't make it into last night's v1.0 snapshot but
it should be there tonight.
 
Thanks for finding this!




From: users-boun...@open-mpi.org
[mailto:users-boun...@open-mpi.org] On Behalf Of Audet, Martin
Sent: Monday, April 10, 2006 4:34 PM
To: us...@open-mpi.org
Subject: [OMPI users] Incorrect behavior for attributes attached
toMPI_COMM_SELF.



Hi,

It looks like there is a problem in OpenMPI 1.0.2 with how
MPI_COMM_SELF attributes callback functions are handled by
MPI_Finalize().

The following C program register a callback function associated
with the MPI_COMM_SELF communicator to be called during the first steps
of MPI_Finalize(). As shown in this example, this can be used to make
sure that global MPI_Datatype variables associated to global datatypes
are freed by calling MPI_Type_free() before program exit (and thus
preventing ugly memory leaks/outstanding allocations when run in
valgrind for example). This mechanism is used by the library I'm working
on as well as by PetSc library.

The program works by taking advantage of the MPI-2 Standard
Section 4.8 "Allowing User Function at Process Termination". As it says,
the MPI_Finalize() function calls the delete callback associated to the
MPI_COMM_SELF attribute "before any other part of MPI are affected". It
also says that "calling MPI_Finalized() will return false in any of
these callback functions".

Section 4.9 of the MPI-2 Standard: "Determining Whether MPI Has
Finished" moreover says that it can be determined if MPI is active by
calling MPI_Finalized(). It also reaffirm that MPI is active in the
callback functions invoked by MPI_Finalize().

I think that an "active" MPI library here means that basic MPI
functions like MPI_Type_free() can be called.

The following small program therefore seems to conform to the
MPI standard.

However where I run it (compiled with OpenMPI 1.0.2 mpicc), I
get the following message:

*** An error occurred in MPI_Type_free
*** after MPI was finalized
*** MPI_ERRORS_ARE_FATAL (goodbye)

Note that this program works well with mpich2.

Please have a look at this problem.

Thanks,

Martin Audet



#include 
#include 

#include 

static int attr_delete_function(MPI_Comm p_comm, int p_keyval,
void *p_attribute_val, void * p_extra_state)
{
   assert(p_attribute_val != NULL);

   /* Get a reference on the datatype received. */
   MPI_Datatype *const cur_datatype = (MPI_Datatype
*)(p_attribute_val);

   /* Free it if non null.  */
   if (*cur_datatype != MPI_DATATYPE_NULL) {
  MPI_Type_free(cur_datatype);
assert(*cur_datatype == MPI_DATATYPE_NULL);
   }

   return MPI_SUCCESS;
}


/* If p_datatype refer to a non null MPI datatype, this function
will register a callback   */
/*  function to free p_datatype and set it to MPI_DATATYPE_NULL.
This callback will be  */
/*  called during the first steps of the MPI_Finalize() function
when the state of the MPI  */
/*  library still allows MPI functions to be called. This is
done by associating an */
/*  attribute to the MPI_COMM_SELF communicator as allowed by
the MPI 2 standard (section 4.8). */
static void add_type_free_callback(MPI_Datatype *p_datatype)
{
   int keyval;

   assert(p_datatype != NULL);

   /* First create the keyval.
*/
   /*  No callback function will be called when MPI_COMM_SELF is
duplicated  */
   /*  and attr_delete_function() will be called when
MPI_COMM_SELF is   */
   /*  freed (e.g. during MPI_Finalize()).
*/
   /*  Since many callback can be associated with MPI_COMM_SELF
to free many */
   /*  datatypes, a new keyval has to be created every time.
*/
   MPI_Keyval_create(MPI_NULL_COPY_FN, _delete_function,
, NULL);

   /* Then associate this keyval to MPI_COMM_SELF and make sure
the pointer  */
   /* to the datatype p_datatype is passed to the callback.
*/
   MPI_Attr_put(MPI_COMM_SELF, keyval, p_datatype);

   /* Free the keyval because it is no longer needed.
*/
   MPI_Keyval_free();
}

typedef struct {
   short