Re: [OMPI users] Open-MPI and gprof

2009-04-22 Thread Brock Palen
There is a tool (not free)  That I have liked that works great with  
OMPI, and can use gprof information.


http://www.allinea.com/index.php?page=74

Also I am not sure but Tau (which is free)  Might support some gprof  
hooks.

http://www.cs.uoregon.edu/research/tau/home.php

Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985



On Apr 22, 2009, at 7:37 PM, jgans wrote:


Hi,

Yes you can profile MPI applications by compiling with -pg.  
However, by default each process will produce an output file called  
"gmon.out", which is a problem if all processes are writing to the  
same global file system (i.e. all processes will try to write to  
the same file).


There is an undocumented feature of gprof that allows you to  
specify the filename for profiling output via the environment  
variable GMON_OUT_PREFIX. For example, one can set this variable in  
the .bashrc file for every node to insure unique profile filenames,  
i.e.:


export GMON_OUT_PREFIX='gmon.out-'`/bin/uname -n`

The filename will appear as GMON_OUT_PREFIX.pid, where pid is the  
process id on a given node (so this will work when multiple nodes  
are contained in a single host).


Regards,

Jason

Tiago Almeida wrote:

Hi,
I've never done this, but I believe that an executable compiled  
with profilling support (-pg) will generate the gmon.out file in  
its current directory, regardless of running under MPI or not. So  
I think that you'll have a gmon.out on each node and therefore you  
can "gprof" them independently.


Best regards,
Tiago Almeida
-
jody wrote:

Hi
I wanted to profile my application using gprof, and proceeded like
when profiling a normal application:
- compile everything with option -pg
- run application
- call gprof
This returns a normal-looking output, but i don't know
whether this is the data for node 0 only or accumulated for all  
nodes.


Does anybody have experience in profiling parallel applications?
Is there a way to have profile data for each node separately?
If not, is there another profiling tool which can?

Thank You
  Jody
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users






Re: [OMPI users] Open-MPI and gprof

2009-04-22 Thread jgans

Hi,

Yes you can profile MPI applications by compiling with -pg. However, by 
default each process will produce an output file called "gmon.out", 
which is a problem if all processes are writing to the same global file 
system (i.e. all processes will try to write to the same file).


There is an undocumented feature of gprof that allows you to specify the 
filename for profiling output via the environment variable 
GMON_OUT_PREFIX. For example, one can set this variable in the .bashrc 
file for every node to insure unique profile filenames, i.e.:


export GMON_OUT_PREFIX='gmon.out-'`/bin/uname -n`

The filename will appear as GMON_OUT_PREFIX.pid, where pid is the 
process id on a given node (so this will work when multiple nodes are 
contained in a single host).


Regards,

Jason

Tiago Almeida wrote:

Hi,
I've never done this, but I believe that an executable compiled with 
profilling support (-pg) will generate the gmon.out file in its 
current directory, regardless of running under MPI or not. So I think 
that you'll have a gmon.out on each node and therefore you can "gprof" 
them independently.


Best regards,
Tiago Almeida
-
jody wrote:

Hi
I wanted to profile my application using gprof, and proceeded like
when profiling a normal application:
- compile everything with option -pg
- run application
- call gprof
This returns a normal-looking output, but i don't know
whether this is the data for node 0 only or accumulated for all nodes.

Does anybody have experience in profiling parallel applications?
Is there a way to have profile data for each node separately?
If not, is there another profiling tool which can?

Thank You
  Jody
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

  

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] Open-MPI and gprof

2009-04-22 Thread Tiago Almeida

Hi,
I've never done this, but I believe that an executable compiled with 
profilling support (-pg) will generate the gmon.out file in its current 
directory, regardless of running under MPI or not. So I think that 
you'll have a gmon.out on each node and therefore you can "gprof" them 
independently.


Best regards,
Tiago Almeida
-
jody wrote:

Hi
I wanted to profile my application using gprof, and proceeded like
when profiling a normal application:
- compile everything with option -pg
- run application
- call gprof
This returns a normal-looking output, but i don't know
whether this is the data for node 0 only or accumulated for all nodes.

Does anybody have experience in profiling parallel applications?
Is there a way to have profile data for each node separately?
If not, is there another profiling tool which can?

Thank You
  Jody
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

  


Re: [OMPI users] few Problems

2009-04-22 Thread Luis Vitorio Cargnini
thank you all, I'll try to fix this ASAP, after I'll make a new test  
round than I answer back, Thanks you all until here.


Le 09-04-22 à 17:06, Gus Correa a écrit :


Hi Luis, list

To complement Jeff's recommendation,
see if this recipe to setup passwordless ssh connections helps.
If you use RSA keys instead of DSA, replace all "dsa" by "rsa":

http://www.sshkeychain.org/mirrors/SSH-with-Keys-HOWTO/SSH-with-Keys-HOWTO-4.html#ss4.3

I hope this helps.

Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-


Jeff Squyres wrote:

It looks like you need to fix your password-less ssh problems first:
> Permission denied, please try again.
> AH72000@cluster-srv2's password:
As I mentioned earlier, you need to be able to be able to run
   ssh cluster-srv2 uptime
without being prompted for a password before Open MPI will work  
properly.
If you're still having problems after fixing this, please send all  
the information from the "help" URL I sent earlier.

Thanks!
On Apr 22, 2009, at 3:24 PM, Luis Vitorio Cargnini wrote:

ok this is the debug information debug running on 5 nodes (trying at
least), the process is locked until now:

each node is composed by two quad-core microprocessors.

(don't finish), one node yet asked me the password. I have the home
partition mounted (the same) in all nodes. so login in cluster-
srv[0-4] is the same thing, I generated the


key in each node, in

different files and added all to the authorized_keys, it should be
working.

That is it, all help is welcome.


this is the code been executed:

#include 
#include 


int main (int argc, char *argv[])
{
  int rank, size;

  MPI_Init (, ); /* starts MPI */
  MPI_Comm_rank (MPI_COMM_WORLD, );   /* get current  
process id */
  MPI_Comm_size (MPI_COMM_WORLD, );   /* get number of  
processes */

  printf( "Hello world from process %d of %d\n", rank, size );
  MPI_Finalize();
  return 0;
}


--
debug:

-bash-3.2$ mpirun -v -d -hostfile chosts -np 32 /export/cluster/ 
home/

AH72000/mpi/hello

[cluster-srv0:21606] procdir: /tmp/openmpi-sessions-AH72000@cluster-
srv0_0/35335/0/0
[cluster-srv0:21606] jobdir: /tmp/openmpi-sessions-AH72000@cluster-
srv0_0/35335/0
[cluster-srv0:21606] top: openmpi-sessions-AH72000@cluster-srv0_0
[cluster-srv0:21606] tmp: /tmp
[cluster-srv0:21606] mpirun: reset PATH: /export/cluster/appl/ 
x86_64/

llvm/bin:/bin:/sbin:/export/cluster/appl/x86_64/llvm/bin:/usr/local/
llvm/bin:/usr/local/bin:/usr/bin:/usr/sbin:/home/GTI420/AH72000/oe/
bitbake/bin
[cluster-srv0:21606] mpirun: reset LD_LIBRARY_PATH: /export/cluster/
appl/x86_64/llvm/lib:/lib64:/lib:/export/cluster/appl/x86_64/llvm/ 
lib:/

usr/lib64:/usr/lib:/usr/local/lib64:/usr/local/lib
AH72000@cluster-srv1's password: AH72000@cluster-srv2's password:
AH72000@cluster-srv3's password:
[cluster-srv1:07406] procdir: /tmp/openmpi-sessions-AH72000@cluster-
srv1_0/35335/0/1
[cluster-srv1:07406] jobdir: /tmp/openmpi-sessions-AH72000@cluster-
srv1_0/35335/0
[cluster-srv1:07406] top: openmpi-sessions-AH72000@cluster-srv1_0
[cluster-srv1:07406] tmp: /tmp


Permission denied, please try again.
AH72000@cluster-srv2's password:
[cluster-srv3:14230] procdir: /tmp/openmpi-sessions-AH72000@cluster-
srv3_0/35335/0/3
[cluster-srv3:14230] jobdir: /tmp/openmpi-sessions-AH72000@cluster-
srv3_0/35335/0
[cluster-srv3:14230] top: openmpi-sessions-AH72000@cluster-srv3_0
[cluster-srv3:14230] tmp: /tmp

Permission denied, please try again.
AH72000@cluster-srv2's password:
[cluster-srv2:17092] procdir: /tmp/openmpi-sessions-AH72000@cluster-
srv2_0/35335/0/2
[cluster-srv2:17092] jobdir: /tmp/openmpi-sessions-AH72000@cluster-
srv2_0/35335/0
[cluster-srv2:17092] top: openmpi-sessions-AH72000@cluster-srv2_0
[cluster-srv2:17092] tmp: /tmp
[cluster-srv0:21606] [[35335,0],0] node[0].name cluster-srv0  
daemon 0

arch ffc91200
[cluster-srv0:21606] [[35335,0],0] node[1].name cluster-srv1  
daemon 1

arch ffc91200
[cluster-srv0:21606] [[35335,0],0] node[2].name cluster-srv2  
daemon 2

arch ffc91200
[cluster-srv0:21606] [[35335,0],0] node[3].name cluster-srv3  
daemon 3

arch ffc91200
[cluster-srv0:21606] [[35335,0],0] node[4].name cluster-srv4 daemon
INVALID arch ffc91200
[cluster-srv1:07406] [[35335,0],1] node[0].name cluster-srv0  
daemon 0

arch ffc91200
[cluster-srv1:07406] [[35335,0],1] node[1].name cluster-srv1  
daemon 1

arch ffc91200
[cluster-srv1:07406] [[35335,0],1] node[2].name cluster-srv2  
daemon 2

arch ffc91200
[cluster-srv1:07406] [[35335,0],1] node[3].name cluster-srv3  
daemon 3

arch ffc91200
[cluster-srv1:07406] [[35335,0],1] node[4].name cluster-srv4 daemon
INVALID arch ffc91200
[cluster-srv2:17092] [[35335,0],2] node[0].name cluster-srv0  
daemon 0

arch ffc91200
[cluster-srv2:17092] [[35335,0],2] node[1].name cluster-srv1  
daemon 1

arch ffc91200

Re: [OMPI users] few Problems

2009-04-22 Thread Jeff Squyres

It looks like you need to fix your password-less ssh problems first:

> Permission denied, please try again.
> AH72000@cluster-srv2's password:

As I mentioned earlier, you need to be able to be able to run

ssh cluster-srv2 uptime

without being prompted for a password before Open MPI will work  
properly.


If you're still having problems after fixing this, please send all the  
information from the "help" URL I sent earlier.


Thanks!


On Apr 22, 2009, at 3:24 PM, Luis Vitorio Cargnini wrote:


ok this is the debug information debug running on 5 nodes (trying at
least), the process is locked until now:

each node is composed by two quad-core microprocessors.

(don't finish), one node yet asked me the password. I have the home
partition mounted (the same) in all nodes. so login in cluster-
srv[0-4] is the same thing, I generated the dsa key in each node, in
different files and added all to the authorized_keys, it should be
working.

That is it, all help is welcome.


this is the code been executed:

#include 
#include 


int main (int argc, char *argv[])
{
   int rank, size;

   MPI_Init (, ); /* starts MPI */
   MPI_Comm_rank (MPI_COMM_WORLD, );   /* get current  
process id */
   MPI_Comm_size (MPI_COMM_WORLD, );   /* get number of  
processes */

   printf( "Hello world from process %d of %d\n", rank, size );
   MPI_Finalize();
   return 0;
}


--
debug:

-bash-3.2$ mpirun -v -d -hostfile chosts -np 32 /export/cluster/home/
AH72000/mpi/hello

[cluster-srv0:21606] procdir: /tmp/openmpi-sessions-AH72000@cluster-
srv0_0/35335/0/0
[cluster-srv0:21606] jobdir: /tmp/openmpi-sessions-AH72000@cluster-
srv0_0/35335/0
[cluster-srv0:21606] top: openmpi-sessions-AH72000@cluster-srv0_0
[cluster-srv0:21606] tmp: /tmp
[cluster-srv0:21606] mpirun: reset PATH: /export/cluster/appl/x86_64/
llvm/bin:/bin:/sbin:/export/cluster/appl/x86_64/llvm/bin:/usr/local/
llvm/bin:/usr/local/bin:/usr/bin:/usr/sbin:/home/GTI420/AH72000/oe/
bitbake/bin
[cluster-srv0:21606] mpirun: reset LD_LIBRARY_PATH: /export/cluster/
appl/x86_64/llvm/lib:/lib64:/lib:/export/cluster/appl/x86_64/llvm/ 
lib:/

usr/lib64:/usr/lib:/usr/local/lib64:/usr/local/lib
AH72000@cluster-srv1's password: AH72000@cluster-srv2's password:
AH72000@cluster-srv3's password:
[cluster-srv1:07406] procdir: /tmp/openmpi-sessions-AH72000@cluster-
srv1_0/35335/0/1
[cluster-srv1:07406] jobdir: /tmp/openmpi-sessions-AH72000@cluster-
srv1_0/35335/0
[cluster-srv1:07406] top: openmpi-sessions-AH72000@cluster-srv1_0
[cluster-srv1:07406] tmp: /tmp


Permission denied, please try again.
AH72000@cluster-srv2's password:
[cluster-srv3:14230] procdir: /tmp/openmpi-sessions-AH72000@cluster-
srv3_0/35335/0/3
[cluster-srv3:14230] jobdir: /tmp/openmpi-sessions-AH72000@cluster-
srv3_0/35335/0
[cluster-srv3:14230] top: openmpi-sessions-AH72000@cluster-srv3_0
[cluster-srv3:14230] tmp: /tmp

Permission denied, please try again.
AH72000@cluster-srv2's password:
[cluster-srv2:17092] procdir: /tmp/openmpi-sessions-AH72000@cluster-
srv2_0/35335/0/2
[cluster-srv2:17092] jobdir: /tmp/openmpi-sessions-AH72000@cluster-
srv2_0/35335/0
[cluster-srv2:17092] top: openmpi-sessions-AH72000@cluster-srv2_0
[cluster-srv2:17092] tmp: /tmp
[cluster-srv0:21606] [[35335,0],0] node[0].name cluster-srv0 daemon 0
arch ffc91200
[cluster-srv0:21606] [[35335,0],0] node[1].name cluster-srv1 daemon 1
arch ffc91200
[cluster-srv0:21606] [[35335,0],0] node[2].name cluster-srv2 daemon 2
arch ffc91200
[cluster-srv0:21606] [[35335,0],0] node[3].name cluster-srv3 daemon 3
arch ffc91200
[cluster-srv0:21606] [[35335,0],0] node[4].name cluster-srv4 daemon
INVALID arch ffc91200
[cluster-srv1:07406] [[35335,0],1] node[0].name cluster-srv0 daemon 0
arch ffc91200
[cluster-srv1:07406] [[35335,0],1] node[1].name cluster-srv1 daemon 1
arch ffc91200
[cluster-srv1:07406] [[35335,0],1] node[2].name cluster-srv2 daemon 2
arch ffc91200
[cluster-srv1:07406] [[35335,0],1] node[3].name cluster-srv3 daemon 3
arch ffc91200
[cluster-srv1:07406] [[35335,0],1] node[4].name cluster-srv4 daemon
INVALID arch ffc91200
[cluster-srv2:17092] [[35335,0],2] node[0].name cluster-srv0 daemon 0
arch ffc91200
[cluster-srv2:17092] [[35335,0],2] node[1].name cluster-srv1 daemon 1
arch ffc91200
[cluster-srv2:17092] [[35335,0],2] node[2].name cluster-srv2 daemon 2
arch ffc91200
[cluster-srv2:17092] [[35335,0],2] node[3].name cluster-srv3 daemon 3
arch ffc91200
[cluster-srv2:17092] [[35335,0],2] node[4].name cluster-srv4 daemon
INVALID arch ffc91200
[cluster-srv0:21611] procdir: /tmp/openmpi-sessions-AH72000@cluster-
srv0_0/35335/1/0
[cluster-srv0:21611] jobdir: /tmp/openmpi-sessions-AH72000@cluster-
srv0_0/35335/1
[cluster-srv0:21611] top: openmpi-sessions-AH72000@cluster-srv0_0
[cluster-srv0:21611] tmp: /tmp
[cluster-srv0:21613] procdir: /tmp/openmpi-sessions-AH72000@cluster-
srv0_0/35335/1/2
[cluster-srv0:21613] jobdir: /tmp/openmpi-sessions-AH72000@cluster-
srv0_0/35335/1
[cluster-srv0:21613] top: openmpi-sessions-AH72000@cluster-srv0_0

Re: [OMPI users] Question about restart

2009-04-22 Thread Yaakoub El Khamra
Incidentally, if I add a check for the value base->sig.sh_old, that it
is not NULL, and recompile, everything works fine. I am concerned this
is just fixing a symptom rather than the root of the problem.

if(base->sig.sh_old != NULL)
  free(base->sig.sh_old);

is what I used.

Regards
Yaakoub El Khamra




On Wed, Apr 22, 2009 at 2:13 PM, Yaakoub El Khamra
 wrote:
> Greetings
> I am trying to get the checkpoint/restart to work on a single machine
> with openmpi 1.3 (also tried an svn check-out) and ran into a few
> problems. I am guessing I am doing something wrong, and would
> appreciate some help.
>
> I built openmpi with:
>  ./configure --prefi=/usr/local/openmpi-1.3/ --enable-picky
> --enable-debug --enable-mpi-f77 --enable-mpi-f90 --enable-mpi-profile
> --enable-mpi-cxx --enable-pretty-print-stacktrace --enable-binaries
> --enable-trace --enable-static=yes --enable-debug
> --with-devel-headers=1 --with-mpi-param-check=always --with-ft=cr
> --enable-ft-thread --with-blcr=/usr/local/blcr/
> --with-blcr-libdir=/usr/local/blcr/lib --enable-mpi-threads=yes
>
> I am using blcr 0.8.1 configured with:
>  ./configure --prefix=/usr/local/blcr/ --enable-debug=yes
> --enable-libcr-tracing=yes --enable-kernel-tracing=yes
> --enable-testsuite=yes --enable-all-static=yes --enable-static=yes
>
> Checkpoint works fine, without any problems, I run with:
>  mpirun  -np 2 -mca ft_cr_enabled 1 -mca ompi_cr_verbose 1  -am
> ft-enable-cr -mca crs_verbose 1 -mca crs_blcr_verbose 1  matmultf.exe
>
> I am able to checkpoint without any problems using ompi-checkpoint
> --status --term 
> but when I try to restart, I get the following error:
>
> [yye00@localhost FTOpenMPI]$ ompi-restart -v  ompi_global_snapshot_23858.ckpt
> [localhost.localdomain:24394] Checking for the existence of
> (/home/yye00/ompi_global_snapshot_23858.ckpt)
> [localhost.localdomain:24394] Restarting from file
> (ompi_global_snapshot_23858.ckpt)
> [localhost.localdomain:24394]    Exec in self
> malloc debug: Invalid free (signal.c, 304)
> malloc debug: Invalid free (signal.c, 304)
> [localhost:23860] *** Process received signal ***
> [localhost:23860] Signal: Bus error (7)
> [localhost:23860] Signal code:  (2)
> [localhost:23860] Failing at address: 0x7fcbb737ef88
> [localhost:23860] [ 0] /lib64/libpthread.so.0 [0x32d720f0f0]
> [localhost:23860] [ 1] /usr/local/openmpi-1.3_svn/lib/libmpi.so.0
> [0x7fcbbd1eccae]
> [localhost:23860] [ 2] /usr/local/openmpi-1.3_svn/lib/libmpi.so.0
> [0x7fcbbd1ed5ba]
> [localhost:23860] [ 3] /usr/local/openmpi-1.3_svn/lib/libmpi.so.0
> [0x7fcbbd1ed745]
> [localhost:23860] [ 4]
> /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_progress+0xbc)
> [0x7fcbbcba2aa0]
> [localhost:23860] [ 5] /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0
> [0x7fcbbcbdead1]
> [localhost:23860] [ 6] /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0
> [0x7fcbbcbde8e2]
> [localhost:23860] [ 7]
> /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_crs_blcr_checkpoint+0x19c)
> [0x7fcbbcbde17c]
> [localhost:23860] [ 8]
> /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_cr_inc_core+0xb2)
> [0x7fcbbcba45e9]
> [localhost:23860] [ 9] /usr/local/openmpi-1.3_svn/lib/libopen-rte.so.0
> [0x7fcbbced1d9d]
> [localhost:23860] [10]
> /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_cr_test_if_checkpoint_ready+0x11b)
> [0x7fcbbcba4509]
> [localhost:23860] [11] /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0
> [0x7fcbbcba4bc2]
> [localhost:23860] [12] /lib64/libpthread.so.0 [0x32d72073da]
> [localhost:23860] [13] /lib64/libc.so.6(clone+0x6d) [0x32d66e62bd]
> [localhost:23860] *** End of error message ***
> --
> mpirun noticed that process rank 1 with PID 24396 on node
> localhost.localdomain exited on signal 7 (Bus error).
> --
>
> running strace on the ompi-restart did not provide any useful
> information. Any suggestions are greatly appreciated. Incidentally,
> looking at the signal.c line 304, it is a deallocation subroutine in
> opal, it is the evsignal_dealloc subroutine, the actual line is the
> "free(base->sig.sh_old);" line . I am about to add debug statements to
> that subroutine and see if I can get further information, but was
> hoping the problem is more user-related than code-related.
>
>
> Regards
> Yaakoub El Khamra
>



[OMPI users] Question about restart

2009-04-22 Thread Yaakoub El Khamra
Greetings
I am trying to get the checkpoint/restart to work on a single machine
with openmpi 1.3 (also tried an svn check-out) and ran into a few
problems. I am guessing I am doing something wrong, and would
appreciate some help.

I built openmpi with:
 ./configure --prefi=/usr/local/openmpi-1.3/ --enable-picky
--enable-debug --enable-mpi-f77 --enable-mpi-f90 --enable-mpi-profile
--enable-mpi-cxx --enable-pretty-print-stacktrace --enable-binaries
--enable-trace --enable-static=yes --enable-debug
--with-devel-headers=1 --with-mpi-param-check=always --with-ft=cr
--enable-ft-thread --with-blcr=/usr/local/blcr/
--with-blcr-libdir=/usr/local/blcr/lib --enable-mpi-threads=yes

I am using blcr 0.8.1 configured with:
 ./configure --prefix=/usr/local/blcr/ --enable-debug=yes
--enable-libcr-tracing=yes --enable-kernel-tracing=yes
--enable-testsuite=yes --enable-all-static=yes --enable-static=yes

Checkpoint works fine, without any problems, I run with:
 mpirun  -np 2 -mca ft_cr_enabled 1 -mca ompi_cr_verbose 1  -am
ft-enable-cr -mca crs_verbose 1 -mca crs_blcr_verbose 1  matmultf.exe

I am able to checkpoint without any problems using ompi-checkpoint
--status --term 
but when I try to restart, I get the following error:

[yye00@localhost FTOpenMPI]$ ompi-restart -v  ompi_global_snapshot_23858.ckpt
[localhost.localdomain:24394] Checking for the existence of
(/home/yye00/ompi_global_snapshot_23858.ckpt)
[localhost.localdomain:24394] Restarting from file
(ompi_global_snapshot_23858.ckpt)
[localhost.localdomain:24394]Exec in self
malloc debug: Invalid free (signal.c, 304)
malloc debug: Invalid free (signal.c, 304)
[localhost:23860] *** Process received signal ***
[localhost:23860] Signal: Bus error (7)
[localhost:23860] Signal code:  (2)
[localhost:23860] Failing at address: 0x7fcbb737ef88
[localhost:23860] [ 0] /lib64/libpthread.so.0 [0x32d720f0f0]
[localhost:23860] [ 1] /usr/local/openmpi-1.3_svn/lib/libmpi.so.0
[0x7fcbbd1eccae]
[localhost:23860] [ 2] /usr/local/openmpi-1.3_svn/lib/libmpi.so.0
[0x7fcbbd1ed5ba]
[localhost:23860] [ 3] /usr/local/openmpi-1.3_svn/lib/libmpi.so.0
[0x7fcbbd1ed745]
[localhost:23860] [ 4]
/usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_progress+0xbc)
[0x7fcbbcba2aa0]
[localhost:23860] [ 5] /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0
[0x7fcbbcbdead1]
[localhost:23860] [ 6] /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0
[0x7fcbbcbde8e2]
[localhost:23860] [ 7]
/usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_crs_blcr_checkpoint+0x19c)
[0x7fcbbcbde17c]
[localhost:23860] [ 8]
/usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_cr_inc_core+0xb2)
[0x7fcbbcba45e9]
[localhost:23860] [ 9] /usr/local/openmpi-1.3_svn/lib/libopen-rte.so.0
[0x7fcbbced1d9d]
[localhost:23860] [10]
/usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0(opal_cr_test_if_checkpoint_ready+0x11b)
[0x7fcbbcba4509]
[localhost:23860] [11] /usr/local/openmpi-1.3_svn/lib/libopen-pal.so.0
[0x7fcbbcba4bc2]
[localhost:23860] [12] /lib64/libpthread.so.0 [0x32d72073da]
[localhost:23860] [13] /lib64/libc.so.6(clone+0x6d) [0x32d66e62bd]
[localhost:23860] *** End of error message ***
--
mpirun noticed that process rank 1 with PID 24396 on node
localhost.localdomain exited on signal 7 (Bus error).
--

running strace on the ompi-restart did not provide any useful
information. Any suggestions are greatly appreciated. Incidentally,
looking at the signal.c line 304, it is a deallocation subroutine in
opal, it is the evsignal_dealloc subroutine, the actual line is the
"free(base->sig.sh_old);" line . I am about to add debug statements to
that subroutine and see if I can get further information, but was
hoping the problem is more user-related than code-related.


Regards
Yaakoub El Khamra


Re: [OMPI users] Problems with SSH

2009-04-22 Thread Jeff Squyres
It looks like something must not be right in your password-less ssh  
setup.  You need to be able to "ssh cluster-srv2.logti.etsmtl.ca  
uptime" and have it not ask for a password.  Are you able to do that?


On Apr 21, 2009, at 10:36 AM, Luis Vitorio Cargnini wrote:


Hi,
Please I did as mentioned into the FAQ for SSH password-less but the
mpirun still requesting me the password ?

-bash-3.2$ mpirun -d -v -hostfile chosts -np 16  ./hello
[cluster-srv0.logti.etsmtl.ca:31929] procdir: 
/tmp/openmpi-sessions-AH72000@cluster-srv0.logti.etsmtl.ca_0
/41688/0/0
[cluster-srv0.logti.etsmtl.ca:31929] jobdir: 
/tmp/openmpi-sessions-AH72000@cluster-srv0.logti.etsmtl.ca_0
/41688/0
[cluster-srv0.logti.etsmtl.ca:31929] top: 
openmpi-sessions-AH72000@cluster-srv0.logti.etsmtl.ca_0
[cluster-srv0.logti.etsmtl.ca:31929] tmp: /tmp
[cluster-srv0.logti.etsmtl.ca:31929] mpirun: reset PATH: /export/
cluster/appl/x86_64/llvm/bin:/bin:/sbin:/export/cluster/appl/x86_64/
llvm/bin:/usr/local/llvm/bin:/usr/local/bin:/usr/bin:/usr/sbin:/home/
GTI420/AH72000/oe/bitbake/bin
[cluster-srv0.logti.etsmtl.ca:31929] mpirun: reset LD_LIBRARY_PATH: /
export/cluster/appl/x86_64/llvm/lib:/lib64:/lib:/export/cluster/appl/
x86_64/llvm/lib:/usr/lib64:/usr/lib:/usr/local/lib64:/usr/local/lib
ah72...@cluster-srv1.logti.etsmtl.ca's password: 
ah72...@cluster-srv2.logti.etsmtl.ca
's password: ah72...@cluster-srv3.logti.etsmtl.ca's password:
[cluster-srv1.logti.etsmtl.ca:02621] procdir: 
/tmp/openmpi-sessions-AH72000@cluster-srv1.logti.etsmtl.ca_0
/41688/0/1
[cluster-srv1.logti.etsmtl.ca:02621] jobdir: 
/tmp/openmpi-sessions-AH72000@cluster-srv1.logti.etsmtl.ca_0
/41688/0
[cluster-srv1.logti.etsmtl.ca:02621] top: 
openmpi-sessions-AH72000@cluster-srv1.logti.etsmtl.ca_0
[cluster-srv1.logti.etsmtl.ca:02621] tmp: /tmp


Permission denied, please try again.
ah72...@cluster-srv2.logti.etsmtl.ca's password:
[cluster-srv3.logti.etsmtl.ca:09730] procdir: 
/tmp/openmpi-sessions-AH72000@cluster-srv3.logti.etsmtl.ca_0
/41688/0/3
[cluster-srv3.logti.etsmtl.ca:09730] jobdir: 
/tmp/openmpi-sessions-AH72000@cluster-srv3.logti.etsmtl.ca_0
/41688/0
[cluster-srv3.logti.etsmtl.ca:09730] top: 
openmpi-sessions-AH72000@cluster-srv3.logti.etsmtl.ca_0
[cluster-srv3.logti.etsmtl.ca:09730] tmp: /tmp

Permission denied, please try again.
ah72...@cluster-srv2.logti.etsmtl.ca's password:
[cluster-srv2.logti.etsmtl.ca:12802] procdir: 
/tmp/openmpi-sessions-AH72000@cluster-srv2.logti.etsmtl.ca_0
/41688/0/2
[cluster-srv2.logti.etsmtl.ca:12802] jobdir: 
/tmp/openmpi-sessions-AH72000@cluster-srv2.logti.etsmtl.ca_0
/41688/0
[cluster-srv2.logti.etsmtl.ca:12802] top: 
openmpi-sessions-AH72000@cluster-srv2.logti.etsmtl.ca_0
[cluster-srv2.logti.etsmtl.ca:12802] tmp: /tmp

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



Re: [OMPI users] few Problems

2009-04-22 Thread Jeff Squyres
This isn't really enough information for us to help you.  Can you send  
all the information here:


http://www.open-mpi.org/community/help/

Thanks.


On Apr 21, 2009, at 10:34 AM, Luis Vitorio Cargnini wrote:


Hi,
Please someone can answer me which can be this problem ?
  daemon INVALID arch ffc91200




the debug output:
[[41704,1],14] node[4].name cluster-srv4 daemon INVALID arch ffc91200
[cluster-srv3:09684] [[41704,1],13] node[0].name cluster-srv0 daemon 0
arch ffc91200
[cluster-srv3:09684] [[41704,1],13] node[1].name cluster-srv1 daemon 1
arch ffc91200
[cluster-srv3:09684] [[41704,1],13] node[2].name cluster-srv2 daemon 2
arch ffc91200
[cluster-srv3:09684] [[41704,1],13] node[3].name cluster-srv3 daemon 3
arch ffc91200
[cluster-srv3:09684] [[41704,1],13] node[4].name cluster-srv4 daemon
INVALID arch ffc91200

ORTE_ERROR_LOG: A message is attempting to be sent to a process whose
contact information is unknown in file rml_oob_send.c at line 105
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
Cisco Systems



[OMPI users] Open MPI v1.3.2 released

2009-04-22 Thread Ralph Castain
The Open MPI Team, representing a consortium of research, academic,
and industry partners, is pleased to announce the release of Open MPI
version 1.3.2. This release is mainly a bug fix release over the v1.3.1
release, but there are few new features.  We strongly recommend
that all users upgrade to version 1.3.2 if possible.

Version 1.3.2 can be downloaded from the main Open MPI web site or
any of its mirrors (mirrors will be updating shortly).

NOTE: The Open MPI team has uncovered a serious bug in Open MPI v1.3.0 and
v1.3.1: when running on OpenFabrics-based networks, silent data
corruption is possible in some cases. There are two workarounds to
avoid the issue -- please see the bug ticket that has been opened
about this issue for further details:

https://svn.open-mpi.org/trac/ompi/ticket/1853

We strongly encourage all users who are using Open MPI v1.3.0 and/or
v1.3.1 on OpenFabrics-based networks to upgrade to 1.3.2.


Here is a list of changes in v1.3.2 as compared to v1.3.1:

- Fixed a potential infinite loop in the openib BTL that could occur
 in senders in some frequent-communication scenarios.  Thanks to Don
 Wood for reporting the problem.
- Add a new checksum PML variation on ob1 (main MPI point-to-point
 communication engine) to detect memory corruption in node-to-node
 messages
- Add a new configuration option to add padding to the openib
 header so the data is aligned
- Add a new configuration option to use an alternative checksum algo
 when using the checksum PML
- Fixed a problem reported by multiple users on the mailing list that
 the LSF support would fail to find the appropriate libraries at
 run-time.
- Allow empty shell designations from getpwuid().  Thanks to Sergey
 Koposov for the bug report.
- Ensure that mpirun exits with non-zero status when applications die
 due to user signal.  Thanks to Geoffroy Pignot for suggesting the
 fix.
- Ensure that MPI_VERSION / MPI_SUBVERSION match what is returned by
 MPI_GET_VERSION.  Thanks to Rob Egan for reporting the error.
- Updated MPI_*KEYVAL_CREATE functions to properly handle Fortran
 extra state.
- A variety of ob1 (main MPI point-to-point communication engine) bug
 fixes that could have caused hangs or seg faults.
- Do not install Open MPI's signal handlers in MPI_INIT if there are
 already signal handlers installed.  Thanks to Kees Verstoep for
 bringing the issue to our attention.
- Fix GM support to not seg fault in MPI_INIT.
- Various VampirTrace fixes.
- Various PLPA fixes.
- No longer create BTLs for invalid (TCP) devices.
- Various man page style and lint cleanups.
- Fix critical OpenFabrics-related bug noted here:
 http://www.open-mpi.org/community/lists/announce/2009/03/0029.php.
 Open MPI now uses a much more robust memory intercept scheme that is
 quite similar to what is used by MX.  The use of "-lopenmpi-malloc"
 is no longer necessary, is deprecated, and is expected to disappear
 in a future release.  -lopenmpi-malloc will continue to work for the
 duration of the Open MPI v1.3 and v1.4 series.
- Fix some OpenFabrics shutdown errors, both regarding iWARP and SRQ.
- Allow the udapl BTL to work on Solaris platforms that support
 relaxed PCI ordering.
- Fix problem where the mpirun would sometimes use rsh/ssh to launch on
 the localhost (instead of simply forking).
- Minor SLURM stdin fixes.
- Fix to run properly under SGE jobs.
- Scalability and latency improvements for shared memory jobs: convert
 to using one message queue instead of N queues.
- Automatically size the shared-memory area (mmap file) to match
 better what is needed;  specifically, so that large-np jobs will start.
- Use fixed-length MPI predefined handles in order to provide ABI
 compatibility between Open MPI releases.
- Fix building of the posix paffinity component to properly get the
 number of processors in loosely tested environments (e.g.,
 FreeBSD).  Thanks to Steve Kargl for reporting the issue.
- Fix --with-libnuma handling in configure.  Thanks to Gus Correa for
 reporting the problem.


Re: [OMPI users] Problem with running openMPI program

2009-04-22 Thread Gus Correa

Hi

Do "yum list | grep mpi" to find the correct package names.
Then uninstall them with "yum remove" using the correct package name.

Don't use yum to install different flavors of MPI.
Things like mpicc, mpirun, MPI libraries, man pages, etc,
will get overwritten in /usr or /usr/local.
If you want to use yum, install only one MPI flavor.

OR (after removing the old packages with yum):

Reinstall OpenMPI from source.
Use the --prefix-/your/target/OpenMPI/dir during configure.

Reinstall MPICH2 from source.
Use the --prefix-/your/target/MPICH2/dir during configure.

Use different directories for OpenMPI and MPICH2.
Or install only one MPI flavor.

I hope this helps.

Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-

Ankush Kaul wrote:
i feel the above problem occured due 2 installing mpich package, now 
even nomal mpi programs are not running.
What should we do? we even tried *yum remove mpich* but it says no 
packages to remove.

Please Help!!!

On Wed, Apr 22, 2009 at 11:34 AM, Ankush Kaul > wrote:


We are facing another problem, we were tryin to install different
benchmarking packages

now whenever we try to run *mpirun* command (which was working
perfectly before) we get this error:
/usr/local/bin/mpdroot: open failed for root's mpd conf filempdtrace
(__init__ 1190): forked process failed; status=255/

whats the problem here?




On Tue, Apr 21, 2009 at 11:45 PM, Gus Correa > wrote:

Hi Ankush

Ankush Kaul wrote:

@Eugene
they are ok but we wanted something better, which would more
clearly show de diff in using a single pc and the cluster.

@Prakash
i had prob with running de programs as they were compiling
using mpcc n not mpicc

@gus
we are tryin 2 figure out de hpl config, its quite complicated,


I sent you some sketchy instructions to build HPL,
on my last message to this thread.
I built HPL and run it here yesterday that way.
Did you try my suggestions?
Where did you get stuck?


also de locate command lists lots of confusing results.


I would say the list is just long, not really confusing.
You can  find what you need if you want.
Pipe the output of locate through "more", and search carefully.
If you are talking about BLAS try "locate libblas.a" and
"locate libgoto.a".
Those are the libraries you need, and if they are not there
you need to install one of them.
Read my previous email for details.
I hope it will help you get HPL working, if you are interested
on HPL.


I hope this helps.

Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-

@jeff
i think u are correct we may have installed openmpi without
VT support, but is there anythin we can do now???

One more thing I found this program but dont know how to run
it : http://www.cis.udel.edu/~pollock/367/manual/node35.html

Thanks 2 all u guys 4 putting in so much efforts to help us out.







___
users mailing list
us...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/users






___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Problem with running openMPI program

2009-04-22 Thread Gus Correa

Hi

This is a MPICH2 error, not OpenMPI.
I saw you sent the same message to the MPICH list.
It looks like you are mixed both MPI flavors.

Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-

Ankush Kaul wrote:
We are facing another problem, we were tryin to install different 
benchmarking packages


now whenever we try to run *mpirun* command (which was working perfectly 
before) we get this error:
/usr/local/bin/mpdroot: open failed for root's mpd conf filempdtrace 
(__init__ 1190): forked process failed; status=255/


whats the problem here?




On Tue, Apr 21, 2009 at 11:45 PM, Gus Correa > wrote:


Hi Ankush

Ankush Kaul wrote:

@Eugene
they are ok but we wanted something better, which would more
clearly show de diff in using a single pc and the cluster.

@Prakash
i had prob with running de programs as they were compiling using
mpcc n not mpicc

@gus
we are tryin 2 figure out de hpl config, its quite complicated,


I sent you some sketchy instructions to build HPL,
on my last message to this thread.
I built HPL and run it here yesterday that way.
Did you try my suggestions?
Where did you get stuck?


also de locate command lists lots of confusing results.


I would say the list is just long, not really confusing.
You can  find what you need if you want.
Pipe the output of locate through "more", and search carefully.
If you are talking about BLAS try "locate libblas.a" and
"locate libgoto.a".
Those are the libraries you need, and if they are not there
you need to install one of them.
Read my previous email for details.
I hope it will help you get HPL working, if you are interested on HPL.


I hope this helps.

Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-

@jeff
i think u are correct we may have installed openmpi without VT
support, but is there anythin we can do now???

One more thing I found this program but dont know how to run it
: http://www.cis.udel.edu/~pollock/367/manual/node35.html

Thanks 2 all u guys 4 putting in so much efforts to help us out.






___
users mailing list
us...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/users





___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Problem with running openMPI program

2009-04-22 Thread Gus Correa

Hi Ankush

I second Eugene's comments.

I already sent you on previous emails to this thread
all relevant information on where to
get HPL from Netlib (http://netlib.org/benchmark/hpl/),
Goto BLAS from TACC (http://www.tacc.utexas.edu/resources/software/),
and the standard BLAS from Netlib (http://www.netlib.org/blas/).
OK, there go the links once again!

I also sent you instructions on how to install HPL,
exactly as I installed it here two days ago.
Did you read those instructions?
It is one of the messages on this thread.
Check the mailing list archive, if you don't have that email anymore.

Eugene just sent you a gotcha on how to build Netlib BLAS.
Goto BLAS is also easy to install with its "quickbuild" scripts,
and it comes with Readme, QuickInstall, and FAQ files, which you should 
read.


However, somehow you are repeating the same questions
that I and others already answered.
There isn't much more I can say to help you out.

Good luck!

Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-

Eugene Loh wrote:

Ankush Kaul wrote:


@gus
we are not able to make hpl sucessfully.
 
i think it has to do something with blas
 
i cannot find blas tar file on the net, i found rpm but de 
installation steps is with tar file.


First of all, this mail list is for Open MPI issues.  On this list are 
people who are helpful and know about lots of stuff (including things 
having anything at all to do with MPI), but HPL and HPCC have their own 
support mechanisms and you should probably pursue those for HPL questions.


Anyhow, if I google "blas", I immediately come up with netlib.org, which 
is where you can get a BLAS source tar file.  I've had to go through the 
HPL experience myself in the last 0-2 days, and it seems to me that the 
netlib.org site is not responding.  So, one can google "netlib mirror" 
to find mirror sites, poke around a little, and end up getting BLAS from 
the Sandia mirror site.


Short version:  try http://netlib.sandia.gov/blas/blas.tgz

I found a gotcha.  I changed the "g77" in the BLAS/make.inc file to 
become mpif77.  Also, in the HPL hpl/Make.$ARCH file, I used mpif77 for 
the linker.  This way, some Fortran I/O routines used by blas (xerbla.f) 
will be found at link time.  (I was using HPL from HPCC.  Not sure if 
your HPL is the same.)

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Could following situations caused by RDMA mcaparameters?

2009-04-22 Thread Jeff Squyres

On Apr 21, 2009, at 11:01 AM, Tsung Han Shie wrote:


I tried to increase speed of a program with openmpi-1.1.3


Did you mean 1.1.3 or 1.3.1?


by adding following 4 parameters into openmpi-mca-params.conf file.

mpi_leave_pinned=1
btl_openib_eager_rdma_num=128
btl_openib_max_eager_rdma=128
btl_openib_eager_limit=1024


If you meant 1.3.1 above, please see the following message about an  
important bug in 1.3 and 1.3.1 with the use of mpi_leave_pinned:


http://www.open-mpi.org/community/lists/announce/2009/03/0029.php


and then, I ran my program twice(124 processes on 31 nodes). one  
with "mpi_leave_pinned=1", another with "mpi_leave_pinned=0".
All of them were stopped abnormally with "ctrl+c" and "killall -9  
".


Why -- did they hang?


After that, I couldn't start to run that program again.


What exactly was the error?

I checked every nodes with "free -m" and I found that huge amount of  
cached memory were used in each nodes.
Could this situation be caused by those 4 parameters? IS there  
anyway to free theme?



Probably not.

Can you send all the information listed here:

http://www.open-mpi.org/community/help/

--
Jeff Squyres
Cisco Systems



[OMPI users] 100% CPU doing nothing!?

2009-04-22 Thread Douglas Guptill
Hi Ross:

On Tue, Apr 21, 2009 at 07:19:53PM -0700, Ross Boylan wrote:
> I'm using Rmpi (a pretty thin wrapper around MPI for R) on Debian Lenny
> (amd64).  My set up has a central calculator and a bunch of slaves to
> wich work is distributed.
> 
> The slaves wait like this:
> mpi.send(as.double(0), doubleType, root, requestCode, comm=comm)
> request <- request+1
> cases <- mpi.recv(cases, integerType, root, mpi.any.tag(),
> comm=comm)
> 
> I.e., they do a simple send and then a receive.
> 
> It's possible there's no one to talk to, so it could be stuck at
> mpi.send or mpi.recv.
> 
> Are either of those operations that should chew up CPU?  At this point,
> I'm just trying to figure out where to look for the source of the
> problem.

Search the list archives
  http://www.open-mpi.org/community/lists/users/
for "100% CPU" and you will get lots to look at.

I had a similar problem with a FORTRAN program.  With the help of Jeff
Squyres and Eugene Loh I wrote a solution: user-written MPI_Recv.c and
MPI_Send.c which I load them with my application, and the MPI problem
"100% CPU usage while doing nothing" is cured.

The code for MPI_Recv.c and MPI_Send.c is here:
  http://www.open-mpi.org/community/lists/users/2008/12/7563.php

Cheers,
Douglas.



Re: [OMPI users] Problem with running openMPI program

2009-04-22 Thread Eugene Loh

Ankush Kaul wrote:


@gus
we are not able to make hpl sucessfully.
 
i think it has to do something with blas
 
i cannot find blas tar file on the net, i found rpm but de 
installation steps is with tar file.


First of all, this mail list is for Open MPI issues.  On this list are 
people who are helpful and know about lots of stuff (including things 
having anything at all to do with MPI), but HPL and HPCC have their own 
support mechanisms and you should probably pursue those for HPL questions.


Anyhow, if I google "blas", I immediately come up with netlib.org, which 
is where you can get a BLAS source tar file.  I've had to go through the 
HPL experience myself in the last 0-2 days, and it seems to me that the 
netlib.org site is not responding.  So, one can google "netlib mirror" 
to find mirror sites, poke around a little, and end up getting BLAS from 
the Sandia mirror site.


Short version:  try http://netlib.sandia.gov/blas/blas.tgz

I found a gotcha.  I changed the "g77" in the BLAS/make.inc file to 
become mpif77.  Also, in the HPL hpl/Make.$ARCH file, I used mpif77 for 
the linker.  This way, some Fortran I/O routines used by blas (xerbla.f) 
will be found at link time.  (I was using HPL from HPCC.  Not sure if 
your HPL is the same.)


Re: [OMPI users] [Fwd: mpi alltoall memory requirement]

2009-04-22 Thread Ashley Pittman
On Wed, 2009-04-22 at 12:40 +0530, vkm wrote:

> The same amount of memory required for recvbuf. So at the least each 
> node should have 36GB of memory.
> 
> Am I calculating right ? Please correct.

Your calculation looks correct, the conclusion is slightly wrong
however.  The Application buffers will consume 36Gb of memory, the rest
of the application, any comms buffers and the usual OS overhead will be
on top of this so putting only 36Gb of ram in your nodes will still
leave you short.

Ashley,



Re: [OMPI users] Problem with running openMPI program

2009-04-22 Thread Ankush Kaul
@gus
we are not able to make hpl sucessfully.

i think it has to do something with blas

i cannot find blas tar file on the net, i found rpm but de installation
steps is with tar file.

#*locate blas* gave us the following result

*[root@ccomp1 hpl]# locate blas
/hpl/include/hpl_blas.h
/hpl/makes/Make.blas
/hpl/src/blas
/hpl/src/blas/HPL_daxpy.c
/hpl/src/blas/HPL_dcopy.c
/hpl/src/blas/HPL_dgemm.c
/hpl/src/blas/HPL_dgemv.c
/hpl/src/blas/HPL_dger.c
/hpl/src/blas/HPL_dscal.c
/hpl/src/blas/HPL_dswap.c
/hpl/src/blas/HPL_dtrsm.c
/hpl/src/blas/HPL_dtrsv.c
/hpl/src/blas/HPL_idamax.c
/hpl/src/blas/ccomp
/hpl/src/blas/i386
/hpl/src/blas/ccomp/Make.inc
/hpl/src/blas/ccomp/Makefile
/hpl/src/blas/i386/Make.inc
/hpl/src/blas/i386/Makefile
/usr/include/boost/numeric/ublas
/usr/include/boost/numeric/ublas/banded.hpp
/usr/include/boost/numeric/ublas/blas.hpp
/usr/include/boost/numeric/ublas/detail
/usr/include/boost/numeric/ublas/exception.hpp
/usr/include/boost/numeric/ublas/expression_types.hpp
/usr/include/boost/numeric/ublas/functional.hpp
/usr/include/boost/numeric/ublas/fwd.hpp
/usr/include/boost/numeric/ublas/hermitian.hpp
/usr/include/boost/numeric/ublas/io.hpp
/usr/include/boost/numeric/ublas/lu.hpp
/usr/include/boost/numeric/ublas/matrix.hpp
/usr/include/boost/numeric/ublas/matrix_expression.hpp
/usr/include/boost/numeric/ublas/matrix_proxy.hpp
/usr/include/boost/numeric/ublas/matrix_sparse.hpp
/usr/include/boost/numeric/ublas/operation.hpp
/usr/include/boost/numeric/ublas/operation_blocked.hpp
/usr/include/boost/numeric/ublas/operation_sparse.hpp
/usr/include/boost/numeric/ublas/storage.hpp
/usr/include/boost/numeric/ublas/storage_sparse.hpp
/usr/include/boost/numeric/ublas/symmetric.hpp
/usr/include/boost/numeric/ublas/traits.hpp
/usr/include/boost/numeric/ublas/triangular.hpp
/usr/include/boost/numeric/ublas/vector.hpp
/usr/include/boost/numeric/ublas/vector_expression.hpp
/usr/include/boost/numeric/ublas/vector_of_vector.hpp
/usr/include/boost/numeric/ublas/vector_proxy.hpp
/usr/include/boost/numeric/ublas/vector_sparse.hpp
/usr/include/boost/numeric/ublas/detail/concepts.hpp
/usr/include/boost/numeric/ublas/detail/config.hpp
/usr/include/boost/numeric/ublas/detail/definitions.hpp
/usr/include/boost/numeric/ublas/detail/documentation.hpp
/usr/include/boost/numeric/ublas/detail/duff.hpp
/usr/include/boost/numeric/ublas/detail/iterator.hpp
/usr/include/boost/numeric/ublas/detail/matrix_assign.hpp
/usr/include/boost/numeric/ublas/detail/raw.hpp
/usr/include/boost/numeric/ublas/detail/returntype_deduction.hpp
/usr/include/boost/numeric/ublas/detail/temporary.hpp
/usr/include/boost/numeric/ublas/detail/vector_assign.hpp
/usr/lib/libblas.so.3
/usr/lib/libblas.so.3.1
/usr/lib/libblas.so.3.1.1
/usr/lib/openoffice.org/basis3.0/share/gallery/htmlexpo/cublast.gif
/usr/lib/openoffice.org/basis3.0/share/gallery/htmlexpo/cublast_.gif
/usr/share/backgrounds/images/tiny_blast_of_red.jpg
/usr/share/doc/blas-3.1.1
/usr/share/doc/blas-3.1.1/blasqr.ps
/usr/share/man/manl/intro_blas1.l.gz*

When we try to make using the following command
*# make arch=ccomp*
**
it gives error :
*Makefile:47: Make.inc: No such file or directory
make[2]: *** No rule to make target `Make.inc'.  Stop.
make[2]: Leaving directory `/hpl/src/auxil/ccomp'
make[1]: *** [build_src] Error 2
make[1]: Leaving directory `/hpl'
make: *** [build] Error 2*
**
*ccomp* folder is created but *xhpl* file is not created
is it some prob with de config file?




On Wed, Apr 22, 2009 at 11:40 AM, Ankush Kaul wrote:

> i feel the above problem occured due 2 installing mpich package, now even
> nomal mpi programs are not running.
> What should we do? we even tried *yum remove mpich* but it says no
> packages to remove.
> Please Help!!!
>
>   On Wed, Apr 22, 2009 at 11:34 AM, Ankush Kaul wrote:
>
>> We are facing another problem, we were tryin to install different
>> benchmarking packages
>>
>> now whenever we try to run *mpirun* command (which was working perfectly
>> before) we get this error:
>> *usr/local/bin/mpdroot: open failed for root's mpd conf filempdtrace
>> (__init__ 1190): forked process failed; status=255*
>>
>> whats the problem here?
>>
>>
>>
>> On Tue, Apr 21, 2009 at 11:45 PM, Gus Correa wrote:
>>
>>> Hi Ankush
>>>
>>> Ankush Kaul wrote:
>>>
 @Eugene
 they are ok but we wanted something better, which would more clearly
 show de diff in using a single pc and the cluster.

 @Prakash
 i had prob with running de programs as they were compiling using mpcc n
 not mpicc

 @gus
 we are tryin 2 figure out de hpl config, its quite complicated,

>>>
>>> I sent you some sketchy instructions to build HPL,
>>> on my last message to this thread.
>>> I built HPL and run it here yesterday that way.
>>> Did you try my suggestions?
>>> Where did you get stuck?
>>>
>>> also de locate command lists lots of confusing results.


>>> I would say the list is 

[OMPI users] [Fwd: mpi alltoall memory requirement]

2009-04-22 Thread vkm

Hi,
I am running MPI alltoall test on my 8nodes cluster. They all have 
24core cpus.
So total number of processes that I am running is 8*24=192. In summary, 
alltoall test on 8nodes and 24 processes per node.


But, my test consumes all RAM and swap space memory. However, if I count 
required memory then calculation comes up as below.


Alltoall test runs max upto 4M datasizes. Each proc will have ONE 
sendbuf and ONE recvbuf for all remaining 191 processes to talk(and one 
to talk to itself).


So, on one node one process will need 192*4M = 768M memory for sendbuf. 
Now, one one node there are in fact 24 process running. So on one node, 
in total, I need 768M *24 = 18432M = ~18G for sendbuf


The same amount of memory required for recvbuf. So at the least each 
node should have 36GB of memory.


Am I calculating right ? Please correct.





Re: [OMPI users] Problem with running openMPI program

2009-04-22 Thread Ankush Kaul
i feel the above problem occured due 2 installing mpich package, now even
nomal mpi programs are not running.
What should we do? we even tried *yum remove mpich* but it says no packages
to remove.
Please Help!!!

On Wed, Apr 22, 2009 at 11:34 AM, Ankush Kaul wrote:

> We are facing another problem, we were tryin to install different
> benchmarking packages
>
> now whenever we try to run *mpirun* command (which was working perfectly
> before) we get this error:
> *usr/local/bin/mpdroot: open failed for root's mpd conf filempdtrace
> (__init__ 1190): forked process failed; status=255*
>
> whats the problem here?
>
>
>
> On Tue, Apr 21, 2009 at 11:45 PM, Gus Correa wrote:
>
>> Hi Ankush
>>
>> Ankush Kaul wrote:
>>
>>> @Eugene
>>> they are ok but we wanted something better, which would more clearly show
>>> de diff in using a single pc and the cluster.
>>>
>>> @Prakash
>>> i had prob with running de programs as they were compiling using mpcc n
>>> not mpicc
>>>
>>> @gus
>>> we are tryin 2 figure out de hpl config, its quite complicated,
>>>
>>
>> I sent you some sketchy instructions to build HPL,
>> on my last message to this thread.
>> I built HPL and run it here yesterday that way.
>> Did you try my suggestions?
>> Where did you get stuck?
>>
>> also de locate command lists lots of confusing results.
>>>
>>>
>> I would say the list is just long, not really confusing.
>> You can  find what you need if you want.
>> Pipe the output of locate through "more", and search carefully.
>> If you are talking about BLAS try "locate libblas.a" and
>> "locate libgoto.a".
>> Those are the libraries you need, and if they are not there
>> you need to install one of them.
>> Read my previous email for details.
>> I hope it will help you get HPL working, if you are interested on HPL.
>>
>> I hope this helps.
>>
>> Gus Correa
>> -
>> Gustavo Correa
>> Lamont-Doherty Earth Observatory - Columbia University
>> Palisades, NY, 10964-8000 - USA
>> -
>>
>>  @jeff
>>> i think u are correct we may have installed openmpi without VT support,
>>> but is there anythin we can do now???
>>>
>>> One more thing I found this program but dont know how to run it :
>>> http://www.cis.udel.edu/~pollock/367/manual/node35.html
>>>
>>> Thanks 2 all u guys 4 putting in so much efforts to help us out.
>>>
>>>
>>> 
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>


Re: [OMPI users] Problem with running openMPI program

2009-04-22 Thread Ankush Kaul
We are facing another problem, we were tryin to install different
benchmarking packages

now whenever we try to run *mpirun* command (which was working perfectly
before) we get this error:
*usr/local/bin/mpdroot: open failed for root's mpd conf filempdtrace
(__init__ 1190): forked process failed; status=255*

whats the problem here?



On Tue, Apr 21, 2009 at 11:45 PM, Gus Correa  wrote:

> Hi Ankush
>
> Ankush Kaul wrote:
>
>> @Eugene
>> they are ok but we wanted something better, which would more clearly show
>> de diff in using a single pc and the cluster.
>>
>> @Prakash
>> i had prob with running de programs as they were compiling using mpcc n
>> not mpicc
>>
>> @gus
>> we are tryin 2 figure out de hpl config, its quite complicated,
>>
>
> I sent you some sketchy instructions to build HPL,
> on my last message to this thread.
> I built HPL and run it here yesterday that way.
> Did you try my suggestions?
> Where did you get stuck?
>
> also de locate command lists lots of confusing results.
>>
>>
> I would say the list is just long, not really confusing.
> You can  find what you need if you want.
> Pipe the output of locate through "more", and search carefully.
> If you are talking about BLAS try "locate libblas.a" and
> "locate libgoto.a".
> Those are the libraries you need, and if they are not there
> you need to install one of them.
> Read my previous email for details.
> I hope it will help you get HPL working, if you are interested on HPL.
>
> I hope this helps.
>
> Gus Correa
> -
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> -
>
>  @jeff
>> i think u are correct we may have installed openmpi without VT support,
>> but is there anythin we can do now???
>>
>> One more thing I found this program but dont know how to run it :
>> http://www.cis.udel.edu/~pollock/367/manual/node35.html
>>
>> Thanks 2 all u guys 4 putting in so much efforts to help us out.
>>
>>
>> 
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] 100% CPU doing nothing!?

2009-04-22 Thread Terry Frankcombe
On Tue, 2009-04-21 at 19:19 -0700, Ross Boylan wrote:
> I'm using Rmpi (a pretty thin wrapper around MPI for R) on Debian Lenny
> (amd64).  My set up has a central calculator and a bunch of slaves to
> wich work is distributed.
> 
> The slaves wait like this:
> mpi.send(as.double(0), doubleType, root, requestCode, comm=comm)
> request <- request+1
> cases <- mpi.recv(cases, integerType, root, mpi.any.tag(),
> comm=comm)
> 
> I.e., they do a simple send and then a receive.
> 
> It's possible there's no one to talk to, so it could be stuck at
> mpi.send or mpi.recv.
> 
> Are either of those operations that should chew up CPU?  At this point,
> I'm just trying to figure out where to look for the source of the
> problem.


The short response is:  why do you not want it to use the whole CPU
while waiting?

There's some discussion starting here:


If you really want to make your application run slower, you can yield
the CPU when not doing much:



Ciao
Terry

-- 
Dr. Terry Frankcombe
Research School of Chemistry, Australian National University
Ph: (+61) 0417 163 509Skype: terry.frankcombe