Re: [OMPI users] segfault with -pernode on 1.4.2

2010-06-07 Thread Ralph Castain
Thanks for reporting this.

-pernode is just -npernode 1 - see the following ticket. Not sure when a fix 
will come out.

https://svn.open-mpi.org/trac/ompi/ticket/2431


On Jun 7, 2010, at 4:27 PM, S. Levent Yilmaz wrote:

> Dear All, 
> 
> I recently installed 1.4.2 version, and am having a problem specific to this 
> version only (or so it seems). Before I lay out the details please note that  
> I am building 1.4.2 *exactly* the same as I built 1.4.1: same compiler 
> options, same OpenIB and other system libraries, same configure options, and 
> same everything. Version 1.4.1 doesn't have this issue
> 
> The error message is the following: 
> 
> $ mpirun -pernode ./hello
> 
> [n90:21674] *** Process received signal ***
> [n90:21674] Signal: Segmentation fault (11)
> [n90:21674] Signal code: Address not mapped (1)
> [n90:21674] Failing at address: 0x50
> [n90:21674] [ 0] /lib64/libpthread.so.0 [0x3654a0e4c0]
> [n90:21674] [ 1] 
> /opt/fermi/openmpi/1.4.2/intel/fast/lib/libopen-rte.so.0(orte_util_encode_pidmap+0xa7)
>  [0x2b6b2f299b87]
> [n90:21674] [ 2] 
> /opt/fermi/openmpi/1.4.2/intel/fast/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x3ce)
>  [0x2b6b2f2baefe]
> [n90:21674] [ 3] 
> /opt/fermi/openmpi/1.4.2/intel/fast/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0xd5)
>  [0x2b6b2f2ce1e5]
> [n90:21674] [ 4] /opt/fermi/openmpi/1.4.2/intel/fast/lib/libopen-rte.so.0 
> [0x2b6b2f2d17ee]
> [n90:21674] [ 5] mpirun [0x404cff]
> [n90:21674] [ 6] mpirun [0x403e48]
> [n90:21674] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3653e1d974]
> [n90:21674] [ 8] mpirun [0x403d79]
> [n90:21674] *** End of error message ***
> Segmentation fault
> 
> [n74:21733] [[41942,0],1] routed:binomial: Connection to lifeline 
> [[41942,0],0] lost  
> 
> This last line is  from mpirun, not the executable. The executable is a 
> simple hello world. All is well without the -pernode flag. All is well even 
> when there is only one process per node (say, if so allocated over PBS) and 
> -pernode flag is not used. 
> 
> Attached are what is asked herein: http://www.open-mpi.org/community/help/  
> except the Infiniband specific details. I'll be happy to provide if that is 
> necessary, but note that the failure is the same if I used -mca btl self,tcp 
> 
> Thank you, 
> Levent
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



[OMPI users] segfault with -pernode on 1.4.2

2010-06-07 Thread S. Levent Yilmaz
Dear All,

I recently installed 1.4.2 version, and am having a problem specific to this
version only (or so it seems). Before I lay out the details please note that
 I am building 1.4.2 *exactly* the same as I built 1.4.1: same compiler
options, same OpenIB and other system libraries, same configure options, and
same everything. Version 1.4.1 doesn't have this issue

The error message is the following:

 $ mpirun -pernode ./hello

[n90:21674] *** Process received signal ***
[n90:21674] Signal: Segmentation fault (11)
[n90:21674] Signal code: Address not mapped (1)
[n90:21674] Failing at address: 0x50
[n90:21674] [ 0] /lib64/libpthread.so.0 [0x3654a0e4c0]
[n90:21674] [ 1]
/opt/fermi/openmpi/1.4.2/intel/fast/lib/libopen-rte.so.0(orte_util_encode_pidmap+0xa7)
[0x2b6b2f299b87]
[n90:21674] [ 2]
/opt/fermi/openmpi/1.4.2/intel/fast/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x3ce)
[0x2b6b2f2baefe]
[n90:21674] [ 3]
/opt/fermi/openmpi/1.4.2/intel/fast/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0xd5)
[0x2b6b2f2ce1e5]
[n90:21674] [ 4] /opt/fermi/openmpi/1.4.2/intel/fast/lib/libopen-rte.so.0
[0x2b6b2f2d17ee]
[n90:21674] [ 5] mpirun [0x404cff]
[n90:21674] [ 6] mpirun [0x403e48]
[n90:21674] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3653e1d974]
[n90:21674] [ 8] mpirun [0x403d79]
[n90:21674] *** End of error message ***
Segmentation fault

[n74:21733] [[41942,0],1] routed:binomial: Connection to lifeline
[[41942,0],0] lost

This last line is  from mpirun, not the executable. The executable is a
simple hello world. All is well without the -pernode flag. All is well even
when there is only one process per node (say, if so allocated over PBS) and
-pernode flag is not used.

Attached are what is asked herein:
http://www.open-mpi.org/community/help/  except
the Infiniband specific details. I'll be happy to provide if that is
necessary, but note that the failure is the same if I used -mca btl
self,tcp

Thank you,
Levent


config.log.gz
Description: GNU Zip compressed data


ompi_info.all.gz
Description: GNU Zip compressed data


Re: [OMPI users] Behaviour of MPI_Cancel when using 'large' messages

2010-06-07 Thread Jovana Knezevic
Hello Gijsbert,

I had the same problem few months ago. I even could not cancel the
messages for which I did not have a matching receive on the other side
(thus, they could not have been received! :-)). I was wondering really
what was going on... I have some experience with MPI, but I am not an
expert. I would really appreciate an explanation from the developers.
While "google"-ing the potential solution, I found out that some
distributions (not Open-MPI) do not allow canceling, thus, I think
that one cannot rely on MPI_Cancel(). If I am right, the question is
then: why implement it?! Is the logic behind "better ever than never"?
:-) So, use it when it is better to do the cancellation, but don't
really rely on it... ?! As I said, I am not an expert, but it would be
great to hear about this from them. If, however, YOU find any
solution, it would be great if you wrote about it on this list! Thanks
in advance.

Regards,
Jovana Knezevic

2010/6/7  :
> Send users mailing list submissions to
>        us...@open-mpi.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        http://www.open-mpi.org/mailman/listinfo.cgi/users
> or, via email, send a message with subject or body 'help' to
>        users-requ...@open-mpi.org
>
> You can reach the person managing the list at
>        users-ow...@open-mpi.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of users digest..."
>
>
> Today's Topics:
>
>   1. Re: mpi_iprobe not behaving as expect (David Zhang)
>   2. Re: mpi_iprobe not behaving as expect (David Zhang)
>   3. Re: mpi_iprobe not behaving as expect (David Zhang)
>   4. Behaviour of MPI_Cancel when using 'large' messages
>      (Gijsbert Wiesenekker)
>   5. Re: [sge::tight-integration] slot scheduling and  resources
>      handling (Eloi Gaudry)
>   6. ompi-restart, ompi-ps problem (Nguyen Kim Son)
>   7. ompi-restart failed (Nguyen Toan)
>   8. Re: ompi-restart failed (Nguyen Toan)
>
>
> --
>
> Message: 1
> Date: Sun, 6 Jun 2010 11:08:41 -0700
> From: David Zhang 
> Subject: Re: [OMPI users] mpi_iprobe not behaving as expect
> To: users 
> Message-ID:
>        
> Content-Type: text/plain; charset="iso-8859-1"
>
> On Sat, Jun 5, 2010 at 2:44 PM, David Zhang  wrote:
>
>> Dear all:
>>
>> I'm using mpi_iprobe to serve as a way to send signals between different
>> mpi executables. I'm using the following test codes (fortran):
>>
>> #1
>> program send
>> implicit none
>>         include 'mpif.h'
>>
>> real*8 :: vec(2)=1.0
>> integer :: ierr,i=0,request(1)
>>
>>         call mpi_init(ierr)
>>         do
>>                 call mpi_isend(vec,2,mpi_real8,
>> 0,1,mpi_comm_world,request(1),ierr)
>>                 i=i+1
>>                 print *,i
>>                 vec=-vec
>>                 call usleep_fortran(2.d0)
>>                 call mpi_wait(request(1),MPI_STATUS_IGNORE,ierr)
>>         end do
>>
>> end program send
>> --
>> #2
>> program send
>> implicit none
>>         include 'mpif.h'
>>
>> real*8 :: vec(2)
>> integer :: ierr
>>
>>         call mpi_init(ierr)
>>         do
>>                 if(key_present()) then
>>                         call
>> mpi_recv(vec,2,mpi_real8,1,1,mpi_comm_world,MPI_STATUS_IGNORE,ierr)
>>                 end if
>>                 call usleep_fortran(0.05d0)
>>
>>         end do
>>
>> contains
>>
>> function key_present()
>> implicit none
>>   logical :: key_present
>>
>>         key_present = .false.
>>         call
>> mpi_iprobe(1,1,mpi_comm_world,key_present,MPI_STATUS_IGNORE,ierr)
>>         print *, key_present
>>
>> end function key_present
>>
>> end program send
>> ---
>> The usleep_fortran is a routine I've written to pause the program for that
>> amount of time (in seconds).  As you can see, on the receiving end I'm
>> probing to see whether the message has being received every 0.05 seconds,
>> where each probing would result a print of the probing result; while the
>> sending is once every 2 seconds.
>>
>> Doing
>> mpirun -np 1 recv : -np 1 send
>>  Naturally I expect the output to be something like:
>>
>> 1
>> (fourty or so F)
>> T
>> 2
>> (another fourty or so F)
>> T
>> 3
>>
>> however this is the output I get:
>>
>> 1
>> (fourty or so F)
>> T
>> 2
>> (about a two second delay)
>> T
>> 3
>>
>> It seems to me that after the first set of probes, once the message was
>> received, the non-blocking mpi probe becomes blocking for some strange
>> reason.  I'm using mpi_iprobe for the first time, so I'm not sure if I'm
>> doing something blatantly wrong.
>>
>>
>> --
>> David Zhang
>> University of California, San Diego
>>
>
>
>
> --
> David Zhang
> University of California, San Diego
> 

Re: [OMPI users] Process doesn't exit on remote machine when using hostfile

2010-06-07 Thread Shiqing Fan


Hi,


The hostfile seems working for me on my Windows XP machines, but it 
should be the same on Windows 7. The problem you had looks to me more 
like a synchronization problem. Could you send me your test program?



Regards,
Shiqing

On 2010-5-25 11:41 AM, Rajnesh Jindel wrote:
disabled the firewall and using admin account so security isnt the 
issue here. like I said this problem only occurs when using a 
hostfile,  if I actually specify the hostname on the commandfline it 
works fine
On 25 May 2010 09:08, Shiqing Fan > 
wrote:


Hi,

What's the firewall setting on the remote node? Could you try to
add an exception for the application, or turn off the firewall
completely?

Regards,
Shiqing




On 2010-5-24 4:44 PM, Rajnesh Jindel wrote:

When I specify the hosts separately on the commandline, as
follows, the process completes as expected.
mpirun -np 8 -host remotehost,localhost myapp
Output appears for the localhost and a textfile is created on the
remotehost

However when I use a hostfile the remote processes never
complete. I can see the output from the local processes and by
remote login I can see that that processes are being started on
the remote machine but never complete.

The is a simple reduce example using boost.mpi (v1.43) I'm using
windows 7 x64 pro on both machines and openmpi 1.4.2 the hostfile
and athe app are in the same locaion on both machines.

Any idea why this is happening?

Raj


___
users mailing list
us...@open-mpi.org 
http://www.open-mpi.org/mailman/listinfo.cgi/users



-- 
--

Shiqing Fanhttp://www.hlrs.de/people/fan
High Performance Computing   Tel.: +49 711 685 87234
   Center Stuttgart (HLRS)Fax.: +49 711 685 65832
Address:Allmandring 30   email:f...@hlrs.de  

70569 Stuttgart
 






--
--
Shiqing Fan  http://www.hlrs.de/people/fan
High Performance Computing   Tel.: +49 711 685 87234
  Center Stuttgart (HLRS)Fax.: +49 711 685 65832
Address:Allmandring 30   email: f...@hlrs.de
70569 Stuttgart



Re: [hwloc-users] Is OSX a supported platform ?

2010-06-07 Thread Samuel Thibault
Wheeler, Kyle Bruce, le Mon 07 Jun 2010 13:00:48 -0600, a écrit :
> True; but you can make each "cpu" a thread set ID.

Ok, that's what I feared :)

The problem is that you don't control _location_ at all, so yes, this
really seems like lying too much :)

Samuel


Re: [OMPI users] ompi-restart failed

2010-06-07 Thread Nguyen Toan
Sorry, I just want to add 2 more things:
+ I tried configure with and without --enable-ft-thread but nothing changed
+ I also applied this patch for OpenMPI here and reinstalled but I got the
same error
https://svn.open-mpi.org/trac/ompi/raw-attachment/ticket/2139/v1.4-preload-part1.diff

Somebody helps? Thank you very much.

Nguyen Toan

On Mon, Jun 7, 2010 at 11:51 PM, Nguyen Toan wrote:

> Hello everyone,
>
> I'm using OpenMPI 1.4.2 with BLCR 0.8.2 to test checkpointing on 2 nodes
> but it failed to restart (Segmentation fault).
> Here are the details concerning my problem:
>
> + OS: Centos 5.4
> + OpenMPI configure:
> ./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads \
> --with-blcr=/home/nguyen/opt/blcr
> --with-blcr-libdir=/home/nguyen/opt/blcr/lib \
> --prefix=/home/nguyen/opt/openmpi \
> --enable-mpirun-prefix-by-default
> + mpirun -am ft-enable-cr -machinefile host ./test
>
> I checkpointed the test program using "ompi-checkpoint -v -s PID" and the
> checkpoint file was created successfully. However it failed to restart using
> ompi-restart:
> *"mpirun noticed that process rank 0 with PID 21242 on node rc014.local
> exited on signal 11 (Segmentation fault)"
> *
> Did I miss something in the installation of OpenMPI?
>
> Regards,
> Nguyen Toan
>


[OMPI users] ompi-restart failed

2010-06-07 Thread Nguyen Toan
Hello everyone,

I'm using OpenMPI 1.4.2 with BLCR 0.8.2 to test checkpointing on 2 nodes but
it failed to restart (Segmentation fault).
Here are the details concerning my problem:

+ OS: Centos 5.4
+ OpenMPI configure:
./configure --with-ft=cr --enable-ft-thread --enable-mpi-threads \
--with-blcr=/home/nguyen/opt/blcr
--with-blcr-libdir=/home/nguyen/opt/blcr/lib \
--prefix=/home/nguyen/opt/openmpi \
--enable-mpirun-prefix-by-default
+ mpirun -am ft-enable-cr -machinefile host ./test

I checkpointed the test program using "ompi-checkpoint -v -s PID" and the
checkpoint file was created successfully. However it failed to restart using
ompi-restart:
*"mpirun noticed that process rank 0 with PID 21242 on node rc014.local
exited on signal 11 (Segmentation fault)"
*
Did I miss something in the installation of OpenMPI?

Regards,
Nguyen Toan


[OMPI users] ompi-restart, ompi-ps problem

2010-06-07 Thread Nguyen Kim Son
Hello,

I'n trying to get functions like orte-checkpoint, orte-restart,... works but
there are some errors that I don't have any clue about.

Blcr (0.8.2) works fine apparently and  I have installed openmpi 1.4.2 from
source with option blcr.
The command
mpirun -np 4  -am ft-enable-cr ./checkpoint_test
seemed OK but
orte-checkpoint --term PID_of_checkpoint_test ( obtaining after ps -ef |
grep mpirun )
does not return and shows nothing like errors!

Then, I checked with
ompi-ps
this time, I obtain:
oob-tcp: Communication retries exceeded.  Can not communicate with peer

Does anyone has the same problem?
Any idea is welcomed!
Thanks,
Son.


-- 
-
Son NGUYEN KIM
Antibes 06600
Tel: 06 48 28 37 47


Re: [OMPI users] [sge::tight-integration] slot scheduling and resources handling

2010-06-07 Thread Eloi Gaudry
Hi Reuti,

I've been unable to reproduce the issue so far.

Sorry for the convenience,
Eloi

On Tuesday 25 May 2010 11:32:44 Reuti wrote:
> Hi,
> 
> Am 25.05.2010 um 09:14 schrieb Eloi Gaudry:
> > I do no reset any environment variable during job submission or job
> > handling. Is there a simple way to check that openmpi is working as
> > expected with SGE tight integration (as displaying environment
> > variables, setting options on the command line, etc. ) ?
> 
> a) put a command:
> 
> env
> 
> in the jobscript and check the output for $JOB_ID and various $SGE_*
> variables.
> 
> b) to confirm the misbehavior: are the tasks on the slave nodes kids of
> sge_shepherd or any system sshd/rshd?
> 
> -- Reuti
> 
> > Regards,
> > Eloi
> > 
> > On Friday 21 May 2010 17:35:24 Reuti wrote:
> >> Hi,
> >> 
> >> Am 21.05.2010 um 17:19 schrieb Eloi Gaudry:
> >>> Hi Reuti,
> >>> 
> >>> Yes, the openmpi binaries used were build after having used the
> >>> --with-sge during configure, and we only use those binaries on our
> >>> cluster.
> >>> 
> >>> [eg@moe:~]$ /opt/openmpi-1.3.3/bin/ompi_info
> >>> 
> >>>MCA ras: gridengine (MCA v2.0, API v2.0, Component
> >>>v1.3.3)
> >> 
> >> ok. As you have a Tight Integration as goal and set in your PE
> >> "control_slaves TRUE", SGE wouldn't allow `qrsh -inherit ...` to nodes
> >> which are not in the list of granted nodes. So it looks, like your job
> >> is running outside of this Tight Integration with its own `rsh`or
> >> `ssh`.
> >> 
> >> Do you reset $JOB_ID or other environment variables in your jobscript,
> >> which could trigger Open MPI to assume that it's not running inside SGE?
> >> 
> >> -- Reuti
> >> 
> >>> On Friday 21 May 2010 16:01:54 Reuti wrote:
>  Hi,
>  
>  Am 21.05.2010 um 14:11 schrieb Eloi Gaudry:
> > Hi there,
> > 
> > I'm observing something strange on our cluster managed by SGE6.2u4
> > when launching a parallel computation on several nodes, using
> > OpenMPI/SGE tight- integration mode (OpenMPI-1.3.3). It seems that
> > the SGE allocated slots are not used by OpenMPI, as if OpenMPI was
> > doing is own
> > round-robin allocation based on the allocated node hostnames.
>  
>  you compiled Open MPI with --with-sge (and recompiled your
>  applications)? You are using the correct mpiexec?
>  
>  -- Reuti
>  
> > Here is what I'm doing:
> > - launch a parallel computation involving 8 processors, using for
> > each of them 14GB of memory. I'm using a qsub command where i
> > request memory_free resource and use tight integration with openmpi
> > - 3 servers are available:
> > . barney with 4 cores (4 slots) and 32GB
> > . carl with 4 cores (4 slots) and 32GB
> > . charlie with 8 cores (8 slots) and 64GB
> > 
> > Here is the output of the allocated nodes (OpenMPI output):
> > ==   ALLOCATED NODES   ==
> > 
> > Data for node: Name: charlie   Launch id: -1 Arch: ffc91200  State: 2
> > 
> > Daemon: [[44332,0],0] Daemon launched: True
> > Num slots: 4  Slots in use: 0
> > Num slots allocated: 4  Max slots: 0
> > Username on node: NULL
> > Num procs: 0  Next node_rank: 0
> > 
> > Data for node: Name: carl.fftLaunch id: -1 Arch: 0 State: 2
> > 
> > Daemon: Not defined Daemon launched: False
> > Num slots: 2  Slots in use: 0
> > Num slots allocated: 2  Max slots: 0
> > Username on node: NULL
> > Num procs: 0  Next node_rank: 0
> > 
> > Data for node: Name: barney.fftLaunch id: -1 Arch: 0 State: 2
> > 
> > Daemon: Not defined Daemon launched: False
> > Num slots: 2  Slots in use: 0
> > Num slots allocated: 2  Max slots: 0
> > Username on node: NULL
> > Num procs: 0  Next node_rank: 0
> > 
> > =
> > 
> > Here is what I see when my computation is running on the cluster:
> > # rank   pid  hostname
> > 
> >   0 28112  charlie
> >   1 11417  carl
> >   2 11808  barney
> >   3 28113  charlie
> >   4 11418  carl
> >   5 11809  barney
> >   6 28114  charlie
> >   7 11419  carl
> > 
> > Note that -the parallel environment used under SGE is defined as:
> > [eg@moe:~]$ qconf -sp round_robin
> > pe_nameround_robin
> > slots  32
> > user_lists NONE
> > xuser_listsNONE
> > start_proc_args/bin/true
> > stop_proc_args /bin/true
> > allocation_rule$round_robin
> > control_slaves TRUE
> > job_is_first_task  FALSE
> > urgency_slots  min
> > accounting_summary FALSE
> > 
> > I'm wondering why OpenMPI didn't use the allocated 

[OMPI users] Behaviour of MPI_Cancel when using 'large' messages

2010-06-07 Thread Gijsbert Wiesenekker
The following code tries to send a message, but if it takes too long the 
message is cancelled:

  #define DEADLOCK_ABORT   (30.0)

  MPI_Isend(message, count, MPI_BYTE, comm_id,
MPI_MESSAGE_TAG, MPI_COMM_WORLD, );

  t0 = time(NULL);
  cancelled = FALSE;

  while(TRUE)
  {
//do some work

//test if message is delivered or cancelled
MPI_Test(, , );
if (flag) break;

//test if it takes too long
t1 = time(NULL);
wall = difftime(t1, t0);
if (!cancelled and (wall > DEADLOCK_ABORT))
{
  MPI_Cancel();
  cancelled = TRUE;
  my_printf("cancelled!\n");
}
  }

Now if I use a message size of about 5000 bytes and the message cannot be 
delivered after DEADLOCK_ABORT seconds the MPI_Cancel is executed, but still 
MPI_Test never returns TRUE, so it looks like the message cannot be cancelled 
for some reason.
I am using OpenMPI 1.4.2 on Fedora Core 13.
Any ideas?

Thanks,
Gijsbert