Re: [MTT users] MTT trivial tests fails to complete on Centos5.3-x86_64 bit platform with OFED 1.5

2009-11-18 Thread Ethan Mallove
Could you run with --debug (instead of --verbose), and send the
output.

Thanks,
Ethan

On Wed, Nov/18/2009 11:08:18AM, Venkat Venkatsubra wrote:
> 
> 
> 
> 
>From: Venkat Venkatsubra
>Sent: Wednesday, November 18, 2009 12:54 PM
>To: 'mtt-us...@open-mpi.org'
>Subject: MTT trivial tests fails to complete on Centos5.3-x86_64 bit
>platform with OFED 1.5
> 
> 
> 
>Hello All,
> 
> 
> 
>How do I debug this problem ? Attached are the developer.ini and
>trivial.ini files.
> 
>I can provide any other information that you need.
> 
> 
> 
>[root@samples]# cat /etc/issue
> 
>CentOS release 5.3 (Final)
> 
>Kernel \r on an \m
> 
> 
> 
>[root@samples]# uname -a
> 
>Linux 2.6.18-128.el5 #1 SMP Wed Jan 21 10:41:14 EST 2009 x86_64 x86_64
>x86_64 GNU/Linux
> 
> 
> 
>I am running OFED-1.5-20091029-0617 daily build.
> 
> 
> 
>Started trivial tests using the following command:
> 
> 
> 
>[root@samples]# cat developer.ini trivial.ini | ../client/mtt --verbose -
> 
>
> 
>
> 
> >> Initializing reporter module: TextFile
> 
> *** Reporter initialized
> 
> *** MPI Get phase starting
> 
> >> MPI Get: [mpi get: my installation]
> 
>Checking for new MPI sources...
> 
>Using MPI in: /usr/mpi/gcc/openmpi-1.3.2/
> 
> *** WARNING: alreadyinstalled_mpi_type was not specified, defaulting to
> 
> "OMPI".
> 
>Got new MPI sources: version 1.3.2
> 
> *** MPI Get phase complete
> 
> *** MPI Install phase starting
> 
> >> MPI Install [mpi install: my installation]
> 
>Installing MPI: [my installation] / [1.3.2] / [my installation]...
> 
> >> Reported to text file
> 
> 
>/root/mtt-svn/samples/MPI_Install-my_installation-my_installation-1.3.2.htm
> 
>   l
> 
> >> Reported to text file
> 
> 
>/root/mtt-svn/samples/MPI_Install-my_installation-my_installation-1.3.2.txt
> 
>Completed MPI Install successfully
> 
> *** MPI Install phase complete
> 
> *** Test Get phase starting
> 
> >> Test Get: [test get: trivial]
> 
>Checking for new test sources...
> 
>Got new test sources
> 
> *** Test Get phase complete
> 
> *** Test Build phase starting
> 
> >> Test Build [test build: trivial]
> 
>Building for [my installation] / [1.3.2] / [my installation] /
>[trivial]
> 
> >> Reported to text file
> 
>   /root/mtt-svn/samples/Test_Build-trivial-my_installation-1.3.2.html
> 
> >> Reported to text file
> 
>   /root/mtt-svn/samples/Test_Build-trivial-my_installation-1.3.2.txt
> 
>Completed test build successfully
> 
> *** Test Build phase complete
> 
> *** Test Run phase starting
> 
> >> Test Run [trivial]
> 
> >> Running with [my installation] / [1.3.2] / [my installation]
> 
>Using MPI Details [open mpi] with MPI Install [my installation]
> 
> 
> 
>During this stage the test stalls.
> 
>After about ~10 minutes the test gets killed.
> 
>dmesg on which the test is running displays the following output:
> 
> 
> 
> ==
> 
> Dmesg output
> 
> ==
> 
> Out of memory: Killed process 5346 (gdmgreeter).
> 
> audispd invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0
> 
> 
> 
> Call Trace:
> 
>  [] out_of_memory+0x8e/0x2f5
> 
>  [] __alloc_pages+0x245/0x2ce
> 
>  [] __do_page_cache_readahead+0x95/0x1d9
> 
>  [] sock_readv+0xb7/0xd1
> 
>  [] __wake_up_common+0x3e/0x68
> 
>  [] filemap_nopage+0x148/0x322
> 
>  [] __handle_mm_fault+0x1f8/0xe5c
> 
>  [] do_page_fault+0x4cb/0x830
> 
>  [] error_exit+0x0/0x84
> 
> 
> 
>Thanks!
> 
> 
> 
>Venkat



> ___
> mtt-users mailing list
> mtt-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users



Re: [OMPI users] Changing location where checkpoints are saved

2009-11-18 Thread Constantinos Makassikis

Josh Hursey wrote:

(Sorry for the excessive delay in replying)

On Sep 30, 2009, at 11:02 AM, Constantinos Makassikis wrote:


Thanks for the reply!

Concerning the mca options for checkpointing:
- are verbosity options (e.g.: crs_base_verbose) limited to 0 and 1 
values ?
- in priority options (e.g.: crs_blcr_priority) do lower numbers 
indicate higher priority ?


By searching in the archives of the mailing list I found two 
interesting/useful posts:
- [1] http://www.open-mpi.org/community/lists/users/2008/09/6534.php 
(for different checkpointing schemes)
- [2] http://www.open-mpi.org/community/lists/users/2009/05/9385.php 
(for restarting)


Following indications given in [1], I tried to make each process
checkpoint itself in it local /tmp and centralize the resulting
checkpoints in /tmp or $HOME:

Excerpt from mca-params.conf:
-
snapc_base_store_in_place=0
snapc_base_global_snapshot_dir=/tmp or $HOME
crs_base_snapshot_dir=/tmp

COMMANDS used:
--
mpirun -n 2 -machinefile machines -am ft-enable-cr a.out
ompi-checkpoint mpirun_pid



OUTPUT of ompi-checkpoint -v 16753
--
[ic85:17044] orte_checkpoint: Checkpointing...
[ic85:17044] PID 17036
[ic85:17044] Connected to Mpirun [[42098,0],0]
[ic85:17044] orte_checkpoint: notify_hnp: Contact Head Node Process 
PID 17036
[ic85:17044] orte_checkpoint: notify_hnp: Requested a checkpoint of 
jobid [INVALID]

[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Requested - Global Snapshot Reference: 
(null)

[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044]   Pending - Global Snapshot Reference: 
(null)

[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044]   Running - Global Snapshot Reference: 
(null)

[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] File Transfer - Global Snapshot Reference: 
(null)

[ic85:17044] orte_checkpoint: hnp_receiver: Receive a command message.
[ic85:17044] orte_checkpoint: hnp_receiver: Status Update.
[ic85:17044] Error - Global Snapshot Reference: 
ompi_global_snapshot_17036.ckpt




OUTPUT of MPIRUN


[ic85:17038] crs:blcr: blcr_checkpoint_peer: Thread finished with 
status 3
[ic86:20567] crs:blcr: blcr_checkpoint_peer: Thread finished with 
status 3
-- 


WARNING: Could not preload specified file: File already exists.

Fileset: /tmp/ompi_global_snapshot_17036.ckpt/0
Host: ic85

Will continue attempting to launch the process.

-- 


[ic85:17036] filem:rsh: wait_all(): Wait failed (-1)
[ic85:17036] [[42098,0],0] ORTE_ERROR_LOG: Error in file 
../../../../../orte/mca/snapc/full/snapc_full_global.c at line 1054


This is a warning about creating the global snapshot directory 
(ompi_global_snapshot_17036.ckpt) for the first checkpoint (seq 0). It 
seems to indicate that the directory existed when the file gather 
started.


A couple things to check:
 - Did you clean out the /tmp on all of the nodes with any files 
starting with "opal" or "ompi"?
 - Does the error go away when you set 
(snapc_base_global_snapshot_dir=$HOME)?
 - Could you try running against a v1.3 release? (I wonder if this 
feature has been broken on the trunk)


Let me know what you find. In the next couple days, I'll try to test 
the trunk again with this feature to make sure that it is still 
working on my test machines.


-- Josh

Hello Josh,

I have switched to v1.3 and re-run with 
snapc_base_global_snapshot_dir=/tmp or $HOME

with a clean /tmp.

In both cases I get the same error as before :-(

I don't know if the following can be of any help but after ompi-checkpoint
returns there is only a copy of the checkpoint of process of rank 0 in
the global snapshot directory:

$(snapc_base_global_snapshot_dir)/ompi_global_snapshot_.ckpt/0

So I guess the error occurs during the remote copy phase.

--
Constantinos







Does anyone has an idea about what is wrong?


Best regards,

--
Constantinos



Josh Hursey wrote:
This is described in the C/R User's Guide attached to the webpage 
below:

 https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR

Additionally this has been addressed on the users mailing list in 
the past, so searching around will likely turn up some examples.


-- Josh

On Sep 18, 2009, at 11:58 AM, Constantinos Makassikis wrote:


Dear all,

I have installed blcr 0.8.2 and Open MPI (r21973) on my NFS 
account. By default,
it seems that 

Re: [OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0

2009-11-18 Thread Ashley Pittman
On Wed, 2009-11-18 at 01:28 -0800, Bill Broadley wrote:
> A rather stable production code that has worked with various versions
> of MPI
> on various architectures started hanging with gcc-4.4.2 and openmpi
> 1.3.33
> 
> Which lead me to this thread. 

If you're investigating hangs in a parallel job take a look at the tool
linked to below (padb), it should be able to give you a parallel stack
trace and the message queues for the job.

http://padb.pittman.org.uk/full-report.html

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



[OMPI users] Building Name Service for Intercommunication

2009-11-18 Thread Alexander Gordeyev
Hi all,

Hope, this list is a good starter.

I'm struggling with "Example 3: Building Name Service for
Intercommunication" (at page 217) from the "MPI: A Message-Passing
Interface Standard Version 2.1"

1. There is an error in line 29 at page 217: "server_key_val" integer
should be "server_keyval"
2. An array buffer[10] on line 46 at page 220 does not get initialized
for non-leaders, thus  13
MPI_Intercomm_create on line 13-14 at page 221 gets a corrupted remote
leader rank.

Due to the latter bug and lack of en_queue/de_queue routines, I do not
realize the whole point of the example. Is there a working copy
available?


-- 
With best regards!


Re: [OMPI users] Antw: Re: mpirun not working on more than one node

2009-11-18 Thread Laurin Müller
Thanks thats it!
 
Would have been straigth forward, but there is a lot of things to
consider by setting up a cluster the first time - a lot to oversee.
 
Anyway thanks for your help.

>>> Ralph Castain  18.11.2009 15:57 >>>
Bingo! This is why we ask for info on how you configure OMPI :-)

You need to rebuild OMPI with --enable-heterogeneous. Because there is
additional overhead associated with running hetero configurations, and
so few people do so, it is disabled by default.


On Nov 18, 2009, at 2:55 AM, Laurin Müller wrote:



Now i have the same openmpi versions. 1.3.2
 
recalulated on both nodes and it works again on each node seperatly:
 
node1:
cluster@bioclust:/mnt/projects/PS3Cluster/Benchmark$ mpirun --version
mpirun (Open MPI) 1.3.2
cluster@bioclust:/mnt/projects/PS3Cluster/Benchmark$ (
mailto:1.3.2cluster@bioclust:/mnt/projects/PS3Cluster/Benchmark$ )
mpirun --hostfile /etc/openmpi/openmpi-default-hostfile -np 4
/mnt/projects/PS3Cluster/Benchmark/pi
Input number of intervals:
20
1: pi = 0.798498008827023
2: pi = 0.773339953424083
3: pi = 0.747089984650041
0: pi = 0.822248040052981
pi = 3.141175986954128
node2 (PS3):
root@kasimir:/mnt/projects/PS3Cluster/Benchmark# mpirun --version
mpirun (Open MPI) 1.3.2
[...]
root@kasimir:/mnt/projects/PS3Cluster/Benchmark# mpirun -np 2 pi
Input number of intervals:
20
0: pi = 1.595587993477064
1: pi = 1.545587993477064
pi = 3.141175986954128
BUT when i start it on node1 with more than 16 processes and hostfile.
i get this errors:
cluster@bioclust:/mnt/projects/PS3Cluster/Benchmark$ mpirun --hostfile
/etc/openmpi/openmpi-default-hostfile -np 17
/mnt/projects/PS3Cluster/Benchmark/pi
--
This installation of Open MPI was configured without support for
heterogeneous architectures, but at least one node in the allocation
was detected to have a different architecture. The detected node was:
 
Node: bioclust
 
In order to operate in a heterogeneous environment, please reconfigure
Open MPI with --enable-heterogeneous.
--
--
It looks like MPI_INIT failed for some reason; your parallel process
is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
 
  ompi_proc_set_arch failed
  --> Returned "Not supported" (-8) instead of "Success" (0)
--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1239] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1240] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1241] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1242] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1244] Abort before MPI_INIT completed
 successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1245] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1246] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1247] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1248] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were 

Re: [OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0

2009-11-18 Thread Eugene Loh




Vincent Loechner wrote:

  Bill,
  
  
A rather stable production code that has worked with various versions of MPI
on various architectures started hanging with gcc-4.4.2 and openmpi 1.3.33

  
  
Probably this bug :
https://svn.open-mpi.org/trac/ompi/ticket/2043

Waiting for a correction, try adding this option to mpirun :
-mca btl_sm_num_fifos 5

Bill, I noticed you updated the ticket.  Thank you.  I've been working
on this in earnest.  Something funny is going on as far as the "memory
model" goes:  values when writing to the shared-memory FIFOs go goofy. 
Like a FIFO slot that was initialized to be free and still "should be"
free, looks occupied when a writer checks, but it's empty immediately
thereafter even though no one "presumably" has accessed that location. 
I almost have a stand-alone program (C only, no OMPI infrastructure)
that demonstrates the problem, but I'm not quite there.  Then, it'll
either become evident to me what's wrong or I'll be able to show other
people more easily why I think something is wrong.  At this point, I
really have no idea if the problem is GCC 4.4.x or OMPI 1.3.x.




Re: [OMPI users] Antw: Re: mpirun not working on more than one node

2009-11-18 Thread Ralph Castain
Bingo! This is why we ask for info on how you configure OMPI :-)

You need to rebuild OMPI with --enable-heterogeneous. Because there is 
additional overhead associated with running hetero configurations, and so few 
people do so, it is disabled by default.


On Nov 18, 2009, at 2:55 AM, Laurin Müller wrote:

> Now i have the same openmpi versions. 1.3.2
>  
> recalulated on both nodes and it works again on each node seperatly:
>  
> node1:
> cluster@bioclust:/mnt/projects/PS3Cluster/Benchmark$ mpirun --version
> mpirun (Open MPI) 1.3.2
> cluster@bioclust:/mnt/projects/PS3Cluster/Benchmark$ mpirun --hostfile 
> /etc/openmpi/openmpi-default-hostfile -np 4 
> /mnt/projects/PS3Cluster/Benchmark/pi
> Input number of intervals:
> 20
> 1: pi = 0.798498008827023
> 2: pi = 0.773339953424083
> 3: pi = 0.747089984650041
> 0: pi = 0.822248040052981
> pi = 3.141175986954128
> node2 (PS3):
> root@kasimir:/mnt/projects/PS3Cluster/Benchmark# mpirun --version
> mpirun (Open MPI) 1.3.2
> [...]
> root@kasimir:/mnt/projects/PS3Cluster/Benchmark# mpirun -np 2 pi
> Input number of intervals:
> 20
> 0: pi = 1.595587993477064
> 1: pi = 1.545587993477064
> pi = 3.141175986954128
> BUT when i start it on node1 with more than 16 processes and hostfile. i get 
> this errors:
> cluster@bioclust:/mnt/projects/PS3Cluster/Benchmark$ mpirun --hostfile 
> /etc/openmpi/openmpi-default-hostfile -np 17 
> /mnt/projects/PS3Cluster/Benchmark/pi
> --
> This installation of Open MPI was configured without support for
> heterogeneous architectures, but at least one node in the allocation
> was detected to have a different architecture. The detected node was:
>  
> Node: bioclust
>  
> In order to operate in a heterogeneous environment, please reconfigure
> Open MPI with --enable-heterogeneous.
> --
> --
> It looks like MPI_INIT failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during MPI_INIT; some of which are due to configuration or environment
> problems.  This failure appears to be an internal failure; here's some
> additional information (which may only be relevant to an Open MPI
> developer):
>  
>   ompi_proc_set_arch failed
>   --> Returned "Not supported" (-8) instead of "Success" (0)
> --
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [bioclust:1239] Abort before MPI_INIT completed successfully; not able to 
> guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [bioclust:1240] Abort before MPI_INIT completed successfully; not able to 
> guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [bioclust:1241] Abort before MPI_INIT completed successfully; not able to 
> guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [bioclust:1242] Abort before MPI_INIT completed successfully; not able to 
> guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [bioclust:1244] Abort before MPI_INIT completed successfully; not able to 
> guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [bioclust:1245] Abort before MPI_INIT completed successfully; not able to 
> guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [bioclust:1246] Abort before MPI_INIT completed successfully; not able to 
> guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [bioclust:1247] Abort before MPI_INIT completed successfully; not able to 
> guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
> [bioclust:1248] Abort before MPI_INIT completed successfully; not able to 
> guarantee that all other processes were killed!
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL 

Re: [OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0

2009-11-18 Thread Vincent Loechner

Bill,

> A rather stable production code that has worked with various versions of MPI
> on various architectures started hanging with gcc-4.4.2 and openmpi 1.3.33

Probably this bug :
https://svn.open-mpi.org/trac/ompi/ticket/2043

Waiting for a correction, try adding this option to mpirun :
-mca btl_sm_num_fifos 5

--Vincent


Re: [OMPI users] Segmentation fault whilst running RaXML-MPI

2009-11-18 Thread Nick Holway
Dear all,

A quick follow up in aid of Google.

Upgrading the Intel compilers made no difference to the error message.

I contacted the researcher who wrote it who told me that the problem
was likely to be the Intel compilers over-optimising the code and
suggested using GCC which worked. He also pointed me in the direction
of new versions of RAxML which are available at
http://wwwkramer.in.tum.de/exelixis/software.html

Nick

2009/11/6 Nick Holway :
> Hi,
>
> Thank you for the information, I'm going to try the new Intel
> Compilers which I'm downloading now, but as they're taking so long to
> download I don't think I'm going to be able to look into this again
> until after the weekend. BTW using their java-based downloader is a
> bit less painful than their normal download.
>
> In the meantime, if anyone else has some suggestions then please let me know.
>
> Thanks
>
> Nick
>
> 2009/11/5 Jeff Squyres :
>> FWIW, I think Intel released 11.1.059 earlier today (I've been trying to
>> download it all morning).  I doubt it's an issue in this case, but I thought
>> I'd mention it as a public service announcement.  ;-)
>>
>> Seg faults are *usually* an application issue (never say "never", but they
>> *usually* are).  You might want to first contact the RaXML team to see if
>> there are any known issues with their software and Open MPI 1.3.3...?
>>  (Sorry, I'm totally unfamiliar with RaXML)
>>
>> On Nov 5, 2009, at 12:30 PM, Nick Holway wrote:
>>
>>> Dear all,
>>>
>>> I'm trying to run RaXML 7.0.4 on my 64bit Rocks 5.1 cluster (ie Centos
>>> 5.2). I compiled Open MPI 1.3.3 using the Intel compilers v 11.1.056
>>> using ./configure CC=icc CXX=icpc F77=ifort FC=ifort --with-sge
>>> --prefix=/usr/prog/mpi/openmpi/1.3.3/x86_64-no-mem-man
>>> --with-memory-manager=none.
>>>
>>> When I run run RaXML in a qlogin session using
>>> /usr/prog/mpi/openmpi/1.3.3/x86_64-no-mem-man/bin/mpirun -np 8
>>> /usr/prog/bioinformatics/RAxML/7.0.4/x86_64/RAxML-7.0.4/raxmlHPC-MPI
>>> -f a -x 12345 -p12345 -# 10 -m GTRGAMMA -s
>>> /users/holwani1/jay/ornodko-1582 -n mpitest39
>>>
>>> I get the following output:
>>>
>>> This is the RAxML MPI Worker Process Number: 1
>>> This is the RAxML MPI Worker Process Number: 3
>>>
>>> This is the RAxML MPI Master process
>>>
>>> This is the RAxML MPI Worker Process Number: 7
>>>
>>> This is the RAxML MPI Worker Process Number: 4
>>>
>>> This is the RAxML MPI Worker Process Number: 5
>>>
>>> This is the RAxML MPI Worker Process Number: 2
>>>
>>> This is the RAxML MPI Worker Process Number: 6
>>> IMPORTANT WARNING: Alignment column 1695 contains only undetermined
>>> values which will be treated as missing data
>>>
>>>
>>> IMPORTANT WARNING: Sequences A4_H10 and A3ii_E11 are exactly identical
>>>
>>>
>>> IMPORTANT WARNING: Sequences A2_A08 and A9_C10 are exactly identical
>>>
>>>
>>> IMPORTANT WARNING: Sequences A3ii_B03 and A3ii_C06 are exactly identical
>>>
>>>
>>> IMPORTANT WARNING: Sequences A9_D08 and A9_F10 are exactly identical
>>>
>>>
>>> IMPORTANT WARNING: Sequences A3ii_F07 and A9_C08 are exactly identical
>>>
>>>
>>> IMPORTANT WARNING: Sequences A6_F05 and A6_F11 are exactly identical
>>>
>>> IMPORTANT WARNING
>>> Found 6 sequences that are exactly identical to other sequences in the
>>> alignment.
>>> Normally they should be excluded from the analysis.
>>>
>>>
>>> IMPORTANT WARNING
>>> Found 1 column that contains only undetermined values which will be
>>> treated as missing data.
>>> Normally these columns should be excluded from the analysis.
>>>
>>> An alignment file with undetermined columns and sequence duplicates
>>> removed has already
>>> been printed to file /users/holwani1/jay/ornodko-1582.reduced
>>>
>>>
>>> You are using RAxML version 7.0.4 released by Alexandros Stamatakis in
>>> April 2008
>>>
>>> Alignment has 1280 distinct alignment patterns
>>>
>>> Proportion of gaps and completely undetermined characters in this
>>> alignment: 0.124198
>>>
>>> RAxML rapid bootstrapping and subsequent ML search
>>>
>>>
>>> Executing 10 rapid bootstrap inferences and thereafter a thorough ML
>>> search
>>>
>>> All free model parameters will be estimated by RAxML
>>> GAMMA model of rate heteorgeneity, ML estimate of alpha-parameter
>>> GAMMA Model parameters will be estimated up to an accuracy of
>>> 0.10 Log Likelihood units
>>>
>>> Partition: 0
>>> Name: No Name Provided
>>> DataType: DNA
>>> Substitution Matrix: GTR
>>> Empirical Base Frequencies:
>>> pi(A): 0.261129 pi(C): 0.228570 pi(G): 0.315946 pi(T): 0.194354
>>>
>>>
>>> Switching from GAMMA to CAT for rapid Bootstrap, final ML search will
>>> be conducted under the GAMMA model you specified
>>> Bootstrap[10]: Time 44.442728 bootstrap likelihood -inf, best
>>> rearrangement setting 5
>>> Bootstrap[0]: Time 44.814948 bootstrap likelihood -inf, best
>>> rearrangement setting 5
>>> Bootstrap[6]: Time 46.470371 bootstrap likelihood -inf, best
>>> rearrangement setting 6
>>> 

[OMPI users] Antw: Re: mpirun not working on more than one node

2009-11-18 Thread Laurin Müller
Now i have the same openmpi versions. 1.3.2
 
recalulated on both nodes and it works again on each node seperatly:
 
node1:
cluster@bioclust:/mnt/projects/PS3Cluster/Benchmark$ mpirun --version
mpirun (Open MPI) 1.3.2
cluster@bioclust:/mnt/projects/PS3Cluster/Benchmark$ (
mailto:1.3.2cluster@bioclust:/mnt/projects/PS3Cluster/Benchmark$ )
mpirun --hostfile /etc/openmpi/openmpi-default-hostfile -np 4
/mnt/projects/PS3Cluster/Benchmark/pi
Input number of intervals:
20
1: pi = 0.798498008827023
2: pi = 0.773339953424083
3: pi = 0.747089984650041
0: pi = 0.822248040052981
pi = 3.141175986954128
node2 (PS3):
root@kasimir:/mnt/projects/PS3Cluster/Benchmark# mpirun --version
mpirun (Open MPI) 1.3.2
[...]
root@kasimir:/mnt/projects/PS3Cluster/Benchmark# mpirun -np 2 pi
Input number of intervals:
20
0: pi = 1.595587993477064
1: pi = 1.545587993477064
pi = 3.141175986954128
BUT when i start it on node1 with more than 16 processes and hostfile.
i get this errors:
cluster@bioclust:/mnt/projects/PS3Cluster/Benchmark$ mpirun --hostfile
/etc/openmpi/openmpi-default-hostfile -np 17
/mnt/projects/PS3Cluster/Benchmark/pi
--
This installation of Open MPI was configured without support for
heterogeneous architectures, but at least one node in the allocation
was detected to have a different architecture. The detected node was:
 
Node: bioclust
 
In order to operate in a heterogeneous environment, please reconfigure
Open MPI with --enable-heterogeneous.
--
--
It looks like MPI_INIT failed for some reason; your parallel process
is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or
environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
 
  ompi_proc_set_arch failed
  --> Returned "Not supported" (-8) instead of "Success" (0)
--
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1239] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1240] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1241] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1242] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1244] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1245] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1246] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
***
 An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1247] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1248] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1250] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[bioclust:1251] Abort before MPI_INIT completed successfully; not able
to guarantee that all other processes were killed!
*** An error occurred in MPI_Init
*** before MPI was 

Re: [OMPI users] MPI processes hang when using OpenMPI 1.3.2 and Gcc-4.4.0

2009-11-18 Thread Bill Broadley
A rather stable production code that has worked with various versions of MPI
on various architectures started hanging with gcc-4.4.2 and openmpi 1.3.33

Which lead me to this thread.

I made some very small changes to Eugene's code, here's the diff:
$ diff testorig.c billtest.c
3,5c3,4
<
< #define N 4
< #define M 4
---
> #define N 8000
> #define M 8000
17c16
<
---
>   fprintf (stderr, "Initialized\n");
32,33c31,39
< MPI_Sendrecv (sbuf, N, MPI_FLOAT, top, 0,
< rbuf, N, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, );
---
> {
>   if ((me == 0) && (i % 100 == 0))
>   {
> fprintf (stderr, "%d\n", i);
>   }
>   MPI_Sendrecv (sbuf, N, MPI_FLOAT, top, 0, rbuf, N, MPI_FLOAT, bottom, 0,
>   MPI_COMM_WORLD, );
> }
>

Basically print some occasional progress, and shrink M and N.

I'm running on a new intel dual socket nehalem system with centos-5.4.  I
compiled gcc-4.4.2 and openmpi myself with all the defaults, except I had to
point out mpfr-2.4.1 to gcc.

If I run:
$ mpirun -np 4 ./billtest

About 1 in 2 times I get something like:
bill@farm bill]$ mpirun -np 4 ./billtest
Initialized
Initialized
Initialized
Initialized
0
100


Next time worked, next time:
[bill@farm bill]$ mpirun -np 4 ./billtest
Initialized
Initialized
Initialized
Initialized
0
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
3400
3500


Next time hung at 7100.

Next time worked.

If I strace it when hung I get something like:
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN},
{fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}], 6, 0) =
0 (Timeout)

If I run gdb on a hung job (compiled with -O4 -g)
(gdb) bt
#0  0x2ab3b34cb385 in ompi_request_default_wait ()
   from /share/apps/openmpisb-1.3/gcc-4.4/lib/libmpi.so.0
#1  0x2ab3b34f0d48 in PMPI_Sendrecv () from
/share/apps/openmpisb-1.3/gcc-4.4/lib/libmpi.so.0
#2  0x00400b88 in main (argc=1, argv=0x7fff083fd298) at billtest.c:36
(gdb)

If I recompile with -O1 I get the same thing.

Even -g I get the same thing.

If I compile the application with gcc-4.3 and still use a gcc-4.4 compiled
openmpi I still get hangs.

If I compiled openmpi-1.3.3 with gcc-4.3 and the application with gcc-4.3 and
I run it 20 times I get zero hangs.  Seems like that gcc-4.4 and openib-1.3.3
are incompatible.  In my production code I'd always get hung at MPI_Waitall,
but the above is obviously inside of Sendrecv.

To be paranoid I just reran it 40 times without a hang.

Original code below.

Eugene Loh wrote:
...

> #include 
> #include 
> 
> #define N 4
> #define M 4
> 
> int main(int argc, char **argv) {
>  int np, me, i, top, bottom;
>  float sbuf[N], rbuf[N];
>  MPI_Status status;
> 
>  MPI_Init(,);
>  MPI_Comm_size(MPI_COMM_WORLD,);
>  MPI_Comm_rank(MPI_COMM_WORLD,);
> 
>  top= me + 1;   if ( top  >= np ) top-= np;
>  bottom = me - 1;   if ( bottom < 0 ) bottom += np;
> 
>  for ( i = 0; i < N; i++ ) sbuf[i] = 0;
>  for ( i = 0; i < N; i++ ) rbuf[i] = 0;
> 
>  MPI_Barrier(MPI_COMM_WORLD);
>  for ( i = 0; i < M - 1; i++ )
>MPI_Sendrecv(sbuf, N, MPI_FLOAT, top   , 0,
> rbuf, N, MPI_FLOAT, bottom, 0, MPI_COMM_WORLD, );
>  MPI_Barrier(MPI_COMM_WORLD);
> 
>  MPI_Finalize();
>  return 0;
> }
> 
> Can you reproduce your problem with this test case?
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users