date:20100331

Re: [OMPI users] openmpi.ld.conf file

2010-03-31 Thread Jeff Squyres

On Mar 31, 2010, at 5:25 PM, Abhishek Gupta wrote:

> I am trying to find out the location of openmpi.ld.conf file for my
> openmpi/openmpi-libs. Can someone tell me where that file is placed?

There is no openmpi.ld.conf in the official Open MPI distribution.  Are you 
installing Open MPI from a package?  Other Open MPI packagers may have created 
this file and put it in a supplemental RPM (or whatever package you're 
using)...?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] ompi-checkpoint --term

2010-03-31 Thread Fernando Lemos

On Wed, Mar 31, 2010 at 7:39 PM, Addepalli, Srirangam V
 wrote:
> Hello All.
> I am trying to checkpoint a mpi application that has been started using the 
> follwong mpirun command
>
> mpirun -am ft-enable-cr -np 8 pw.x  < Ge46.pw.in > Ge46.ph.out
>
> ompi-checkpoint 31396 ( Works) How ever when i try to terminate the process
>
> ompi-checkpoint  --term 31396  it never finishes.  How do i bebug this issue.

ompi-checkpoint is exactly ompi-checkpoint + sending SIGTERM to your
app. If ompi-checkpoint finishes, then your app is not dealing with
SIGTERM correctly.

Make sure you're not ignoring SIGTERM, you need to either handle it or
let it kill your app. If it's a multithreaded app, make sure you can
"distribute" the SIGTERM to ALL the threads, i.e., when you receive
SIGTERM, notify all other threads that they should join or quit.

Regards,

[OMPI users] ompi-checkpoint --term

2010-03-31 Thread Addepalli, Srirangam V

Hello All.
I am trying to checkpoint a mpi application that has been started using the 
follwong mpirun command

mpirun -am ft-enable-cr -np 8 pw.x  < Ge46.pw.in > Ge46.ph.out

ompi-checkpoint 31396 ( Works) How ever when i try to terminate the process

ompi-checkpoint  --term 31396  it never finishes.  How do i bebug this issue. 

Rangam

Re: [OMPI users] Hide Abort output

2010-03-31 Thread David Singleton



Yes, Dick has isolated the issue - novice users often believe Open MPI
(not their application) had a problem.  Anything along the lines he suggests
can only help.

David

On 04/01/2010 01:12 AM, Richard Treumann wrote:


I do not know what the OpenMPI message looks like or why people want to
hide it. It should be phrased to avoid any implication of a problem with
OpenMPI itself.

How about something like this which:

"The application has called MPI_Abort. The application is terminated by
OpenMPI as the application demanded"


Dick Treumann  -  MPI Team
IBM Systems&  Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363




   From:   "Jeff Squyres (jsquyres)"

   To:,

   Date:   03/31/2010 06:43 AM

   Subject:Re: [OMPI users] Hide Abort output

   Sent by:users-boun...@open-mpi.org






At present there is no such feature, but it should not be hard to add.

Can you guys be a little more specific about exactly what you are seeing
and exactly what you want to see?  (And what version you're working with -
I'll caveat my discussion that this may be a 1.5-and-forward thing)

-jms
Sent from my PDA.  No type good.

- Original Message -
From: users-boun...@open-mpi.org
To: Open MPI Users
Sent: Wed Mar 31 05:38:48 2010
Subject: Re: [OMPI users] Hide Abort output


I have to say this is a very common issue for our users.  They repeatedly
report the long Open MPI MPI_Abort() message in help queries and fail to
look for the application error message about the root cause.  A short
MPI_Abort() message that said "look elsewhere for the real error message"
would be useful.

Cheers,
David

On 03/31/2010 07:58 PM, Yves Caniou wrote:

Dear all,

I am using the MPI_Abort() command in a MPI program.
I would like to not see the note explaining that the command caused Open

MPI

to kill all the jobs and so on.
I thought that I could find a --mca parameter, but couldn't grep it. The

only

ones deal with the delay and printing more information (the stack).

Is there a mean to avoid the printing of the note (except the 2>/dev/null
tips)? Or to delay this printing?

Thank you.

.Yves.

[OMPI users] openmpi.ld.conf file

2010-03-31 Thread Abhishek Gupta


Hi,
I am trying to find out the location of openmpi.ld.conf file for my 
openmpi/openmpi-libs. Can someone tell me where that file is placed?

Thanks,
Abhi.

Re: [OMPI users] openMPI on Xgrid

2010-03-31 Thread Jeff Squyres

Yes, good idea.  SGE is a fine scheduler; it's actively supported by Open MPI.

On Mar 31, 2010, at 11:21 AM, Cristobal Navarro wrote:

> and how about Sun Grid Engine + openMPI, good idea??
> 
> im asking because i just checked out that Mathematica 7 supports cluster 
> integration with SGE which will be a plus apart from our C programs.
> 
> 
> Cristobal
> 
> 
> 
> 
> On Tue, Mar 30, 2010 at 4:06 PM, Gus Correa  wrote:
> Craig Tierney wrote:
> Jody Klymak wrote:
> On Mar 30, 2010, at  11:12 AM, Cristobal Navarro wrote:
> 
> i just have some questions,
> Torque requires moab, but from what i've read on the site you have to
> buy moab right?
> I am pretty sure you can download torque w/o moab.  I do not use moab,
> which I think is a higher-level scheduling layer on top of pbs. However, 
> there are folks here who would know far more than I do about
> these sorts of things.
> 
> Cheers,  Jody
> 
> 
> Moab is a scheduler, which works with Torque and several other
> products.  Torque comes with a basic scheduler, and Moab is not
> required.  If you want more features but not pay for Moab, you
> can look at Maui.
> 
> Craig
> 
> 
> 
> Hi
> 
> Just adding to what Craig and Jody said.
> Moab is not required for Torque.
> 
> A small cluster with a few users can work well with
> the basic Torque/PBS scheduler (pbs_sched),
> and its first-in-first-out job policy.
> An alternative is to replace pbs_sched with the
> free Maui scheduler, if you need fine grained job control.
> 
> You can install both Torque and Maui from source code (available here 
> http://www.clusterresources.com/), but it takes some work.
> 
> Some Linux distributions have Torque and Maui available as packages
> through yum, apt-get, etc.
> I would guess for the Mac you can get at least Torque through fink,
> or not?
> 
> Gus Correa
> -
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> -
> 
> 
> 
> 
> -- 
> Jody Klymak
> http://web.uvic.ca/~jklymak/
> 
> 
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] openMPI on Xgrid

2010-03-31 Thread Cristobal Navarro

and how about Sun Grid Engine + openMPI, good idea??

im asking because i just checked out that Mathematica 7 supports cluster
integration with SGE which will be a plus apart from our C programs.


Cristobal




On Tue, Mar 30, 2010 at 4:06 PM, Gus Correa  wrote:

> Craig Tierney wrote:
>
>> Jody Klymak wrote:
>>
>>> On Mar 30, 2010, at  11:12 AM, Cristobal Navarro wrote:
>>>
>>>  i just have some questions,
 Torque requires moab, but from what i've read on the site you have to
 buy moab right?

>>> I am pretty sure you can download torque w/o moab.  I do not use moab,
>>> which I think is a higher-level scheduling layer on top of pbs. However,
>>> there are folks here who would know far more than I do about
>>> these sorts of things.
>>>
>>> Cheers,  Jody
>>>
>>>
>> Moab is a scheduler, which works with Torque and several other
>> products.  Torque comes with a basic scheduler, and Moab is not
>> required.  If you want more features but not pay for Moab, you
>> can look at Maui.
>>
>> Craig
>>
>>
>>
> Hi
>
> Just adding to what Craig and Jody said.
> Moab is not required for Torque.
>
> A small cluster with a few users can work well with
> the basic Torque/PBS scheduler (pbs_sched),
> and its first-in-first-out job policy.
> An alternative is to replace pbs_sched with the
> free Maui scheduler, if you need fine grained job control.
>
> You can install both Torque and Maui from source code (available here
> http://www.clusterresources.com/), but it takes some work.
>
> Some Linux distributions have Torque and Maui available as packages
> through yum, apt-get, etc.
> I would guess for the Mac you can get at least Torque through fink,
> or not?
>
> Gus Correa
> -
> Gustavo Correa
> Lamont-Doherty Earth Observatory - Columbia University
> Palisades, NY, 10964-8000 - USA
> -
>
>
>
>>
>>  --
>>> Jody Klymak
>>> http://web.uvic.ca/~jklymak/
>>>
>>>
>>>
>>>
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>

Re: [OMPI users] Segmentation fault (11)

2010-03-31 Thread Joshua Hursey

That is interesting. I cannot think of any reason why this might be causing a 
problem just in Open MPI. popen() is similar to fork()/system() so you have to 
be careful with interconnects that do not play nice with fork(), like openib. 
But since it looks like you are excluding openib, this should not be the 
problem.

I wonder if this has something to so with the way we use BLCR (maybe we need to 
pass additional parameters to cr_checkpoint()). When the process fails, are 
there any messages in the system logs from BLCR indicating an issue that it 
encountered? It is common for BLCR to post a 'socket open' warning, but that is 
expected/normal since we leave TCP sockets open in most cases as an 
optimization. I am wondering if there is a warning about the popen'ed process.

Personally, I will not have an opportunity to look into this in more detail 
until probably mid-April. :/

Let me know what you find, and maybe we can sort out what is happening on the 
list.

-- Josh

On Mar 29, 2010, at 2:28 PM, Jean Potsam wrote:

> Hi Josh/All,
>I just tested a simple c application with blcr and it worked 
> fine.
>  
> ##
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include
> #include  
> #include 
> 
> char * getprocessid() 
> {
> FILE * read_fp;
> char buffer[BUFSIZ + 1];
> int chars_read;
> char * buffer_data="12345";
> memset(buffer, '\0', sizeof(buffer));
>   read_fp = popen("uname -a", "r");
>  /*
>   ...
>  */ 
>  return buffer_data;
> }
>  
> int main(int argc, char ** argv)
> {
> 
>  int rank;
>int size;
> char * thedata;
> int n=0;
>  thedata=getprocessid();
>  printf(" the data is %s", thedata);
>  
>   while( n <10)
>   {
> printf("value is %d\n", n);
> n++;
> sleep(1);
>}
>  printf("bye\n");
>  
> }
>  
>  
> jean@sun32:/tmp$ cr_run ./pipetest3 &
> [1] 31807
> jean@sun32:~$  the data is 12345value is 0
> value is 1
> value is 2
> ...
> value is 9
> bye
>  
> jean@sun32:/tmp$ cr_checkpoint 31807
>  
> jean@sun32:/tmp$ cr_restart context.31807
> value is 7
> value is 8
> value is 9
> bye
>  
> ##
>  
>  
> It looks like its more to do with Openmpi.  Any ideas from you side?
>  
> Thank you.
>  
> Kind regards,
>  
> Jean.
>  
>  
>  
> 
> 
> --- On Mon, 29/3/10, Josh Hursey  wrote:
> 
> From: Josh Hursey 
> Subject: Re: [OMPI users] Segmentation fault (11)
> To: "Open MPI Users" 
> Date: Monday, 29 March, 2010, 16:08
> 
> I wonder if this is a bug with BLCR (since the segv stack is in the BLCR 
> thread). Can you try an non-MPI version of this application that uses 
> popen(), and see if BLCR properly checkpoints/restarts it?
> 
> If so, we can start to see what Open MPI might be doing to confuse things, 
> but I suspect that this might be a bug with BLCR. Either way let us know what 
> you find out.
> 
> Cheers,
> Josh
> 
> On Mar 27, 2010, at 6:17 AM, jody wrote:
> 
> > I'm not sure if this is the cause of your problems:
> > You define the constant BUFFER_SIZE, but in the code you use a constant 
> > called BUFSIZ...
> > Jody
> > 
> > 
> > On Fri, Mar 26, 2010 at 10:29 PM, Jean Potsam  
> > wrote:
> > Dear All,
> >   I am having a problem with openmpi . I have installed openmpi 
> > 1.4 and blcr 0.8.1
> > 
> > I have written a small mpi application as follows below:
> > 
> > ###
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> > #include 
> > #include
> > #include 
> > #include 
> > 
> > #define BUFFER_SIZE PIPE_BUF
> > 
> > char * getprocessid()
> > {
> > FILE * read_fp;
> > char buffer[BUFSIZ + 1];
> > int chars_read;
> > char * buffer_data="12345";
> > memset(buffer, '\0', sizeof(buffer));
> >   read_fp = popen("uname -a", "r");
> >  /*
> >   ...
> >  */
> >  return buffer_data;
> > }
> > 
> > int main(int argc, char ** argv)
> > {
> >   MPI_Status status;
> >  int rank;
> >int size;
> > char * thedata;
> > MPI_Init(, );
> > MPI_Comm_size(MPI_COMM_WORLD,);
> > MPI_Comm_rank(MPI_COMM_WORLD,);
> >  thedata=getprocessid();
> >  printf(" the data is %s", thedata);
> > MPI_Finalize();
> > }
> > 
> > 
> > I get the following result:
> > 
> > ###
> > jean@sunn32:~$ mpicc pipetest2.c -o pipetest2
> > jean@sunn32:~$ mpirun -np 1 -am ft-enable-cr -mca btl ^openib  pipetest2
> > [sun32:19211] *** Process received signal ***
> > [sun32:19211] Signal: Segmentation fault (11)
> > [sun32:19211] Signal code: Address not mapped (1)
> > [sun32:19211] Failing at address: 0x4
> > [sun32:19211] [ 0] [0xb7f3c40c]
> > [sun32:19211] [ 1] /lib/libc.so.6(cfree+0x3b) [0xb796868b]
> > [sun32:19211] [ 2]

Re: [OMPI users] Hide Abort output

2010-03-31 Thread Richard Treumann

I do not know what the OpenMPI message looks like or why people want to
hide it. It should be phrased to avoid any implication of a problem with
OpenMPI itself.

How about something like this which:

"The application has called MPI_Abort. The application is terminated by
OpenMPI as the application demanded"

Dick Treumann  -  MPI Team
IBM Systems & Technology Group
Dept X2ZA / MS P963 -- 2455 South Road -- Poughkeepsie, NY 12601
Tele (845) 433-7846 Fax (845) 433-8363

  From:   "Jeff Squyres (jsquyres)"    

  To: ,    

  Date:   03/31/2010 06:43 AM  

  Subject:Re: [OMPI users] Hide Abort output   

  Sent by:users-boun...@open-mpi.org   

At present there is no such feature, but it should not be hard to add.

Can you guys be a little more specific about exactly what you are seeing
and exactly what you want to see?  (And what version you're working with -
I'll caveat my discussion that this may be a 1.5-and-forward thing)

-jms
Sent from my PDA.  No type good.

- Original Message -
From: users-boun...@open-mpi.org 
To: Open MPI Users 
Sent: Wed Mar 31 05:38:48 2010
Subject: Re: [OMPI users] Hide Abort output

I have to say this is a very common issue for our users.  They repeatedly
report the long Open MPI MPI_Abort() message in help queries and fail to
look for the application error message about the root cause.  A short
MPI_Abort() message that said "look elsewhere for the real error message"
would be useful.

Cheers,
David

On 03/31/2010 07:58 PM, Yves Caniou wrote:
> Dear all,
>
> I am using the MPI_Abort() command in a MPI program.
> I would like to not see the note explaining that the command caused Open
MPI
> to kill all the jobs and so on.
> I thought that I could find a --mca parameter, but couldn't grep it. The
only
> ones deal with the delay and printing more information (the stack).
>
> Is there a mean to avoid the printing of the note (except the 2>/dev/null
> tips)? Or to delay this printing?
>
> Thank you.
>
> .Yves.
>

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-03-31 Thread Oliver Geisler

I have tried up to kernel 2.6.33.1 on both architectures (Core2 Duo and
I5) with the same results. The "slow" results are also appearing for
distribution of processes on the 4 cores one single node.
We use
btl = self,sm,tcp
in
/etc/openmpi/openmpi-mca-params.conf
Distributing several process to each one core on several machines is
fast and has "normal" communication times. So I guess tcp communication
shouldn't be the problem.
Also multiple instances of the program, started on one "master" node,
with each instance distributing several processes to one core of "slave"
nodes don't seem to be a problem. In effect 4 instances of the program
occupie all 4 cores on each node which doesn't influence communication
and overall calculation time much.
But running 4 processes from the same "master" instance on 4 cores on
the same node does.

Do you have some more ideas what I can test for? I tried to test
connectivity_c from openmpi examples on 8 nodes/32 processes. It is hard
to get reliable/consistent figures from 'top' since the programm
terminates quite fast and interesting usage is very short. But these are
some shots of 'top' (master and slave nodes show similar images)

System and/or Wait Time are up.

sh-3.2$ mpirun -np 4 -host cluster-05 connectivity_c : -np 28 -host
cluster-06,cluster-07,cluster-08,cluster-09,cluster-10,cluster-11,cluster-12
connectivity_c
Connectivity test on 32 processes PASSED.

Cpu(s): 37.5%us, 46.6%sy,  0.0%ni,  0.0%id, 15.9%wa,  0.0%hi,  0.0%si,
0.0%st
Mem:   8181236k total,   168200k used,  8013036k free,0k buffers
Swap:0k total,0k used,0k free,   132092k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  P COMMAND
25179 oli   20   0  143m 3436 2196 R   43  0.0   0:00.57 0
25180 oli   20   0  142m 3392 2180 R  100  0.0   0:00.85 3
25182 oli   20   0  142m 3312 2172 R  100  0.0   0:00.93 2
25181 oli   20   0  134m 3052 2172 R  100  0.0   0:00.93 1

Cpu(s): 10.3%us,  8.7%sy,  0.0%ni, 21.4%id, 58.7%wa,  0.8%hi,  0.0%si,
0.0%st
Mem:   8181236k total,   171352k used,  8009884k free,0k buffers
Swap:0k total,0k used,0k free,   130572k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  P COMMAND
29496 oli   20   0  142m 3300 2176 D   33  0.0   0:00.21 2
29497 oli   20   0  142m 3280 2160 R   25  0.0   0:00.17 0
29494 oli   20   0  134m 3044 2180 D0  0.0   0:00.01 1
29495 oli   20   0  134m 3036 2172 R   16  0.0   0:00.11 3

Cpu(s): 18.3%us, 36.3%sy,  0.0%ni, 38.0%id,  6.3%wa,  1.1%hi,  0.0%si,
0.0%st
Mem:   8181236k total,   141704k used,  8039532k free,0k buffers
Swap:0k total,0k used,0k free,99828k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  P COMMAND
29452 oli   20   0  143m 3452 2212 R   52  0.0   0:00.37 1
29455 oli   20   0  143m 3452 2212 S   57  0.0   0:00.41 3
29453 oli   20   0  143m 3440 2200 S   55  0.0   0:00.39 0
29454 oli   20   0  143m 3440 2200 R   55  0.0   0:00.39 2

Thanks for your thoughts, each input is appreciated.

Oli

On 3/31/2010 8:38 AM, Jeff Squyres wrote:
> I have a very dim recollection of some kernel TCP issues back in some older 
> kernel versions -- such issues affected all TCP communications, not just MPI. 
>  Can you try a newer kernel, perchance?
> 
> 
> On Mar 30, 2010, at 1:26 PM,   wrote:
> 
>> Hello List,
>>
>> I hope you can help us out on that one, as we are trying to figure out
>> since weeks.
>>
>> The situation: We have a program being capable of slitting to several
>> processes to be shared on nodes within a cluster network using openmpi.
>> We were running that system on "older" cluster hardware (Intel Core2 Duo
>> based, 2GB RAM) using an "older" kernel (2.6.18.6). All nodes are
>> diskless network booting. Recently we upgraded the hardware (Intel i5,
>> 8GB RAM) which also required an upgrade to a recent kernel version
>> (2.6.26+).
>>
>> Here is the problem: We experience overall performance loss on the new
>> hardware and think, we can break it down to a communication issue
>> inbetween the processes.
>>
>> Also, we found out, the issue araises in the transition from kernel
>> 2.6.23 to 2.6.24 (tested on the Core2 Duo system).
>>
>> Here is an output from our programm:
>>
>> 2.6.23.17 (64bit), MPI 1.2.7
>> 5 Iterationen (Core2 Duo) 6 CPU:
>> 93.33 seconds per iteration.
>>  Node   0 communication/computation time:  6.83 /647.64 seconds.
>>  Node   1 communication/computation time: 10.09 /644.36 seconds.
>>  Node   2 communication/computation time:  7.27 /645.03 seconds.
>>  Node   3 communication/computation time:165.02 /485.52 seconds.
>>  Node   4 communication/computation time:  6.50 /643.82 seconds.
>>  Node   5 communication/computation time:  7.80 /627.63 seconds.
>>  Computation time:897.00 seconds.
>>
>> 2.6.24.7 (64bit) ..

Re: [OMPI users] strange problem with OpenMPI + rankfile + Intelcompiler 11.0.074 + centos/fedora-12

2010-03-31 Thread Jeff Squyres

On Mar 24, 2010, at 12:49 AM, Anton Starikov wrote:

> Two different OSes: centos 5.4 (2.6.18 kernel) and Fedora-12 (2.6.32 kernel)
> Two different CPUs: Opteron 248 and Opteron 8356.
> 
> same binary for OpenMPI. Same binary for user code (vasp compiled for older 
> arch)

Are you sure that the code is binary compatible between the two platforms?  Can 
you repeat the process with native builds of Open MPI and the app for both 
architectures?

> When I supply rankfile, then depending on combo of OS and CPU results are 
> different
> 
> centos+Opt8356 : works
> centos+Opt248 : works
> fedora+Opt8356 : works
> fedora+Opt248 : fails
> 
> rankfile is (in case of Opt248)
> 
> rank 0=node014 slot=1
> rank 1=node014 slot=0
> 
> I tried play with formats, leave one slot (and start one process) - it 
> doesn't change result
> Without rankfile it works on all combos.

Nifty (meaning: ick!).

I wonder if the processor affinity code is causing the problem here...?  It 
could be a problem in a heterogeneous environment if the systems are "close" 
but not "exact" in terms of binary compatibility...?

> Just in case, all this happens inside of cpuset which always wraps all slots 
> given in rankfile (I just use torque with cpusets and my custom patch for 
> torque which also creates rankfile for openmpi, in this case MPI tasks are 
> bound to particular cores and multithreaded codes limited by given cpuset).
> 
> AFAIR, it also works without problem on both hardware setups with 1.3.x/1.4.0 
> and 2.6.30 kernel from OpenSuSE 11.1.
> 
> Strangely, but when I run OSU benchmarks (osu_bw etc), it works without any 
> problems.

Can you re-run with a trivial test, like MPI hello world and/or ring?  See the 
examples/ directory.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] kernel 2.6.23 vs 2.6.24 - communication/wait times

2010-03-31 Thread Jeff Squyres

I have a very dim recollection of some kernel TCP issues back in some older 
kernel versions -- such issues affected all TCP communications, not just MPI.  
Can you try a newer kernel, perchance?


On Mar 30, 2010, at 1:26 PM,   wrote:

> Hello List,
> 
> I hope you can help us out on that one, as we are trying to figure out
> since weeks.
> 
> The situation: We have a program being capable of slitting to several
> processes to be shared on nodes within a cluster network using openmpi.
> We were running that system on "older" cluster hardware (Intel Core2 Duo
> based, 2GB RAM) using an "older" kernel (2.6.18.6). All nodes are
> diskless network booting. Recently we upgraded the hardware (Intel i5,
> 8GB RAM) which also required an upgrade to a recent kernel version
> (2.6.26+).
> 
> Here is the problem: We experience overall performance loss on the new
> hardware and think, we can break it down to a communication issue
> inbetween the processes.
> 
> Also, we found out, the issue araises in the transition from kernel
> 2.6.23 to 2.6.24 (tested on the Core2 Duo system).
> 
> Here is an output from our programm:
> 
> 2.6.23.17 (64bit), MPI 1.2.7
> 5 Iterationen (Core2 Duo) 6 CPU:
> 93.33 seconds per iteration.
>  Node   0 communication/computation time:  6.83 /647.64 seconds.
>  Node   1 communication/computation time: 10.09 /644.36 seconds.
>  Node   2 communication/computation time:  7.27 /645.03 seconds.
>  Node   3 communication/computation time:165.02 /485.52 seconds.
>  Node   4 communication/computation time:  6.50 /643.82 seconds.
>  Node   5 communication/computation time:  7.80 /627.63 seconds.
>  Computation time:897.00 seconds.
> 
> 2.6.24.7 (64bit) .. re-evaluated, MPI 1.2.7
> 5 Iterationen (Core2 Duo) 6 CPU:
>131.33 seconds per iteration.
>  Node   0 communication/computation time:364.15 /645.24 seconds.
>  Node   1 communication/computation time:362.83 /645.26 seconds.
>  Node   2 communication/computation time:349.39 /645.07 seconds.
>  Node   3 communication/computation time:508.34 /485.53 seconds.
>  Node   4 communication/computation time:349.94 /643.81 seconds.
>  Node   5 communication/computation time:349.07 /627.47 seconds.
>  Computation time:   1251.00 seconds.
> 
> The program is 32 bit software, but it doesn't make any difference
> whether the kernel is 64 or 32 bit. Also the OpenMPI version 1.4.1 was
> tested, cut communication times by half (which still is too high), but
> improvement decreased with increasing kernel version number.
> 
> The communication time is meant to be the time the master process
> distributes the data portions for calculation and collecting the results
> from the slave processes. The value also contains times a slave has to
> wait to communicate with the master as he is occupied. This explains the
> extended communication time of node #3 as the calculation time is
> reduced (based on the nature of the data)
> 
> The command to start the calculation:
> mpirun -np 2 -host cluster-17 invert-master -b -s -p inv_grav.inp : -np
> 4 -host cluster-18,cluster-19
> 
> Using top (with 'f' and 'j' showing P row) we could track which process
> runs on which core. We found processes stayed on its initial core in
> kernel 2.6.23, but started to flip around with 2.6.24. Using the
> --bind-to-core option in openmpi 1.4.1 kept the processes on its cores
> again, but that didn't influence the overall outcome, didn't fix the issue.
> 
> We found top showing ~25% CPU wait time, and processes showing 'D' ,
> also on slave only nodes. According to our programmer communications are
> only between the master process and its slaves, but not among slaves. On
> kernel 2.6.23 and lower CPU usage is 100% on user, no wait or system
> percentage.
> 
> Example from top:
> 
> Cpu(s): 75.3%us,  0.6%sy,  0.0%ni,  0.0%id, 23.1%wa,  0.7%hi,  0.3%si,
> 0.0%st
> Mem:   8181236k total,   131224k used,  8050012k free,0k buffers
> Swap:0k total,0k used,0k free,49868k cached
> 
>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  P COMMAND
>  3386 oli   20   0 90512  20m 3988 R   74  0.3  12:31.80 0 invert-
>  3387 oli   20   0 85072  15m 3780 D   67  0.2  11:59.30 1 invert-
>  3388 oli   20   0 85064  14m 3588 D   77  0.2  12:56.90 2 invert-
>  3389 oli   20   0 84936  14m 3436 R   85  0.2  13:28.30 3 invert-
> 
> 
> Some system information that might be helpful:
> 
> Nodes Hardware:
> 1. "older": Intel Core2 Duo, (2x1)GB RAM
> 2. "newer": Intel(R) Core(TM) i5 CPU, Mainboard ASUS RS100-E6, (4x2)GB RAM
> 
> Debian stable (lenny) distribution with
> ii  libc6 2.7-18lenny2
> ii  libopenmpi1   1.2.7~rc2-2
> ii  openmpi-bin   1.2.7~rc2-2
> ii  openmpi-common1.2.7~rc2-2
> 
> Nodes are booting diskless with

Re: [OMPI users] OPEN_MPI macro for mpif.h?

2010-03-31 Thread Jeff Squyres

On Mar 29, 2010, at 4:10 PM, Martin Bernreuther wrote:

> looking at the Open MPI mpi.h include file there's a preprocessor macro
> OPEN_MPI defined, as well as e.g. OMPI_MAJOR_VERSION, OMPI_MINOR_VERSION
> and OMPI_RELEASE_VERSION. version.h e.g. also defines OMPI_VERSION
> This seems to be missing in mpif.h and therefore something like
> 
> include 'mpif.h'
> [...]
> #ifdef OPEN_MPI
>write( *, '("MPI library: OpenMPI",I2,".",I2,".",I2)' ) &
> &OMPI_MAJOR_VERSION, OMPI_MINOR_VERSION, OMPI_RELEASE_VERSION
> #endif
> 
> doesn't work for a FORTRAN openmpi program.

Correct.  The reason we didn't do this is because not all Fortran compilers 
will submit your code through a preprocessor.  For example:

-
shell% cat bogus.h
#define MY_VALUE 1
shell% cat bogus.f90
program main
#include "bogus.h"
  implicit none
  integer a
  a = MY_VALUE
end program
shell% ln -s bogus.f90 bogus-preproc.F90
shell% gfortran bogus.f90
Warning: bogus.f90:2: Illegal preprocessor directive
bogus.f90:5.14:

  a = MY_VALUE
  1
Error: Symbol 'my_value' at (1) has no IMPLICIT type
shell% gfortran bogus-preproc.F90
shell% 
-

That's one example.  I used gfortran here; I learned during the process that 
include'd files are not preprocessed by gfortran, but #include'd files are 
(regardless of the filename of the main source file).  The moral of the story 
here is that it's a losing game for our wrappers to try and keep up with what 
file extensions and/or compiler switches enable preprocessing, and trying to 
determine whether mpif.h was include'd or #include'd.  :-(

That being said, I have a [very] dim recollection of adding some -D's to the 
wrapper compiler command line so that -DOPEN_MPI would be defined and we 
wouldn't have to worry about all the .f90 vs. .F90 / include vs. #include 
muckety muck...  I don't remember what happened with that, though...

Are you enough of a fortran person to know whether -D is pretty universally 
supported among Fortran compilers?  It wouldn't be too hard to add a configure 
test to see if -D is supported.  Would you have any time/interest to create a 
patch for this, perchance?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] Problem in remote nodes

2010-03-31 Thread Jeff Squyres

On Mar 30, 2010, at 4:28 PM, Robert Collyer wrote:

> I changed the SELinux config to permissive (log only), and it didn't
> change anything.  Back to the drawing board.

I'm afraid I have no expereince with SELinux -- I don't know what it restricts. 
 Generally, you need to be able to run processes on remote nodes without 
entering a password and also be able to open random TCP and unix sockets 
between previously unrelated processes.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI users] Help om Openmpi

2010-03-31 Thread Jeff Squyres (jsquyres)

Yes, you need to install open mpi on all nodes and you need to be able to login 
to each node without being prompted for a password. 

Also, not that v1.2.7 is pretty ancient. If you're juist starting with open 
mpi, can you upgrade to the latest version? 

-jms 
Sent from my PDA. No type good.



From: users-boun...@open-mpi.org  
To: us...@open-mpi.org  
Sent: Wed Mar 31 03:39:08 2010
Subject: [OMPI users] Help om Openmpi 


Dear all,
I had install my cluster which the configuration as following:
- headnode : 
  + linux CenOS 5.4, 4 CPUs, 3G RAM
  + sun gridengine sge6.0u12. The headnode is admin and submit node too.
  + Openmpi 1.2.9. In the installation openmpi :.configure 
--prefix=/opt/openmpi --with-sge ...Processes complilation and make was fine.
  + I have 2 others nodes which confg. are: 4 CPU, 1 G RAM and on which run 
sgeexecd.
Testing for SGE on headnode and nodes by qsub was fine.
When testing openmpi with as folowing:
[guser1@ioitg2 examples]$ /opt/openmpi/bin/mpirun -np 4 --hostfile myhosts 
hello_cxx
Hello, world!  I am 0 of 4
Hello, world!  I am 1 of 4
Hello, world!  I am 3 of 4
Hello, world!  I am 2 of 4
[guser1@ioitg2 examples]$ 

The openmpi runs well.
My file myhosts:
ioitg2.ioit-grid.ac.vn slots=4
node1.ioit-grid.ac.vn slots=4
node2.ioit-grid.ac.vn slots=4

Now for more processes:
[guser1@ioitg2 examples]$ /opt/openmpi/bin/mpirun -np 6 --hostfile myhosts 
hello_cxx
gus...@node1.ioit-grid.ac.vn's password: 
--
Failed to find the following executable:

Host:   node1.ioit-grid.ac.vn
Executable: hello_cxx

Cannot continue.
--
mpirun noticed that job rank 0 with PID 19164 on node ioitg2.ioit-grid.ac.vn 
exited on signal 15 (Terminated). 
3 additional processes aborted (not shown)
[guser1@ioitg2 examples]$ 

This is error massage. I was login on node1 successful.

PLS, Help me. What problems I have 9installation, configurations, ...). Have I 
install openmpi on all nodes ?

Thank you very much and I am waitting your helps.

Re: [OMPI users] Hide Abort output

2010-03-31 Thread Jeff Squyres (jsquyres)

At present there is no such feature, but it should not be hard to add. 

Can you guys be a little more specific about exactly what you are seeing and 
exactly what you want to see?  (And what version you're working with - I'll 
caveat my discussion that this may be a 1.5-and-forward thing)

-jms
Sent from my PDA.  No type good.

- Original Message -
From: users-boun...@open-mpi.org 
To: Open MPI Users 
Sent: Wed Mar 31 05:38:48 2010
Subject: Re: [OMPI users] Hide Abort output

I have to say this is a very common issue for our users.  They repeatedly
report the long Open MPI MPI_Abort() message in help queries and fail to
look for the application error message about the root cause.  A short
MPI_Abort() message that said "look elsewhere for the real error message"
would be useful.

Cheers,
David

On 03/31/2010 07:58 PM, Yves Caniou wrote:
> Dear all,
>
> I am using the MPI_Abort() command in a MPI program.
> I would like to not see the note explaining that the command caused Open MPI
> to kill all the jobs and so on.
> I thought that I could find a --mca parameter, but couldn't grep it. The only
> ones deal with the delay and printing more information (the stack).
>
> Is there a mean to avoid the printing of the note (except the 2>/dev/null
> tips)? Or to delay this printing?
>
> Thank you.
>
> .Yves.
>

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

Re: [OMPI users] Problem in remote nodes

2010-03-31 Thread Jeff Squyres (jsquyres)

Those are normal ssh messages, I think - an ssh session may try mulktiple auth 
methods before one succeeds. 

You're absolutely sure that there's no firewalling software and selinux is 
disabled?  Ompi is behaving as if it is trying to communicate and failing 
(e.g., its hanging while trying to open some tcp sockets back). 

Can you open random tcp sockets between your nodes?  (E.g., in non-mpi 
processes)

-jms
Sent from my PDA.  No type good.

- Original Message -
From: users-boun...@open-mpi.org 
To: Open MPI Users 
Sent: Wed Mar 31 06:25:43 2010
Subject: Re: [OMPI users] Problem in remote nodes

I've been checking the /var/log/messages on the compute node and there is
nothing new after executing ' mpirun --host itanium2 -np 2
helloworld.out',
but in the /var/log/messages file on the remote node it appears the
following messages, nothing about unix_chkpwd.

Mar 31 11:56:51 itanium2 sshd(pam_unix)[15349]: authentication failure;
logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=itanium1  user=otro
Mar 31 11:56:53 itanium2 sshd[15349]: Accepted publickey for otro from
192.168.3.1 port 40999 ssh2
Mar 31 11:56:53 itanium2 sshd(pam_unix)[15351]: session opened for user
otro by (uid=500)
Mar 31 11:56:53 itanium2 sshd(pam_unix)[15351]: session closed for user otro

It seems that the authentication fails at first, but in the next message
it connects with the node...

El Mar, 30 de Marzo de 2010, 20:02, Robert Collyer escribió:
> I've been having similar problems using Fedora core 9.  I believe the
> issue may be with SELinux, but this is just an educated guess.  In my
> setup, shortly after a login via mpi, there is a notation in the
> /var/log/messages on the compute node as follows:
>
> Mar 30 12:39:45  kernel: type=1400 audit(1269970785.534:588):
> avc:  denied  { read } for  pid=8047 comm="unix_chkpwd" name="hosts"
> dev=dm-0 ino=24579
> scontext=system_u:system_r:system_chkpwd_t:s0-s0:c0.c1023
> tcontext=unconfined_u:object_r:etc_runtime_t:s0 tclass=file
>
> which says SELinux denied unix_chkpwd read access to hosts.
>
> Are you getting anything like this?
>
> In the meantime, I'll check if allowing unix_chkpwd read access to hosts
> eliminates the problem on my system, and if it works, I'll post the
> steps involved.
>
> uriz.49...@e.unavarra.es wrote:
>> I've benn investigating and there is no firewall that could stop TCP
>> traffic in the cluster. With the option --mca plm_base_verbose 30 I get
>> the following output:
>>
>> [itanium1] /home/otro > mpirun --mca plm_base_verbose 30 --host itanium2
>> helloworld.out
>> [itanium1:08311] mca: base: components_open: Looking for plm components
>> [itanium1:08311] mca: base: components_open: opening plm components
>> [itanium1:08311] mca: base: components_open: found loaded component rsh
>> [itanium1:08311] mca: base: components_open: component rsh has no
>> register
>> function
>> [itanium1:08311] mca: base: components_open: component rsh open function
>> successful
>> [itanium1:08311] mca: base: components_open: found loaded component
>> slurm
>> [itanium1:08311] mca: base: components_open: component slurm has no
>> register function
>> [itanium1:08311] mca: base: components_open: component slurm open
>> function
>> successful
>> [itanium1:08311] mca:base:select: Auto-selecting plm components
>> [itanium1:08311] mca:base:select:(  plm) Querying component [rsh]
>> [itanium1:08311] mca:base:select:(  plm) Query of component [rsh] set
>> priority to 10
>> [itanium1:08311] mca:base:select:(  plm) Querying component [slurm]
>> [itanium1:08311] mca:base:select:(  plm) Skipping component [slurm].
>> Query
>> failed to return a module
>> [itanium1:08311] mca:base:select:(  plm) Selected component [rsh]
>> [itanium1:08311] mca: base: close: component slurm closed
>> [itanium1:08311] mca: base: close: unloading component slurm
>>
>> --Hangs here
>>
>> It seems a slurm problem??
>>
>> Thanks to any idea
>>
>> El Vie, 19 de Marzo de 2010, 17:57, Ralph Castain escribió:
>>
>>> Did you configure OMPI with --enable-debug? You should do this so that
>>> more diagnostic output is available.
>>>
>>> You can also add the following to your cmd line to get more info:
>>>
>>> --debug --debug-daemons --leave-session-attached
>>>
>>> Something is likely blocking proper launch of the daemons and processes
>>> so
>>> you aren't getting to the btl's at all.
>>>
>>>
>>> On Mar 19, 2010, at 9:42 AM, uriz.49...@e.unavarra.es wrote:
>>>
>>>
 The processes are running on the remote nodes but they don't give the
 response to the origin node. I don't know why.
 With the option --mca btl_base_verbose 30, I have the same problems
 and
 it
 doesn't show any message.

 Thanks


> On Wed, Mar 17, 2010 at 1:41 PM, Jeff Squyres 
> wrote:
>
>> On Mar 17, 2010, at 4:39 AM,  wrote:
>>
>>
>>> Hi everyone I'm a new Open MPI user and I

Re: [OMPI users] Problem in remote nodes

2010-03-31 Thread uriz . 49949

I've been checking the /var/log/messages on the compute node and there is
nothing new after executing ' mpirun --host itanium2 -np 2
helloworld.out',
but in the /var/log/messages file on the remote node it appears the
following messages, nothing about unix_chkpwd.

Mar 31 11:56:51 itanium2 sshd(pam_unix)[15349]: authentication failure;
logname= uid=0 euid=0 tty=NODEVssh ruser= rhost=itanium1  user=otro
Mar 31 11:56:53 itanium2 sshd[15349]: Accepted publickey for otro from
192.168.3.1 port 40999 ssh2
Mar 31 11:56:53 itanium2 sshd(pam_unix)[15351]: session opened for user
otro by (uid=500)
Mar 31 11:56:53 itanium2 sshd(pam_unix)[15351]: session closed for user otro

It seems that the authentication fails at first, but in the next message
it connects with the node...

El Mar, 30 de Marzo de 2010, 20:02, Robert Collyer escribió:
> I've been having similar problems using Fedora core 9.  I believe the
> issue may be with SELinux, but this is just an educated guess.  In my
> setup, shortly after a login via mpi, there is a notation in the
> /var/log/messages on the compute node as follows:
>
> Mar 30 12:39:45  kernel: type=1400 audit(1269970785.534:588):
> avc:  denied  { read } for  pid=8047 comm="unix_chkpwd" name="hosts"
> dev=dm-0 ino=24579
> scontext=system_u:system_r:system_chkpwd_t:s0-s0:c0.c1023
> tcontext=unconfined_u:object_r:etc_runtime_t:s0 tclass=file
>
> which says SELinux denied unix_chkpwd read access to hosts.
>
> Are you getting anything like this?
>
> In the meantime, I'll check if allowing unix_chkpwd read access to hosts
> eliminates the problem on my system, and if it works, I'll post the
> steps involved.
>
> uriz.49...@e.unavarra.es wrote:
>> I've benn investigating and there is no firewall that could stop TCP
>> traffic in the cluster. With the option --mca plm_base_verbose 30 I get
>> the following output:
>>
>> [itanium1] /home/otro > mpirun --mca plm_base_verbose 30 --host itanium2
>> helloworld.out
>> [itanium1:08311] mca: base: components_open: Looking for plm components
>> [itanium1:08311] mca: base: components_open: opening plm components
>> [itanium1:08311] mca: base: components_open: found loaded component rsh
>> [itanium1:08311] mca: base: components_open: component rsh has no
>> register
>> function
>> [itanium1:08311] mca: base: components_open: component rsh open function
>> successful
>> [itanium1:08311] mca: base: components_open: found loaded component
>> slurm
>> [itanium1:08311] mca: base: components_open: component slurm has no
>> register function
>> [itanium1:08311] mca: base: components_open: component slurm open
>> function
>> successful
>> [itanium1:08311] mca:base:select: Auto-selecting plm components
>> [itanium1:08311] mca:base:select:(  plm) Querying component [rsh]
>> [itanium1:08311] mca:base:select:(  plm) Query of component [rsh] set
>> priority to 10
>> [itanium1:08311] mca:base:select:(  plm) Querying component [slurm]
>> [itanium1:08311] mca:base:select:(  plm) Skipping component [slurm].
>> Query
>> failed to return a module
>> [itanium1:08311] mca:base:select:(  plm) Selected component [rsh]
>> [itanium1:08311] mca: base: close: component slurm closed
>> [itanium1:08311] mca: base: close: unloading component slurm
>>
>> --Hangs here
>>
>> It seems a slurm problem??
>>
>> Thanks to any idea
>>
>> El Vie, 19 de Marzo de 2010, 17:57, Ralph Castain escribió:
>>
>>> Did you configure OMPI with --enable-debug? You should do this so that
>>> more diagnostic output is available.
>>>
>>> You can also add the following to your cmd line to get more info:
>>>
>>> --debug --debug-daemons --leave-session-attached
>>>
>>> Something is likely blocking proper launch of the daemons and processes
>>> so
>>> you aren't getting to the btl's at all.
>>>
>>>
>>> On Mar 19, 2010, at 9:42 AM, uriz.49...@e.unavarra.es wrote:
>>>
>>>
 The processes are running on the remote nodes but they don't give the
 response to the origin node. I don't know why.
 With the option --mca btl_base_verbose 30, I have the same problems
 and
 it
 doesn't show any message.

 Thanks


> On Wed, Mar 17, 2010 at 1:41 PM, Jeff Squyres 
> wrote:
>
>> On Mar 17, 2010, at 4:39 AM,  wrote:
>>
>>
>>> Hi everyone I'm a new Open MPI user and I have just installed Open
>>> MPI
>>> in
>>> a 6 nodes cluster with Scientific Linux. When I execute it in local
>>> it
>>> works perfectly, but when I try to execute it on the remote nodes
>>> with
>>> the
>>> --host  option it hangs and gives no message. I think that the
>>> problem
>>> could be with the shared libraries but i'm not sure. In my opinion
>>> the
>>> problem is not ssh because i can access to the nodes with no
>>> password
>>>
>> You might want to check that Open MPI processes are actually running
>> on
>> the remote nodes -- check with ps if you see any "orted" or other

Re: [OMPI users] Hide Abort output

2010-03-31 Thread David Singleton



I have to say this is a very common issue for our users.  They repeatedly
report the long Open MPI MPI_Abort() message in help queries and fail to
look for the application error message about the root cause.  A short
MPI_Abort() message that said "look elsewhere for the real error message"
would be useful.

Cheers,
David

On 03/31/2010 07:58 PM, Yves Caniou wrote:

Dear all,

I am using the MPI_Abort() command in a MPI program.
I would like to not see the note explaining that the command caused Open MPI
to kill all the jobs and so on.
I thought that I could find a --mca parameter, but couldn't grep it. The only
ones deal with the delay and printing more information (the stack).

Is there a mean to avoid the printing of the note (except the 2>/dev/null
tips)? Or to delay this printing?

Thank you.

.Yves.

[OMPI users] Hide Abort output

2010-03-31 Thread Yves Caniou

Dear all,

I am using the MPI_Abort() command in a MPI program.
I would like to not see the note explaining that the command caused Open MPI 
to kill all the jobs and so on.
I thought that I could find a --mca parameter, but couldn't grep it. The only 
ones deal with the delay and printing more information (the stack).

Is there a mean to avoid the printing of the note (except the 2>/dev/null 
tips)? Or to delay this printing?

Thank you.

.Yves.

-- 
Yves Caniou
Associate Professor at Université Lyon 1,
Member of the team project INRIA GRAAL in the LIP ENS-Lyon,
Délégation CNRS in Japan French Laboratory of Informatics (JFLI),
  * in Information Technology Center, The University of Tokyo,
2-11-16 Yayoi, Bunkyo-ku, Tokyo 113-8658, Japan
tel: +81-3-5841-0540
  * in National Institute of Informatics
2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan
tel: +81-3-4212-2412 
http://graal.ens-lyon.fr/~ycaniou/

[OMPI users] Help om Openmpi

2010-03-31 Thread Huynh Thuc Cuoc

Dear all,
I had install my cluster which the configuration as following:
- headnode :
  + linux CenOS 5.4, 4 CPUs, 3G RAM
  + sun gridengine sge6.0u12. The headnode is admin and submit node too.
  + Openmpi 1.2.9. In the installation openmpi :.configure
--prefix=/opt/openmpi --with-sge ...Processes complilation and make was
fine.
  + I have 2 others nodes which confg. are: 4 CPU, 1 G RAM and on which run
sgeexecd.
Testing for SGE on headnode and nodes by qsub was fine.
When testing openmpi with as folowing:
[guser1@ioitg2 examples]$ /opt/openmpi/bin/mpirun -np 4 --hostfile myhosts
hello_cxx
Hello, world!  I am 0 of 4
Hello, world!  I am 1 of 4
Hello, world!  I am 3 of 4
Hello, world!  I am 2 of 4
[guser1@ioitg2 examples]$

The openmpi runs well.
My file myhosts:
ioitg2.ioit-grid.ac.vn slots=4
node1.ioit-grid.ac.vn slots=4
node2.ioit-grid.ac.vn slots=4

Now for more processes:
[guser1@ioitg2 examples]$ /opt/openmpi/bin/mpirun -np 6 --hostfile myhosts
hello_cxx
gus...@node1.ioit-grid.ac.vn's password:
--
Failed to find the following executable:

Host:   node1.ioit-grid.ac.vn
Executable: hello_cxx

Cannot continue.
--
mpirun noticed that job rank 0 with PID 19164 on node
ioitg2.ioit-grid.ac.vnexited on signal 15 (Terminated).
3 additional processes aborted (not shown)
[guser1@ioitg2 examples]$

This is error massage. I was login on node1 successful.

PLS, Help me. What problems I have 9installation, configurations, ...). Have
I install openmpi on all nodes ?

Thank you very much and I am waitting your helps.

Re: [OMPI users] Best way to reduce 3D array

2010-03-31 Thread Ricardo Reis


On Tue, 30 Mar 2010, Gus Correa wrote:


Salve Ricardo Reis!

Como vai a Radio Zero?


:) busy, busy, busy. we are preparing to celebrate Yuri's Night, April the 
12th!



Doesn't this serialize the I/O operation across the processors,
whereas MPI_Gather followed by rank_0 I/O may perhaps move
the data faster to rank_0, and eventually to disk
(particularly when the number of processes is large)?


oh, yes. I remember now why I thought of this. If the problem is large 
enough you will run out of memory in the master machine (for me MPI-IO is 
the way to go unless you're tied up with NFS). Off course one could always 
send it by chunks, let the master write, then send other chunk...


abraco!

 Ricardo Reis

 'Non Serviam'

 PhD candidate @ Lasef
 Computational Fluid Dynamics, High Performance Computing, Turbulence
 http://www.lasef.ist.utl.pt

 Cultural Instigator @ Rádio Zero
 http://www.radiozero.pt

 Keep them Flying! Ajude a/help Aero Fénix!

 http://www.aeronauta.com/aero.fenix

 http://www.flickr.com/photos/rreis/

   < sent with alpine 2.00 >

Re: [OMPI users] openmpi.ld.conf file

Re: [OMPI users] ompi-checkpoint --term

[OMPI users] ompi-checkpoint --term

Re: [OMPI users] Hide Abort output

[OMPI users] openmpi.ld.conf file

Re: [OMPI users] openMPI on Xgrid

Re: [OMPI users] openMPI on Xgrid

Re: [OMPI users] Segmentation fault (11)

Re: [OMPI users] Hide Abort output

Re: [OMPI users] kernel 2.6.23 vs 2.6.24 - communication/wait times

Re: [OMPI users] strange problem with OpenMPI + rankfile + Intelcompiler 11.0.074 + centos/fedora-12

Re: [OMPI users] kernel 2.6.23 vs 2.6.24 - communication/wait times

Re: [OMPI users] OPEN_MPI macro for mpif.h?

Re: [OMPI users] Problem in remote nodes

Re: [OMPI users] Help om Openmpi

Re: [OMPI users] Hide Abort output

Re: [OMPI users] Problem in remote nodes

Re: [OMPI users] Problem in remote nodes

Re: [OMPI users] Hide Abort output

[OMPI users] Hide Abort output

[OMPI users] Help om Openmpi

Re: [OMPI users] Best way to reduce 3D array

22 matches

Site Navigation

Mail list logo

Footer information