Re: [OMPI users] File locking in ADIO, OpenMPI 1.6.4

2014-09-18 Thread Beichuan Yan
Rob,

Thank you very much for the suggestion. There are two independent scenarios 
using parallel IO in my code:

1. MPI processes conditionally print, i.e., some processes print in current 
loop (but may not print in next loop), some processes do not print in current 
loop (but may print next loop), and it does not matter who prints first or last 
(NOT ordered). Clearly we cannot use a collective call for this scenario 
because it is conditional, and I don't need it to be ordered, so I chose 
MPI_File_write_shared (non-collective operation, shared pointer, but not 
ordered). It works well if Lustre is mounted with "flock", but does not work 
without "flock".

In this scenario 1, we cannot use individual pointer or explicit offset because 
we cannot predetermine the offset for each process. That is why I had to use a 
shared file pointer.

2. Each MPI process unconditionally prints to a shared file (even if it prints 
nothing) and the order does not matter. Your suggestion works for this 
scenario. Actually it is even simpler because order does not matter. We have 
two options:  (2A) use shared file pointer, either MPI_File_write_shared 
(non-collective) or MPI_File_write_ordered (collective) works, and don't need 
to predetermine offset, it requires "flock". (2B). use individual file pointer, 
e.g., MPI_File_seek (or MPI_File_set_view) and MPI_File_write_all (collective), 
this requires calculating offset, which is pre-determinable. It does not 
require "flock".

In summary, scenario 2 can avoid "flock" requirement by using 2B, but scenario 
1 cannot.

Thanks,
Beichuan

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Rob Latham
Sent: Thursday, September 18, 2014 08:49
To: us...@open-mpi.org
Subject: Re: [OMPI users] File locking in ADIO, OpenMPI 1.6.4



On 09/17/2014 05:46 PM, Beichuan Yan wrote:
> Hi Rob,
>
> As you pointed out in April that there are many cases that could 
> arouse ADIOI_Set_lock error. My code writes to a file at a location 
> specified by a shared file pointer (it is a blocking and collective 
> call): MPI_File_write_ordered(contactFile, const_cast<char*> 
> (inf.str().c_str()), length, MPI_CHAR, );
>
> That is why disabling data-sieving does not work for me, even if I tested it 
> with latest openmpi-1.8.2 and gcc-4.9.1.
>
> Can I ask a question? Except that Lustre is mounted with "flock" option, is 
> there other workaround to avoid this ADIOI_Set_lock error in MPI-2 parallel 
> IO?
>

Shared file pointer operations don't get a lot of attention.

ROMIO is going to try to lock a hidden file that contains the 8 byte location 
of the shared file pointer.

Do you mix independent shared file pointer operations with ordered mode 
operations?  If not, read on for a better way to achieve ordering:

It's pretty easy to replace ordered mode operations with a collective call of 
the same behavior.  The key is to use MPI_SCAN:

   MPI_File_get_position(mpi_fh, );

   MPI_Scan(, _offset, 1, MPI_LONG_LONG_INT,
   MPI_SUM, MPI_COMM_WORLD);
   new_offset -= incr;
   new_offset += offset;

   ret = MPI_File_write_at_all(mpi_fh, new_offset, buf, count,
   datatype, status);

See: every process has "incr" amount of data.  The MPI_SCAN ensures the offsets 
computed are ascending in rank order (as they would for ordered mode i/o) and 
the actual I/O happens with a much faster MPI_File_write_at_all.

We wrote this up in our 2005 shared memory for shared file pointers paper, even 
though this approach doesn't need RMA shared memory.

==rob

> Thanks,
> Beichuan
>
> -Original Message-
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Rob 
> Latham
> Sent: Monday, April 14, 2014 14:24
> To: Open MPI Users
> Subject: Re: [OMPI users] File locking in ADIO, OpenMPI 1.6.4
>
>
>
> On 04/08/2014 05:49 PM, Daniel Milroy wrote:
>> Hello,
>>
>> The file system in question is indeed Lustre, and mounting with flock 
>> isn't possible in our environment.  I recommended the following 
>> changes to the users' code:
>
> Hi.  I'm the ROMIO guy, though I do rely on the community to help me keep the 
> lustre driver up to snuff.
>
>> MPI_Info_set(info, "collective_buffering", "true"); 
>> MPI_Info_set(info, "romio_lustre_ds_in_coll", "disable"); 
>> MPI_Info_set(info, "romio_ds_read", "disable"); MPI_Info_set(info, 
>> "romio_ds_write", "disable");
>>
>> Which results in the same error as before.  Are there any other MPI 
>> options I can set?
>
> I'd like to hear more about the workload generating these lock messages, but 
> I can tel

Re: [OMPI users] File locking in ADIO, OpenMPI 1.6.4

2014-09-17 Thread Beichuan Yan
Hi Rob,

As you pointed out in April that there are many cases that could arouse 
ADIOI_Set_lock error. My code writes to a file at a location specified by a 
shared file pointer (it is a blocking and collective call): 
MPI_File_write_ordered(contactFile, const_cast (inf.str().c_str()), 
length, MPI_CHAR, );

That is why disabling data-sieving does not work for me, even if I tested it 
with latest openmpi-1.8.2 and gcc-4.9.1.

Can I ask a question? Except that Lustre is mounted with "flock" option, is 
there other workaround to avoid this ADIOI_Set_lock error in MPI-2 parallel IO?

Thanks,
Beichuan

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Rob Latham
Sent: Monday, April 14, 2014 14:24
To: Open MPI Users
Subject: Re: [OMPI users] File locking in ADIO, OpenMPI 1.6.4



On 04/08/2014 05:49 PM, Daniel Milroy wrote:
> Hello,
>
> The file system in question is indeed Lustre, and mounting with flock 
> isn't possible in our environment.  I recommended the following 
> changes to the users' code:

Hi.  I'm the ROMIO guy, though I do rely on the community to help me keep the 
lustre driver up to snuff.

> MPI_Info_set(info, "collective_buffering", "true"); MPI_Info_set(info, 
> "romio_lustre_ds_in_coll", "disable"); MPI_Info_set(info, 
> "romio_ds_read", "disable"); MPI_Info_set(info, "romio_ds_write", 
> "disable");
>
> Which results in the same error as before.  Are there any other MPI 
> options I can set?

I'd like to hear more about the workload generating these lock messages, but I 
can tell you the situations in which ADIOI_SetLock gets called:
- everywhere in NFS.  If you have a Lustre file system exported to some clients 
as NFS, you'll get NFS (er, that might not be true unless you pick up a recent 
patch)
- when writing a non-contiguous region in file, unless you disable data 
sieving, as you did above.
- note: you don't need to disable data sieving for reads, though you might want 
to if the data sieving algorithm is wasting a lot of data.
- if atomic mode was set on the file (i.e. you called
MPI_File_set_atomicity)
- if you use any of the shared file pointer operations
- if you use any of the ordered mode collective operations

you've turned off data sieving writes, which is what I would have first guessed 
would trigger this lock message.  So I guess you are hitting one of the other 
cases.

==rob

--
Rob Latham
Mathematics and Computer Science Division Argonne National Lab, IL USA 
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] OpenMPI job initializing problem

2014-03-21 Thread Beichuan Yan
Good suggestion.

This overall walltime reveals little difference between Intel MPI and Open MPI, 
for example: intelmpi=3.76 mins and openmpi=3.73 mins, while PBS pro shows 
intelmpi=3.82 mins and openmpi=3.80 mins.

Beichuan


-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Friday, March 21, 2014 07:06
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI job initializing problem

One thing to check would be the time spent between MPI_Init and MPI_Finalize - 
i.e., see if the time difference is caused by differences in init and finalize 
themselves. My guess is that is the source - would help us target the problem.


On Mar 20, 2014, at 9:00 PM, Beichuan Yan <beichuan@colorado.edu> wrote:

> Here is an example of my data measured in seconds:
>
> communication overhead = commuT + migraT + print, compuT is
> computational cost, totalT = compuT + communication overhead,
> overhead% denotes percentage of communication overhead
>
> intelmpi (walltime=00:03:51)
> iter [commuT  migraT  printT]  compuT 
>totalT  overhead%
> 3999   4.945993e-03   2.689362e-04   1.440048e-04   1.689100e-02   
> 2.224994e-02   2.343795e+01
> 5999   4.938126e-03   1.451969e-04   2.689362e-04   1.663089e-02   
> 2.198315e-02   2.312373e+01
> 7999   4.904985e-03   1.490116e-04   1.451969e-04   1.678491e-02   
> 2.198410e-02   2.298933e+01
>    4.915953e-03   1.380444e-04   1.490116e-04   1.687193e-02   
> 2.207494e-02   2.289473e+01
>
> openmpi (walltime=00:04:32)
> iter  [commuT  migraT printT]  compuT 
>  totalT overhead%
> 3999   3.574133e-03   1.139641e-04   1.089573e-04   1.598001e-02   
> 1.977706e-02   1.864836e+01
> 5999   3.574848e-03   1.189709e-04   1.139641e-04   1.599526e-02   
> 1.980305e-02   1.865278e+01
> 7999   3.571033e-03   1.168251e-04   1.189709e-04   1.601100e-02   
> 1.981783e-02   1.860879e+01
>    3.587008e-03   1.258850e-04   1.168251e-04   1.596618e-02   
> 1.979589e-02   1.875587e+01
>
> It can be seen that Open MPI is faster in both communication and computation 
> measured by MPI_Wtime calls, but the wall time reported by PBS pro is larger.
>
>
> -Original Message-
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus
> Correa
> Sent: Thursday, March 20, 2014 15:08
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI job initializing problem
>
> On 03/20/2014 04:48 PM, Beichuan Yan wrote:
>> Ralph and Noam,
>>
>> Thanks for the clarifications, they are important.
> I could be wrong in understanding the filesystem.
>>
>> Spirit appears to use a scratch directory for
> shared memory backing which is mounted on Lustre, and does not seem to have 
> local directories or does not allow user to change TEMPDIR. Here is the info:
>> [compute node]$ stat -f -L -c %T /tmp tmpfs [compute node]$ stat -f
>> -L -c %T /home/yanb/scratch lustre
>>
>
> So, /tmp is a tmpfs, in memory/RAM.
> Maybe they don't open writing permissions for regular users on /tmp?
>
>> On another university supercomputer, I found the following:
>> node0448[~]$ stat -f -L -c %T /tmp
>> ramfs
>> node0448[~]$ stat -f -L -c %T /home/yanb/scratch/ lustre Is this /tmp
>> at compute node a local directory? I don't know how to tell it.
>>
>> Thanks,
>> Beichuan
>>
>>
>>
>> -Original Message-
>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph
>> Castain
>> Sent: Thursday, March 20, 2014 12:13
>> To: Open MPI Users
>> Subject: Re: [OMPI users] OpenMPI job initializing problem
>>
>>
>> On Mar 20, 2014, at 9:48 AM, Beichuan Yan <beichuan@colorado.edu> wrote:
>>
>>> Hi,
>>>
>>> Today I tested OMPI v1.7.5rc5 and surprisingly, it works like a charm!
>>>
>>> I found discussions related to this issue:
>>>
>>> 1. http://www.open-mpi.org/community/lists/users/2011/11/17688.php
>>> The correct solution here is get your sys admin to make /tmp local. Making 
>>> /tmp NFS mounted across multiple nodes is a major "faux pas" in the Linux 
>>> world - it should never be done, for the reasons stated by Jeff.
>>>
>>> my comment: for most clusters I have used, /tmp is NOT local. Open MPI 
>>> community may not enforce it.
>>
>> We don't enforce anything, but /tmp being network mounted is a VERY
>> unusual situation in the cluster world, and highly unrecommended
>>
>>
>>>
>>> 2. http://www.open-mpi.org/community/lists/users/201

Re: [OMPI users] OpenMPI job initializing problem

2014-03-21 Thread Beichuan Yan
Here is an example of my data measured in seconds:

communication overhead = commuT + migraT + print, compuT is computational cost, 
totalT = compuT + communication overhead, overhead% denotes percentage of 
communication overhead

intelmpi (walltime=00:03:51)
iter [commuT  migraT  printT]  compuT   
 totalT  overhead%
3999   4.945993e-03   2.689362e-04   1.440048e-04   1.689100e-02   2.224994e-02 
  2.343795e+01
5999   4.938126e-03   1.451969e-04   2.689362e-04   1.663089e-02   2.198315e-02 
  2.312373e+01
7999   4.904985e-03   1.490116e-04   1.451969e-04   1.678491e-02   2.198410e-02 
  2.298933e+01
   4.915953e-03   1.380444e-04   1.490116e-04   1.687193e-02   2.207494e-02 
  2.289473e+01

openmpi (walltime=00:04:32)
iter  [commuT  migraT printT]  compuT   
   totalT overhead%
3999   3.574133e-03   1.139641e-04   1.089573e-04   1.598001e-02   1.977706e-02 
  1.864836e+01
5999   3.574848e-03   1.189709e-04   1.139641e-04   1.599526e-02   1.980305e-02 
  1.865278e+01
7999   3.571033e-03   1.168251e-04   1.189709e-04   1.601100e-02   1.981783e-02 
  1.860879e+01
   3.587008e-03   1.258850e-04   1.168251e-04   1.596618e-02   1.979589e-02 
  1.875587e+01

It can be seen that Open MPI is faster in both communication and computation 
measured by MPI_Wtime calls, but the wall time reported by PBS pro is larger.


-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
Sent: Thursday, March 20, 2014 15:08
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI job initializing problem

On 03/20/2014 04:48 PM, Beichuan Yan wrote:
> Ralph and Noam,
>
> Thanks for the clarifications, they are important.
I could be wrong in understanding the filesystem.
>
> Spirit appears to use a scratch directory for
shared memory backing which is mounted on Lustre, and does not seem to have 
local directories or does not allow user to change TEMPDIR. Here is the info:
> [compute node]$ stat -f -L -c %T /tmp
> tmpfs
> [compute node]$ stat -f -L -c %T /home/yanb/scratch lustre
>

So, /tmp is a tmpfs, in memory/RAM.
Maybe they don't open writing permissions for regular users on /tmp?

> On another university supercomputer, I found the following:
> node0448[~]$ stat -f -L -c %T /tmp
> ramfs
> node0448[~]$ stat -f -L -c %T /home/yanb/scratch/ lustre Is this /tmp
> at compute node a local directory? I don't know how to tell it.
>
> Thanks,
> Beichuan
>
>
>
> -Original Message-
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph
> Castain
> Sent: Thursday, March 20, 2014 12:13
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI job initializing problem
>
>
> On Mar 20, 2014, at 9:48 AM, Beichuan Yan <beichuan@colorado.edu> wrote:
>
>> Hi,
>>
>> Today I tested OMPI v1.7.5rc5 and surprisingly, it works like a charm!
>>
>> I found discussions related to this issue:
>>
>> 1. http://www.open-mpi.org/community/lists/users/2011/11/17688.php
>> The correct solution here is get your sys admin to make /tmp local. Making 
>> /tmp NFS mounted across multiple nodes is a major "faux pas" in the Linux 
>> world - it should never be done, for the reasons stated by Jeff.
>>
>> my comment: for most clusters I have used, /tmp is NOT local. Open MPI 
>> community may not enforce it.
>
> We don't enforce anything, but /tmp being network mounted is a VERY
> unusual situation in the cluster world, and highly unrecommended
>
>
>>
>> 2. http://www.open-mpi.org/community/lists/users/2011/11/17684.php
>> In the upcoming OMPI v1.7, we revamped the shared memory setup code such 
>> that it'll actually use /dev/shm properly, or use some other mechanism other 
>> than a mmap file backed in a real filesystem. So the issue goes away.
>>
>> my comment: up to OMPI v1.7.4, this shmem issue is still there. However, it 
>> is resolved in OMPI v1.7.5rc5. This is surprising.
>>
>> Anyway, OMPI v1.7.5rc5 works well for multi-processes-on-one-node (shmem) 
>> mode on Spirit. There is no need to tune TCP or IB parameters to use it. My 
>> code just runs well:
>>
>> My test data takes 20 minutes to run with OMPI v1.7.4, but needs less than 1 
>> minute with OMPI v1.7.5rc5. I don't know what the magic is. I am wondering 
>> when OMPI v1.7.5 final will be released.
>>
>> I will update performance comparison between Intel MPI and Open MPI.
>>
>> Thanks,
>> Beichuan
>>
>>
>>
>> -Original Message-
>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus
>> Correa
>> Sent: Friday, March 07, 2014 18:41
>> T

Re: [OMPI users] OpenMPI job initializing problem

2014-03-20 Thread Beichuan Yan
Here is an example of my data measured in seconds:

communication overhead = commuT + migraT + print,
compuT is computational cost,
totalT = compuT + communication overhead,
overhead% denotes percentage of communication overhead

intelmpi (walltime=00:03:51)
iter [commuT  migraT  printT]  compuT   
 totalT  overhead%
3999   4.945993e-03   2.689362e-04   1.440048e-04   1.689100e-02   2.224994e-02 
  2.343795e+01
5999   4.938126e-03   1.451969e-04   2.689362e-04   1.663089e-02   2.198315e-02 
  2.312373e+01
7999   4.904985e-03   1.490116e-04   1.451969e-04   1.678491e-02   2.198410e-02 
  2.298933e+01
   4.915953e-03   1.380444e-04   1.490116e-04   1.687193e-02   2.207494e-02 
  2.289473e+01

openmpi (walltime=00:04:32)
iter  [commuT  migraT printT]  compuT   
   totalT overhead%
3999   3.574133e-03   1.139641e-04   1.089573e-04   1.598001e-02   1.977706e-02 
  1.864836e+01
5999   3.574848e-03   1.189709e-04   1.139641e-04   1.599526e-02   1.980305e-02 
  1.865278e+01
7999   3.571033e-03   1.168251e-04   1.189709e-04   1.601100e-02   1.981783e-02 
  1.860879e+01
   3.587008e-03   1.258850e-04   1.168251e-04   1.596618e-02   1.979589e-02 
  1.875587e+01

It can be seen that Open MPI is faster in both communication and computation 
measured by MPI_Wtime calls, but the wall time reported by PBS pro is larger.

Beichuan

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Beichuan Yan
Sent: Thursday, March 20, 2014 15:15
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI job initializing problem

As for the performance, my 4-node (64-processes) 3-hour job indicates Intel MPI 
and OpenMPI have close benchmarks. Intel MPI takes 2:53 while Open MPI takes 
3:10.

It is interesting that all my MPI_Wtime calls show OpenMPI is faster (up to 
twice or even more) than Intel MPI in communication for a single loop, but in 
overall wall time Open MPI is 10% slower for like 500K loops. The computing 
times are nearly the same. This is a little confusing.

I may set up and run a new test.

Beichuan

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres 
(jsquyres)
Sent: Thursday, March 20, 2014 11:15
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI job initializing problem

On Mar 20, 2014, at 12:48 PM, Beichuan Yan <beichuan@colorado.edu> wrote:

> 2. http://www.open-mpi.org/community/lists/users/2011/11/17684.php
> In the upcoming OMPI v1.7, we revamped the shared memory setup code such that 
> it'll actually use /dev/shm properly, or use some other mechanism other than 
> a mmap file backed in a real filesystem. So the issue goes away.

Woo hoo!

> my comment: up to OMPI v1.7.4, this shmem issue is still there. However, it 
> is resolved in OMPI v1.7.5rc5. This is surprising.
> 
> Anyway, OMPI v1.7.5rc5 works well for multi-processes-on-one-node (shmem) 
> mode on Spirit. There is no need to tune TCP or IB parameters to use it. My 
> code just runs well:

Great!

> My test data takes 20 minutes to run with OMPI v1.7.4, but needs less than 1 
> minute with OMPI v1.7.5rc5. I don't know what the magic is. I am wondering 
> when OMPI v1.7.5 final will be released.

Wow -- that sounds like a fundamental difference there.  Could be something to 
do with the NFS tmp directory...?  I could see how that could cause oodles of 
unnecessary network traffic.

1.7.5 should be released ...immanently...

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] OpenMPI job initializing problem

2014-03-20 Thread Beichuan Yan
As for the performance, my 4-node (64-processes) 3-hour job indicates Intel MPI 
and OpenMPI have close benchmarks. Intel MPI takes 2:53 while Open MPI takes 
3:10.

It is interesting that all my MPI_Wtime calls show OpenMPI is faster (up to 
twice or even more) than Intel MPI in communication for a single loop, but in 
overall wall time Open MPI is 10% slower for like 500K loops. The computing 
times are nearly the same. This is a little confusing.

I may set up and run a new test.

Beichuan

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres 
(jsquyres)
Sent: Thursday, March 20, 2014 11:15
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI job initializing problem

On Mar 20, 2014, at 12:48 PM, Beichuan Yan <beichuan@colorado.edu> wrote:

> 2. http://www.open-mpi.org/community/lists/users/2011/11/17684.php
> In the upcoming OMPI v1.7, we revamped the shared memory setup code such that 
> it'll actually use /dev/shm properly, or use some other mechanism other than 
> a mmap file backed in a real filesystem. So the issue goes away.

Woo hoo!

> my comment: up to OMPI v1.7.4, this shmem issue is still there. However, it 
> is resolved in OMPI v1.7.5rc5. This is surprising.
> 
> Anyway, OMPI v1.7.5rc5 works well for multi-processes-on-one-node (shmem) 
> mode on Spirit. There is no need to tune TCP or IB parameters to use it. My 
> code just runs well:

Great!

> My test data takes 20 minutes to run with OMPI v1.7.4, but needs less than 1 
> minute with OMPI v1.7.5rc5. I don't know what the magic is. I am wondering 
> when OMPI v1.7.5 final will be released.

Wow -- that sounds like a fundamental difference there.  Could be something to 
do with the NFS tmp directory...?  I could see how that could cause oodles of 
unnecessary network traffic.

1.7.5 should be released ...immanently...

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] OpenMPI job initializing problem

2014-03-20 Thread Beichuan Yan
Good for me to read it.

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
Sent: Thursday, March 20, 2014 15:00
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI job initializing problem

On 03/20/2014 02:13 PM, Ralph Castain wrote:
>
> On Mar 20, 2014, at 9:48 AM, Beichuan Yan <beichuan@colorado.edu> wrote:
>
>> Hi,
>>
>> Today I tested OMPI v1.7.5rc5 and surprisingly, it works like a charm!
>>
>> I found discussions related to this issue:
>>
>> 1. http://www.open-mpi.org/community/lists/users/2011/11/17688.php
>> The correct solution here is get your sys admin to make /tmp local.
Making /tmp NFS mounted across multiple nodes is a major "faux pas"
in the Linux world - it should never be done, for the reasons stated by Jeff.
>>

Actually, besides the previous discussions on this thread,
this problem is documented in the OMPI FAQ:

http://www.open-mpi.org/faq/?category=sm#poor-sm-btl-performance

>> my comment: for most clusters I have used, /tmp is NOT local.
Open MPI community may not enforce it.
>
> We don't enforce anything, but /tmp being network mounted is a
> VERY unusual situation in the cluster world, and highly unrecommended
>

I agree that it is bad.
Perhaps unusual also, but not unheard of.
If these nodes are diskless,
I guess that the cluster vendor would probably
recommend mounting /tmp as a tmpfs / ramfs (in RAM / shared memory).
That is what is usually done in diskless computers, right?
Why some installations mount /tmp over the network is unclear.

I guess OpenMPI is not alone in using /tmp for to store
temporary and readily accessible stuff,
which, given its name, /tmp is supposed to do.
So, it is not a matter of OMPI enforcing it.

However, reducing the dependence on /tmp, may be a plus anyway.

>
>>
>> 2. http://www.open-mpi.org/community/lists/users/2011/11/17684.php
>> In the upcoming OMPI v1.7, we revamped the shared memory setup code such 
>> that it'll actually use /dev/shm properly, or use some other mechanism other 
>> than a mmap file backed in a real filesystem. So the issue goes away.
>>
>> my comment: up to OMPI v1.7.4, this shmem issue is still there. However, it 
>> is resolved in OMPI v1.7.5rc5. This is surprising.
>>
>> Anyway, OMPI v1.7.5rc5 works well for multi-processes-on-one-node (shmem) 
>> mode on Spirit. There is no need to tune TCP or IB parameters to use it. My 
>> code just runs well:
>>
>> My test data takes 20 minutes to run with OMPI v1.7.4, but needs less than 1 
>> minute with OMPI v1.7.5rc5. I don't know what the magic is. I am wondering 
>> when OMPI v1.7.5 final will be released.
>>
>> I will update performance comparison between Intel MPI and Open MPI.
>>
>> Thanks,
>> Beichuan
>>
>>
>>
>> -Original Message-
>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
>> Sent: Friday, March 07, 2014 18:41
>> To: Open MPI Users
>> Subject: Re: [OMPI users] OpenMPI job initializing problem
>>
>> On 03/06/2014 04:52 PM, Beichuan Yan wrote:
>>> No, I did all these and none worked.
>>>
>>> I just found, with exact the same code, data and job settings, a job can 
>>> really run one day while cannot the other day. It is NOT repeatable. I 
>>> don't know what the problem is: hardware? OpenMPI? PBS Pro?
>>>
>>> Anyway, I may have to give up using OpenMPI on that system and switch to 
>>> IntelMPI which always work.
>>>
>>> Thanks,
>>> Beichuan
>>
>> Well, this machine may have been setup to run only Intel MPI (DAPL?) and SGI 
>> MPI.
>> It is a pity that it doesn't seem to work with OpenMPI.
>>
>> In any case, good luck with your research project.
>>
>> Gus Correa
>>
>>>
>>> -Original Message-
>>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus
>>> Correa
>>> Sent: Thursday, March 06, 2014 13:51
>>> To: Open MPI Users
>>> Subject: Re: [OMPI users] OpenMPI job initializing problem
>>>
>>> On 03/06/2014 03:35 PM, Beichuan Yan wrote:
>>>> Gus,
>>>>
>>>> Yes, 10.148.0.0/16 is the IB subnet.
>>>>
>>>> I did try others but none worked:
>>>> #export
>>>> TCP="--mca btl sm,openib"
>>>> No run, no output
>>>
>>> If I remember right, and unless this changed in recent OMPI vervsions, you 
>>> also need "self":
>>>
>>> -mca btl sm,openib,self
>>>
>>> Alternatively, you could rule 

Re: [OMPI users] OpenMPI job initializing problem

2014-03-20 Thread Beichuan Yan
Ralph and Noam,

Thanks for the clarifications, they are important. I could be wrong in 
understanding the filesystem.

Spirit appears to use a scratch directory for shared memory backing which is 
mounted on Lustre, and does not seem to have local directories or does not 
allow user to change TEMPDIR. Here is the info:
[compute node]$ stat -f -L -c %T /tmp
tmpfs
[compute node]$ stat -f -L -c %T /home/yanb/scratch
lustre

On another university supercomputer, I found the following:
node0448[~]$ stat -f -L -c %T /tmp
ramfs
node0448[~]$ stat -f -L -c %T /home/yanb/scratch/
lustre
Is this /tmp at compute node a local directory? I don't know how to tell it.

Thanks,
Beichuan



-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Thursday, March 20, 2014 12:13
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI job initializing problem


On Mar 20, 2014, at 9:48 AM, Beichuan Yan <beichuan@colorado.edu> wrote:

> Hi,
>
> Today I tested OMPI v1.7.5rc5 and surprisingly, it works like a charm!
>
> I found discussions related to this issue:
>
> 1. http://www.open-mpi.org/community/lists/users/2011/11/17688.php
> The correct solution here is get your sys admin to make /tmp local. Making 
> /tmp NFS mounted across multiple nodes is a major "faux pas" in the Linux 
> world - it should never be done, for the reasons stated by Jeff.
>
> my comment: for most clusters I have used, /tmp is NOT local. Open MPI 
> community may not enforce it.

We don't enforce anything, but /tmp being network mounted is a VERY unusual 
situation in the cluster world, and highly unrecommended


>
> 2. http://www.open-mpi.org/community/lists/users/2011/11/17684.php
> In the upcoming OMPI v1.7, we revamped the shared memory setup code such that 
> it'll actually use /dev/shm properly, or use some other mechanism other than 
> a mmap file backed in a real filesystem. So the issue goes away.
>
> my comment: up to OMPI v1.7.4, this shmem issue is still there. However, it 
> is resolved in OMPI v1.7.5rc5. This is surprising.
>
> Anyway, OMPI v1.7.5rc5 works well for multi-processes-on-one-node (shmem) 
> mode on Spirit. There is no need to tune TCP or IB parameters to use it. My 
> code just runs well:
>
> My test data takes 20 minutes to run with OMPI v1.7.4, but needs less than 1 
> minute with OMPI v1.7.5rc5. I don't know what the magic is. I am wondering 
> when OMPI v1.7.5 final will be released.
>
> I will update performance comparison between Intel MPI and Open MPI.
>
> Thanks,
> Beichuan
>
>
>
> -Original Message-
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus
> Correa
> Sent: Friday, March 07, 2014 18:41
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI job initializing problem
>
> On 03/06/2014 04:52 PM, Beichuan Yan wrote:
>> No, I did all these and none worked.
>>
>> I just found, with exact the same code, data and job settings, a job can 
>> really run one day while cannot the other day. It is NOT repeatable. I don't 
>> know what the problem is: hardware? OpenMPI? PBS Pro?
>>
>> Anyway, I may have to give up using OpenMPI on that system and switch to 
>> IntelMPI which always work.
>>
>> Thanks,
>> Beichuan
>
> Well, this machine may have been setup to run only Intel MPI (DAPL?) and SGI 
> MPI.
> It is a pity that it doesn't seem to work with OpenMPI.
>
> In any case, good luck with your research project.
>
> Gus Correa
>
>>
>> -----Original Message-
>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus
>> Correa
>> Sent: Thursday, March 06, 2014 13:51
>> To: Open MPI Users
>> Subject: Re: [OMPI users] OpenMPI job initializing problem
>>
>> On 03/06/2014 03:35 PM, Beichuan Yan wrote:
>>> Gus,
>>>
>>> Yes, 10.148.0.0/16 is the IB subnet.
>>>
>>> I did try others but none worked:
>>> #export
>>> TCP="--mca btl sm,openib"
>>> No run, no output
>>
>> If I remember right, and unless this changed in recent OMPI vervsions, you 
>> also need "self":
>>
>> -mca btl sm,openib,self
>>
>> Alternatively, you could rule out tcp:
>>
>> -mca btl ^tcp
>>
>>>
>>> #export
>>> TCP="--mca btl sm,openib --mca btl_tcp_if_include 10.148.0.0/16"
>>> No run, no output
>>>
>>> Beichuan
>>
>> Likewise, "self" is missing here.
>>
>> Also, I don't know if you can ask for openib and also add --mca 
>> btl_tcp_if_include 10.148.0.0/16.
>> Note that one turns off tcp (I think), whereas the other r

Re: [OMPI users] OpenMPI job initializing problem

2014-03-20 Thread Beichuan Yan
Hi,

Today I tested OMPI v1.7.5rc5 and surprisingly, it works like a charm!

I found discussions related to this issue:

1. http://www.open-mpi.org/community/lists/users/2011/11/17688.php
The correct solution here is get your sys admin to make /tmp local. Making /tmp 
NFS mounted across multiple nodes is a major "faux pas" in the Linux world - it 
should never be done, for the reasons stated by Jeff.

my comment: for most clusters I have used, /tmp is NOT local. Open MPI 
community may not enforce it.

2. http://www.open-mpi.org/community/lists/users/2011/11/17684.php
In the upcoming OMPI v1.7, we revamped the shared memory setup code such that 
it'll actually use /dev/shm properly, or use some other mechanism other than a 
mmap file backed in a real filesystem. So the issue goes away.

my comment: up to OMPI v1.7.4, this shmem issue is still there. However, it is 
resolved in OMPI v1.7.5rc5. This is surprising.

Anyway, OMPI v1.7.5rc5 works well for multi-processes-on-one-node (shmem) mode 
on Spirit. There is no need to tune TCP or IB parameters to use it. My code 
just runs well:

My test data takes 20 minutes to run with OMPI v1.7.4, but needs less than 1 
minute with OMPI v1.7.5rc5. I don't know what the magic is. I am wondering when 
OMPI v1.7.5 final will be released.

I will update performance comparison between Intel MPI and Open MPI.

Thanks,
Beichuan



-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
Sent: Friday, March 07, 2014 18:41
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI job initializing problem

On 03/06/2014 04:52 PM, Beichuan Yan wrote:
> No, I did all these and none worked.
>
> I just found, with exact the same code, data and job settings, a job can 
> really run one day while cannot the other day. It is NOT repeatable. I don't 
> know what the problem is: hardware? OpenMPI? PBS Pro?
>
> Anyway, I may have to give up using OpenMPI on that system and switch to 
> IntelMPI which always work.
>
> Thanks,
> Beichuan

Well, this machine may have been setup to run only Intel MPI (DAPL?) and SGI 
MPI.
It is a pity that it doesn't seem to work with OpenMPI.

In any case, good luck with your research project.

Gus Correa

>
> -Original Message-
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus
> Correa
> Sent: Thursday, March 06, 2014 13:51
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI job initializing problem
>
> On 03/06/2014 03:35 PM, Beichuan Yan wrote:
>> Gus,
>>
>> Yes, 10.148.0.0/16 is the IB subnet.
>>
>> I did try others but none worked:
>> #export
>> TCP="--mca btl sm,openib"
>> No run, no output
>
> If I remember right, and unless this changed in recent OMPI vervsions, you 
> also need "self":
>
> -mca btl sm,openib,self
>
> Alternatively, you could rule out tcp:
>
> -mca btl ^tcp
>
>>
>> #export
>> TCP="--mca btl sm,openib --mca btl_tcp_if_include 10.148.0.0/16"
>> No run, no output
>>
>   >  Beichuan
>
> Likewise, "self" is missing here.
>
> Also, I don't know if you can ask for openib and also add --mca 
> btl_tcp_if_include 10.148.0.0/16.
> Note that one turns off tcp (I think), whereas the other requests a
> tcp interface (or that the IB interface with IPoIB functionality).
> That combination sounds weird to me.
> The OMPI developers may clarify if this is valid syntax/syntax combination.
>
> I would try simply -mca btl sm,openib,self, which is likely to give
> you the IB transport with verbs, plus shared memory intra-node, plus
> the
> (mandatory?) self (loopback interface?).
> In my experience, this will also help identify any malfunctioning IB HCA in 
> the nodes (with a failure/error message).
>
>
> I hope it helps,
> Gus Correa
>
>
>>
>> -Original Message-
>> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus
>> Correa
>> Sent: Thursday, March 06, 2014 13:16
>> To: Open MPI Users
>> Subject: Re: [OMPI users] OpenMPI job initializing problem
>>
>> Hi Beichuan
>>
>> So, it looks like that now the program runs, even though with specific 
>> settings depending on whether you're using OMPI 1.6.5 or 1.7.4, right?
>>
>> It looks like the problem now is performance, right?
>>
>> System load affects performance, but unless the network is overwhelmed, or 
>> perhaps the Lustre file system is hanging or too slow, I would think that a 
>> walltime increase from 1min to 10min is not related to system load, but 
>> something else.
>>
>> Do you remember the setup that gave you 1min walltime?
>> Was it the same that you sent below?
>> Do you happen to

Re: [OMPI users] OpenMPI job initializing problem

2014-03-06 Thread Beichuan Yan
No, I did all these and none worked.

I just found, with exact the same code, data and job settings, a job can really 
run one day while cannot the other day. It is NOT repeatable. I don't know what 
the problem is: hardware? OpenMPI? PBS Pro?

Anyway, I may have to give up using OpenMPI on that system and switch to 
IntelMPI which always work.

Thanks,
Beichuan

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
Sent: Thursday, March 06, 2014 13:51
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI job initializing problem

On 03/06/2014 03:35 PM, Beichuan Yan wrote:
> Gus,
>
> Yes, 10.148.0.0/16 is the IB subnet.
>
> I did try others but none worked:
> #export
> TCP="--mca btl sm,openib"
> No run, no output

If I remember right, and unless this changed in recent OMPI vervsions, you also 
need "self":

-mca btl sm,openib,self

Alternatively, you could rule out tcp:

-mca btl ^tcp

>
> #export
> TCP="--mca btl sm,openib --mca btl_tcp_if_include 10.148.0.0/16"
> No run, no output
>
 > Beichuan

Likewise, "self" is missing here.

Also, I don't know if you can ask for openib and also add --mca 
btl_tcp_if_include 10.148.0.0/16.
Note that one turns off tcp (I think),
whereas the other requests a tcp interface (or that the IB interface with IPoIB 
functionality).
That combination sounds weird to me.
The OMPI developers may clarify if this is valid syntax/syntax combination.

I would try simply -mca btl sm,openib,self, which is likely to give you the IB 
transport with verbs, plus shared memory intra-node, plus the
(mandatory?) self (loopback interface?).
In my experience, this will also help identify any malfunctioning IB HCA in the 
nodes (with a failure/error message).


I hope it helps,
Gus Correa


>
> -Original Message-
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
> Sent: Thursday, March 06, 2014 13:16
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI job initializing problem
>
> Hi Beichuan
>
> So, it looks like that now the program runs, even though with specific 
> settings depending on whether you're using OMPI 1.6.5 or 1.7.4, right?
>
> It looks like the problem now is performance, right?
>
> System load affects performance, but unless the network is overwhelmed, or 
> perhaps the Lustre file system is hanging or too slow, I would think that a 
> walltime increase from 1min to 10min is not related to system load, but 
> something else.
>
> Do you remember the setup that gave you 1min walltime?
> Was it the same that you sent below?
> Do you happen to know which nodes?
> Are you sharing nodes with other jobs, or are you running alone on the nodes?
> Sharing with other processes may slow down your job.
> If you request all cores in the node, PBS should give you a full node (unless 
> they tricked PBS to think the nodes have more cores than they actually do).
> How do you request the nodes in your #PBS directives?
> Do you request nodes and ppn, or do you request procs?
>
> I suggest that you do:
> cat $PBS_NODEFILE
> in your PBS script, just to document which nodes are actually given to you.
>
> Also helpful to document/troubleshoot is to add -v and -tag-output to your 
> mpiexec command line.
>
>
> The difference in walltime could be due to some malfunction of IB HCAs on the 
> nodes, for instance.
> Since you are allowing (if I remember right) the use of TCP, OpenMPI will try 
> to use any interfaces that you did not rule out.
> If your mpiexec command line doesn't make any restriction, it will use 
> anything available, if I remember right.
> (Jeff will correct me in the next second.) If your mpiexec command line has 
> mca btl_tcp_if_include 10.148.0.0/16 it will use the 10.148.0.0/16 subnet in 
> with TCP transport, I think.
> (Jeff will cut my list subscription after that one, for spreading 
> misinformation.)
>
> In either case my impression is that you may have left a door open to the use 
> of non-IB (and non-IB-verbs) transport.
>
> Is 10.148.0.0/16 the an Infiniband subnet or an Ethernet subnet?
>
> Did you remeber Jeff's suggestion from a while ago to avoid TCP (over 
> Ethernet or over IB), and stick to IB verbs?
>
>
> Is 10.148.0.0/16 the IB or the Ethernet subnet?
>
> On 03/02/2014 02:38 PM, Jeff Squyres (jsquyres) wrote:
>   >  Both 1.6.x and 1.7.x/1.8.x will need verbs.h to use the native verbs
>   >  network stack.
>   >
>   >  You can use emulated TCP over IB (e.g., using the OMPI TCP BTL), but
>   >  it's nowhere near as fast/efficient the native verbs network stack.
>   >
>
>
> You could force the use of IB verbs with
>
> -mca btl ^tcp
>
> or with
>
> -mca bt

Re: [OMPI users] OpenMPI job initializing problem

2014-03-06 Thread Beichuan Yan
Gus,

Yes, 10.148.0.0/16 is the IB subnet.

I did try others but none worked:
#export
TCP="--mca btl sm,openib"
No run, no output

#export
TCP="--mca btl sm,openib --mca btl_tcp_if_include 10.148.0.0/16"
No run, no output

Beichuan

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
Sent: Thursday, March 06, 2014 13:16
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI job initializing problem

Hi Beichuan

So, it looks like that now the program runs, even though with specific settings 
depending on whether you're using OMPI 1.6.5 or 1.7.4, right?

It looks like the problem now is performance, right?

System load affects performance, but unless the network is overwhelmed, or 
perhaps the Lustre file system is hanging or too slow, I would think that a 
walltime increase from 1min to 10min is not related to system load, but 
something else.

Do you remember the setup that gave you 1min walltime?
Was it the same that you sent below?
Do you happen to know which nodes?
Are you sharing nodes with other jobs, or are you running alone on the nodes?
Sharing with other processes may slow down your job.
If you request all cores in the node, PBS should give you a full node (unless 
they tricked PBS to think the nodes have more cores than they actually do).
How do you request the nodes in your #PBS directives?
Do you request nodes and ppn, or do you request procs?

I suggest that you do:
cat $PBS_NODEFILE
in your PBS script, just to document which nodes are actually given to you.

Also helpful to document/troubleshoot is to add -v and -tag-output to your 
mpiexec command line.


The difference in walltime could be due to some malfunction of IB HCAs on the 
nodes, for instance.
Since you are allowing (if I remember right) the use of TCP, OpenMPI will try 
to use any interfaces that you did not rule out.
If your mpiexec command line doesn't make any restriction, it will use anything 
available, if I remember right.
(Jeff will correct me in the next second.) If your mpiexec command line has mca 
btl_tcp_if_include 10.148.0.0/16 it will use the 10.148.0.0/16 subnet in with 
TCP transport, I think.
(Jeff will cut my list subscription after that one, for spreading 
misinformation.)

In either case my impression is that you may have left a door open to the use 
of non-IB (and non-IB-verbs) transport.

Is 10.148.0.0/16 the an Infiniband subnet or an Ethernet subnet?

Did you remeber Jeff's suggestion from a while ago to avoid TCP (over Ethernet 
or over IB), and stick to IB verbs?


Is 10.148.0.0/16 the IB or the Ethernet subnet?

On 03/02/2014 02:38 PM, Jeff Squyres (jsquyres) wrote:
 > Both 1.6.x and 1.7.x/1.8.x will need verbs.h to use the native verbs
 > network stack.
 >
 > You can use emulated TCP over IB (e.g., using the OMPI TCP BTL), but
 > it's nowhere near as fast/efficient the native verbs network stack.
 >


You could force the use of IB verbs with

-mca btl ^tcp

or with

-mca btl sm,openib,self

on the mpiexec command line.

In this case, if any of the IB HCAs on the nodes is bad,
the job will abort with an error message, instead of running too slow
(if it is using other networks).

There are also ways to tell OMPI to do a more verbose output,
that may perhaps help diagnose the problem.
ompi_info | grep verbose
may give some hints (I confess I don't remember them).


Believe me, this did happen to me, i.e., to run MPI programs in a
cluster that had all sorts of non-homogeneous nodes, some with
faulty IB HCAs, some with incomplete OFED installation, some that
were not mounting shared file systems properly, etc.
[I didn't administer that one!]
Hopefully that is not the problem you are facing, but verbose output
may help anyways.

I hope this helps,
Gus Correa



On 03/06/2014 01:49 PM, Beichuan Yan wrote:
> 1. For $TMPDIR and $TCP, there are four combinations by commenting on/off 
> (note the system's default TMPDIR=/work3/yanb):
> export TMPDIR=/work1/home/yanb/tmp
> TCP="--mca btl_tcp_if_include 10.148.0.0/16"
>
> 2. I tested the 4 combinations for OpenMPI 1.6.5 and OpenMPI 1.7.4 
> respectively for the pure-MPI mode (no OPENMP threads; 8 nodes, each node 
> runs 16 processes). The results are weird: of all 8 cases, only TWO of them 
> can run, but run so slow:
>
> OpenMPI 1.6.5:
> export TMPDIR=/work1/home/yanb/tmp
> TCP="--mca btl_tcp_if_include 10.148.0.0/16"
> Warning: shared-memory, /work1/home/yanb/tmp/
> Run, take 10 minutes, slow
>
> OpenMPI 1.7.4:
> #export TMPDIR=/work1/home/yanb/tmp
> #TCP="--mca btl_tcp_if_include 10.148.0.0/16"
> Warning: shared-memory /work3/yanb/605832.SPIRIT/
> Run, take 10 minutess, slow
>
> So you see, a) openmpi 1.6.5 and 1.7.4 need different settings to run;
b) whether specifying TMPDIR, I got the shared memory warning.
>
> 3. But a few days ago, OpenMPI 1.6.5 worked great and took o

Re: [OMPI users] OpenMPI job initializing problem

2014-03-06 Thread Beichuan Yan
1. For $TMPDIR and $TCP, there are four combinations by commenting on/off (note 
the system's default TMPDIR=/work3/yanb):
export TMPDIR=/work1/home/yanb/tmp
TCP="--mca btl_tcp_if_include 10.148.0.0/16"

2. I tested the 4 combinations for OpenMPI 1.6.5 and OpenMPI 1.7.4 respectively 
for the pure-MPI mode (no OPENMP threads; 8 nodes, each node runs 16 
processes). The results are weird: of all 8 cases, only TWO of them can run, 
but run so slow:

OpenMPI 1.6.5:
export TMPDIR=/work1/home/yanb/tmp
TCP="--mca btl_tcp_if_include 10.148.0.0/16"
Warning: shared-memory, /work1/home/yanb/tmp/
Run, take 10 minutes, slow

OpenMPI 1.7.4:
#export TMPDIR=/work1/home/yanb/tmp
#TCP="--mca btl_tcp_if_include 10.148.0.0/16"
Warning: shared-memory /work3/yanb/605832.SPIRIT/
Run, take 10 minutess, slow

So you see, a) openmpi 1.6.5 and 1.7.4 need different settings to run; b) 
whether specifying TMPDIR, I got the shared memory warning.

3. But a few days ago, OpenMPI 1.6.5 worked great and took only 1 minute (now 
it takes 10 minutes). I am so confused by the results. Does the system loading 
level or fluctuation or PBS pro affect OpenMPI performance?

Thanks,
Beichuan

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
Sent: Tuesday, March 04, 2014 08:48
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI job initializing problem

Hi Beichuan

So, from "df" it looks like /home is /work1, right?

Also, "mount" shows only /work[1-4], not the other
7 CWFS panfs (Panasas?), which apparently are not available in the compute 
nodes/blades.

I presume you have access and are using only some of the /work[1-4]
(lustre) file systems for all your MPI and other software installation, right? 
Not the panfs, right?

Awkward that it doesn't work, because lustre is supposed to be a parallel file 
system, highly available to all nodes (assuming it is mounted on all nodes).

It also shows a small /tmp with a tmpfs file system, which is volatile, in 
memory:

http://en.wikipedia.org/wiki/Tmpfs

I would guess they don't let you write there, so TMPDIR=/tmp may not be a 
possible option, but this is just a wild guess.
Or maybe OMPI requires an actual non-volatile file system to write its shared 
memory auxiliary files and other stuff that normally goes on /tmp?  [Jeff, 
Ralph, help!!] I kind of remember some old discussion on this list about this, 
but maybe it was in another list.

[You could ask the sys admin about this, and perhaps what he recommends to use 
to replace /tmp.]

Just in case they may have some file system mount point mixup, you could try 
perhaps TMPDIR=/work1/yanb/tmp (rather than /home) You could also try 
TMPDIR=/work3/yanb/tmp, as if I remember right this is another file system you 
have access to (not sure anymore, it may have been in the previous emails).
Either way, you may need to create the tmp directory beforehand.

**

Any chances that this is an environment mixup?

Say, that you may be inadvertently using the SGI-MPI mpiexec Using a 
/full/path/to/mpiexec in your job may clarify this.

"which mpiexec" will tell, but since the environment on the compute nodes may 
not be exactly the same as in the login node, it may not be reliable 
information.

Or perhaps you may not be pointing to the OMPI libraries?
Are you exporting PATH and LD_LIBRARY_PATH on .bashrc/.tcshrc, with the OMPI 
items (bin and lib) *PREPENDED* (not appended), so as to take precedence over 
other possible/SGI/pre-existent MPI items?

Those are pretty (ugly) common problems.

**

I hope this helps,
Gus Correa

On 03/03/2014 10:13 PM, Beichuan Yan wrote:
> 1. info from a compute node
> -bash-4.1$ hostname
> r32i1n1
> -bash-4.1$ df -h /home
> FilesystemSize  Used Avail Use% Mounted on
> 10.148.18.45@o2ib:10.148.18.46@o2ib:/fs1
>1.2P  136T  1.1P  12% /work1 -bash-4.1$ mount
> devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /tmp type
> tmpfs (rw,size=150m) none on /proc/sys/fs/binfmt_misc type binfmt_misc
> (rw) cpuset on /dev/cpuset type cpuset (rw)
> 10.148.18.45@o2ib:10.148.18.46@o2ib:/fs1 on /work1 type lustre
> (rw,flock)
> 10.148.18.76@o2ib:10.148.18.164@o2ib:/fs2 on /work2 type lustre
> (rw,flock)
> 10.148.18.104@o2ib:10.148.18.165@o2ib:/fs3 on /work3 type lustre
> (rw,flock)
> 10.148.18.132@o2ib:10.148.18.133@o2ib:/fs4 on /work4 type lustre
> (rw,flock)
>
>
> 2. For "export TMPDIR=/home/yanb/tmp", I created it beforehand, and I did see 
> mpi-related temporary files there when the job gets started.
>
> -Original Message-
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus
> Correa
> Sent: Monday, March 03, 2014 18:23
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI job initializing problem
>
> Hi Beichuan
>
> OK, it says "unclassified.html"

Re: [OMPI users] OpenMPI job initializing problem

2014-03-03 Thread Beichuan Yan
1. info from a compute node
-bash-4.1$ hostname
r32i1n1
-bash-4.1$ df -h /home
FilesystemSize  Used Avail Use% Mounted on
10.148.18.45@o2ib:10.148.18.46@o2ib:/fs1
  1.2P  136T  1.1P  12% /work1
-bash-4.1$ mount
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /tmp type tmpfs (rw,size=150m)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
cpuset on /dev/cpuset type cpuset (rw)
10.148.18.45@o2ib:10.148.18.46@o2ib:/fs1 on /work1 type lustre (rw,flock)
10.148.18.76@o2ib:10.148.18.164@o2ib:/fs2 on /work2 type lustre (rw,flock)
10.148.18.104@o2ib:10.148.18.165@o2ib:/fs3 on /work3 type lustre (rw,flock)
10.148.18.132@o2ib:10.148.18.133@o2ib:/fs4 on /work4 type lustre (rw,flock)


2. For "export TMPDIR=/home/yanb/tmp", I created it beforehand, and I did see 
mpi-related temporary files there when the job gets started.

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
Sent: Monday, March 03, 2014 18:23
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI job initializing problem

Hi Beichuan

OK, it says "unclassified.html", so I presume it is not a problem.

The web site says the computer is an SGI ICE X.
I am not familiar to it, so what follows are guesses.

The SGI site brochure suggests that the nodes/blades have local disks:
https://www.sgi.com/pdfs/4330.pdf

The file systems prefixed with IP addresses (work[1-4]) and with panfs (cwfs 
and CWFS[1-6]) and a colon (:) are shared exports (not local), but not 
necessarily NFS (panfs may be Panasas?).
 From this output it is hard to tell where /home is, but I would guess it is 
also shared (not local).
Maybe "df -h /home" will tell.  Or perhaps "mount".

You may be logged in to a login/service node, so although it does have a /tmp 
(your ls / shows tmp), this doesn't guarantee that the compute nodes/blades 
also do.

Since your jobs failed when you specified TMPDIR=/tmp, I would guess /tmp 
doesn't exist on the nodes/blades, or is not writable.

Did you try to submit a job with, say, "mpiexec -np 16 ls -ld /tmp"?
This should tell if /tmp exists on the nodes, if it is writable.

A stupid question:
When you tried your job with this:

export TMPDIR=/home/yanb/tmp

Did you create the directory /home/yanb/tmp beforehand?

Anyway, you may need to ask the help of a system administrator of this machine.

Gus Correa

On 03/03/2014 07:43 PM, Beichuan Yan wrote:
> Gus,
>
> I am using this system: 
> http://centers.hpc.mil/systems/unclassified.html#Spirit. I don't know exactly 
> configurations of the file system. Here is the output of "df -h":
> FilesystemSize  Used Avail Use% Mounted on
> /dev/sda6 919G   16G  857G   2% /
> tmpfs  32G 0   32G   0% /dev/shm
> /dev/sda5 139M   33M  100M  25% /boot
> adfs3v-s:/adfs3/hafs14
>6.5T  678G  5.5T  11% /scratch
> adfs3v-s:/adfs3/hafs16
>6.5T  678G  5.5T  11% /var/spool/mail
> 10.148.18.45@o2ib:10.148.18.46@o2ib:/fs1
>1.2P  136T  1.1P  12% /work1
> 10.148.18.132@o2ib:10.148.18.133@o2ib:/fs4
>1.2P  793T  368T  69% /work4
> 10.148.18.104@o2ib:10.148.18.165@o2ib:/fs3
>1.2P  509T  652T  44% /work3
> 10.148.18.76@o2ib:10.148.18.164@o2ib:/fs2
>1.2P  521T  640T  45% /work2 
> panfs://172.16.0.10/CWFS
>728T  286T  443T  40% /p/cwfs
> panfs://172.16.1.61/CWFS1
>728T  286T  443T  40% /p/CWFS1
> panfs://172.16.0.210/CWFS2
>728T  286T  443T  40% /p/CWFS2
> panfs://172.16.1.125/CWFS3
>728T  286T  443T  40% /p/CWFS3
> panfs://172.16.1.224/CWFS4
>728T  286T  443T  40% /p/CWFS4
> panfs://172.16.1.224/CWFS5
>728T  286T  443T  40% /p/CWFS5
> panfs://172.16.1.224/CWFS6
>728T  286T  443T  40% /p/CWFS6
> panfs://172.16.1.224/CWFS7
>728T  286T  443T  40% /p/CWFS7
>
> 1. My home directory is /home/yanb.
> My simulation files are located at /work3/yanb.
> The default TMPDIR set by system is just /work3/yanb
>
> 2. I did try not to set TMPDIR and let it default, which is just case 1 and 
> case 2.
>Case1: #export TMPDIR=/home/yanb/tmp
>  TCP="--mca btl_tcp_if_include 10.148.0.0/16"
> It gives no apparent reason.
>Case2: #export TMPDIR=/home/yanb/tmp
>  #TCP="--mca btl_tcp_if_include 10.148.0.0/16"
> It gives warning of shared memory file on network file system.
>
> 3. With "export TMPDIR=/tmp", the job gives the same, no apparent reason.
>
> 4. FYI, "ls /" gives:

Re: [OMPI users] OpenMPI job initializing problem

2014-03-03 Thread Beichuan Yan
Gus,

I am using this system: 
http://centers.hpc.mil/systems/unclassified.html#Spirit. I don't know exactly 
configurations of the file system. Here is the output of "df -h":
FilesystemSize  Used Avail Use% Mounted on
/dev/sda6 919G   16G  857G   2% /
tmpfs  32G 0   32G   0% /dev/shm
/dev/sda5 139M   33M  100M  25% /boot
adfs3v-s:/adfs3/hafs14
  6.5T  678G  5.5T  11% /scratch
adfs3v-s:/adfs3/hafs16
  6.5T  678G  5.5T  11% /var/spool/mail
10.148.18.45@o2ib:10.148.18.46@o2ib:/fs1
  1.2P  136T  1.1P  12% /work1
10.148.18.132@o2ib:10.148.18.133@o2ib:/fs4
  1.2P  793T  368T  69% /work4
10.148.18.104@o2ib:10.148.18.165@o2ib:/fs3
  1.2P  509T  652T  44% /work3
10.148.18.76@o2ib:10.148.18.164@o2ib:/fs2
  1.2P  521T  640T  45% /work2
panfs://172.16.0.10/CWFS
  728T  286T  443T  40% /p/cwfs
panfs://172.16.1.61/CWFS1
  728T  286T  443T  40% /p/CWFS1
panfs://172.16.0.210/CWFS2
  728T  286T  443T  40% /p/CWFS2
panfs://172.16.1.125/CWFS3
  728T  286T  443T  40% /p/CWFS3
panfs://172.16.1.224/CWFS4
  728T  286T  443T  40% /p/CWFS4
panfs://172.16.1.224/CWFS5
  728T  286T  443T  40% /p/CWFS5
panfs://172.16.1.224/CWFS6
  728T  286T  443T  40% /p/CWFS6
panfs://172.16.1.224/CWFS7
  728T  286T  443T  40% /p/CWFS7

1. My home directory is /home/yanb.
My simulation files are located at /work3/yanb.
The default TMPDIR set by system is just /work3/yanb

2. I did try not to set TMPDIR and let it default, which is just case 1 and 
case 2.
  Case1: #export TMPDIR=/home/yanb/tmp
TCP="--mca btl_tcp_if_include 10.148.0.0/16"
  It gives no apparent reason.
  Case2: #export TMPDIR=/home/yanb/tmp
#TCP="--mca btl_tcp_if_include 10.148.0.0/16"
  It gives warning of shared memory file on network file system.

3. With "export TMPDIR=/tmp", the job gives the same, no apparent reason.

4. FYI, "ls /" gives:
ELTapps  cgroup  hafs1   hafs12  hafs2  hafs5  hafs8home   
lost+found  mnt  p  root selinux  tftpboot  varwork3
admin  bin   dev hafs10  hafs13  hafs3  hafs6  hafs9libmedia
   net  panfs  sbin srv  tmp   work1  work4
appboot  etc hafs11  hafs15  hafs4  hafs7  hafs_x86_64  lib64  misc 
   opt  proc   scratch  sys  usr   work2  workspace

Beichuan

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
Sent: Monday, March 03, 2014 17:24
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI job initializing problem

Hi Beichuan

If you are using the university cluster, chances are that /home is not local, 
but on an NFS share, or perhaps Lustre (which you may have mentioned before, I 
don't remember).

Maybe "df -h" will show what is local what is not.
It works for NFS, it prefixes file systems with the server name, but I don't 
know about Lustre.

Did you try just not to set TMPDIR and let it default?
If the default TMPDIR is on Lustre (did you say this?, anyway I don't
remember) you could perhaps try to force it to /tmp:
export TMPDIR=/tmp,
If the cluster nodes are diskfull /tmp is likely to exist and be local to the 
cluster nodes.
[But the cluster nodes may be diskless ... :( ]

I hope this helps,
Gus Correa

On 03/03/2014 07:10 PM, Beichuan Yan wrote:
> How to set TMPDIR to a local filesystem? Is /home/yanb/tmp a local 
> filesystem? I don't know how to tell a directory is local file system or 
> network file system.
>
> -Original Message-
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff 
> Squyres (jsquyres)
> Sent: Monday, March 03, 2014 16:57
> To: Open MPI Users
> Subject: Re: [OMPI users] OpenMPI job initializing problem
>
> How about setting TMPDIR to a local filesystem?
>
>
> On Mar 3, 2014, at 3:43 PM, Beichuan Yan<beichuan@colorado.edu>  wrote:
>
>> I agree there are two cases for pure-MPI mode: 1. Job fails with no apparent 
>> reason;  2 job complains shared-memory file on network file system, which 
>> can be resolved by " export TMPDIR=/home/yanb/tmp", /home/yanb/tmp is my 
>> local directory. The default TMPDIR points to a Lustre directory.
>>
>> There is no any other output. I checked my job with "qstat -n" and found 
>> that processes were actually not started on compute nodes even though PBS 
>> Pro has "started" my job.
>>
>> Beichuan
>>
>>> 3. Then I test pure-MPI mode: OPENMP is turned off, and each compute node 
>>> runs 16 processes (clearly shared-memory of MPI is used). Fo

Re: [OMPI users] OpenMPI job initializing problem

2014-03-03 Thread Beichuan Yan
How to set TMPDIR to a local filesystem? Is /home/yanb/tmp a local filesystem? 
I don't know how to tell a directory is local file system or network file 
system.

-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Jeff Squyres 
(jsquyres)
Sent: Monday, March 03, 2014 16:57
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI job initializing problem

How about setting TMPDIR to a local filesystem?


On Mar 3, 2014, at 3:43 PM, Beichuan Yan <beichuan@colorado.edu> wrote:

> I agree there are two cases for pure-MPI mode: 1. Job fails with no apparent 
> reason;  2 job complains shared-memory file on network file system, which can 
> be resolved by " export TMPDIR=/home/yanb/tmp", /home/yanb/tmp is my local 
> directory. The default TMPDIR points to a Lustre directory.
> 
> There is no any other output. I checked my job with "qstat -n" and found that 
> processes were actually not started on compute nodes even though PBS Pro has 
> "started" my job.
> 
> Beichuan
> 
>> 3. Then I test pure-MPI mode: OPENMP is turned off, and each compute node 
>> runs 16 processes (clearly shared-memory of MPI is used). Four combinations 
>> of "TMPDIR" and "TCP" are tested:
>> case 1:
>> #export TMPDIR=/home/yanb/tmp
>> TCP="--mca btl_tcp_if_include 10.148.0.0/16"
>> mpirun $TCP -np 64 -npernode 16 -hostfile $PBS_NODEFILE ./paraEllip3d 
>> input.txt
>> output:
>> Start Prologue v2.5 Mon Mar  3 15:47:16 EST 2014 End Prologue v2.5 
>> Mon Mar  3 15:47:16 EST 2014
>> -bash: line 1: 448597 Terminated  
>> /var/spool/PBS/mom_priv/jobs/602244.service12.SC
>> Start Epilogue v2.5 Mon Mar  3 15:50:51 EST 2014 Statistics 
>> cpupercent=0,cput=00:00:00,mem=7028kb,ncpus=128,vmem=495768kb,walltim
>> e
>> =00:03:24 End Epilogue v2.5 Mon Mar  3 15:50:52 EST 2014
> 
> It looks like you have two general cases:
> 
> 1. The job fails for no apparent reason (like above), or 2. The job 
> complains that your TMPDIR is on a shared filesystem
> 
> Right?
> 
> I think the real issue, then, is to figure out why your jobs are failing with 
> no output.
> 
> Is there anything in the stderr output?
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] OpenMPI job initializing problem

2014-03-03 Thread Beichuan Yan
I agree there are two cases for pure-MPI mode: 1. Job fails with no apparent 
reason;  2 job complains shared-memory file on network file system, which can 
be resolved by " export TMPDIR=/home/yanb/tmp", /home/yanb/tmp is my local 
directory. The default TMPDIR points to a Lustre directory.

There is no any other output. I checked my job with "qstat -n" and found that 
processes were actually not started on compute nodes even though PBS Pro has 
"started" my job.

Beichuan

> 3. Then I test pure-MPI mode: OPENMP is turned off, and each compute node 
> runs 16 processes (clearly shared-memory of MPI is used). Four combinations 
> of "TMPDIR" and "TCP" are tested:
> case 1:
> #export TMPDIR=/home/yanb/tmp
> TCP="--mca btl_tcp_if_include 10.148.0.0/16"
> mpirun $TCP -np 64 -npernode 16 -hostfile $PBS_NODEFILE ./paraEllip3d 
> input.txt
> output:
> Start Prologue v2.5 Mon Mar  3 15:47:16 EST 2014 End Prologue v2.5 Mon 
> Mar  3 15:47:16 EST 2014
> -bash: line 1: 448597 Terminated  
> /var/spool/PBS/mom_priv/jobs/602244.service12.SC
> Start Epilogue v2.5 Mon Mar  3 15:50:51 EST 2014 Statistics  
> cpupercent=0,cput=00:00:00,mem=7028kb,ncpus=128,vmem=495768kb,walltime
> =00:03:24 End Epilogue v2.5 Mon Mar  3 15:50:52 EST 2014

It looks like you have two general cases:

1. The job fails for no apparent reason (like above), or 2. The job complains 
that your TMPDIR is on a shared filesystem

Right?

I think the real issue, then, is to figure out why your jobs are failing with 
no output.

Is there anything in the stderr output?

--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] OpenMPI job initializing problem

2014-03-02 Thread Beichuan Yan
Ralph and Gus,

1. Thank you for your suggestion. I built Open MPI 1.6.5 with the following 
command: 
./configure --prefix=/work4/projects/openmpi/openmpi-1.6.5-gcc-compilers-4.7.3 
--with-tm=/opt/pbs/default --with-openib=  --with-openib-libdir=/usr/lib64

In my job script, I need to specify the IB subnet like this:
TCP="--mca btl_tcp_if_include 10.148.0.0/16"
mpirun $TCP -np 64 -hostfile $PBS_NODEFILE ./paraEllip3d input.txt

Then my job can get initialized and run correctly each time!

2. However, to build Open MPI 1.7.4 with another command (in order to 
test/compare shared-memory performance of Open MPI):
./configure --prefix=/work4/projects/openmpi/openmpi-1.7.4-gcc-compilers-4.7.3 
--with-tm=/opt/pbs/default --with-verbs=  --with-verbs-libdir=/usr/lib64

It gets error as follows:

== Modular Component Architecture (MCA) setup

checking for subdir args...  
'--prefix=/work4/projects/openmpi/openmpi-1.7.4-gcc-compilers-4.7.3' 
'--with-tm=/opt/pbs/default' '--with-verbs=' '--with-verbs-libdir=/usr/lib64' 
'CC=gcc' 'CXX=g++'
checking --with-verbs value... simple ok (unspecified)
checking --with-verbs-libdir value... sanity check ok (/usr/lib64)
configure: WARNING: Could not find verbs.h in the usual locations under
configure: error: Cannot continue

Our system is Red Hat 6.4. Do we need to install more packages of Infiniband? 
Can you please advise?

Thanks,
Beichuan Yan


-Original Message-
From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Gus Correa
Sent: Friday, February 28, 2014 15:59
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI job initializing problem

HI Beichuan

To add to what Ralph said,
the RHEL OpenMPI package probably wasn't built with with PBS Pro support either.
Besides, OMPI 1.5.4 (RHEL version) is old.

**

You will save yourself time and grief if you read the installation FAQs, before 
you install from the source tarball:

http://www.open-mpi.org/faq/?category=building

However, as Ralph said, that is your best bet, and it is quite easy to get 
right.


See this FAQ on how to build with PBS Pro support:

http://www.open-mpi.org/faq/?category=building#build-rte-tm

And this one on how to build with Infiniband support:

http://www.open-mpi.org/faq/?category=building#build-p2p

Here is how to select the installation directory (--prefix):

http://www.open-mpi.org/faq/?category=building#easy-build

Here is how to select the compilers (gcc,g++, and gfortran are fine):

http://www.open-mpi.org/faq/?category=building#build-compilers

I hope this helps,
Gus Correa

On 02/28/2014 12:36 PM, Ralph Castain wrote:
> Almost certainly, the redhat package wasn't built with matching 
> infiniband support and so we aren't picking it up. I'd suggest 
> downloading the latest 1.7.4 or 1.7.5 nightly tarball, or even the 
> latest 1.6 tarball if you want the stable release, and build it 
> yourself so you *know* it was built for your system.
>
>
> On Feb 28, 2014, at 9:20 AM, Beichuan Yan <beichuan@colorado.edu 
> <mailto:beichuan@colorado.edu>> wrote:
>
>> Hi there,
>> I am running jobs on clusters with Infiniband connection. They 
>> installed OpenMPI v1.5.4 via REDHAT 6 yum package). My problem is 
>> that although my jobs gets queued and started by PBS PRO quickly, 
>> most of the time they don't really run (occasionally they really run) 
>> and give error info like this (even though there are a lot of CPU/IB 
>> resource
>> available):
>> [r2i6n7][[25564,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_com
>> plete_connect]
>> connect() to 192.168.159.156 failed: Connection refused (111) And 
>> even though when a job gets started and runs well, it prompts this
>> error:
>> -
>> -
>> WARNING: There was an error initializing an OpenFabrics device.
>> Local host: r1i2n6
>> Local device: mlx4_0
>> -
>> - 1. Here is the info from one of the compute nodes:
>> -bash-4.1$ /sbin/ifconfig
>> eth0 Link encap:Ethernet HWaddr 8C:89:A5:E3:D2:96 inet 
>> addr:192.168.159.205 Bcast:192.168.159.255 Mask:255.255.255.0
>> inet6 addr: fe80::8e89:a5ff:fee3:d296/64 Scope:Link UP BROADCAST 
>> RUNNING MULTICAST MTU:1500 Metric:1 RX packets:48879864 errors:0 
>> dropped:0 overruns:17 frame:0 TX packets:39286060 errors:0 dropped:0 
>> overruns:0 carrier:0
>> collisions:0 txqueuelen:1000
>> RX bytes:54771093645 (51.0 GiB) TX bytes:37512462596 (34.9 GiB)
>> Memory:dfc0-dfc2
>> Ifconfig uses the ioctl access method to get the full address 
>> informatio

[OMPI users] OpenMPI job initializing problem

2014-02-28 Thread Beichuan Yan
Hi there,

I am running jobs on clusters with Infiniband connection. They installed 
OpenMPI v1.5.4 via REDHAT 6 yum package). My problem is that although my jobs 
gets queued and started by PBS PRO quickly, most of the time they don't really 
run (occasionally they really run) and give error info like this (even though 
there are a lot of CPU/IB resource available):

[r2i6n7][[25564,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect]
 connect() to 192.168.159.156 failed: Connection refused (111)

And even though when a job gets started and runs well, it prompts this error:
--
WARNING: There was an error initializing an OpenFabrics device.
  Local host:   r1i2n6
  Local device: mlx4_0
--

1. Here is the info from one of the compute nodes:
-bash-4.1$ /sbin/ifconfig
eth0  Link encap:Ethernet  HWaddr 8C:89:A5:E3:D2:96
  inet addr:192.168.159.205  Bcast:192.168.159.255  Mask:255.255.255.0
  inet6 addr: fe80::8e89:a5ff:fee3:d296/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:48879864 errors:0 dropped:0 overruns:17 frame:0
  TX packets:39286060 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:54771093645 (51.0 GiB)  TX bytes:37512462596 (34.9 GiB)
  Memory:dfc0-dfc2

Ifconfig uses the ioctl access method to get the full address information, 
which limits hardware addresses to 8 bytes.
Because Infiniband address has 20 bytes, only the first 8 bytes are displayed 
correctly.
Ifconfig is obsolete! For replacement check ip.
ib0   Link encap:InfiniBand  HWaddr 
80:00:00:48:FE:C0:00:00:00:00:00:00:00:00:00:00:00:00:00:00
  inet addr:10.148.0.114  Bcast:10.148.255.255  Mask:255.255.0.0
  inet6 addr: fe80::202:c903:fb:3489/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
  RX packets:43807414 errors:0 dropped:0 overruns:0 frame:0
  TX packets:10534050 errors:0 dropped:24 overruns:0 carrier:0
  collisions:0 txqueuelen:256
  RX bytes:47824448125 (44.5 GiB)  TX bytes:44764010514 (41.6 GiB)

loLink encap:Local Loopback
  inet addr:127.0.0.1  Mask:255.0.0.0
  inet6 addr: ::1/128 Scope:Host
  UP LOOPBACK RUNNING  MTU:16436  Metric:1
  RX packets:17292 errors:0 dropped:0 overruns:0 frame:0
  TX packets:17292 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:0
  RX bytes:1492453 (1.4 MiB)  TX bytes:1492453 (1.4 MiB)

-bash-4.1$ chkconfig --list iptables
iptables0:off   1:off   2:on3:on4:on5:on6:off

2. I tried various parameters below but none of them can assure my jobs get 
initialized and run:
#TCP="--mca btl ^tcp"
#TCP="--mca btl self,openib"
#TCP="--mca btl_tcp_if_exclude lo"
#TCP="--mca btl_tcp_if_include eth0"
#TCP="--mca btl_tcp_if_include eth0, ib0"
#TCP="--mca btl_tcp_if_exclude 192.168.0.0/24,127.0.0.1/8 --mca 
oob_tcp_if_exclude 192.168.0.0/24,127.0.0.1/8"
#TCP="--mca btl_tcp_if_include 10.148.0.0/16"
mpirun $TCP -hostfile $PBS_NODEFILE -np 8 ./paraEllip3d input.txt

3. Then I turned to Intel MPI, which surprisingly starts and runs my job 
correctly each time (though it is a little slower than OpenMPI, maybe 15% 
slower, but it works each time).

Can you please advise? Many thanks.

Sincerely,
Beichuan Yan