Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for OpenMPI usage

2011-11-08 Thread Jeff Squyres
We talked about this issue on the weekly OMPI engineering teleconf today.

It seems like it would be a good idea to bring over the new shared memory 
revamp to the v1.5 series before it transitions to v1.6 so that it can avoid 
network-mounted /tmp filesystem issues.  LANL will be evaluating this; the gut 
feeling was that it would not be a lot of work to bring this over to the v1.5 
branch.

I've created https://svn.open-mpi.org/trac/ompi/ticket/2908 to track the issue.



On Nov 8, 2011, at 8:21 AM, Jeff Squyres wrote:

> On Nov 7, 2011, at 12:12 PM, Blosch, Edwin L wrote:
> 
>> Thanks for the valuable input. I'll change to a wait-and-watch approach.
>> 
>> The FAQ on tuning sm says "If the session directory is located on a network 
>> filesystem, the shared memory BTL latency will be extremely high."  And the 
>> title is 'Why am I seeing incredibly poor performance...'.  So I made the 
>> leap that this configuration must be avoided at all costs...
> 
> (sorry for jumping in late; it's the week before SC, and lots of deadlines 
> are approaching!)
> 
> This is definitely true: if OMPI's mmap files are located on a network 
> filesystem (such as if /tmp is NFS-mounted), your latencies will be higher.  
> I don't claim to know all the exact reasons why, but I have personally seen 
> enough empirical evidence to believe it.  Perhaps newer versions of 
> Linux/NFS/whatever have made the issue better.  But I'm quite sure that it 
> was happening; that's why we put in that warning.
> 
> Here's a few points to add to this discussion, in no particular order:
> 
> 1. Keep in mind the difference between the session directory and the shared 
> memory backing files: the session directory contains some meta data that OMPI 
> processes need.  In general, most of that data is not performance-critical, 
> such that if it's on a networked filesystem, general MPI performance will not 
> be affected.  In 1.4.x and 1.5.x, the shared memory mmap files are also 
> located in the session directory, and as described above, we have definitely 
> seen a negative MPI latency performance impact when this file is on a 
> networked file system.
> 
> 2. In the upcoming OMPI v1.7, we revamped the shared memory backing system 
> such that mmap does not have to be used, and therefore will not care if /tmp 
> is on a networked filesystem.
> 
> 3. I don't know whether /tmp on an networked filesystem is 100% "proper" or 
> not.  I know that some people do it, but there are uniqueness requirements 
> that can definitely be violated in various other tools in this case.  OMPI 
> may not be the only software package that can run into problems here, even if 
> the problems are rare and difficult to track down (e.g., because two 
> processes with the same PID on different machines tried to use the same 
> filename in /tmp, or attempts to use file locking, etc.).
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for OpenMPI usage

2011-11-08 Thread Jeff Squyres
On Nov 7, 2011, at 12:12 PM, Blosch, Edwin L wrote:

> Thanks for the valuable input. I'll change to a wait-and-watch approach.
> 
> The FAQ on tuning sm says "If the session directory is located on a network 
> filesystem, the shared memory BTL latency will be extremely high."  And the 
> title is 'Why am I seeing incredibly poor performance...'.  So I made the 
> leap that this configuration must be avoided at all costs...

(sorry for jumping in late; it's the week before SC, and lots of deadlines are 
approaching!)

This is definitely true: if OMPI's mmap files are located on a network 
filesystem (such as if /tmp is NFS-mounted), your latencies will be higher.  I 
don't claim to know all the exact reasons why, but I have personally seen 
enough empirical evidence to believe it.  Perhaps newer versions of 
Linux/NFS/whatever have made the issue better.  But I'm quite sure that it was 
happening; that's why we put in that warning.

Here's a few points to add to this discussion, in no particular order:

1. Keep in mind the difference between the session directory and the shared 
memory backing files: the session directory contains some meta data that OMPI 
processes need.  In general, most of that data is not performance-critical, 
such that if it's on a networked filesystem, general MPI performance will not 
be affected.  In 1.4.x and 1.5.x, the shared memory mmap files are also located 
in the session directory, and as described above, we have definitely seen a 
negative MPI latency performance impact when this file is on a networked file 
system.

2. In the upcoming OMPI v1.7, we revamped the shared memory backing system such 
that mmap does not have to be used, and therefore will not care if /tmp is on a 
networked filesystem.

3. I don't know whether /tmp on an networked filesystem is 100% "proper" or 
not.  I know that some people do it, but there are uniqueness requirements that 
can definitely be violated in various other tools in this case.  OMPI may not 
be the only software package that can run into problems here, even if the 
problems are rare and difficult to track down (e.g., because two processes with 
the same PID on different machines tried to use the same filename in /tmp, or 
attempts to use file locking, etc.).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for OpenMPI usage

2011-11-07 Thread Blosch, Edwin L
Thanks for the valuable input. I'll change to a wait-and-watch approach.

The FAQ on tuning sm says "If the session directory is located on a network 
filesystem, the shared memory BTL latency will be extremely high."  And the 
title is 'Why am I seeing incredibly poor performance...'.  So I made the leap 
that this configuration must be avoided at all costs...

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of David Singleton
Sent: Sunday, November 06, 2011 4:15 PM
To: us...@open-mpi.org
Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for 
OpenMPI usage


On 11/05/2011 09:11 AM, Blosch, Edwin L wrote:
..
>
> I know where you're coming from, and I probably didn't title the post 
> correctly because I wasn't sure what to ask.  But I definitely saw it, and 
> still see it, as an OpenMPI issue.  Having /tmp mounted over NFS on a 
> stateless cluster is not a broken configuration, broadly speaking. The 
> vendors made those decisions and presumably that's how they do it for other 
> customers as well. There are two other (Platform/HP) MPI applications that 
> apparently work normally. But OpenMPI doesn't work normally. So it's 
> deficient.
>

I'm also concerned that there is a bit of an over-reaction to network
filesystems.  Stores to mmap'd files do not instantly turn into filesystem
writes - there are dirty_writeback parameters to control how often
writes occur and its typically 5-20 seconds.  Ideally, memory or a local
disk is used for session directories but, in many cases, you just wont
notice a performance hit from network filesystems - we didn't when we
tested session directories on Lustre.  If your app is one of those handful
that is slowed by OS jitter at megascale, then you may well notice.
Obviously, its something to test.

For our 1.5 install, I removed Lustre from the list of filesystem types
that generate the warning message about network filesystems.  It would be
nice if it was a site choice whether or not to produce that message and
when.

David

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for OpenMPI usage

2011-11-04 Thread Blosch, Edwin L
Thanks, Ralph, 

> Having a local /tmp is typically required by Linux for proper operation as 
> the OS itself needs to ensure its usage is protected, as was > previously 
> stated and is reiterated in numerous books on managing Linux systems. 

There is a /tmp, but it's not local.  I don't know if that passes muster as a 
proper setup or not.  I'll gift a Linux book for Christmas to the two reputable 
vendors who have configured diskless clusters for us where /tmp was not local, 
and both /usr/tmp and /var/tmp were linked to /tmp. :)

> IMO, discussions of how to handle /tmp on diskless systems goes beyond the 
> bounds of OMPI - it is a Linux system management issue that > is covered in 
> depth by material on that subject. Explaining how the session directory is 
> used, and why we now include a test and warning if the session directory is 
> going to land on a networked file system (pretty sure this is now in the 1.5 
> series, but certainly is > in the trunk for future releases), would be 
> reasonable.

I know where you're coming from, and I probably didn't title the post correctly 
because I wasn't sure what to ask.  But I definitely saw it, and still see it, 
as an OpenMPI issue.  Having /tmp mounted over NFS on a stateless cluster is 
not a broken configuration, broadly speaking. The vendors made those decisions 
and presumably that's how they do it for other customers as well. There are two 
other (Platform/HP) MPI applications that apparently work normally. But OpenMPI 
doesn't work normally. So it's deficient.

I'll ask the vendor to rebuild the stateless image with a /usr/tmp partition so 
that the end-user application in question can then set orte_tmpdir_base to 
/usr/tmp and all will then work beautifully...

Thanks again,

Ed



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for OpenMPI usage

2011-11-04 Thread David Turner

I should have been more careful.  When we first started using OpenMPI,
version 1.4.1, there was a bug that caused session directories to be
left behind.  This was fixed in subsequent releases (and via a patch
for 1.4.1).

Our batch epilogue still removes everything in /tmp that belongs to the
owner of the batch job.  It is invoked after the user's application has
terminated, so the session directories are already gone by that time.

Sorry for the confusion!

On 11/4/11 3:43 AM, TERRY DONTJE wrote:

David, are you saying your jobs consistently leave behind session files
after the job exits? It really shouldn't even in the case when a job
aborts, I thought, mpirun took great pains to cleanup after itself. Can
you tell us what version of OMPI you are running with? I think I could
see kill -9 of mpirun and processes below would cause turds to be left
behind.

--td

On 11/4/2011 2:37 AM, David Turner wrote:

% df /tmp
Filesystem 1K-blocks Used Available Use% Mounted on
- 12330084 822848 11507236 7% /
% df /
Filesystem 1K-blocks Used Available Use% Mounted on
- 12330084 822848 11507236 7% /

That works out to 11GB. But...

The compute nodes have 24GB. Freshly booted, about 3.2GB is
consumed by the kernel, various services, and the root file system.
At this time, usage of /tmp is essentially nil.

We set user memory limits to 20GB.

I would imagine that the size of the session directories depends on a
number of factors; perhaps the developers can comment on that. I have
only seen total sizes in the 10s of MBs on our 8-node, 24GB nodes.

As long as they're removed after each job, they don't really compete
with the application for available memory.

On 11/3/11 8:40 PM, Ed Blosch wrote:

Thanks very much, exactly what I wanted to hear. How big is /tmp?

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of David Turner
Sent: Thursday, November 03, 2011 6:36 PM
To: us...@open-mpi.org
Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node
/tmp
for OpenMPI usage

I'm not a systems guy, but I'll pitch in anyway. On our cluster,
all the compute nodes are completely diskless. The root file system,
including /tmp, resides in memory (ramdisk). OpenMPI puts these
session directories therein. All our jobs run through a batch
system (torque). At the conclusion of each batch job, an epilogue
process runs that removes all files belonging to the owner of the
current batch job from /tmp (and also looks for and kills orphan
processes belonging to the user). This epilogue had to written
by our systems staff.

I believe this is a fairly common configuration for diskless
clusters.

On 11/3/11 4:09 PM, Blosch, Edwin L wrote:

Thanks for the help. A couple follow-up-questions, maybe this starts to

go outside OpenMPI:


What's wrong with using /dev/shm? I think you said earlier in this
thread

that this was not a safe place.


If the NFS-mount point is moved from /tmp to /work, would a /tmp
magically

appear in the filesystem for a stateless node? How big would it be,
given
that there is no local disk, right? That may be something I have to
ask the
vendor, which I've tried, but they don't quite seem to get the question.


Thanks




-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On

Behalf Of Ralph Castain

Sent: Thursday, November 03, 2011 5:22 PM
To: Open MPI Users
Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less
node /tmp

for OpenMPI usage



On Nov 3, 2011, at 2:55 PM, Blosch, Edwin L wrote:


I might be missing something here. Is there a side-effect or
performance

loss if you don't use the sm btl? Why would it exist if there is a
wholly
equivalent alternative? What happens to traffic that is intended for
another process on the same node?


There is a definite performance impact, and we wouldn't recommend doing

what Eugene suggested if you care about performance.


The correct solution here is get your sys admin to make /tmp local.
Making

/tmp NFS mounted across multiple nodes is a major "faux pas" in the
Linux
world - it should never be done, for the reasons stated by Jeff.





Thanks


-Original Message-
From: users-boun...@open-mpi.org
[mailto:users-boun...@open-mpi.org] On

Behalf Of Eugene Loh

Sent: Thursday, November 03, 2011 1:23 PM
To: us...@open-mpi.org
Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node

/tmp for OpenMPI usage


Right. Actually "--mca btl ^sm". (Was missing "btl".)

On 11/3/2011 11:19 AM, Blosch, Edwin L wrote:

I don't tell OpenMPI what BTLs to use. The default uses sm and puts a

session file on /tmp, which is NFS-mounted and thus not a good choice.


Are you suggesting something like --mca ^sm?


-Original Message-
From: users-boun...@open-mpi.org
[mailto:users-boun...@open-mpi.org] On

Behalf Of Eugene Loh

Sent: Thursday, November 03, 2011 12:54 PM
To: us...@open-mpi.org
Subject: Re: [OMPI user

Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for OpenMPI usage

2011-11-04 Thread Ralph Castain

On Nov 4, 2011, at 10:19 AM, Blosch, Edwin L wrote:

> OK, I wouldn't have guessed that the space for /tmp isn't actually in RAM 
> until it's needed.  That's the key piece of knowledge I was missing; I really 
> appreciate it.  So you can allow /tmp to be reasonably sized, but if you 
> aren't actually using it, then it doesn't take up 11 GB of RAM.  And you 
> prevent users from crashing the node by setting mem limit to 4 GB less than 
> the available memory. Got it.
> 
> I agree with your earlier comment:  these are fairly common systems now.  We 
> have program- and owner-specific disks where I work, and after the program 
> ends, the disks are archived or destroyed.  Before the stateless 
> configuration option, the entire computer, nodes and switches as well as 
> disks, were archived or destroyed after each program.  Not too cost-effective.
> 
> Is this a reasonable final summary? :  OpenMPI uses temporary files in such a 
> way that it is performance-critical that these so-called session files, used 
> for shared-memory communications, must be "local".  For state-less clusters, 
> this means the node image must include a /tmp or /wrk partition, 
> intelligently sized so as not to enable an application to exhaust the 
> physical memory of the node, and care must be taken not to mask this 
> in-memory /tmp with an NFS mounted filesystem.  


> It is not uncommon for cluster enablers to exclude /tmp from a typical base 
> Linux filesystem image or mount it over NFS, as a means of providing users 
> with a larger-sized /tmp that is not limited to a fraction of the node's 
> physical memory, or to avoid garbage accumulation in /tmp taking up the 
> physical RAM.

Not sure I agree with this statement, but it is irrelevant here.

>  But not having /tmp or mounting it over NFS is not a viable stateless-node 
> configuration option if you intend to run OpenMPI. Instead you could have a 
> /bigtmp which is NFS-mounted and a /tmp whi!
> ch is local, for example. Starting in OpenMPI 1.7.x, shared-memory 
> communication will no longer go through memory-mapped files, and 
> vendors/users will no longer need to be vigilant concerning this OpenMPI 
> performance requirement on stateless node configuration. 

Having a local /tmp is typically required by Linux for proper operation as the 
OS itself needs to ensure its usage is protected, as was previously stated and 
is reiterated in numerous books on managing Linux systems. The "usual" way of 
dealing with what you describe is for sys admins to add a /usr/tmp space which 
is solely intended for use by users, with the understanding that they may stomp 
on each other if they don't take care in naming their files. This is why we 
provided the ability to redirect the placement of the session directories.

> 
> 
> Is that a reasonable summary?
> 
> If so, would it be helpful to include this as an FAQ entry under General 
> category?  Or the "shared memory" category?  Or the "troubleshooting" 
> category?

IMO, discussions of how to handle /tmp on diskless systems goes beyond the 
bounds of OMPI - it is a Linux system management issue that is covered in depth 
by material on that subject. Explaining how the session directory is used, and 
why we now include a test and warning if the session directory is going to land 
on a networked file system (pretty sure this is now in the 1.5 series, but 
certainly is in the trunk for future releases), would be reasonable.

> 
> 
> Thanks
> 
> 
> 
> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> Behalf Of David Turner
> Sent: Friday, November 04, 2011 1:38 AM
> To: Open MPI Users
> Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp 
> for OpenMPI usage
> 
> % df /tmp
> Filesystem   1K-blocks  Used Available Use% Mounted on
> - 12330084822848  11507236   7% /
> % df /
> Filesystem   1K-blocks  Used Available Use% Mounted on
> - 12330084822848  11507236   7% /
> 
> That works out to 11GB.  But...
> 
> The compute nodes have 24GB.  Freshly booted, about 3.2GB is
> consumed by the kernel, various services, and the root file system.
> At this time, usage of /tmp is essentially nil.
> 
> We set user memory limits to 20GB.
> 
> I would imagine that the size of the session directories depends on a
> number of factors; perhaps the developers can comment on that.  I have
> only seen total sizes in the 10s of MBs on our 8-node, 24GB nodes.
> 
> As long as they're removed after each job, they don't really compete
> with the application for available memory.
> 
> On 11/3/11 8:40 PM, Ed Blosch wrote:
>> Thanks very much, e

Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for OpenMPI usage

2011-11-04 Thread Blosch, Edwin L
OK, I wouldn't have guessed that the space for /tmp isn't actually in RAM until 
it's needed.  That's the key piece of knowledge I was missing; I really 
appreciate it.  So you can allow /tmp to be reasonably sized, but if you aren't 
actually using it, then it doesn't take up 11 GB of RAM.  And you prevent users 
from crashing the node by setting mem limit to 4 GB less than the available 
memory. Got it.

I agree with your earlier comment:  these are fairly common systems now.  We 
have program- and owner-specific disks where I work, and after the program 
ends, the disks are archived or destroyed.  Before the stateless configuration 
option, the entire computer, nodes and switches as well as disks, were archived 
or destroyed after each program.  Not too cost-effective.

Is this a reasonable final summary? :  OpenMPI uses temporary files in such a 
way that it is performance-critical that these so-called session files, used 
for shared-memory communications, must be "local".  For state-less clusters, 
this means the node image must include a /tmp or /wrk partition, intelligently 
sized so as not to enable an application to exhaust the physical memory of the 
node, and care must be taken not to mask this in-memory /tmp with an NFS 
mounted filesystem.  It is not uncommon for cluster enablers to exclude /tmp 
from a typical base Linux filesystem image or mount it over NFS, as a means of 
providing users with a larger-sized /tmp that is not limited to a fraction of 
the node's physical memory, or to avoid garbage accumulation in /tmp taking up 
the physical RAM.  But not having /tmp or mounting it over NFS is not a viable 
stateless-node configuration option if you intend to run OpenMPI. Instead you 
could have a /bigtmp which is NFS-mounted and a /tmp which is local, for 
example. Starting in OpenMPI 1.7.x, shared-memory communication will no longer 
go through memory-mapped files, and vendors/users will no longer need to be 
vigilant concerning this OpenMPI performance requirement on stateless node 
configuration. 


Is that a reasonable summary?

If so, would it be helpful to include this as an FAQ entry under General 
category?  Or the "shared memory" category?  Or the "troubleshooting" category?


Thanks



-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of David Turner
Sent: Friday, November 04, 2011 1:38 AM
To: Open MPI Users
Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for 
OpenMPI usage

% df /tmp
Filesystem   1K-blocks  Used Available Use% Mounted on
- 12330084822848  11507236   7% /
% df /
Filesystem   1K-blocks  Used Available Use% Mounted on
- 12330084822848  11507236   7% /

That works out to 11GB.  But...

The compute nodes have 24GB.  Freshly booted, about 3.2GB is
consumed by the kernel, various services, and the root file system.
At this time, usage of /tmp is essentially nil.

We set user memory limits to 20GB.

I would imagine that the size of the session directories depends on a
number of factors; perhaps the developers can comment on that.  I have
only seen total sizes in the 10s of MBs on our 8-node, 24GB nodes.

As long as they're removed after each job, they don't really compete
with the application for available memory.

On 11/3/11 8:40 PM, Ed Blosch wrote:
> Thanks very much, exactly what I wanted to hear. How big is /tmp?
>
> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
> Behalf Of David Turner
> Sent: Thursday, November 03, 2011 6:36 PM
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp
> for OpenMPI usage
>
> I'm not a systems guy, but I'll pitch in anyway.  On our cluster,
> all the compute nodes are completely diskless.  The root file system,
> including /tmp, resides in memory (ramdisk).  OpenMPI puts these
> session directories therein.  All our jobs run through a batch
> system (torque).  At the conclusion of each batch job, an epilogue
> process runs that removes all files belonging to the owner of the
> current batch job from /tmp (and also looks for and kills orphan
> processes belonging to the user).  This epilogue had to written
> by our systems staff.
>
> I believe this is a fairly common configuration for diskless
> clusters.
>
> On 11/3/11 4:09 PM, Blosch, Edwin L wrote:
>> Thanks for the help.  A couple follow-up-questions, maybe this starts to
> go outside OpenMPI:
>>
>> What's wrong with using /dev/shm?  I think you said earlier in this thread
> that this was not a safe place.
>>
>> If the NFS-mount point is moved from /tmp to /work, would a /tmp magically
> appear in the filesystem for a stateless node?  How big would it be, given
> that there is no loc

Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for OpenMPI usage

2011-11-04 Thread Ralph Castain
That isn't the situation, Terry. We had problems with early OMPI releases, 
particularly the 1.2 series. In response, the labs wrote an epilogue to ensure 
that the session directories were removed. Executing the epilogue is now 
standard operating procedure, even though our more recent releases do a much 
better job of cleanup.

Frankly, it's a good idea anyway. It hurts nothing, takes milliseconds to do, 
and guarantees nothing got left behind (e.g., if someone was using a debug 
version of OMPI and directed opal_output to a file).

On Nov 4, 2011, at 4:43 AM, TERRY DONTJE wrote:

> David, are you saying your jobs consistently leave behind session files after 
> the job exits?  It really shouldn't even in the case when a job aborts, I 
> thought, mpirun took great pains to cleanup after itself.Can you tell us 
> what version of OMPI you are running with?  I think I could see kill -9 of 
> mpirun and processes below would cause turds to be left behind.
> 
> --td
> 
> On 11/4/2011 2:37 AM, David Turner wrote:
>> 
>> % df /tmp 
>> Filesystem   1K-blocks  Used Available Use% Mounted on 
>> - 12330084822848  11507236   7% / 
>> % df / 
>> Filesystem   1K-blocks  Used Available Use% Mounted on 
>> - 12330084822848  11507236   7% / 
>> 
>> That works out to 11GB.  But... 
>> 
>> The compute nodes have 24GB.  Freshly booted, about 3.2GB is 
>> consumed by the kernel, various services, and the root file system. 
>> At this time, usage of /tmp is essentially nil. 
>> 
>> We set user memory limits to 20GB. 
>> 
>> I would imagine that the size of the session directories depends on a 
>> number of factors; perhaps the developers can comment on that.  I have 
>> only seen total sizes in the 10s of MBs on our 8-node, 24GB nodes. 
>> 
>> As long as they're removed after each job, they don't really compete 
>> with the application for available memory. 
>> 
>> On 11/3/11 8:40 PM, Ed Blosch wrote: 
>>> Thanks very much, exactly what I wanted to hear. How big is /tmp? 
>>> 
>>> -Original Message- 
>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
>>> Behalf Of David Turner 
>>> Sent: Thursday, November 03, 2011 6:36 PM 
>>> To: us...@open-mpi.org 
>>> Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp 
>>> for OpenMPI usage 
>>> 
>>> I'm not a systems guy, but I'll pitch in anyway.  On our cluster, 
>>> all the compute nodes are completely diskless.  The root file system, 
>>> including /tmp, resides in memory (ramdisk).  OpenMPI puts these 
>>> session directories therein.  All our jobs run through a batch 
>>> system (torque).  At the conclusion of each batch job, an epilogue 
>>> process runs that removes all files belonging to the owner of the 
>>> current batch job from /tmp (and also looks for and kills orphan 
>>> processes belonging to the user).  This epilogue had to written 
>>> by our systems staff. 
>>> 
>>> I believe this is a fairly common configuration for diskless 
>>> clusters. 
>>> 
>>> On 11/3/11 4:09 PM, Blosch, Edwin L wrote: 
>>>> Thanks for the help.  A couple follow-up-questions, maybe this starts to 
>>> go outside OpenMPI: 
>>>> 
>>>> What's wrong with using /dev/shm?  I think you said earlier in this thread 
>>> that this was not a safe place. 
>>>> 
>>>> If the NFS-mount point is moved from /tmp to /work, would a /tmp magically 
>>> appear in the filesystem for a stateless node?  How big would it be, given 
>>> that there is no local disk, right?  That may be something I have to ask 
>>> the 
>>> vendor, which I've tried, but they don't quite seem to get the question. 
>>>> 
>>>> Thanks 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -Original Message- 
>>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
>>> Behalf Of Ralph Castain 
>>>> Sent: Thursday, November 03, 2011 5:22 PM 
>>>> To: Open MPI Users 
>>>> Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp 
>>> for OpenMPI usage 
>>>> 
>>>> 
>>>> On Nov 3, 2011, at 2:55 PM, Blosch, Edwin L wrote: 
>>>> 
>>>>> I might be missing something here. Is there a side-effect or performance 
>>> loss if you don't use the sm btl?  Why would it exist if there is a wholly 

Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for OpenMPI usage

2011-11-04 Thread TERRY DONTJE
David, are you saying your jobs consistently leave behind session files 
after the job exits?  It really shouldn't even in the case when a job 
aborts, I thought, mpirun took great pains to cleanup after itself.
Can you tell us what version of OMPI you are running with?  I think I 
could see kill -9 of mpirun and processes below would cause turds to be 
left behind.


--td

On 11/4/2011 2:37 AM, David Turner wrote:

% df /tmp
Filesystem   1K-blocks  Used Available Use% Mounted on
- 12330084822848  11507236   7% /
% df /
Filesystem   1K-blocks  Used Available Use% Mounted on
- 12330084822848  11507236   7% /

That works out to 11GB.  But...

The compute nodes have 24GB.  Freshly booted, about 3.2GB is
consumed by the kernel, various services, and the root file system.
At this time, usage of /tmp is essentially nil.

We set user memory limits to 20GB.

I would imagine that the size of the session directories depends on a
number of factors; perhaps the developers can comment on that.  I have
only seen total sizes in the 10s of MBs on our 8-node, 24GB nodes.

As long as they're removed after each job, they don't really compete
with the application for available memory.

On 11/3/11 8:40 PM, Ed Blosch wrote:

Thanks very much, exactly what I wanted to hear. How big is /tmp?

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of David Turner
Sent: Thursday, November 03, 2011 6:36 PM
To: us...@open-mpi.org
Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node 
/tmp

for OpenMPI usage

I'm not a systems guy, but I'll pitch in anyway.  On our cluster,
all the compute nodes are completely diskless.  The root file system,
including /tmp, resides in memory (ramdisk).  OpenMPI puts these
session directories therein.  All our jobs run through a batch
system (torque).  At the conclusion of each batch job, an epilogue
process runs that removes all files belonging to the owner of the
current batch job from /tmp (and also looks for and kills orphan
processes belonging to the user).  This epilogue had to written
by our systems staff.

I believe this is a fairly common configuration for diskless
clusters.

On 11/3/11 4:09 PM, Blosch, Edwin L wrote:
Thanks for the help.  A couple follow-up-questions, maybe this 
starts to

go outside OpenMPI:


What's wrong with using /dev/shm?  I think you said earlier in this 
thread

that this was not a safe place.


If the NFS-mount point is moved from /tmp to /work, would a /tmp 
magically
appear in the filesystem for a stateless node?  How big would it be, 
given
that there is no local disk, right?  That may be something I have to 
ask the

vendor, which I've tried, but they don't quite seem to get the question.


Thanks




-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On

Behalf Of Ralph Castain

Sent: Thursday, November 03, 2011 5:22 PM
To: Open MPI Users
Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less 
node /tmp

for OpenMPI usage



On Nov 3, 2011, at 2:55 PM, Blosch, Edwin L wrote:

I might be missing something here. Is there a side-effect or 
performance
loss if you don't use the sm btl?  Why would it exist if there is a 
wholly

equivalent alternative?  What happens to traffic that is intended for
another process on the same node?


There is a definite performance impact, and we wouldn't recommend doing

what Eugene suggested if you care about performance.


The correct solution here is get your sys admin to make /tmp local. 
Making
/tmp NFS mounted across multiple nodes is a major "faux pas" in the 
Linux

world - it should never be done, for the reasons stated by Jeff.





Thanks


-Original Message-
From: users-boun...@open-mpi.org 
[mailto:users-boun...@open-mpi.org] On

Behalf Of Eugene Loh

Sent: Thursday, November 03, 2011 1:23 PM
To: us...@open-mpi.org
Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node

/tmp for OpenMPI usage


Right.  Actually "--mca btl ^sm".  (Was missing "btl".)

On 11/3/2011 11:19 AM, Blosch, Edwin L wrote:

I don't tell OpenMPI what BTLs to use. The default uses sm and puts a

session file on /tmp, which is NFS-mounted and thus not a good choice.


Are you suggesting something like --mca ^sm?


-Original Message-
From: users-boun...@open-mpi.org 
[mailto:users-boun...@open-mpi.org] On

Behalf Of Eugene Loh

Sent: Thursday, November 03, 2011 12:54 PM
To: us...@open-mpi.org
Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node

/tmp for OpenMPI usage


I've not been following closely.  Why must one use shared-memory
communications?  How about using other BTLs in a "loopback" fashion?
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
_

Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for OpenMPI usage

2011-11-04 Thread David Turner

% df /tmp
Filesystem   1K-blocks  Used Available Use% Mounted on
- 12330084822848  11507236   7% /
% df /
Filesystem   1K-blocks  Used Available Use% Mounted on
- 12330084822848  11507236   7% /

That works out to 11GB.  But...

The compute nodes have 24GB.  Freshly booted, about 3.2GB is
consumed by the kernel, various services, and the root file system.
At this time, usage of /tmp is essentially nil.

We set user memory limits to 20GB.

I would imagine that the size of the session directories depends on a
number of factors; perhaps the developers can comment on that.  I have
only seen total sizes in the 10s of MBs on our 8-node, 24GB nodes.

As long as they're removed after each job, they don't really compete
with the application for available memory.

On 11/3/11 8:40 PM, Ed Blosch wrote:

Thanks very much, exactly what I wanted to hear. How big is /tmp?

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of David Turner
Sent: Thursday, November 03, 2011 6:36 PM
To: us...@open-mpi.org
Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp
for OpenMPI usage

I'm not a systems guy, but I'll pitch in anyway.  On our cluster,
all the compute nodes are completely diskless.  The root file system,
including /tmp, resides in memory (ramdisk).  OpenMPI puts these
session directories therein.  All our jobs run through a batch
system (torque).  At the conclusion of each batch job, an epilogue
process runs that removes all files belonging to the owner of the
current batch job from /tmp (and also looks for and kills orphan
processes belonging to the user).  This epilogue had to written
by our systems staff.

I believe this is a fairly common configuration for diskless
clusters.

On 11/3/11 4:09 PM, Blosch, Edwin L wrote:

Thanks for the help.  A couple follow-up-questions, maybe this starts to

go outside OpenMPI:


What's wrong with using /dev/shm?  I think you said earlier in this thread

that this was not a safe place.


If the NFS-mount point is moved from /tmp to /work, would a /tmp magically

appear in the filesystem for a stateless node?  How big would it be, given
that there is no local disk, right?  That may be something I have to ask the
vendor, which I've tried, but they don't quite seem to get the question.


Thanks




-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On

Behalf Of Ralph Castain

Sent: Thursday, November 03, 2011 5:22 PM
To: Open MPI Users
Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp

for OpenMPI usage



On Nov 3, 2011, at 2:55 PM, Blosch, Edwin L wrote:


I might be missing something here. Is there a side-effect or performance

loss if you don't use the sm btl?  Why would it exist if there is a wholly
equivalent alternative?  What happens to traffic that is intended for
another process on the same node?


There is a definite performance impact, and we wouldn't recommend doing

what Eugene suggested if you care about performance.


The correct solution here is get your sys admin to make /tmp local. Making

/tmp NFS mounted across multiple nodes is a major "faux pas" in the Linux
world - it should never be done, for the reasons stated by Jeff.





Thanks


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On

Behalf Of Eugene Loh

Sent: Thursday, November 03, 2011 1:23 PM
To: us...@open-mpi.org
Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node

/tmp for OpenMPI usage


Right.  Actually "--mca btl ^sm".  (Was missing "btl".)

On 11/3/2011 11:19 AM, Blosch, Edwin L wrote:

I don't tell OpenMPI what BTLs to use. The default uses sm and puts a

session file on /tmp, which is NFS-mounted and thus not a good choice.


Are you suggesting something like --mca ^sm?


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On

Behalf Of Eugene Loh

Sent: Thursday, November 03, 2011 12:54 PM
To: us...@open-mpi.org
Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node

/tmp for OpenMPI usage


I've not been following closely.  Why must one use shared-memory
communications?  How about using other BTLs in a "loopback" fashion?
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
us

Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for OpenMPI usage

2011-11-04 Thread Ed Blosch
Thanks very much, exactly what I wanted to hear. How big is /tmp?

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of David Turner
Sent: Thursday, November 03, 2011 6:36 PM
To: us...@open-mpi.org
Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp
for OpenMPI usage

I'm not a systems guy, but I'll pitch in anyway.  On our cluster,
all the compute nodes are completely diskless.  The root file system,
including /tmp, resides in memory (ramdisk).  OpenMPI puts these
session directories therein.  All our jobs run through a batch
system (torque).  At the conclusion of each batch job, an epilogue
process runs that removes all files belonging to the owner of the
current batch job from /tmp (and also looks for and kills orphan
processes belonging to the user).  This epilogue had to written
by our systems staff.

I believe this is a fairly common configuration for diskless
clusters.

On 11/3/11 4:09 PM, Blosch, Edwin L wrote:
> Thanks for the help.  A couple follow-up-questions, maybe this starts to
go outside OpenMPI:
>
> What's wrong with using /dev/shm?  I think you said earlier in this thread
that this was not a safe place.
>
> If the NFS-mount point is moved from /tmp to /work, would a /tmp magically
appear in the filesystem for a stateless node?  How big would it be, given
that there is no local disk, right?  That may be something I have to ask the
vendor, which I've tried, but they don't quite seem to get the question.
>
> Thanks
>
>
>
>
> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Ralph Castain
> Sent: Thursday, November 03, 2011 5:22 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp
for OpenMPI usage
>
>
> On Nov 3, 2011, at 2:55 PM, Blosch, Edwin L wrote:
>
>> I might be missing something here. Is there a side-effect or performance
loss if you don't use the sm btl?  Why would it exist if there is a wholly
equivalent alternative?  What happens to traffic that is intended for
another process on the same node?
>
> There is a definite performance impact, and we wouldn't recommend doing
what Eugene suggested if you care about performance.
>
> The correct solution here is get your sys admin to make /tmp local. Making
/tmp NFS mounted across multiple nodes is a major "faux pas" in the Linux
world - it should never be done, for the reasons stated by Jeff.
>
>
>>
>> Thanks
>>
>>
>> -Original Message-
>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Eugene Loh
>> Sent: Thursday, November 03, 2011 1:23 PM
>> To: us...@open-mpi.org
>> Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node
/tmp for OpenMPI usage
>>
>> Right.  Actually "--mca btl ^sm".  (Was missing "btl".)
>>
>> On 11/3/2011 11:19 AM, Blosch, Edwin L wrote:
>>> I don't tell OpenMPI what BTLs to use. The default uses sm and puts a
session file on /tmp, which is NFS-mounted and thus not a good choice.
>>>
>>> Are you suggesting something like --mca ^sm?
>>>
>>>
>>> -----Original Message-
>>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Eugene Loh
>>> Sent: Thursday, November 03, 2011 12:54 PM
>>> To: us...@open-mpi.org
>>> Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node
/tmp for OpenMPI usage
>>>
>>> I've not been following closely.  Why must one use shared-memory
>>> communications?  How about using other BTLs in a "loopback" fashion?
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>> ___
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Best regards,

David Turner
User Services Groupemail: dptur...@lbl.gov
NERSC Division phone: (510) 486-4027
Lawrence Berkeley Labfax: (510) 486-4316
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for OpenMPI usage

2011-11-03 Thread David Turner

I'm not a systems guy, but I'll pitch in anyway.  On our cluster,
all the compute nodes are completely diskless.  The root file system,
including /tmp, resides in memory (ramdisk).  OpenMPI puts these
session directories therein.  All our jobs run through a batch
system (torque).  At the conclusion of each batch job, an epilogue
process runs that removes all files belonging to the owner of the
current batch job from /tmp (and also looks for and kills orphan
processes belonging to the user).  This epilogue had to written
by our systems staff.

I believe this is a fairly common configuration for diskless
clusters.

On 11/3/11 4:09 PM, Blosch, Edwin L wrote:

Thanks for the help.  A couple follow-up-questions, maybe this starts to go 
outside OpenMPI:

What's wrong with using /dev/shm?  I think you said earlier in this thread that 
this was not a safe place.

If the NFS-mount point is moved from /tmp to /work, would a /tmp magically 
appear in the filesystem for a stateless node?  How big would it be, given that 
there is no local disk, right?  That may be something I have to ask the vendor, 
which I've tried, but they don't quite seem to get the question.

Thanks




-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Ralph Castain
Sent: Thursday, November 03, 2011 5:22 PM
To: Open MPI Users
Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for 
OpenMPI usage


On Nov 3, 2011, at 2:55 PM, Blosch, Edwin L wrote:


I might be missing something here. Is there a side-effect or performance loss 
if you don't use the sm btl?  Why would it exist if there is a wholly 
equivalent alternative?  What happens to traffic that is intended for another 
process on the same node?


There is a definite performance impact, and we wouldn't recommend doing what 
Eugene suggested if you care about performance.

The correct solution here is get your sys admin to make /tmp local. Making /tmp NFS 
mounted across multiple nodes is a major "faux pas" in the Linux world - it 
should never be done, for the reasons stated by Jeff.




Thanks


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Eugene Loh
Sent: Thursday, November 03, 2011 1:23 PM
To: us...@open-mpi.org
Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for 
OpenMPI usage

Right.  Actually "--mca btl ^sm".  (Was missing "btl".)

On 11/3/2011 11:19 AM, Blosch, Edwin L wrote:

I don't tell OpenMPI what BTLs to use. The default uses sm and puts a session 
file on /tmp, which is NFS-mounted and thus not a good choice.

Are you suggesting something like --mca ^sm?


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Eugene Loh
Sent: Thursday, November 03, 2011 12:54 PM
To: us...@open-mpi.org
Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for 
OpenMPI usage

I've not been following closely.  Why must one use shared-memory
communications?  How about using other BTLs in a "loopback" fashion?
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Best regards,

David Turner
User Services Groupemail: dptur...@lbl.gov
NERSC Division phone: (510) 486-4027
Lawrence Berkeley Labfax: (510) 486-4316


Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for OpenMPI usage

2011-11-03 Thread Blosch, Edwin L
Thanks for the help.  A couple follow-up-questions, maybe this starts to go 
outside OpenMPI:

What's wrong with using /dev/shm?  I think you said earlier in this thread that 
this was not a safe place.

If the NFS-mount point is moved from /tmp to /work, would a /tmp magically 
appear in the filesystem for a stateless node?  How big would it be, given that 
there is no local disk, right?  That may be something I have to ask the vendor, 
which I've tried, but they don't quite seem to get the question.

Thanks




-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Ralph Castain
Sent: Thursday, November 03, 2011 5:22 PM
To: Open MPI Users
Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for 
OpenMPI usage


On Nov 3, 2011, at 2:55 PM, Blosch, Edwin L wrote:

> I might be missing something here. Is there a side-effect or performance loss 
> if you don't use the sm btl?  Why would it exist if there is a wholly 
> equivalent alternative?  What happens to traffic that is intended for another 
> process on the same node?

There is a definite performance impact, and we wouldn't recommend doing what 
Eugene suggested if you care about performance.

The correct solution here is get your sys admin to make /tmp local. Making /tmp 
NFS mounted across multiple nodes is a major "faux pas" in the Linux world - it 
should never be done, for the reasons stated by Jeff.


> 
> Thanks
> 
> 
> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> Behalf Of Eugene Loh
> Sent: Thursday, November 03, 2011 1:23 PM
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp 
> for OpenMPI usage
> 
> Right.  Actually "--mca btl ^sm".  (Was missing "btl".)
> 
> On 11/3/2011 11:19 AM, Blosch, Edwin L wrote:
>> I don't tell OpenMPI what BTLs to use. The default uses sm and puts a 
>> session file on /tmp, which is NFS-mounted and thus not a good choice.
>> 
>> Are you suggesting something like --mca ^sm?
>> 
>> 
>> -Original Message-
>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
>> Behalf Of Eugene Loh
>> Sent: Thursday, November 03, 2011 12:54 PM
>> To: us...@open-mpi.org
>> Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp 
>> for OpenMPI usage
>> 
>> I've not been following closely.  Why must one use shared-memory
>> communications?  How about using other BTLs in a "loopback" fashion?
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for OpenMPI usage

2011-11-03 Thread Jeff Squyres
The sm btl is definitely more performant than loopback on other devices.

On Nov 3, 2011, at 4:55 PM, Blosch, Edwin L wrote:

> I might be missing something here. Is there a side-effect or performance loss 
> if you don't use the sm btl?  Why would it exist if there is a wholly 
> equivalent alternative?  What happens to traffic that is intended for another 
> process on the same node?
> 
> Thanks
> 
> 
> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> Behalf Of Eugene Loh
> Sent: Thursday, November 03, 2011 1:23 PM
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp 
> for OpenMPI usage
> 
> Right.  Actually "--mca btl ^sm".  (Was missing "btl".)
> 
> On 11/3/2011 11:19 AM, Blosch, Edwin L wrote:
>> I don't tell OpenMPI what BTLs to use. The default uses sm and puts a 
>> session file on /tmp, which is NFS-mounted and thus not a good choice.
>> 
>> Are you suggesting something like --mca ^sm?
>> 
>> 
>> -Original Message-
>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
>> Behalf Of Eugene Loh
>> Sent: Thursday, November 03, 2011 12:54 PM
>> To: us...@open-mpi.org
>> Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp 
>> for OpenMPI usage
>> 
>> I've not been following closely.  Why must one use shared-memory
>> communications?  How about using other BTLs in a "loopback" fashion?
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for OpenMPI usage

2011-11-03 Thread Ralph Castain

On Nov 3, 2011, at 2:55 PM, Blosch, Edwin L wrote:

> I might be missing something here. Is there a side-effect or performance loss 
> if you don't use the sm btl?  Why would it exist if there is a wholly 
> equivalent alternative?  What happens to traffic that is intended for another 
> process on the same node?

There is a definite performance impact, and we wouldn't recommend doing what 
Eugene suggested if you care about performance.

The correct solution here is get your sys admin to make /tmp local. Making /tmp 
NFS mounted across multiple nodes is a major "faux pas" in the Linux world - it 
should never be done, for the reasons stated by Jeff.


> 
> Thanks
> 
> 
> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> Behalf Of Eugene Loh
> Sent: Thursday, November 03, 2011 1:23 PM
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp 
> for OpenMPI usage
> 
> Right.  Actually "--mca btl ^sm".  (Was missing "btl".)
> 
> On 11/3/2011 11:19 AM, Blosch, Edwin L wrote:
>> I don't tell OpenMPI what BTLs to use. The default uses sm and puts a 
>> session file on /tmp, which is NFS-mounted and thus not a good choice.
>> 
>> Are you suggesting something like --mca ^sm?
>> 
>> 
>> -Original Message-
>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
>> Behalf Of Eugene Loh
>> Sent: Thursday, November 03, 2011 12:54 PM
>> To: us...@open-mpi.org
>> Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp 
>> for OpenMPI usage
>> 
>> I've not been following closely.  Why must one use shared-memory
>> communications?  How about using other BTLs in a "loopback" fashion?
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for OpenMPI usage

2011-11-03 Thread Blosch, Edwin L
I might be missing something here. Is there a side-effect or performance loss 
if you don't use the sm btl?  Why would it exist if there is a wholly 
equivalent alternative?  What happens to traffic that is intended for another 
process on the same node?

Thanks


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Eugene Loh
Sent: Thursday, November 03, 2011 1:23 PM
To: us...@open-mpi.org
Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for 
OpenMPI usage

Right.  Actually "--mca btl ^sm".  (Was missing "btl".)

On 11/3/2011 11:19 AM, Blosch, Edwin L wrote:
> I don't tell OpenMPI what BTLs to use. The default uses sm and puts a session 
> file on /tmp, which is NFS-mounted and thus not a good choice.
>
> Are you suggesting something like --mca ^sm?
>
>
> -Original Message-
> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On 
> Behalf Of Eugene Loh
> Sent: Thursday, November 03, 2011 12:54 PM
> To: us...@open-mpi.org
> Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp 
> for OpenMPI usage
>
> I've not been following closely.  Why must one use shared-memory
> communications?  How about using other BTLs in a "loopback" fashion?
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for OpenMPI usage

2011-11-03 Thread Eugene Loh

Right.  Actually "--mca btl ^sm".  (Was missing "btl".)

On 11/3/2011 11:19 AM, Blosch, Edwin L wrote:

I don't tell OpenMPI what BTLs to use. The default uses sm and puts a session 
file on /tmp, which is NFS-mounted and thus not a good choice.

Are you suggesting something like --mca ^sm?


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Eugene Loh
Sent: Thursday, November 03, 2011 12:54 PM
To: us...@open-mpi.org
Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for 
OpenMPI usage

I've not been following closely.  Why must one use shared-memory
communications?  How about using other BTLs in a "loopback" fashion?
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for OpenMPI usage

2011-11-03 Thread Blosch, Edwin L
I don't tell OpenMPI what BTLs to use. The default uses sm and puts a session 
file on /tmp, which is NFS-mounted and thus not a good choice.

Are you suggesting something like --mca ^sm?


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Eugene Loh
Sent: Thursday, November 03, 2011 12:54 PM
To: us...@open-mpi.org
Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for 
OpenMPI usage

I've not been following closely.  Why must one use shared-memory 
communications?  How about using other BTLs in a "loopback" fashion?
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for OpenMPI usage

2011-11-03 Thread Jeff Squyres
On Nov 3, 2011, at 1:36 PM, Blosch, Edwin L wrote:

> Yes it sucks, so that's what led me to post my original question: If /dev/shm 
> isn't the right place to put the session file, and /tmp is NFS-mounted, then 
> what IS the "right" way to set up a diskless cluster?  I don't think the idea 
> of tempfs sounds very appealing, after reading the discussion in FAQ #8 about 
> shared-memory usage. We definitely have a job-queueing system and jobs are 
> very often killed using qdel, and writing a post-script handler is way beyond 
> the level of involvement or expertise we can expect from our sys admins.

In the upcoming OMPI v1.7, we revamped the shared memory setup code such that 
it'll actually use /dev/shm properly, or use some other mechanism other than a 
mmap file backed in a real filesystem.  So the issue goes away.  But it doesn't 
help you yet.  :-\

> Surely there's some reasonable guidance that can be offered to work around an 
> issue that is so disabling.

Other than the shared memory file, the session directory shouldn't be large.  
So keeping it in a tmpfs should be ok.  It's just that putting the shared 
memory in a tmpfs has the potential to cost you "twice": the actual shared 
memory itself, and then taking up space in tmpfs (although I have not verified 
this myself -- perhaps Linux is smart enough to not do this?).

Are there *no* local disk on the machines at all?

> A related question would be: How is it that HP-MPI works just fine on this 
> cluster as it is configured now?  Are they doing something different for 
> shared memory communications?

They're probably either not warning you about the issue or not using mmaped 
files that are backed in a filesystem (warning you about the issue is actually 
a relatively new feature in OMPI, IIRC -- since 1.0, IIRC, OMPI has used mmap 
files in a filesystem).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for OpenMPI usage

2011-11-03 Thread Eugene Loh
I've not been following closely.  Why must one use shared-memory 
communications?  How about using other BTLs in a "loopback" fashion?


Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for OpenMPI usage

2011-11-03 Thread Blosch, Edwin L
Cross-thread response here, as this is related to the shared-memory thread:

Yes it sucks, so that's what led me to post my original question: If /dev/shm 
isn't the right place to put the session file, and /tmp is NFS-mounted, then 
what IS the "right" way to set up a diskless cluster?  I don't think the idea 
of tempfs sounds very appealing, after reading the discussion in FAQ #8 about 
shared-memory usage. We definitely have a job-queueing system and jobs are very 
often killed using qdel, and writing a post-script handler is way beyond the 
level of involvement or expertise we can expect from our sys admins.

Surely there's some reasonable guidance that can be offered to work around an 
issue that is so disabling.

A related question would be: How is it that HP-MPI works just fine on this 
cluster as it is configured now?  Are they doing something different for shared 
memory communications?


Thanks


-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Jeff Squyres
Sent: Thursday, November 03, 2011 11:35 AM
To: Open MPI Users
Subject: EXTERNAL: Re: [OMPI users] How to set up state-less node /tmp for 
OpenMPI usage

On Nov 1, 2011, at 7:31 PM, Blosch, Edwin L wrote:

> I'm getting this message below which is observing correctly that /tmp is 
> NFS-mounted.   But there is no other directory which has user or group write 
> permissions.  So I think I'm kind of stuck, and it sounds like a serious 
> issue.

That does kinda suck.  :-\

> Before I ask the administrators to change their image, i.e. mount this 
> partition under /work instead of /tmp, I'd like to ask if anyone is using 
> OpenMPI on a state-less cluster, and are there any gotchas with regards to 
> performance of OpenMPI, i.e. like handling of /tmp, that one would need to 
> know?

I don't have much empirical information here -- I know that some people have 
done this (make /tmp be NFS-mounted).  I think there are at least some issues 
with this, though -- many applications believe that a sufficient condition for 
uniqueness in /tmp is to simply append your PID to a filename.  But this may no 
longer be true if /tmp is shared across multiple OS instances.

I don't have a specific case where this is problematic, but it's not a large 
stretch to imagine that this could happen in practice with random applications 
that make temp files in /tmp.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users