FWIW, we have a home-grown project going here at UD called "VALET" to setup 
users' environment on our HPC clusters.  It's implemented in Python and uses 
only standard packages present on a RHEL-based installation (probably Ubuntu et 
al., too).  The main reasons we wrote it:


- Automatically detect standard paths under a base prefix (e.g. prefix = /usr, 
make sure /usr/bin is on PATH, /usr/lib and /usr/lib64 on LD_LIBRARY_PATH, 
-I/usr/include is in CPPFLAGS, etc.)

- Automatically handle dependencies (e.g. openmpi/1.10.2-intel2015 requires the 
intel/2015 package)

- Create environment "snapshots" before changes are made (thus, a package can 
source a script to make env changes and _still_ have them removed from the 
environment on a rollback)


This happened when TCL modules was in its prime, and we didn't feel comfortable 
asking our users to write TCL code to handle multiple revisions of their 
software.  Since we have a LOT of software that uses the standard filesystem 
layout (bin/lib/include) we install into an NFS share a'la


/opt/shared/open-mpi/1.10.2-gcc
/opt/shared/open-mpi/1.10.2-intel2015


and the package definition in VALET looks like (in YAML syntax, but JSON and 
XML are available, too):


open-mpi:
  description: Open MPI:  Message-Passing Interface
  url: http://www.open-mpi.org/
  prefix: /opt/shared/open-mpi
  default-version: 1.10.2-gcc
  versions:
    1.10.2-gcc:
      description: 1.10.2 (with system GCC)
    1.10.2-intel2015:
      description: 1.10.2 (with Intel 2015)
      dependencies:
        intel/2015


As with any solution to the problem, once our users grok the config format, 
this makes it very easy for them to define their own packages and maintain 
multiple versions of their own software.


The project is up on gitlab:


https://gitlab.com/valet/




> On Oct 5, 2017, at 12:51 PM, Mike Cammilleri <mi...@stat.wisc.edu> wrote:
> 
> Thanks to everyone for your responses.
>  
> We primarily use bcfg2 config management to keep the system packages and 
> configs the same throughout the cluster. I have not generally used it for 
> add-on software used with research unless for some reason it really needed to 
> be locally installed for whatever reason.  I think we’ll be putting various 
> versions of gcc, R, matlab, etc. in the NFS mounted space, so it seems to 
> make sense to just build (from source) the tcl-based Environment Modules 
> software into the NFS space as well and have the scripts in /etc/profile.d/ 
> point to it. I’m not using RPM because believe it or not we’re running a 
> Ubuntu-based SLURM cluster, where we built slurm from source as well.
>  
> I’m interested in the Lmod modules available from TACC with EasyBuild if it’s 
> something that works with our Ubuntu based environment – however I disagree 
> with the modules-tcl-1.923 flavor being out of date – perhaps the pre-built 
> packages are – but the source was updated 2017-07-20 and seems to work great. 
> 
> http://modules.sourceforge.net/tcl/NEWS.html
>  
> Most of what we need out of this is setting environments to various versions 
> of software and setting some library paths but not much more complex than 
> that, yet!
>  
> --mike
>  
> From: r...@open-mpi.org [mailto:r...@open-mpi.org] 
> Sent: Thursday, October 5, 2017 10:24 AM
> To: slurm-dev <slurm-dev@schedmd.com>
> Subject: [slurm-dev] Re: Setting up Environment Modules package
>  
> 
> On Oct 5, 2017, at 12:08 AM, Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> 
> wrote:
>  
> 
> On 10/04/2017 06:11 PM, Mike Cammilleri wrote:
> 
> I'm in search of a best practice for setting up Environment Modules for our 
> Slurm 16.05.6 installation (we have not had the time to upgrade to 17.02 
> yet). We're a small group and had no explicit need for this in the beginning, 
> but as we are growing larger with more users we clearly need something like 
> this.
> I see there are a couple ways to implement Environment Modules and I'm 
> wondering which would be the cleanest, most sensible way. I'll list my ideas 
> below:
> 1. Install Environment Modules package and relevant modulefiles on the slurm 
> head/submit/login node, perhaps in the default /usr/local/ location. The 
> modulefiles modules would define paths to various software packages that 
> exist in a location visible/readable to the compute nodes (NFS or similar). 
> The user then loads the modules manually at the command line on the 
> submit/login node and not in the slurm submit script - but specify #SBATCH 
> --export=ALL and import the environment before submitting the sbatch job.
> 2. Install Environment Modules packages in a location visible to the entire 
> cluster (NFS or similar), including the compute nodes, and the user then 
> includes their 'module load' commands in their actual slurm submit scripts 
> since the command would be available on the compute nodes - loading software 
> (either local or from network locations depending on what they're loading) 
> visible to the nodes
> 3. Another variation would be to use a configuration manager like bcfg2 to 
> make sure Environment Modules and necessary modulefiles and all 
> configurations are present on all compute/submit nodes. Seems like that's 
> potential for a mess though.
> Is there a preferred approach? I see in the archives some folks have strange 
> behavior when a user uses --export=ALL, so it would seem to me that the 
> cleaner approach is to have the 'module load' command available on all 
> compute nodes and have users do this in their submit scripts. If this is the 
> case, I'll need to configure Environment Modules and relevant modulefiles to 
> live in special places when I build Environment Modules (./configure 
> --prefix=/mounted-fs --modulefilesdir=/mounted-fs, etc.).
> We've been testing with modules-tcl-1.923
> 
> I strongly recommend uninstalling the Linux distro "environment-modules" 
> package because this old Tcl-based software hasn't been maintained for 5+ 
> years.  I recommend a very readable paper on various module systems:
> http://dl.acm.org/citation.cfm?id=2691141
>  
> As was pointed out to me when I made a similar comment on another mailing 
> list, the Tcl-based system is actively maintained - the repo simply moved. 
> I’m not recommending either direction as it gets into religion rather quickly.
> 
> 
> 
> We use the modern and actively maintained Lmod modules developed at TACC 
> (https://www.tacc.utexas.edu/research-development/tacc-projects/lmod) 
> together with the EasyBuild module building system (a strong HPC community 
> effort, https://github.com/easybuilders/easybuild).
> 
> I believe that the TACC supercomputer systems provide Slurm as a loadable 
> module, but I don't know any details.  We just install Slurm as RPMs on 
> CentOS 7.
> 
> We're extremely happy with Lmod and EasyBuild because of the simplicity with 
> which 1300+ modules are made available.  I've written a Wiki about how we 
> have installed this: https://wiki.fysik.dtu.dk/niflheim/EasyBuild_modules.  
> We put all of our modules on a shared NFS file system for all nodes.
> 
> /Ole


::::::::::::::::::::::::::::::::::::::::::::::::::::::
Jeffrey T. Frey, Ph.D.
Systems Programmer V / HPC Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE  19716
Office: (302) 831-6034  Mobile: (302) 419-4976
::::::::::::::::::::::::::::::::::::::::::::::::::::::




Reply via email to