Re: [Beowulf] Bright Cluster Manager

2018-05-09 Thread John Hearns via Beowulf
> All of a sudden simple “send the same command to all nodes” just doesn’t
work.  And that’s what will inevitably be the case as we scale up in the
HPC world – there will always be dead or malfunctioning nodes.

Jim, this is true. And 'we' should be looking to the webscale generation
for the answers. They thought about computing at scale from the beginning.

Regarding hardware failures, I heard a shaggy dog story that
Microsoft/Amazon/Google order servers ready racked in shipping containers.
When a certain proportion of servers are dead, they simply close it down
and move on.
Can anyone confirm or deny this story?

Which brings me to another one of my hobby horses - the environmental costs
of HPC. When pitching HPC clusters you often put in an option for a
mid-life upgrade. I think upping the RAM is quite common, but processors
and interconnect much less so.

So kit is hopefulyl worked hard for five years, till the cost of power and
cooling is outweighed by the performance of a new generation. But where
does the kit get recycled? Again when pitching clusters you have to put in
guarantees about WEE (or the equivalent in the USA)


On 8 May 2018 at 18:34, Lux, Jim (337K) <james.p@jpl.nasa.gov> wrote:

>
>
>
>
> *From: *Beowulf <beowulf-boun...@beowulf.org> on behalf of "
> beowulf@beowulf.org" <beowulf@beowulf.org>
> *Reply-To: *John Hearns <hear...@googlemail.com>
> *Date: *Thursday, May 3, 2018 at 6:54 AM
> *To: *"beowulf@beowulf.org" <beowulf@beowulf.org>
> *Subject: *Re: [Beowulf] Bright Cluster Manager
>
>
>
> I agree with Doug. The way forward is a lightweight OS with containers for
> the applications.
>
> I think we need to learn from the new kids on the block - the webscale
> generation.
>
> They did not go out and look at how massive supercomputer clusters are put
> together.
>
> No, they went out and build scale out applications built on public clouds.
>
> We see 'applications designed to fail' and 'serverless'
>
>
>
> Yes, I KNOW that scale out applications like these are Web type
> applications, and all application examples you
>
> see are based on the load balancer/web server/database (or whatever style)
> paradigm
>
>
>
> The art of this will be deploying the more tightly coupled applications
> with HPC has,
>
> which depend upon MPI communications over a reliable fabric, which depend
> upon GPUs etc.
>
> The other hat I will toss into the ring is separating parallel tasks which
> require computation on several
>
> servers and MPI communication between them versus 'embarrassingly
> parallel' operations which may run on many, many cores
>
> but do not particularly need communication between them.
>
> The best successes I have seen on clusters is where the heavy parallel
> applications get exclusive compute nodes.
>
> Cleaner, you get all the memory and storage bandwidth and easy to clean
> up. Hell, reboot the things after each job. You got an exclusive node.
>
> I think many designs of HPC clusters still try to cater for all workloads
> - Oh Yes, we can run an MPI weather forecasting/ocean simulation
>
> But at the same time we have this really fast IO system and we can run
> your Hadoop jobs.
>
> I wonder if we are going to see a fork in HPC. With the massively parallel
> applications being deployed, as Doug says, on specialised
>
> lightweight OSes which have dedicated high speed, reliable fabrics and
> with containers.
>
> You won't really be able to manage those systems like individual Linux
> servers. Will you be able to ssh in for instance?
>
> ssh assumes there is an ssh daemon running. Does a lightweight OS have
> ssh? Authentication Services? The kitchen sink?
>
>
>
> The less parallel applications being run more and more on cloud type
> installations, either on-premise clouds or public clouds.
>
> I confound myself here, as I cant say what the actual difference between
> those two types of machines is, as you always needs
>
> an interconnect fabric and storage, so why not have the same for both
> types of tasks.
>
> Maybe one further quip to stimulate some conversation. Silicon is cheap.
> No, really it is.
>
> Your friendly Intel salesman may wince when you say that. After all those
> lovely Xeon CPUs cost north of 1000 dollars each.
>
> But again I throw in some talking points:
>
> power and cooling costs the same if not more than your purchase cost over
> several years
>
> are we exploiting all the capabilities of those Xeon CPUs
>
> And another aspect of this -  I’ve been doing stuff with “loose clusters”
> of low capability processors (Arduino, Rpi, Beagle) doing distributed
> sensing kinds of tasks – leaving aside the Ar

Re: [Beowulf] Bright Cluster Manager

2018-05-08 Thread Lux, Jim (337K)


From: Beowulf <beowulf-boun...@beowulf.org> on behalf of "beowulf@beowulf.org" 
<beowulf@beowulf.org>
Reply-To: John Hearns <hear...@googlemail.com>
Date: Thursday, May 3, 2018 at 6:54 AM
To: "beowulf@beowulf.org" <beowulf@beowulf.org>
Subject: Re: [Beowulf] Bright Cluster Manager

I agree with Doug. The way forward is a lightweight OS with containers for the 
applications.
I think we need to learn from the new kids on the block - the webscale 
generation.
They did not go out and look at how massive supercomputer clusters are put 
together.
No, they went out and build scale out applications built on public clouds.
We see 'applications designed to fail' and 'serverless'

Yes, I KNOW that scale out applications like these are Web type applications, 
and all application examples you
see are based on the load balancer/web server/database (or whatever style) 
paradigm

The art of this will be deploying the more tightly coupled applications with 
HPC has,
which depend upon MPI communications over a reliable fabric, which depend upon 
GPUs etc.
The other hat I will toss into the ring is separating parallel tasks which 
require computation on several
servers and MPI communication between them versus 'embarrassingly parallel' 
operations which may run on many, many cores
but do not particularly need communication between them.
The best successes I have seen on clusters is where the heavy parallel 
applications get exclusive compute nodes.
Cleaner, you get all the memory and storage bandwidth and easy to clean up. 
Hell, reboot the things after each job. You got an exclusive node.
I think many designs of HPC clusters still try to cater for all workloads  - Oh 
Yes, we can run an MPI weather forecasting/ocean simulation
But at the same time we have this really fast IO system and we can run your 
Hadoop jobs.
I wonder if we are going to see a fork in HPC. With the massively parallel 
applications being deployed, as Doug says, on specialised
lightweight OSes which have dedicated high speed, reliable fabrics and with 
containers.
You won't really be able to manage those systems like individual Linux servers. 
Will you be able to ssh in for instance?
ssh assumes there is an ssh daemon running. Does a lightweight OS have ssh? 
Authentication Services? The kitchen sink?

The less parallel applications being run more and more on cloud type 
installations, either on-premise clouds or public clouds.
I confound myself here, as I cant say what the actual difference between those 
two types of machines is, as you always needs
an interconnect fabric and storage, so why not have the same for both types of 
tasks.
Maybe one further quip to stimulate some conversation. Silicon is cheap. No, 
really it is.
Your friendly Intel salesman may wince when you say that. After all those 
lovely Xeon CPUs cost north of 1000 dollars each.
But again I throw in some talking points:
power and cooling costs the same if not more than your purchase cost over 
several years
are we exploiting all the capabilities of those Xeon CPUs

And another aspect of this -  I’ve been doing stuff with “loose clusters” of 
low capability processors (Arduino, Rpi, Beagle) doing distributed sensing 
kinds of tasks – leaving aside the Arduino (no OS) – the other two wind up with 
some flavor of Debian but often with lots of stuff you don’t need (i.e. 
Apache). Once you’ve fiddled with one node to get the configuration right, you 
want to replicate it across a bunch of nodes – right now that means sneakernet 
of SD cards - although in theory, one should be able to push an image out to 
the local file system (typically 4GB eMMC in the case of beagles), and tell it 
to write that to the “boot area” – but I’ve not tried it.
While I’d never claim my pack of beagles is HPC, it does share some aspects – 
there’s parallel work going on, the nodes need to be aware of each other and 
synchronize their behavior (that is, it’s not an embarrassingly parallel task 
that’s farmed out from a queue), and most importantly, the management has to be 
scalable.   While I might have 4 beagles on the bench right now – the idea is 
to scale the approach to hundreds.  Typing “sudo apt-get install tbd-package” 
on 4 nodes sequentially might be ok (although pdsh and csshx help a lot) it’s 
not viable for 100 nodes.
The other aspect of my application that’s interesting, and applicable to 
exascale kinds of problems, is tolerance to failures – if I have a low data 
rate link among nodes (with not necessarily all to all connectivity), one can 
certainly distribute a new OS image (or container) with time. There’s some ways 
to deal with errors in the transfers (other than just retransmit all – which 
doesn’t work if the error rate is high enough that you can guarantee at least 
one error will occur in a long transfer).  But how do you *manage* a cluster 
with hundreds or thousands of nodes where some fail randomly, reset randomly, 
etc.
All of a sudde

Re: [Beowulf] Bright Cluster Manager

2018-05-04 Thread Douglas Eadline

Good points. I should have mentioned I was talking more about
"generic mainstream HPC" (like you say "cloud")
and not the performance cases where running
on bare metal is essential.

--
Doug

> On Thursday, 3 May 2018 11:04:38 PM AEST Douglas Eadline wrote:
>
>> Here is where I see it going
>>
>> 1. Computer nodes with a base minimal generic Linux OS
>>(with PR_SET_NO_NEW_PRIVS in kernel, added in 3.5)
>
> Depends on your containerisation method, some don't need to rely on that
> as
> the proactively disarm containers of dangerous abilities (setuid/setgid/
> capabilities) before the user gets near them.
>
> That said, even RHEL6 has support for that, so you'd be hard pressed to
> find an
> up-to-date system that doesn't have that ability.
>
>> 2. A Scheduler (that supports containers)
>>
>> 3. Containers (Singularity mostly)
>>
>> All "provisioning" is moved to the container. There will be edge cases
>> of
>> course, but applications will be pulled down from
>> a container repos and "just run"
>
> This then relies on people building containers that have the right
> libraries
> for the hardware you are using.  For instance I tried to use some
> Singularity
> containers on our system for MPI work but can't because the base OS is too
> old
> to include support for our OmniPath interconnect.
>
> The other issue is that it encourages people to build generic binaries
> rather
> than optimised binaries to broaden the systems the container can run on
> and/or
> because they don't have a proprietary compiler (or the distro has a
> version of
> GCC too old to optimise for the hardware).
>
> I would argue that there is a place for that sort of work, but that it's
> the
> cloud not so much HPC (as they're not trying to get the most out of the
> hardware).
>
> I'm conflicted on this because I also have great sympathies for the
> reproducibility side of the coin!
>
> All the best,
> Chris
> --
>  Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
> --
> MailScanner: Clean
>
>


-- 
Doug

-- 
MailScanner: Clean

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Bright Cluster Manager

2018-05-04 Thread Chris Samuel
On Thursday, 3 May 2018 11:04:38 PM AEST Douglas Eadline wrote:

> Here is where I see it going
> 
> 1. Computer nodes with a base minimal generic Linux OS
>(with PR_SET_NO_NEW_PRIVS in kernel, added in 3.5)

Depends on your containerisation method, some don't need to rely on that as 
the proactively disarm containers of dangerous abilities (setuid/setgid/
capabilities) before the user gets near them.

That said, even RHEL6 has support for that, so you'd be hard pressed to find an 
up-to-date system that doesn't have that ability.

> 2. A Scheduler (that supports containers)
> 
> 3. Containers (Singularity mostly)
> 
> All "provisioning" is moved to the container. There will be edge cases of
> course, but applications will be pulled down from
> a container repos and "just run"

This then relies on people building containers that have the right libraries 
for the hardware you are using.  For instance I tried to use some Singularity 
containers on our system for MPI work but can't because the base OS is too old 
to include support for our OmniPath interconnect.

The other issue is that it encourages people to build generic binaries rather 
than optimised binaries to broaden the systems the container can run on and/or 
because they don't have a proprietary compiler (or the distro has a version of 
GCC too old to optimise for the hardware).

I would argue that there is a place for that sort of work, but that it's the 
cloud not so much HPC (as they're not trying to get the most out of the 
hardware).

I'm conflicted on this because I also have great sympathies for the 
reproducibility side of the coin!

All the best,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Bright Cluster Manager

2018-05-04 Thread Chris Samuel
On Thursday, 3 May 2018 11:53:14 PM AEST John Hearns via Beowulf wrote:

> The best successes I have seen on clusters is where the heavy parallel
> applications get exclusive compute nodes. Cleaner, you get all the memory
> and storage bandwidth and easy to clean up. Hell, reboot the things after
> each job. You got an exclusive node.

You are describing the BlueGene/Q philosophy there John. :-)

This idea tends to break when you throw GPUs in to the mix as there 
(hopefully) you only need a couple of cores on the node to shovel data around 
and the GPU does the gruntwork.  That means you'll generally have cores left 
over that could be doing something useful.

On the cluster I'm currently involved with we've got 36 cores per node and a 
pair of P100 GPUs.  We have 2 Slurm partitions per node, one for non-GPU jobs 
that can only use up to 32 cores per node and another for GPU jobs that has no 
restriction.   This means we always keep at least 4 cores per node free for 
GPU jobs.

All the best,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Bright Cluster Manager

2018-05-04 Thread Chris Samuel
On Thursday, 3 May 2018 5:52:41 AM AEST Jeff White wrote:

> Nothing special.  xcat/Warewulf/Scyld/Rocks just get in the way more than
> they help IMO.

To my mind having built clusters with xCAT and then used systems that have 
been done in a DIY manner I always run into tooling that I'm missing with the 
latter. Usually around node discovery (and BMC config), centralised logging and 
IPMI/HMC tooling (remote power control, SoL console logging, IPMI sensor 
information, event logs, etc).

Yes you can roll your own there, but having a consistent toolset that takes 
the drudgery out of rolling your own and means you don't need to think "wait, 
is this an IPMI v2 node or managed via an HMC?" and then use different methods 
depending on the answer is a big win.

It's the same reason things like EasyBuild and Spack exist; we've spent 
decades building software from scratch and creating little shell scripts to do 
the config/build for each new version, but abstracting that and building a 
framework to make it easy is a good thing at scale.   It also means you can 
add things like checksums for tarballs and catch projects that re-release 
their 1.7.0 tarball with new patches without changing the version number (yes, 
TensorFlow, I'm looking at you).

But unpopular opinions are good, and the great thing about the Beowulf 
philosophy is that there is the ability to do things your own way.  It's like 
building a Linux system with Linux From Scratch, yes you could install Ubuntu 
or some other distro that makes it easy but you learn a hell of a lot from 
doing it the hard way - and anyone with a strong interest in Linux should try 
that at least once in their life.

Aside: Be aware if you are using Puppet that some folks on the Slurm list have 
found that when it runs it can move HPC jobs out of the Slurm control group.

All the best,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Bright Cluster Manager

2018-05-04 Thread Chris Samuel
On Thursday, 3 May 2018 6:19:48 AM AEST Chris Dagdigian wrote:

> - Keeping the DIY ZFS appliances online and running took the FULL TIME
> efforts of FIVE STORAGE ENGINEERS

That sounds very fishy.  Either they had really flakey hardware or something 
else weird was going on there.

-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Bright Cluster Manager

2018-05-03 Thread Douglas Eadline


And, I forgot to mention, the other important aspect here is
reproducibility. Create/modify a code, put it in
a signed container (like Singularity), use it,
write the paper. Five years later the machine on which it ran is
gone, your new grad student wants to re-run some data. Easy, because it is
in a container just run it on any system that supports your
containers. No need to ask a kindly sysadmin to help you track down
libraries, compile, and run an older code.

--
Doug

>
>
> Here is where I see it going
>
> 1. Computer nodes with a base minimal generic Linux OS
>(with PR_SET_NO_NEW_PRIVS in kernel, added in 3.5)
>
> 2. A Scheduler (that supports containers)
>
> 3. Containers (Singularity mostly)
>
> All "provisioning" is moved to the container. There will be edge cases
of course, but applications will be pulled down from
> a container repos and "just run"
>
> --
> Doug
>
>
>> I never used Bright.  Touched it and talked to a salesperson at a
conference but I wasn't impressed.
>>
>> Unpopular opinion: I don't see a point in using "cluster managers"
unless you have a very tiny cluster and zero Linux experience. 
These are just Linux boxes with a couple applications (e.g. Slurm)
running on them.  Nothing special. xcat/Warewulf/Scyld/Rocks just
get in the way more than they help IMO.  They are mostly crappy
wrappers around free software (e.g. ISC's dhcpd) anyway.  When they
aren't it's
>> proprietary
>> trash.
>>
>> I install CentOS nodes and use
>> Salt/Chef/Puppet/Ansible/WhoCares/Whatever to plop down my configs and
software.  This also means I'm not suck with "node images" and can
instead build everything as plain old text files (read: write SaltStack
states), update them at will, and push changes any time.  My "base
image" is CentOS and I need no "baby's first cluster" HPC software to
install/PXEboot it.  YMMV
>>
>>
>> Jeff White
>>
>> On 05/01/2018 01:57 PM, Robert Taylor wrote:
>>> Hi Beowulfers.
>>> Does anyone have any experience with Bright Cluster Manager?
>>> My boss has been looking into it, so I wanted to tap into the
>>> collective HPC consciousness and see
>>> what people think about it.
>>> It appears to do node management, monitoring, and provisioning, so we
would still need a job scheduler like lsf, slurm,etc, as well. Is that
correct?
>>>
>>> If you have experience with Bright, let me know. Feel free to contact
me off list or on.
>>>
>>>
>>>
>>> ___
>>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
>>> To change your subscription (digest mode or unsubscribe) visit
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.beowulf.org_mailman_listinfo_beowulf=DwIGaQ=C3yme8gMkxg_ihJNXS06ZyWk4EJm8LdrrvxQb-Je7sw=DhM5WMgdrH-xWhI5BzkRTzoTvz8C-BRZ05t9kW9SXZk=2km_EqLvNf2v9rNf8LphAYkJ-Sc_azfEyHqyDIzpLOc=kq0wdhy80VqcBCwcQAAQa0RbsgWIekhd0qU0zC81g1Q=
>>
>>
>> --
>> MailScanner: Clean
>>
>> ___
>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin
Computing To change your subscription (digest mode or unsubscribe)
visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>
>
> --
> Doug
>
> --
> MailScanner: Clean
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>


-- 
Doug



-- 
Doug

-- 
MailScanner: Clean

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Bright Cluster Manager

2018-05-03 Thread Joe Landman
I agree with both John and Doug.  I've believed for a long time that 
OSes are merely specific details of a particular job, and you should be 
ready to change them out at a moments notice, as part of a job.   This 
way you can always start in a pristine and identical state across your 
fleet of compute nodes.


Moreover, with the emergence of Docker, k8s, and others on Linux, I've 
been of the opinion that most of the value of distributions has been 
usurped, in that you can craft an ideal environment for your job, which 
is then portable across nodes.


Singularity looks like it has done the job correctly as compared to 
Docker et al. so now you can far more securely distribute your jobs as 
statically linked black boxes to nodes.   All you need is a good 
substrate to run them on.


Not astroturfing here ... have a look at 
https://github.com/joelandman/nyble , an early stage project of mine*, 
which is all about building a PXE or USB bootable substrate system, 
based upon your favorite OS (currently supporting Debian9, CentOS7, 
others to be added).  No real docs just yet, though I expect to add them 
soon.  Basically,


    git clone https://github.com/joelandman/nyble
    cd nyble
    # edit makefile to set the DISTRO= variable, and config/all.conf
    # edit the urls.conf and OS/${DISTRO}/distro_urls.conf as need to 
point to local repos and kernel repos

    make

then sane PXE bootable kernel and initramfs appear some time later in 
/mnt/root/boot .   My goal here is to make sure we view the substrate OS 
as a software appliance substrate upon which to run containers, jobs, etc.


Why this is better than other substrates for Singularity/kvm/etc. comes 
down to the fact that you start from a known immutable image.  That is, 
you always boot the same image, unless you decide to change it.  You 
configure everything you need after boot.  You don't need to worry about 
various package manager states and collisions.   You only need to 
install what you need for the substrate (HVM, containers, drivers, 
etc.).  Also, there is no OS disk management, which for very large 
fleets, is an issue.  Roll forwards/backs are trivial and exact, testing 
is trivial, and can be done in prod on a VM or canary machine.  This can 
easily become part of a CD system, so that your hardware and OS 
substrate can be treated as if it were code.  Which is what you want.




* This is a reworking of the SIOS framework from Scalable Informatics 
days.  We had used that successfully for years to pxe boot all of our 
systems from a single small management node.   Its not indicated there 
yet, but it has an apache 2.0 license.  A commit this afternoon should 
show this.


On 05/03/2018 09:53 AM, John Hearns via Beowulf wrote:
I agree with Doug. The way forward is a lightweight OS with containers 
for the applications.
I think we need to learn from the new kids on the block - the webscale 
generation.
They did not go out and look at how massive supercomputer clusters are 
put together.

No, they went out and build scale out applications built on public clouds.
We see 'applications designed to fail' and 'serverless'

Yes, I KNOW that scale out applications like these are Web type 
applications, and all application examples you
see are based on the load balancer/web server/database (or whatever 
style) paradigm


The art of this will be deploying the more tightly coupled 
applications with HPC has,
which depend upon MPI communications over a reliable fabric, which 
depend upon GPUs etc.


The other hat I will toss into the ring is separating parallel tasks 
which require computation on several
servers and MPI communication between them versus 'embarrassingly 
parallel' operations which may run on many, many cores

but do not particularly need communication between them.

The best successes I have seen on clusters is where the heavy parallel 
applications get exclusive compute nodes.
Cleaner, you get all the memory and storage bandwidth and easy to 
clean up. Hell, reboot the things after each job. You got an exclusive 
node.
I think many designs of HPC clusters still try to cater for all 
workloads  - Oh Yes, we can run an MPI weather forecasting/ocean 
simulation
But at the same time we have this really fast IO system and we can run 
your Hadoop jobs.


I wonder if we are going to see a fork in HPC. With the massively 
parallel applications being deployed, as Doug says, on specialised
lightweight OSes which have dedicated high speed, reliable fabrics and 
with containers.
You won't really be able to manage those systems like individual Linux 
servers. Will you be able to ssh in for instance?
ssh assumes there is an ssh daemon running. Does a lightweight OS have 
ssh? Authentication Services? The kitchen sink?


The less parallel applications being run more and more on cloud type 
installations, either on-premise clouds or public clouds.
I confound myself here, as I cant say what the actual difference 
between those two types of 

Re: [Beowulf] Bright Cluster Manager

2018-05-03 Thread John Hearns via Beowulf
I agree with Doug. The way forward is a lightweight OS with containers for
the applications.
I think we need to learn from the new kids on the block - the webscale
generation.
They did not go out and look at how massive supercomputer clusters are put
together.
No, they went out and build scale out applications built on public clouds.
We see 'applications designed to fail' and 'serverless'

Yes, I KNOW that scale out applications like these are Web type
applications, and all application examples you
see are based on the load balancer/web server/database (or whatever style)
paradigm

The art of this will be deploying the more tightly coupled applications
with HPC has,
which depend upon MPI communications over a reliable fabric, which depend
upon GPUs etc.

The other hat I will toss into the ring is separating parallel tasks which
require computation on several
servers and MPI communication between them versus 'embarrassingly parallel'
operations which may run on many, many cores
but do not particularly need communication between them.

The best successes I have seen on clusters is where the heavy parallel
applications get exclusive compute nodes.
Cleaner, you get all the memory and storage bandwidth and easy to clean up.
Hell, reboot the things after each job. You got an exclusive node.
I think many designs of HPC clusters still try to cater for all workloads
- Oh Yes, we can run an MPI weather forecasting/ocean simulation
But at the same time we have this really fast IO system and we can run your
Hadoop jobs.

I wonder if we are going to see a fork in HPC. With the massively parallel
applications being deployed, as Doug says, on specialised
lightweight OSes which have dedicated high speed, reliable fabrics and with
containers.
You won't really be able to manage those systems like individual Linux
servers. Will you be able to ssh in for instance?
ssh assumes there is an ssh daemon running. Does a lightweight OS have ssh?
Authentication Services? The kitchen sink?

The less parallel applications being run more and more on cloud type
installations, either on-premise clouds or public clouds.
I confound myself here, as I cant say what the actual difference between
those two types of machines is, as you always needs
an interconnect fabric and storage, so why not have the same for both types
of tasks.
Maybe one further quip to stimulate some conversation. Silicon is cheap.
No, really it is.
Your friendly Intel salesman may wince when you say that. After all those
lovely Xeon CPUs cost north of 1000 dollars each.
But again I throw in some talking points:

power and cooling costs the same if not more than your purchase cost over
several years

are we exploiting all the capabilities of those Xeon CPUs

































































On 3 May 2018 at 15:04, Douglas Eadline  wrote:

>
>
> Here is where I see it going
>
> 1. Computer nodes with a base minimal generic Linux OS
>(with PR_SET_NO_NEW_PRIVS in kernel, added in 3.5)
>
> 2. A Scheduler (that supports containers)
>
> 3. Containers (Singularity mostly)
>
> All "provisioning" is moved to the container. There will be edge cases of
> course, but applications will be pulled down from
> a container repos and "just run"
>
> --
> Doug
>
>
> > I never used Bright.  Touched it and talked to a salesperson at a
> > conference but I wasn't impressed.
> >
> > Unpopular opinion: I don't see a point in using "cluster managers"
> > unless you have a very tiny cluster and zero Linux experience.  These
> > are just Linux boxes with a couple applications (e.g. Slurm) running on
> > them.  Nothing special. xcat/Warewulf/Scyld/Rocks just get in the way
> > more than they help IMO.  They are mostly crappy wrappers around free
> > software (e.g. ISC's dhcpd) anyway.  When they aren't it's proprietary
> > trash.
> >
> > I install CentOS nodes and use
> > Salt/Chef/Puppet/Ansible/WhoCares/Whatever to plop down my configs and
> > software.  This also means I'm not suck with "node images" and can
> > instead build everything as plain old text files (read: write SaltStack
> > states), update them at will, and push changes any time.  My "base
> > image" is CentOS and I need no "baby's first cluster" HPC software to
> > install/PXEboot it.  YMMV
> >
> >
> > Jeff White
> >
> > On 05/01/2018 01:57 PM, Robert Taylor wrote:
> >> Hi Beowulfers.
> >> Does anyone have any experience with Bright Cluster Manager?
> >> My boss has been looking into it, so I wanted to tap into the
> >> collective HPC consciousness and see
> >> what people think about it.
> >> It appears to do node management, monitoring, and provisioning, so we
> >> would still need a job scheduler like lsf, slurm,etc, as well. Is that
> >> correct?
> >>
> >> If you have experience with Bright, let me know. Feel free to contact
> >> me off list or on.
> >>
> >>
> >>
> >> ___
> >> Beowulf mailing list, Beowulf@beowulf.org 

Re: [Beowulf] Bright Cluster Manager

2018-05-03 Thread Douglas Eadline


Here is where I see it going

1. Computer nodes with a base minimal generic Linux OS
   (with PR_SET_NO_NEW_PRIVS in kernel, added in 3.5)

2. A Scheduler (that supports containers)

3. Containers (Singularity mostly)

All "provisioning" is moved to the container. There will be edge cases of
course, but applications will be pulled down from
a container repos and "just run"

--
Doug


> I never used Bright.  Touched it and talked to a salesperson at a
> conference but I wasn't impressed.
>
> Unpopular opinion: I don't see a point in using "cluster managers"
> unless you have a very tiny cluster and zero Linux experience.  These
> are just Linux boxes with a couple applications (e.g. Slurm) running on
> them.  Nothing special. xcat/Warewulf/Scyld/Rocks just get in the way
> more than they help IMO.  They are mostly crappy wrappers around free
> software (e.g. ISC's dhcpd) anyway.  When they aren't it's proprietary
> trash.
>
> I install CentOS nodes and use
> Salt/Chef/Puppet/Ansible/WhoCares/Whatever to plop down my configs and
> software.  This also means I'm not suck with "node images" and can
> instead build everything as plain old text files (read: write SaltStack
> states), update them at will, and push changes any time.  My "base
> image" is CentOS and I need no "baby's first cluster" HPC software to
> install/PXEboot it.  YMMV
>
>
> Jeff White
>
> On 05/01/2018 01:57 PM, Robert Taylor wrote:
>> Hi Beowulfers.
>> Does anyone have any experience with Bright Cluster Manager?
>> My boss has been looking into it, so I wanted to tap into the
>> collective HPC consciousness and see
>> what people think about it.
>> It appears to do node management, monitoring, and provisioning, so we
>> would still need a job scheduler like lsf, slurm,etc, as well. Is that
>> correct?
>>
>> If you have experience with Bright, let me know. Feel free to contact
>> me off list or on.
>>
>>
>>
>> ___
>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.beowulf.org_mailman_listinfo_beowulf=DwIGaQ=C3yme8gMkxg_ihJNXS06ZyWk4EJm8LdrrvxQb-Je7sw=DhM5WMgdrH-xWhI5BzkRTzoTvz8C-BRZ05t9kW9SXZk=2km_EqLvNf2v9rNf8LphAYkJ-Sc_azfEyHqyDIzpLOc=kq0wdhy80VqcBCwcQAAQa0RbsgWIekhd0qU0zC81g1Q=
>
>
> --
> MailScanner: Clean
>
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>


-- 
Doug

-- 
MailScanner: Clean

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Bright Cluster Manager

2018-05-03 Thread John Hearns via Beowulf
Regarding storage, Chris Dagdigian comments:

>And you know what? After the Isilon NAS was deployed the management of
*many* petabytes of single-namespace storage was now handled by the IT
Director in his 'spare time' -- And the five engineers who used to do
nothing > >but keep ZFS from falling over were re-assigned to more
impactful and presumably more fun/interesting work.

The person who runs the huge JASMIN climate research project in the UK
makes the same comment, only with Panasas storage.
He is able to manage petabytes of Panasas storage with himself and one
other person. A lto of that storage installed by my fair hands.
To be honest though installing Panasas is a matter of how fast you can
unbox the blades  (*)

(*) Well, that is not so in real life! During that install we had several
'funnies' - all of which were diagnosed and a fix given by the superb
Panasas support.
Including the shelf where after replacing every component over the period
of two weeks - something like Triggers Broom
http://foolsandhorses.weebly.com/triggers-broom.html
we at last found the bent pin in the multiway connector (ahem)





On 3 May 2018 at 09:23, John Hearns  wrote:

> Jorg,  I did not know that you used Bright.  Or I may have forgotten!
> I thought you were a Debian fan.  Of relevance, Bright 8 now supports
> Debian.
>
> You commented on the Slurm configuration file being changed.
> I found during the install at Greenwich, where we put in a custom
> slurm.conf, that Bright has an option
> to 'freeze' files. This is defined in the cmd.conf file.  So if new nodes
> are added, or other changes made,
> the slurm.conf gile is left unchanged and you have to manually manage it.
> I am not 100% sure what happens with an update of the RPMs, but I would
> imagine the freeze state is respected.
>
>
> >I should add I am working in academia and I know little about the
> commercial
> >world here. Having said that, my friends in commerce are telling me that
> the
> >company likes to outsource as it is 'cheaper'.
> I would not say cheaper. However (see below) HPC skills are scarce.
> And if you are in industry you commit to your management that HPC
> resources will be up and running
> for XX % of a year - ie you have some explaining to do if there is
> extended downtime.
> HPC is looked upon as something comparable to machine tools - in Formula 1
> we competed for beudget against
> fize axis milling machines for instance. Can you imagine what would happen
> if the machine shop supervisor said
> "Sorry - no parts being made today. My guys have the covers off and we are
> replacing one of the motors with one we got off Ebay"
>
>
> So yes you do want commercial support for aspects of your setup - let us
> say that jobs are going into hold states
> on your batch system, or jobs are immediately terminating. Do you:
>
> a) spend all day going through logs with a fine tooth comb, and send out
> an email to the Slurm/PBS/SGE list and hope you get
> some sort of answer
>
> b) take a dump of the relevant logs and get a ticket opened with your
> support people
>
> Actually in real life you do both, but path (b) is going to get you up and
> running quicker.
>
> Also for storage, in industry you really want support on your storage.
>
>
>
>
> >Anyhow, I don't want to digress here too much. However, "..do HPC work in
> >commercial environments where the skills simply don't exist onsite."
> >Are we a dying art?
>
> Jorg, yes. HPC skills are rare, as are the people who take the time and
> trouble to learn deeply about the systems they operate.
> I know this as recruitment consultants tell me this regularly.
> I find that often in life people do the minimum they need, and once they
> are given instructions they never change,
> even when the configuration steps they carry out have lost meaning.
> I have met that attitude in several companies. Echoing Richard Feynman I
> call this 'cargo cult systems'
> The people like you who are willing to continually learn and to abandon
> old ways of work
> are invaluable.
>
> I am consulting at the moment with a biotech firm in Denmark. Replying to
> Chris Dagdigian, this company does have excellent in-house
> Linux skills, so I suppose is the exception to the rule!
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On 2 May 2018 at 23:04, Jörg Saßmannshausen  > wrote:
>
>> Dear Chris,
>>
>> further to your email:
>>
>> > - And if miracles occur and they do have expert level linux people then
>> > more often than not these people are overworked or stretched in many
>> > directions
>>
>> This is exactly what has happened to me at the old work place: pulled
>> into too
>> many different directions.
>>
>> I am a bit surprised about the ZFS experiences. Although I did not have
>> petabyte of storage and I did not generate 300 TB per week, I did have a
>> fairly large storage space running on xfs and ext4 for backups and
>> provisioning of file space. Some of it 

Re: [Beowulf] Bright Cluster Manager

2018-05-03 Thread John Hearns via Beowulf
Jorg,  I did not know that you used Bright.  Or I may have forgotten!
I thought you were a Debian fan.  Of relevance, Bright 8 now supports
Debian.

You commented on the Slurm configuration file being changed.
I found during the install at Greenwich, where we put in a custom
slurm.conf, that Bright has an option
to 'freeze' files. This is defined in the cmd.conf file.  So if new nodes
are added, or other changes made,
the slurm.conf gile is left unchanged and you have to manually manage it.
I am not 100% sure what happens with an update of the RPMs, but I would
imagine the freeze state is respected.


>I should add I am working in academia and I know little about the
commercial
>world here. Having said that, my friends in commerce are telling me that
the
>company likes to outsource as it is 'cheaper'.
I would not say cheaper. However (see below) HPC skills are scarce.
And if you are in industry you commit to your management that HPC resources
will be up and running
for XX % of a year - ie you have some explaining to do if there is extended
downtime.
HPC is looked upon as something comparable to machine tools - in Formula 1
we competed for beudget against
fize axis milling machines for instance. Can you imagine what would happen
if the machine shop supervisor said
"Sorry - no parts being made today. My guys have the covers off and we are
replacing one of the motors with one we got off Ebay"


So yes you do want commercial support for aspects of your setup - let us
say that jobs are going into hold states
on your batch system, or jobs are immediately terminating. Do you:

a) spend all day going through logs with a fine tooth comb, and send out an
email to the Slurm/PBS/SGE list and hope you get
some sort of answer

b) take a dump of the relevant logs and get a ticket opened with your
support people

Actually in real life you do both, but path (b) is going to get you up and
running quicker.

Also for storage, in industry you really want support on your storage.




>Anyhow, I don't want to digress here too much. However, "..do HPC work in
>commercial environments where the skills simply don't exist onsite."
>Are we a dying art?

Jorg, yes. HPC skills are rare, as are the people who take the time and
trouble to learn deeply about the systems they operate.
I know this as recruitment consultants tell me this regularly.
I find that often in life people do the minimum they need, and once they
are given instructions they never change,
even when the configuration steps they carry out have lost meaning.
I have met that attitude in several companies. Echoing Richard Feynman I
call this 'cargo cult systems'
The people like you who are willing to continually learn and to abandon old
ways of work
are invaluable.

I am consulting at the moment with a biotech firm in Denmark. Replying to
Chris Dagdigian, this company does have excellent in-house
Linux skills, so I suppose is the exception to the rule!
























On 2 May 2018 at 23:04, Jörg Saßmannshausen 
wrote:

> Dear Chris,
>
> further to your email:
>
> > - And if miracles occur and they do have expert level linux people then
> > more often than not these people are overworked or stretched in many
> > directions
>
> This is exactly what has happened to me at the old work place: pulled into
> too
> many different directions.
>
> I am a bit surprised about the ZFS experiences. Although I did not have
> petabyte of storage and I did not generate 300 TB per week, I did have a
> fairly large storage space running on xfs and ext4 for backups and
> provisioning of file space. Some of it was running on old hardware (please
> sit
> down, I am talking about me messing around with SCSI cables) and I
> gradually
> upgraded to newer one. So, I am not quite sure what went wrong with the
> ZFS
> storage here.
>
> However, there is a common trend, at least what I observe here in the UK,
> to
> out-source problems: pass the bucket to somebody else and we pay for it.
> I am personally still  more of an in-house expert than an out-sourced
> person
> who may or may not be able to understand what you are doing.
> I should add I am working in academia and I know little about the
> commercial
> world here. Having said that, my friends in commerce are telling me that
> the
> company likes to outsource as it is 'cheaper'.
> I agree with the Linux expertise. I think I am one of the two who are
> Linux
> admins in the present work place. The official line is: we do not support
> Linux
> (but we teach it).
>
> Anyhow, I don't want to digress here too much. However, "..do HPC work in
> commercial environments where the skills simply don't exist onsite."
> Are we a dying art?
>
> My 1 shilling here from a still cold and dark London.
>
> Jörg
>
>
>
> Am Mittwoch, 2. Mai 2018, 16:19:48 BST schrieb Chris Dagdigian:
> > Jeff White wrote:
> > > I never used Bright.  Touched it and talked to a salesperson at a
> > > conference but I wasn't impressed.
> > >
> 

Re: [Beowulf] Bright Cluster Manager

2018-05-02 Thread Jörg Saßmannshausen
Dear Chris,

further to your email:

> - And if miracles occur and they do have expert level linux people then
> more often than not these people are overworked or stretched in many
> directions

This is exactly what has happened to me at the old work place: pulled into too 
many different directions. 

I am a bit surprised about the ZFS experiences. Although I did not have 
petabyte of storage and I did not generate 300 TB per week, I did have a 
fairly large storage space running on xfs and ext4 for backups and 
provisioning of file space. Some of it was running on old hardware (please sit 
down, I am talking about me messing around with SCSI cables) and I gradually 
upgraded to newer one. So, I am not quite sure what went wrong with the ZFS 
storage here. 

However, there is a common trend, at least what I observe here in the UK, to 
out-source problems: pass the bucket to somebody else and we pay for it. 
I am personally still  more of an in-house expert than an out-sourced person 
who may or may not be able to understand what you are doing. 
I should add I am working in academia and I know little about the commercial 
world here. Having said that, my friends in commerce are telling me that the 
company likes to outsource as it is 'cheaper'. 
I agree with the Linux expertise. I think I am one of the two who are Linux 
admins in the present work place. The official line is: we do not support Linux 
(but we teach it). 

Anyhow, I don't want to digress here too much. However, "..do HPC work in 
commercial environments where the skills simply don't exist onsite."
Are we a dying art?

My 1 shilling here from a still cold and dark London.

Jörg



Am Mittwoch, 2. Mai 2018, 16:19:48 BST schrieb Chris Dagdigian:
> Jeff White wrote:
> > I never used Bright.  Touched it and talked to a salesperson at a
> > conference but I wasn't impressed.
> > 
> > Unpopular opinion: I don't see a point in using "cluster managers"
> > unless you have a very tiny cluster and zero Linux experience.  These
> > are just Linux boxes with a couple applications (e.g. Slurm) running
> > on them.  Nothing special. xcat/Warewulf/Scyld/Rocks just get in the
> > way more than they help IMO.  They are mostly crappy wrappers around
> > free software (e.g. ISC's dhcpd) anyway.  When they aren't it's
> > proprietary trash.
> > 
> > I install CentOS nodes and use
> > Salt/Chef/Puppet/Ansible/WhoCares/Whatever to plop down my configs and
> > software.  This also means I'm not suck with "node images" and can
> > instead build everything as plain old text files (read: write
> > SaltStack states), update them at will, and push changes any time.  My
> > "base image" is CentOS and I need no "baby's first cluster" HPC
> > software to install/PXEboot it.  YMMV
> 
> Totally legit opinion and probably not unpopular at all given the user
> mix on this list!
> 
> The issue here is assuming a level of domain expertise with Linux,
> bare-metal provisioning, DevOps and (most importantly) HPC-specific
> configStuff that may be pervasive or easily available in your
> environment but is often not easily available in a
> commercial/industrial  environment where HPC or "scientific computing"
> is just another business area that a large central IT organization must
> support.
> 
> If you have that level of expertise available then the self-managed DIY
> method is best. It's also my preference
> 
> But in the commercial world where HPC is becoming more and more
> important you run into stuff like:
> 
> - Central IT may not actually have anyone on staff who knows Linux (more
> common than you expect; I see this in Pharma/Biotech all the time)
> 
> - The HPC user base is not given budget or resource to self-support
> their own stack because of a drive to centralize IT ops and support
> 
> - And if they do have Linux people on staff they may be novice-level
> people or have zero experience with HPC schedulers, MPI fabric tweaking
> and app needs (the domain stuff)
> 
> - And if miracles occur and they do have expert level linux people then
> more often than not these people are overworked or stretched in many
> directions
> 
> 
> So what happens in these environments is that organizations will
> willingly (and happily) pay commercial pricing and adopt closed-source
> products if they can deliver a measurable reduction in administrative
> burden, operational effort or support burden.
> 
> This is where Bright, Univa etc. all come in -- you can buy stuff from
> them that dramatically reduces that onsite/local IT has to manage the
> care and feeding of.
> 
> Just having a vendor to call for support on Grid Engine oddities makes
> the cost of Univa licensing worthwhile. Just having a vendor like Bright
> be on the hook for "cluster operations" is a huge win for an overworked
> IT staff that does not have linux or HPC specialists on-staff or easily
> available.
> 
> My best example of "paying to reduce operational burden in HPC" comes
> from a massive well known genome 

Re: [Beowulf] Bright Cluster Manager

2018-05-02 Thread Jörg Saßmannshausen
Dear all,

at least something I can contribute here: at the new work place the small 
cluster I am looking after is using Bright Cluster Manager to manage the 20 
nodes and the 10 or so GPU nodes.

I was not around when it all got installed so I cannot comment on how quickly 
it can be done or how easily. 

I used to do larger installations with up to 112 compute nodes which have 
different physical hardware. So I needed at least 2 images. I done all of that 
with a bit of scripting and not with a GUI. I did not use LDAP and 
authentication was done locally. It all provided a robust system. Maybe not as 
easy to manage as a system which got a GUI which does it all for you but on 
the flip side I knew exactly what the scripts were doing and what I need to do 
if there was a problem. 

By enlarge I agree with what John  Hearns said for example. To be frank: I 
still consider the Bright Cluster Manager tool to be good for people who do 
not know about HPC (I stick to that for this argument), don't know much about 
Linux etc. So in my personal opinion it is good for those who's day-to-day job 
is not HPC but something different. People who are coming from a GUI world (I 
don't mean that nasty here). For situations where it does not reckon to have a 
dedicated support. So for this it is fantastic: it works, there is a good 
support if things go wrong. 
We are using SLURM and the only issue I had when I first started at the new 
place a year ago that during a routine update SLRUM got re-installed and all 
the configurations were gone. This could be as it was not installed properly in 
the first place, it could be a bug, we don't know as the support did not manage 
to reproduce this. 
I am having some other minor issues with the authentication (we are 
authenticating against external AD) but again that could be the way it was 
installed at the time. I don't know who done that. 

Having said all of that: I am personally more a hands-on person so I know what 
the system is doing. This usually gets obscured by a GUI which does things in 
the background you may or may not want it to do. I had some problems at the 
old work place with ROCKS which lead me to removing it and install Debian on 
the clusters. They were working rock solid, even on hardware which had issues 
with the ROCKS installation. 

So, for me the answer to the question is: it depends: If you got a capable HPC 
admin who is well networked and you got a larger, specialized cluster, you 
might be better off to use the money and buy some additional compute nodes. 
For a installation where you do not have a dedicated admin, and you might have 
a smaller, homogeneous installation, you might be better off with a cluster 
management tool light the one Bright is offering. 
If money is an issue, you need to carefully balance the two: a good HPC admin 
does more than installing software, they do user support as well for example 
and make sure users can work. If you are lucky, you get one who actually 
understands what the users are doing. 

I think that is basically what everybody here says in different words: your 
mileage will vary.

My to shillings from a rather cold London! :-)

Jörg

Am Dienstag, 1. Mai 2018, 16:57:40 BST schrieb Robert Taylor:
> Hi Beowulfers.
> Does anyone have any experience with Bright Cluster Manager?
> My boss has been looking into it, so I wanted to tap into the collective
> HPC consciousness and see
> what people think about it.
> It appears to do node management, monitoring, and provisioning, so we would
> still need a job scheduler like lsf, slurm,etc, as well. Is that correct?
> 
> If you have experience with Bright, let me know. Feel free to contact me
> off list or on.

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Bright Cluster Manager

2018-05-02 Thread Chris Dagdigian

Jeff White wrote:


I never used Bright.  Touched it and talked to a salesperson at a 
conference but I wasn't impressed.


Unpopular opinion: I don't see a point in using "cluster managers" 
unless you have a very tiny cluster and zero Linux experience.  These 
are just Linux boxes with a couple applications (e.g. Slurm) running 
on them.  Nothing special. xcat/Warewulf/Scyld/Rocks just get in the 
way more than they help IMO.  They are mostly crappy wrappers around 
free software (e.g. ISC's dhcpd) anyway.  When they aren't it's 
proprietary trash.


I install CentOS nodes and use 
Salt/Chef/Puppet/Ansible/WhoCares/Whatever to plop down my configs and 
software.  This also means I'm not suck with "node images" and can 
instead build everything as plain old text files (read: write 
SaltStack states), update them at will, and push changes any time.  My 
"base image" is CentOS and I need no "baby's first cluster" HPC 
software to install/PXEboot it.  YMMV





Totally legit opinion and probably not unpopular at all given the user 
mix on this list!


The issue here is assuming a level of domain expertise with Linux, 
bare-metal provisioning, DevOps and (most importantly) HPC-specific 
configStuff that may be pervasive or easily available in your 
environment but is often not easily available in a 
commercial/industrial  environment where HPC or "scientific computing" 
is just another business area that a large central IT organization must 
support.


If you have that level of expertise available then the self-managed DIY 
method is best. It's also my preference


But in the commercial world where HPC is becoming more and more 
important you run into stuff like:


- Central IT may not actually have anyone on staff who knows Linux (more 
common than you expect; I see this in Pharma/Biotech all the time)


- The HPC user base is not given budget or resource to self-support 
their own stack because of a drive to centralize IT ops and support


- And if they do have Linux people on staff they may be novice-level 
people or have zero experience with HPC schedulers, MPI fabric tweaking 
and app needs (the domain stuff)


- And if miracles occur and they do have expert level linux people then 
more often than not these people are overworked or stretched in many 
directions



So what happens in these environments is that organizations will 
willingly (and happily) pay commercial pricing and adopt closed-source 
products if they can deliver a measurable reduction in administrative 
burden, operational effort or support burden.


This is where Bright, Univa etc. all come in -- you can buy stuff from 
them that dramatically reduces that onsite/local IT has to manage the 
care and feeding of.


Just having a vendor to call for support on Grid Engine oddities makes 
the cost of Univa licensing worthwhile. Just having a vendor like Bright 
be on the hook for "cluster operations" is a huge win for an overworked 
IT staff that does not have linux or HPC specialists on-staff or easily 
available.


My best example of "paying to reduce operational burden in HPC" comes 
from a massive well known genome shop in the cambridge, MA area. They 
often tell this story:


- 300 TB of new data generation per week (many years ago)
- One of the initial storage tiers was ZFS running on commodity server 
hardware
- Keeping the DIY ZFS appliances online and running took the FULL TIME 
efforts of FIVE STORAGE ENGINEERS


They realized that staff support was not scalable with DIY/ZFS at 
300TB/week of new data generation so they went out and bought a giant 
EMC Isilon scale-out NAS platform


And you know what? After the Isilon NAS was deployed the management of 
*many* petabytes of single-namespace storage was now handled by the IT 
Director in his 'spare time' -- And the five engineers who used to do 
nothing but keep ZFS from falling over were re-assigned to more 
impactful and presumably more fun/interesting work.



They actually went on stage at several conferences and told the story of 
how Isilon allowed senior IT leadership to manage petabyte volumes of 
data "in their spare time" -- this was a huge deal and really resonated 
. Really reinforced for me how in some cases it's actually a good idea 
to pay $$$ for commercial stuff if it delivers gains in 
ops/support/management.



Sorry to digress! This is a topic near and dear to me. I often have to 
do HPC work in commercial environments where the skills simply don't 
exist onsite. Or more commonly -- they have budget to buy software or 
hardware but they are under a hiring freeze and are not allowed to bring 
in new Humans.


Quite a bit of my work on projects like this is helping people make 
sober decisions regarding "build" or "buy" -- and in those environments 
it's totally clear that for some things it makes sense for them to pay 
for an expensive commercially supported "thing" that they don't have to 
manage or support themselves



My $.02 ...







Re: [Beowulf] Bright Cluster Manager

2018-05-02 Thread Jeff White
I never used Bright.  Touched it and talked to a salesperson at a 
conference but I wasn't impressed.


Unpopular opinion: I don't see a point in using "cluster managers" 
unless you have a very tiny cluster and zero Linux experience.  These 
are just Linux boxes with a couple applications (e.g. Slurm) running on 
them.  Nothing special. xcat/Warewulf/Scyld/Rocks just get in the way 
more than they help IMO.  They are mostly crappy wrappers around free 
software (e.g. ISC's dhcpd) anyway.  When they aren't it's proprietary 
trash.


I install CentOS nodes and use 
Salt/Chef/Puppet/Ansible/WhoCares/Whatever to plop down my configs and 
software.  This also means I'm not suck with "node images" and can 
instead build everything as plain old text files (read: write SaltStack 
states), update them at will, and push changes any time.  My "base 
image" is CentOS and I need no "baby's first cluster" HPC software to 
install/PXEboot it.  YMMV



Jeff White

On 05/01/2018 01:57 PM, Robert Taylor wrote:

Hi Beowulfers.
Does anyone have any experience with Bright Cluster Manager?
My boss has been looking into it, so I wanted to tap into the 
collective HPC consciousness and see

what people think about it.
It appears to do node management, monitoring, and provisioning, so we 
would still need a job scheduler like lsf, slurm,etc, as well. Is that 
correct?


If you have experience with Bright, let me know. Feel free to contact 
me off list or on.




___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.beowulf.org_mailman_listinfo_beowulf=DwIGaQ=C3yme8gMkxg_ihJNXS06ZyWk4EJm8LdrrvxQb-Je7sw=DhM5WMgdrH-xWhI5BzkRTzoTvz8C-BRZ05t9kW9SXZk=2km_EqLvNf2v9rNf8LphAYkJ-Sc_azfEyHqyDIzpLOc=kq0wdhy80VqcBCwcQAAQa0RbsgWIekhd0qU0zC81g1Q=


___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Bright Cluster Manager

2018-05-02 Thread Andrew Holway
On 1 May 2018 at 22:57, Robert Taylor  wrote:

> Hi Beowulfers.
> Does anyone have any experience with Bright Cluster Manager?
>

I used to work for ClusterVision from which Bright Cluster Manager was
born. Although my experience is now quite some years out of date I would
still recommend it mainly because Martijn de Vries is still CTO after 8
years and they have a very stable team of gifted developers. The company
has a single focus and they have been at it for a long time.

Back in the day I was able to deploy a complete cluster within a couple of
hours using BCM. All the nodes would boot over PXE and perform an
interesting "pivot root" operation to switch to the freshly installed HDD
from the PXE target. The software supported roles which would integrate
with SLURM allowing GPU node pools for instance. It was quite impressive
that people were able to get their code running so quickly.

I would say that, as a package, its definitely worth the money unless you
have a team of engineers kicking around. The CLI and API were a bit rough
and ready but its been 6 years since I last used it.

They also managed to successfully integrate OpenStack which is a bit of a
feat in its self.
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Bright Cluster Manager

2018-05-02 Thread John Hearns via Beowulf
Chris Samuel says:
>I've not used it, but I've heard from others that it can/does supply
> schedulers like Slurm, but (at least then) out of date versions.

Chris, this is true to some extent. When a new release of Slurm or, say,
Singularity is out you need to wait for Bright to package it up and test it
works with their setup.
This makes sense if you think about it - Bright is a supported product and
no company worh their salt would rush out a bleeding edge version of X
without testing.
I can say that the versions tend to be up to date but not bleeding edge - I
cannot give a specific example at the moment, sorry.

But as I say above, if it really matters to you, you can install your own
version on the master and the node images and create a Module file which
brings it into the users PATH.











On 2 May 2018 at 09:32, John Hearns  wrote:

> Robert,
> I have had a great deal of experience with Bright Cluster Manager and
> I am happy to share my thoughts.
>
>
> My experience with Bright has been as a system integrator in the UK, where
> I deployed Bright for a government defence client,
> for a university in London and on our in-house cluster for benchmarking
> and demos.
> I have a good relationship with the Bright employees in the UK and in
> Europe.
>
> Over the last year I have worked with a very big high tech company in the
> Netherlands, who use Bright to manage their clusters
> which run a whole range of applications.
>
> I would say that Bright is surprisingly easy to install - you should be
> going from bare metal to a functioning cluster within an hour.
> The node discovery mecahnism is either to switch on each node in turn and
> confirm the name.
> Or to note down which port in your Ethernet switch a node is connected to
> and Bright will do a MAC address lookup on that port.
> Hint - do the Ethernet port mapping. Make a sensible choice of node to
> port numbering on each switch.
> You of course have to identify the switches also to Bright.
> But it is then a matter of switching all the nodes on at once, then go off
> for well deserved coffee. Happy days.
>
> Bright can cope with most network topologies, including booting over
> Infiniband.
> If you run into problems their support guys are pretty responsive and very
> clueful. If you get stuck they will schedule a Webex
> and get you out of whatever hole you have dug for yourself. There is even
> a reverse ssh tunnel built in to their software,
> so you can 'call home' and someone can log in to help diagnose your
> problem.
>
> I back up what Chris Dagdidian says.  You pays your money and you takes
> your choice.
>
> Regarding the job scheduler, Brigh comes with pre-packaged and integrated
> Slurm, PBSpro,  Gridengine and I am sure LSF.
> So right out of the box you have a default job scheduler set up. All you
> have to do is choose which one at install time.
> Bright rather like Slurm, as I do also. But I stress that it works
> perfectly well with PBSPro, as I have worked in that environment over the
> last year.
> Should you wish to install your own version of Slurm/PBSPro etc. you can
> do that, again I know this works.
>
> I also stress PBSPro - this is now on a dual support model, so it is open
> source if you dont need the formal support from Altair.
>
> Please ask some more questions - I will tune in later.
>
> Also it should be said that if you choose not to go with Bright a good
> open source alternative is OpenHPC.
> But that is a different beast, and takes a lot more care and feeding.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On 2 May 2018 at 01:24, Christopher Samuel  wrote:
>
>> On 02/05/18 06:57, Robert Taylor wrote:
>>
>> It appears to do node management, monitoring, and provisioning, so we
>>> would still need a job scheduler like lsf, slurm,etc, as well. Is
>>> that correct?
>>>
>>
>> I've not used it, but I've heard from others that it can/does supply
>> schedulers like Slurm, but (at least then) out of date versions.
>>
>> I've heard from people who like Bright and who don't, so YMMV. :-)
>>
>> --
>>  Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
>> ___
>> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
>> To change your subscription (digest mode or unsubscribe) visit
>> http://www.beowulf.org/mailman/listinfo/beowulf
>>
>
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Bright Cluster Manager

2018-05-02 Thread John Hearns via Beowulf
Robert,
I have had a great deal of experience with Bright Cluster Manager and I
am happy to share my thoughts.


My experience with Bright has been as a system integrator in the UK, where
I deployed Bright for a government defence client,
for a university in London and on our in-house cluster for benchmarking and
demos.
I have a good relationship with the Bright employees in the UK and in
Europe.

Over the last year I have worked with a very big high tech company in the
Netherlands, who use Bright to manage their clusters
which run a whole range of applications.

I would say that Bright is surprisingly easy to install - you should be
going from bare metal to a functioning cluster within an hour.
The node discovery mecahnism is either to switch on each node in turn and
confirm the name.
Or to note down which port in your Ethernet switch a node is connected to
and Bright will do a MAC address lookup on that port.
Hint - do the Ethernet port mapping. Make a sensible choice of node to port
numbering on each switch.
You of course have to identify the switches also to Bright.
But it is then a matter of switching all the nodes on at once, then go off
for well deserved coffee. Happy days.

Bright can cope with most network topologies, including booting over
Infiniband.
If you run into problems their support guys are pretty responsive and very
clueful. If you get stuck they will schedule a Webex
and get you out of whatever hole you have dug for yourself. There is even a
reverse ssh tunnel built in to their software,
so you can 'call home' and someone can log in to help diagnose your problem.

I back up what Chris Dagdidian says.  You pays your money and you takes
your choice.

Regarding the job scheduler, Brigh comes with pre-packaged and integrated
Slurm, PBSpro,  Gridengine and I am sure LSF.
So right out of the box you have a default job scheduler set up. All you
have to do is choose which one at install time.
Bright rather like Slurm, as I do also. But I stress that it works
perfectly well with PBSPro, as I have worked in that environment over the
last year.
Should you wish to install your own version of Slurm/PBSPro etc. you can do
that, again I know this works.

I also stress PBSPro - this is now on a dual support model, so it is open
source if you dont need the formal support from Altair.

Please ask some more questions - I will tune in later.

Also it should be said that if you choose not to go with Bright a good open
source alternative is OpenHPC.
But that is a different beast, and takes a lot more care and feeding.























































On 2 May 2018 at 01:24, Christopher Samuel  wrote:

> On 02/05/18 06:57, Robert Taylor wrote:
>
> It appears to do node management, monitoring, and provisioning, so we
>> would still need a job scheduler like lsf, slurm,etc, as well. Is
>> that correct?
>>
>
> I've not used it, but I've heard from others that it can/does supply
> schedulers like Slurm, but (at least then) out of date versions.
>
> I've heard from people who like Bright and who don't, so YMMV. :-)
>
> --
>  Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
> ___
> Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
> To change your subscription (digest mode or unsubscribe) visit
> http://www.beowulf.org/mailman/listinfo/beowulf
>
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Bright Cluster Manager

2018-05-01 Thread Christopher Samuel

On 02/05/18 06:57, Robert Taylor wrote:


It appears to do node management, monitoring, and provisioning, so we
would still need a job scheduler like lsf, slurm,etc, as well. Is
that correct?


I've not used it, but I've heard from others that it can/does supply
schedulers like Slurm, but (at least then) out of date versions.

I've heard from people who like Bright and who don't, so YMMV. :-)

--
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Bright Cluster Manager

2018-05-01 Thread Chris Dagdigian


Bright Cluster Manager is a great product and the only knock is it can 
be pretty expensive. The most value/love I've seen for it is in the 
enterprise / corporate space where there is nobody who can do real 
hands-on HPC support/operations and the reduction in 
administrative/operational burden it brings is worth 10x the price tag. 
Corporate IT shops that are forced to manage a research/HPC environment 
love it.


Basically it's fantastic in shops where software dollars are easier to 
come by than specialist Linux or HPC support staff but the hardcore HPC 
snobs are suspicious because Bright does a lot of the knob and feature 
fiddling that they are used to doing themselves -- and there will always 
be legit and valid disagreement over the 'proper' way to do deployment,  
provisioning and configuration management.


I tell my clients that Bright is legit and it's worth sitting through 
their sales pitch / overview presentation to get a sense of what they 
offer. After that the decision is up to them.


My $.02 of course!

Chris


Robert Taylor 
May 1, 2018 at 4:57 PM
Hi Beowulfers.
Does anyone have any experience with Bright Cluster Manager?
My boss has been looking into it, so I wanted to tap into the 
collective HPC consciousness and see

what people think about it.
It appears to do node management, monitoring, and provisioning, so we 
would still need a job scheduler like lsf, slurm,etc, as well. Is that 
correct?


If you have experience with Bright, let me know. Feel free to contact 
me off list or on.




___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf