Renat,
I understand better here. It does look like job arrays are what you need.
Maybe you could ask for the Slurm version on your HPC to be updated. I
gues that may be difficult.
I am also guessing that you are doing some sort of Big Data task, maybe in
Python?
On 10 November 2017
David,
the common practice is to install blast, or any other application software
packages, on a drive which is to be exported via NFS to the compute nodes.
Or indeed a section on your parallel filesystem (BeeGFS, GPFS, Lustre etc.)
You might call such an area /opt//opt/shared or /cm/shared
Forgive me for saying this. I do have a bit of experience in building HPC
systems.
Distro supplied software packages have improved a lot over the years.
But they do tend to be out of date compared to the latest versions of (say)
Slurm.
I really would say you should consider downloading and
"Otherwise a user can have a sing le job that takes the entire cluster,
and insidesplit it up the way he wants to."
Yair, I agree. That is what I was referring to regardign interactive jobs.
Perhaps not a user reserving the entire cluster,
but a use reserving a lot of compute nodes and not making
of Medicine
> Stanford, California 94305
>
> Tel:1-650-498-7969 No Texting
> Fax:1-650-723-7382
>
> On May 12, 2018, at 00:08, John Hearns <hear...@googlemail.com> wrote:
>
> Eric, I'm sorry to be a little prickly here.
> Each node has an independent home director
directory too!
On 12 May 2018 at 22:02, John Hearns <hear...@googlemail.com> wrote:
> Well I DID say that you need 'what looks like a home directory'.
> So yes indeed you prove, correctly, that this works just fine!
>
> On 12 May 2018 at 20:17, Eric F. Alemany <ealem..
nford, California 94305
>
> Tel:1-650-498-7969 No Texting
> Fax:1-650-723-7382
>
>
>
> On May 11, 2018, at 12:56 AM, Chris Samuel <ch...@csamuel.org> wrote:
>
> On Friday, 11 May 2018 5:11:38 PM AEST John Hearns wrote:
>
> Eric, my advice would be to
Mahmood,
you should check that the slurm.conf files are identical on the head node
and the compute nodes after you run the rocks sync.
On 16 May 2018 at 11:07, Mahmood Naderan wrote:
> Yes I did that prior to my first email. However, I thought that is
> similar to the
ould be the same. I've placed the
> script in github, if you want to try it:
> https://github.com/irush-cs/slurm-scripts
>
> Yair.
>
>
> On Mon, Jun 18, 2018 at 3:33 PM, John Hearns
> wrote:
> > Your problem is that you are listening to Lennart Poettering...
> > I c
Matt, I back up what Loris said regarding interactive jobs.
I am sorry to sound ranty here, but my experience teaches me that in cases
like this you must ask why this is being desired.
Hey - you are the systems expert. If you get the user to explain why they
desire this functionality, it actually
And your permissions on the directory /var/spool/slurmctld/ are
On 15 June 2018 at 09:11, UGI wrote:
> When I start slurmctld, there are some errors in log. And the job running
> information doesn't store to mysql via slurmdbd.
>
> I set
>
>
=7002775
This of course is very dependent on what your environment and applications
are. Would you be able to say please what the problems you are having with
memory?
On 29 May 2018 at 12:26, John Hearns wrote:
> Alexandre, it would be helpful if you could say why this behavi
nk you for your inputs.
>
>
>
>
>
> *De :* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *De la
> part de* John Hearns
> *Envoyé :* mardi 29 mai 2018 12:39
> *À :* Slurm User Community List
> *Objet :* Re: [slurm-users] Using free memory available when all
Tueur what are you trying to achieve here? The example you give is
touch /tmp/newfile.txt'
I think you are trying to send a signal to another process. Could this be
'Hey - the job has finished and there is a new file for you to process'
If that is so, there may be better ways to do this. If you
t; during the job, I would like to run a program on the machine running the
> job
> but I'd like the program to keep running even after the job ends.
>
> 2018-06-04 15:30 GMT+02:00 John Hearns :
>
>> Tueur what are you trying to achieve here? The example you give is
>> to
urm itself. For example accounting
> and other things.
>
>
> Regards,
> Mahmood
>
>
>
>
>
> On Tue, May 1, 2018 at 9:35 PM, Cooper, Trevor <tcoo...@sdsc.edu> wrote:
> >
> >> On May 1, 2018, at 2:58 AM, John Hearns <hear...@googlemail.com> wro
Mahmood, do you haave Hyperthreading enabled?
That may be the root cause of your problem. If you have hyperhtreading,
then when you start to run more than the number of PHYSICAL cores you
will get over-subscription. Now, with certain workloads that is fine - that
is what hyperhtreading is all
I quickly downloaded that roll and unpacked the RPMs.
I cannot quite see how SLurm is configured, so to my shame I gave up (I did
say that Rocks was not my thing)
On 1 May 2018 at 11:58, John Hearns <hear...@googlemail.com> wrote:
> Rocks 7 is now available, which is based on CentOS 7.4
Andrew, I looked at this about a year ago. You might find the thread in
the archives of this list.
At the time, the Cray plugin for the burst buffer was supported. However
staging in/out for other devices was not being developed.
You can achieve the same staging behaviour by using job
Elisabetta, I will not answer your question directly.
However I think that everyone has heard of the Meltdown bug by now, and
there are updated kernels being made available for this.
You should have a look on the Debian pages to see what they are saying
about this, and choose which kernel you need
Elisabetta, I am not an expert on Debian systems.
I think to solve your problem with the kernels, you need to recreate the
initial ramdisk and make sure it has the modules you need.
So boot the system in kernel 3.2 and then run:
mkinitrd 3.16.0-4-amd64
How was the kernel version 3.16.0-4-amd64
That's it. I am calling JohnH's Law:
"Any problem with a batch queueing system is due to hostname resolution"
On 15 January 2018 at 16:30, Elisabetta Falivene
wrote:
> slurmd -Dvvv says
>
> slurmd: fatal: Unable to determine this slurmd's NodeName
>
> b
>
> 2018-01-15
I should also say that Modules should be easy to install on Ubuntu. It will
be the package named "environment-modules"
You probably will have to edit the configuration file a little bit since
the default install will assume al lModules files are local.
You need to set your MODULESPATH to include
Juan, me kne-jerk reaction is to say 'containerisation' here.
However I guess that means that Slurm would have to be able to inspect the
contents of a container, and I do not think that is possible.
I may be very wrong here. Anyone?
However have a look at thre Xalt stuff from TACC
Not specifically Slurm, but it can be useful to have alerts on jobs which
either will never start or which are 'stalled'.
You might want to have an alert on jobs which (say) request more slots or
nodes than physicall exist, so the users job will never run.
Or you can look for 'stalled' jobs where
Hi Elisabetta. No, you normally do not need to install software on all the
compute nodes separately.
It is quite common to use the 'modules' environment to manage software like
this
http://www.admin-magazine.com/HPC/Articles/Environment-Modules
Once you have numpy installed on a shared drive on
Brian, not my area of expertise. Do you want 'premption' - ie the VIP
user runs something and other jobs are pre-empted?
https://slurm.schedmd.com/preempt.html
On 25 January 2018 at 16:27, Brian Novogradac
wrote:
> I'm new to Slurm, and looking for some
If it is any help, https://slurm.schedmd.com/sinfo.html
NODE STATE CODES
Node state codes are shortened as required for the field size. These node
states may be followed by a special character to identify state flags
associated with the node. The following node sufficies and states are used:
***
/var/log ??
>
>
>
> *From:* slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] *On
> Behalf Of *John Hearns
> *Sent:* Tuesday, July 17, 2018 8:57 AM
>
> *To:* Slurm User Community List
> *Subject:* Re: [slurm-users] 'srun hostname' hangs on the command line
>
>
>
&g
ock
> by the firewall. That will prevent srun from running properly
>
> Sent from my iPhone
>
>
> On 17 Jul 2018, at 10:16, John Hearns wrote:
>
> Ronan, as far as I can see this means that you cannot launch a job.
>
>
>
> What state are the compute nodes in whe
Following on from what Chris Samuel says
/root/sl/sl2 kinda suggest Scientific Linux to me (SL - Redhat alike
distribution used by Fermilab and CERN)
Or it could just be sl = slurm
I would run ldd `which slurctld` and let us know what libraries is it
linked to
On Wed, 5 Sep 2018 at 08:51,
Not an answer to your question - a good diagnostic for cgroups is the
utility 'lscgroups'
On Sat, 8 Sep 2018 at 10:10, Gennaro Oliva wrote:
>
> Hi Mike,
>
> On Fri, Sep 07, 2018 at 03:53:44PM +, Mike Cammilleri wrote:
> > I'm getting this error lately for everyone's jobs, which results in
> >
Mahmood, please please forgive me for saying this. A quick Google shows
that Opteron 61xx have eight or twelve cores.
Have you checked that all the servers have 12 cores?
I realise I am appearing stupid here.
On 11 July 2018 at 10:39, Mahmood Naderan wrote:
> >Try runningps -eaf
Mahmood,
I am sure you have checked this. Try runningps -eaf --forest while
a job is running.
I often find the --forest option helps to understand how batch jobs are
being run.
On 11 July 2018 at 09:12, Mahmood Naderan wrote:
> >Check the Gaussian log file for mention of its using just
(s): 2
> NUMA node(s): 4
> Vendor ID: AuthenticAMD
> CPU family:21
> Model: 1
> Model name:AMD Opteron(tm) Processor 6282 SE
> Stepping: 2
>
>
> Regards,
> Mahmood
>
>
>
> On We
Loris, Ole, thankyou so much. That is the Python script I was thinking of.
On 17 April 2018 at 11:15, Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk>
wrote:
> On 04/17/2018 10:56 AM, John Hearns wrote:
>
>> Please can some kind soul remind me what the Python code for manglin
Nicolo, I cannot say what your problem is.
However in the past with problems like this I would
a) look at ps -eaf --forest
Try to see what the parent processes of these job processes are
Clearly if the parent PID is 1 then --forest is nto much help. But the
--forest option is my 'goto'
*Caedite eos. Novit enim Dominus qui sunt eius*
https://en.wikipedia.org/wiki/Caedite_eos._Novit_enim_Dominus_qui_sunt_eius.
I have been wanting to use that line in the context of batch systems and
users for ages.
At least now I can make it a play on killing processes. Rather than being
put on a
chine I launched the srun command from (001).
>
> John -- Yes we are heavily invested in the Trick framework and use their
> Monte-Carlo feature quite extensively, in the past we've used PBS to manage
> our compute nodes, but this is the first attempt to integrate Trick
> Monte-Carlo with SLU
Matteo, a stupid question but if these are single CPU jobs why is mpirun
being used?
Is your user using these 36 jobs to construct a parallel job to run charmm?
If the mpirun is killed, yes all the other processes which are started by
it on the other compute nodes will be killed.
I suspect your
is killed, why
> would all others go down as well?
>
>
> That would make sense if a single mpirun is running 36 tasks... but the
> user is not doing this.
>
>
> From: slurm-users on behalf of
> John Hearns
> Sent: Friday, June 2
what you think is happening - remember that log
messages take effort to put in the code,
well at least some keystrokes, so they usually mean something!
On Tue, 16 Oct 2018 at 10:04, John Hearns wrote:
> Rather dumb question from me - you have checked those processes are
> running within a
Kirk,
MailProg=/usr/bin/sendmail
MailProg should be the program used to SEND mail ie. /bin/mail not
sendmail
If I am not wrong int he jargon MailProg is a MUA not an MTA (sendmail is
an MTA)
On Thu, 18 Oct 2018 at 19:01, Kirk Main wrote:
> Hi all,
>
> I'm a new administrator to Slurm
After doing some Googling
https://jvns.ca/blog/2017/02/17/mystery-swap/ Swapping is weird and
confusing (Amen to that!)
https://jvns.ca/blog/2016/12/03/how-much-memory-is-my-process-using-/
(interesting article)
>From the Docker documentation, below.
Bill - this is what you are seeing. Twice as
Hi Jordan.
Regarding filling up the nodes look at
https://slurm.schedmd.com/elastic_computing.html
*SelectType* Generally must be "select/linear". If Slurm is configured to
allocate individual CPUs to jobs rather than whole nodes (e.g.
SelectType=select/cons_res rather than
Chaofeng, I agree with what Chris says. You should be using cgroups.
I did a lot of work with cgroups anf GPUs in PBSPro (yes I know...
splitter!)
With cgroups you only get access to the devices which are allocated to that
cgroup, and you get CUDA_VISIBLE_DEVICES set for you.
Remember also to
Ashton, on a compute node with 256Gbytes of RAM I would not
configure any swap at all. None.
I managed an SGI UV1 machine at an F1 team which had 1Tbyte of RAM -
and no swap.
Also our ICE clusters were diskless - SGI very smartly configured swap
over ISCSI - but we disabled this, the reason
I would say that, yes, you have a good workflow here with Slurm.
As another aside - is anyone working with suspending and resuming containers?
I see on the Singularity site that suspend/resume in on the roadmap (I
am not talking about checkpointing here).
Also it is worth saying that these days
We recently had a very good discussion on swap space and job suspension.
I had a look at the Intel pages on Optane memory.
https://www.intel.com/content/www/us/en/architecture-and-technology/intel-optane-technology.html
It is definitely being positioned as a fast file cache, ie for block
oriented
Hi David. I set up DCV on a cluster of workstations at a facility not far
from you a few years ago (in Woking...).
I'm not sure what the relevance of having multiple GPUs is - I thought the
DCV documentation dealt with that ??
One thing you should do is introduce MobaXterm to your users if they
Chris, I have delved deep into the OOM killer code and interaction with
cpusets in the past (*).
That experience is not really relevant!
However I always recommend looking at this sysctl parameter
min_free_kbytes
Going off topic, if you want an ssh client and an X-server on a Windows
workstation or laptop, I highly recommend MobaXterm.
You can open a remote desktop easily.
Session types are ssh, VNC, RDP, Telnet(!) , Mosh and anything else you can
think of.
Including a serial terminal for those times when
Loris said:
Until now I had thought that the most elegant way of setting up Slurm
users would be via a PAM module analogous to pam_mkhomedir, the simplest
option being to use pam_script.
When in Denmark this year (hello Ole!) I looked at pam_mkhomedir quite closely.
The object was to
Will, there are some excellent responses here.
I agree that moving data to local fast storage on a node is a great idea.
Regarding the NFS storage, I would look at implementing BeeGFS if you can
get some new hardware or free up existing hardware.
BeeGFS is a skoosh case to set up.
(*) Scottish
Yugendra, the Bright support guys are excellent.
Slurm is their default choice. I would ask again. Yes, Slurm is technically
out of scope for them, but they shoudl help a bit.
By the way, I think your problem is that you have configured authentication
using AD on your head node.
BUT you have not
please have a look at section 6.3 of the Bright Admin Manual
You have run updateprovisioners then rebooted the nodes?
Configuring The Cluster To Authenticate Against An External LDAP Server The
cluster can be configured in different ways to authenticate against an
external LDAP server. For smaller
Bright so no answer to your specific
> question but I hope you can get some support with it. We dumped our BC
> PoC, the sysadmin working on the PoC still has nightmares.
>
> On 2/13/19, 6:54 AM, "slurm-users on behalf of John Hearns" <
> slurm-users-boun...@lists.schedmd.co
OK, I am going to stick my neck out here.
You say a 'remote system' - is this a single server? If it is, for what
purpose do you need Slurm?
If you want to schedule some tasks to run one after the other, simply start
a screen session then put the takss into a script.
I am sorry if I sound rude
Think of system administrators like grumpy bears in their caves.
They will growl at you and make fierce noises.
Btu bring them cookies and they will roll over and let their tummies be
tickled.
On Sun, 26 May 2019 at 05:25, Raymond Wan wrote:
>
>
> On 25/5/2019 7:37 PM, John Hea
Priya, you could set up a cluster on Aamazon or another cloud for testing.
Please have a look at this
https://elasticluster.readthedocs.io/en/latest/
If you want to set up some virtual machines on your own laptop or server,
Google for vagrant slurm There are several vagrant recipes on the
ific instructions
>>> for
>>> installing on Remote server without root access ?
>>> ------ next part --
>>> An HTML attachment was scrubbed...
>>> URL: <
>>> http://lists.schedmd.com/pipermail/slurm-users/attachments/20190525/684d07e9/attachment-0001.html
>>> >
>&g
I agree with Christopher Coffey - look at the sssd caching.
I have had experience with sssd and can help a bit.
Also if you are seeing long waits could you have nested groups?
sssd is notorious for not handling these well, and there are settings in
the configuration file which you can experiment
Janne, thankyou. That FGCI benchmark in a container is pretty smart.
I always say that real application benchmarks beat synthetic benchmarks.
Taking a small mix of applications like that and taking a geometric mean is
great.
Note: *"a reference result run on a Dell PowerEdge C4130"*
In the old
Paul, you refer to banking resources. Which leads me to ask are schemes
such as Gold used these days in Slurm?
Gold was a utility where groups could top up with a virtual amount of money
which would be spent as they consume resources.
Altair also wrote a similar system for PBS, which they offered
Two replies here.
First off for normal user logins you can direct them into a cgroup - I
looked into this about a year ago and it was actually quite easy.
As I remember there is a service or utility available which does just that.
Of course the user cgroup would not have
Expanding on my theme, it
Why are you sshing into the compute node compute-0-2 ???
On the head node named rocks7:
srun -c 1 --partition RUBY --account y8 --mem=1G xclock
On Mon, 20 May 2019 at 16:07, Mahmood Naderan wrote:
> Hi
> Although proper configuration has been defined as below
>
> [root@rocks7 software]# grep
_cgroups.html
On Tue, 21 May 2019 at 01:28, Dave Evans wrote:
> Do you have that resource handy? I looked into the cgroups documentation
> but I see very little on tutorials for modifying the permissions.
>
> On Mon, May 20, 2019 at 2:45 AM John Hearns
> wrote:
>
>> Two repli
MY apology. You do say that the Python program simply printe the rank - so
is a hello world program.
On Fri, 12 Jul 2019 at 07:45, John Hearns wrote:
> Please try something very simple such as a hello world program or
> srun -N2 -n8 hostname
>
> What is the error message wh
Please try something very simple such as a hello world program or
srun -N2 -n8 hostname
What is the error message which you have ?
On Fri, 12 Jul 2019 at 07:07, Pär Lundö wrote:
>
> Hi there Slurm-experts!
> I am trouble using or running a python-mpi program involving more than
> one node.
Par, by 'poking around' Crhis means to use tools such as netstat and lsof.
Also I would look as ps -eaf --forest to make sure there are no 'orphaned'
jusbs sitting on that compute node.
Having said that though, I have a dim memory of a classic PBSPro error
message which says something about a
It's a DNS problem, isn't it? Seriously though - how long does srun
hostname take for a single system?
On Fri, 26 Apr 2019 at 15:49, Douglas Jacobsen wrote:
> We have 12,000 nodes in our system, 9,600 of which are KNL. We can
> start a parallel application within a few seconds in most cases
71 matches
Mail list logo