[slurm-dev] Re: why the env is the env of submit node, not the env of job running node.

2017-09-14 Thread Lachlan Musicman
Zhang,

I understand what you mean, but the answer is I don't think there is a way
to conf slurm to use running node env rather than submit node - since you,
by definition, will not know which node your job is running on. Some slurm
clusters have 1000s of nodes.

I think that's another reason for the users to be sychronised across the
cluster.

While not the case in every set up, often these types of clusters use a
shared filesystem or partial filesystem - and /home is one of the shared
spaces for exactly this reason.

Have you tried adding the .bashrc to that user's homedir on each machine?
Also, it might not be a regular shell, so maybe start the sbatch script
with #!/bin/bash -l ?

I think it's important to understand that shells and environment - all of
them, as a concept - are harder than they seem. They are relatively easy to
explain, but get complex quickly, which is why you are seeing this issue.

Also, the reason for using sbatch with batch files is to take out some of
that pain.

Of course, there is another option - you can put things you would like to
be persistent in /etc/profile.d/env.sh or /etc/environment on all your
nodes - so that they are available to all envs.

Remember: this problem is unlikely to be a fault in SLURM and more likely
to be that environments and shells are hard.

Cheers
L.



--
"The antidote to apocalypticism is *apocalyptic civics*. Apocalyptic civics
is the insistence that we cannot ignore the truth, nor should we panic
about it. It is a shared consciousness that our institutions have failed
and our ecosystem is collapsing, yet we are still here — and we are
creative agents who can shape our destinies. Apocalyptic civics is the
conviction that the only way out is through, and the only way through is
together. "

*Greg Bloom* @greggish
https://twitter.com/greggish/status/873177525903609857

On 15 September 2017 at 11:39, Chaofeng Zhang  wrote:

> Hi Lachlan,
>
>
>
> I am ok if I need some additional variables, I can use the module load the
> load the module. But /etc/profile/ /home/user1/.bashrc has already define
> many variable, I think these are default variables, currently, every time,
> I also need to source them before using, it is not reasonable from my view.
>
> Whether there is  a way to configure slurm to use running node env, not
> submit node env?
>
>
>
> Thanks.
>
> *From:* Lachlan Musicman [mailto:data...@gmail.com]
> *Sent:* Friday, September 15, 2017 6:55 AM
> *To:* slurm-dev 
> *Subject:* [slurm-dev] Re: why the env is the env of submit node, not the
> env of job running node.
>
>
>
> On 14 September 2017 at 19:41, Chaofeng Zhang  wrote:
>
> On node A, I submit job file using sbatch command, the job is running on
> the node B, you will find that the output is not the env of node B, it is
> the env of node A.
>
>
>
> *#!/bin/bash*
>
> *#SBATCH --job-name=mnist10*
>
> *#SBATCH --partition=compute*
>
> *#SBATCH --workdir=/home/share*
>
> *#SBATCH --nodes=1*
>
> *#SBATCH --ntasks-per-node=1*
>
> *#SBATCH --cpus-per-task=1*
>
> *env*
>
>
>
> Zhang,
>
> That is how it's meant to work.
>
> If you need a special env, you are meant to set it up in the sbatch script.
>
> We (where I work) use Environment Modules to do this:
>
> module load python3
>
> module load java/java-1.8-jre
>
> module load samtools/1.5
>
>
>
> All of these are prepended to the PATH in the resulting env.
>
> But you could do anything you want - it is just a shell script after all
>
> SET_VAR="/path/"
>
> NEW_PATH=$SET_VAR:$PATH
>
> etc
>
>
>
> Cheers
>
> L.
>
>
>


[slurm-dev] Re: why the env is the env of submit node, not the env of job running node.

2017-09-14 Thread Chaofeng Zhang
Hi Lachlan,

I am ok if I need some additional variables, I can use the module load the load 
the module. But /etc/profile/ /home/user1/.bashrc has already define many 
variable, I think these are default variables, currently, every time, I also 
need to source them before using, it is not reasonable from my view.
Whether there is  a way to configure slurm to use running node env, not submit 
node env?

Thanks.
From: Lachlan Musicman [mailto:data...@gmail.com]
Sent: Friday, September 15, 2017 6:55 AM
To: slurm-dev 
Subject: [slurm-dev] Re: why the env is the env of submit node, not the env of 
job running node.

On 14 September 2017 at 19:41, Chaofeng Zhang 
> wrote:
On node A, I submit job file using sbatch command, the job is running on the 
node B, you will find that the output is not the env of node B, it is the env 
of node A.

#!/bin/bash
#SBATCH --job-name=mnist10
#SBATCH --partition=compute
#SBATCH --workdir=/home/share
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
env

Zhang,
That is how it's meant to work.

If you need a special env, you are meant to set it up in the sbatch script.
We (where I work) use Environment Modules to do this:
module load python3
module load java/java-1.8-jre
module load samtools/1.5

All of these are prepended to the PATH in the resulting env.
But you could do anything you want - it is just a shell script after all
SET_VAR="/path/"
NEW_PATH=$SET_VAR:$PATH
etc

Cheers
L.



[slurm-dev] Re: why the env is the env of submit node, not the env of job running node.

2017-09-14 Thread Lachlan Musicman
On 14 September 2017 at 19:41, Chaofeng Zhang  wrote:

> On node A, I submit job file using sbatch command, the job is running on
> the node B, you will find that the output is not the env of node B, it is
> the env of node A.
>
>
>
> *#!/bin/bash*
>
> *#SBATCH --job-name=mnist10*
>
> *#SBATCH --partition=compute*
>
> *#SBATCH --workdir=/home/share*
>
> *#SBATCH --nodes=1*
>
> *#SBATCH --ntasks-per-node=1*
>
> *#SBATCH --cpus-per-task=1*
>
> *env*
>


Zhang,

That is how it's meant to work.

If you need a special env, you are meant to set it up in the sbatch script.

We (where I work) use Environment Modules to do this:

module load python3
module load java/java-1.8-jre
module load samtools/1.5

All of these are prepended to the PATH in the resulting env.

But you could do anything you want - it is just a shell script after all

SET_VAR="/path/"
NEW_PATH=$SET_VAR:$PATH

etc

Cheers
L.


[slurm-dev] Re: Accounting estimates/calculations

2017-09-14 Thread Barry Moore
Merlin,

This reminded me I wanted to do something like this too. To my knowledge
there is no command that does this. In fact, when I first started figuring
out the billing weights I used sshare to determine the cost of jobs
submitted one by one (not fun). Anyway, I put together a Python script this
afternoon which might help:

```python
#!/usr/bin/env /ihome/sam/bmooreii/workspace/wrappers/py_wrap.sh
''' crc-job-sus.py -- What does my job cost?
Usage:
crc-job-sus.py [-hv]
crc-job-sus.py 

Positional Arguments:
 The job id you want to estimate SU cost

Options:
-h --help   Print this screen and exit
-v --versionPrint the version of crc-job-sus.py
'''


def generate_squeue_process(cluster):
process = Popen(split("squeue -u {} -M {} -j
{}".format(environ['USER'], cluster, arguments[''])), stdout=PIPE,
stderr=PIPE)
return process.communicate()


def tres_to_map(tres):
tres_list = [x.split('=') for x in tres.split(',')]
return {k.lower(): v.lower() for k, v in tres_list}


def correct_map(m):
n = {}
for k, v in m.items():
if k == "mem":
if "m" in v:
n[k] = float(v.replace("m", "")) / 1024
else:
n[k] = float(v.replace("g", ""))
else:
n[k] = float(v.replace("g", ""))
return n


try:
# Some imports functions and libraries
from docopt import docopt
from subprocess import Popen, PIPE
from shlex import split
from os import environ

arguments = docopt(__doc__, version='crc-job-sus.py version 0.0.1')

# Find the job
clusters = ["smp", "gpu", "mpi"]
clusters_job_info = []
for clus in clusters:
clusters_job_info.append(generate_squeue_process(clus))

# Which cluster is it?
loc = [loc for loc, item in enumerate(clusters_job_info) if
arguments[''] in item[0]]
cluster = clusters[loc[0]]

# Get the information from the job
process = Popen(split("scontrol -M {} show job {}".format(cluster,
arguments[''])), stdout=PIPE, stderr=PIPE)
out, err = process.communicate()
split_out = out.replace('\n', ' ').split()
partition = [x for x in split_out if 'Partition=' in
x][0].split('=')[-1]
time_limit = [x for x in split_out if 'TimeLimit=' in
x][0].split('=')[-1]
tres = [x for x in split_out if 'TRES=' in x][0].replace("TRES=", "")

# Get the partition information
process = Popen(split("scontrol -M {} show partition
{}".format(cluster, partition)), stdout=PIPE, stderr=PIPE)
out, err = process.communicate()
split_out = out.replace('\n', ' ').split()
billing_weights = [x for x in split_out if 'TRESBillingWeights=' in
x][0].replace("TRESBillingWeights=", "")

# Multiply out the tres with weights
# -> hours * max(cpus * cpu_weight, mem * mem_weight, gpus * gpu_weight)

# First, find time_limit in hours
formatted_time = [float(x) for x in time_limit.replace("-",
":").split(":")]
if len(formatted_time) == 4:
hours = (formatted_time[0] * 24.) + formatted_time[1] +
(formatted_time[2] / 60.) + (formatted_time[3] / (60. * 60.))
else:
hours = formatted_time[0] + (formatted_time[1] / 60.) +
(formatted_time[2] / (60. * 60.))

# Now create maps for weights and tres
tres_map = tres_to_map(tres)
billing_map = tres_to_map(billing_weights)

# Remove the nodes from the tres_map, it is accounted for by TRES
del tres_map["node"]

# Convert everything to float, mem should be in GB
tres_map = correct_map(tres_map)
billing_map = correct_map(billing_map)

# Multiply them using a union set
mult = {k: tres_map.get(k, 0.0) * billing_map.get(k, 0.0) for k in
set(tres_map) | set(billing_map)}
print(hours * max(mult.values()))

except KeyboardInterrupt:
exit('Interrupt detected! exiting...')
```

It's not well tested by any means, but it is a decent starting point. Note,
people use multiple versions of Python on our cluster so I have to wrap the
execution of Python (hence first line). We are using MAX_TRES too, so that
print line should be what you are looking for. Otherwise take the
sum(mult.values()).

If someone has a Slurm approved way of doing this I would love to know.

Hope it helps,

Barry

On Thu, Sep 14, 2017 at 5:22 AM, Merlin Hartley <
merlin-sl...@mrc-mbu.cam.ac.uk> wrote:

> Greetings
>
> I recently implemented accounting and FairShare on our cluster but I’m not
> totally convinced it is working correctly - the users who are hogging the
> GPU machines seem to still only have a tiny fairshare value - even though I
> configured billing to count gpu as 160 cpus.
>
> Does anyone know of a command for estimating or calculating the ‘cost’ of
> a job? (per hour for example)
> I think this could be a really useful tool for our users - as well as for
> me to check that the accounting is working as expected!
>
> Here follows parts of our slurm.conf:
>
> 

[slurm-dev] why the env is the env of submit node, not the env of job running node.

2017-09-14 Thread Chaofeng Zhang
On node A, I submit job file using sbatch command, the job is running on the 
node B, you will find that the output is not the env of node B, it is the env 
of node A.

#!/bin/bash
#SBATCH --job-name=mnist10
#SBATCH --partition=compute
#SBATCH --workdir=/home/share
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
env


Jeff (ChaoFeng Zhang, 张超锋) PMP(r) 
zhang...@lenovo.com
HPC | Cloud Software Architect   (+86) - 18116117420
Software solution development(+8621) - 20590223
Shanghai, China



[slurm-dev] Re: How user can requeue an old job?

2017-09-14 Thread Benjamin Redling


On 14.09.2017 11:12, Merlin Hartley wrote:
> I wonder: what would be the ramifications of setting this to 0 in 
production? "A value of zero prevents any job record purging”

> Or is that option only really there for debugging?

(just guessing) should be horrible: once "MaxJobCount" (s. slurm.conf 
help again) is reached, nobody will be able to submit any jobs?


BR,
BR
--
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
☎ +49 3641 9 44323


[slurm-dev] Accounting estimates/calculations

2017-09-14 Thread Merlin Hartley
Greetings

I recently implemented accounting and FairShare on our cluster but I’m not 
totally convinced it is working correctly - the users who are hogging the GPU 
machines seem to still only have a tiny fairshare value - even though I 
configured billing to count gpu as 160 cpus.

Does anyone know of a command for estimating or calculating the ‘cost’ of a 
job? (per hour for example)
I think this could be a really useful tool for our users - as well as for me to 
check that the accounting is working as expected!

Here follows parts of our slurm.conf:

PriorityType=priority/multifactor
PriorityDecayHalfLife=14-0
PriorityWeightFairshare=10
PriorityWeightAge=1000
PriorityWeightPartition=1
PriorityWeightJobSize=1000
PriorityWeightTRES=CPU=1000,Mem=2000,GRES/gpu=3000
PriorityMaxAge=1-0
PriorityFavorSmall=True
PriorityFlags=CALCULATE_RUNNING,SMALL_RELATIVE_TO_TIME,MAX_TRES
GresTypes=gpu
JobAcctGatherType=jobacct_gather/linux
AccountingStorageTRES=gres/gpu
NodeName=pascal[01-03] Sockets=2 CoresPerSocket=8  ThreadsPerCore=2 
RealMemory=232000 Gres=gpu:pascal:4
PartitionName=DEFAULT  DefaultTime=24:0:0 MaxTime=14-0:0:0 MaxNodes=4 
TRESBillingWeights="CPU=1.0,Mem=0.25G,GRES/gpu=160.0"

Many thanks for your time


Merlin
--
Merlin Hartley
Computer Officer
MRC Mitochondrial Biology Unit
Cambridge, CB2 0XY
United Kingdom



[slurm-dev] Re: How user can requeue an old job?

2017-09-14 Thread Benjamin Redling




On 14.09.2017 10:52, Taras Shapovalov wrote:

Hey guys!

As far as I know now there is a built-in 5 min time interval after a job 
is finished, which leads to the job removal from Slurm "memory" (not 
from accounting). This is ok until users need to requeue the job by some 
reason. Thus if 5 minutes have already passed, then requeue command does 
not work.


[...]


Is there any way to extend this 5 minuts period?


https://slurm.schedmd.com/slurm.conf.html

s. MinJobAge

BR,
BR

--
FSU Jena | JULIELab.de/Staff/Benjamin+Redling.html
☎ +49 3641 9 44323


[slurm-dev] How user can requeue an old job?

2017-09-14 Thread Taras Shapovalov
Hey guys!

As far as I know now there is a built-in 5 min time interval after a job is
finished, which leads to the job removal from Slurm "memory" (not from
accounting). This is ok until users need to requeue the job by some reason.
Thus if 5 minutes have already passed, then requeue command does not work.

[user1@demo3 ~]$ squeue
 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)
   177  defq   run.shuser1 PD   0:00  1
(BeginTime)
[user1@demo3 ~]$ squeue
 JOBID PARTITION NAME USER ST   TIME  NODES
NODELIST(REASON)
[user1@demo3 ~]$
[user1@demo3 ~]$ scontrol requeue 117
Invalid job id specified for job 117
[user1@demo3 ~]$

Is there any way to extend this 5 minuts period?


Best regards,

Taras


[slurm-dev] Re: Cores, CPUs, and threads: take 2

2017-09-14 Thread Lachlan Musicman
On 14 September 2017 at 11:06, Lachlan Musicman  wrote:

>
> I've just implemented the change from
>
> NodeName=papr-res-compute[34-36] CPUs=8 RealMemory=31000 State=UNKNOWN
>
> to
>
> NodeName=papr-res-compute[34-36] CPUs=8 RealMemory=31000 Sockets=1
> CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN
>
> in our test env.
>
> The first thing I noticed was a debug message in the logs:
>
> Node configuration differs from hardware: CPUs=8:8(hw) Boards=1:1(hw)
> SocketsPerBoard=8:1(hw) CoresPerSocket=1:4(hw) ThreadsPerCore=1:2(hw)
>
>
> 1. Not necessarily a problem, could just be that it's a debug message
> 2. Am I misconfiguring my Nodes? Should the fields Sockets, CoresPerSocket
> and ThreadsPerCore be in the format x:y ?
>


Arg! I copy and pasted wrong. Those two configs should be reversed:


from

NodeName=papr-res-compute[34-36] CPUs=8 RealMemory=31000 Sockets=1
CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN

to

NodeName=papr-res-compute[34-36] CPUs=8 RealMemory=31000 State=UNKNOWN

Leading to the debug log message

It's worth noting that before this change cgroups couldn't get down to the
thread level. We would only consume at the core level - ie, all jobs would
get an even number of cpus - jobs that requested an odd number of cpus
(threads) would be rounded up to the next even number.

Cheers
L.



--
"The antidote to apocalypticism is *apocalyptic civics*. Apocalyptic civics
is the insistence that we cannot ignore the truth, nor should we panic
about it. It is a shared consciousness that our institutions have failed
and our ecosystem is collapsing, yet we are still here — and we are
creative agents who can shape our destinies. Apocalyptic civics is the
conviction that the only way out is through, and the only way through is
together. "

*Greg Bloom* @greggish
https://twitter.com/greggish/status/873177525903609857