[slurm-dev] Re: srun to existing allocation, but just a specific node

2015-08-27 Thread John Hearns
srun --output=srunps.%N-job%j-%t.out --jobid=5102027 'ps -ef' So that it puts each node's ps in a separate file delineated by the hostname. Oddly, I can't seem to figure out how to pipe inside that call: issuing 'ps -ef' does the same as 'ps -ef | grep GEOS'. Matt, try using the pgrep

[slurm-dev] RE: Distribute M jobs on N nodes without duplication

2015-10-02 Thread John Hearns
So far I tried my hands with SRUN, SBATCH and SALLOC, and thought SBATCH will do what I am looking for. However, SBATCH starts with assigning the requested resource configuration but then runs every srun command on every node. For instance, if my script looks like: sbatch is the command

[slurm-dev] RE: Distribute M jobs on N nodes without duplication

2015-10-02 Thread John Hearns
I stand corrected. I find myself in a maze of twisty little passages, all alike All the examples for SBATCH (in the SLURM manual) uses 'SRUN' for execution of runs. There are lot of other websites which gives SBATCH examples and all of them uses SRUN, unless using some version of MPI.

[slurm-dev] Re: Problem in loading modules in slurm Batch script.

2015-12-04 Thread John Hearns
From: John Hearns [mailto:john.hea...@xma.co.uk] Sent: 04 December 2015 09:56 To: slurm-dev Subject: [slurm-dev] Re: Problem in loading modules in slurm Batch script. Hello Hezi, Have you tried making the shell for the batch script a login shell? #!/bin/bash -l I have not come across

[slurm-dev] Re: Jobs stuck in CF state

2015-11-29 Thread John Hearns
Thankyou Werner. The compute nodes were all in idle~ state - I now tknow this means power down, but the nodes were up and running. I restarted slurm completely, and thisngs are OK now. Scanned by MailMarshal - M86 Security's comprehensive email content

[slurm-dev] Jobs stuck in CF state

2015-11-27 Thread John Hearns
Yesterday I thought to some investigations of the suspend and resume scripts on my in-house test cluster. As my Mum would have said ' "See what thought done..." I have backed out of the changes to slurm.conf (or have I ... ??) I have restarted slurm on the head node and all compute nodes.

[slurm-dev] ReqNodeNotAvail - can't see all info

2016-05-26 Thread John Hearns
I am scheduling an HPCC job on a certain set of nodes using -nodelist I am getting informed that a node is not available - but for the life of me I cannot expand the NODELIST(REASON) fied to show it. JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

[slurm-dev] RE: ReqNodeNotAvail - can't see all info

2016-05-26 Thread John Hearns
. -Original Message- From: John Hearns [mailto:john.hea...@xma.co.uk] Sent: 26 May 2016 09:21 To: slurm-dev <slurm-dev@schedmd.com> Subject: [slurm-dev] ReqNodeNotAvail - can't see all info I am scheduling an HPCC job on a certain set of nodes using -nodelist I am getting in

[slurm-dev] Re: Increase size of running job/correcting incorrect resource allocations?

2016-06-01 Thread John Hearns
Plus from me too! I used the cpusets integration with PbsPro in my last job, and it was a godsend. This was on a large SMP machine, but same lessons apply to clusters. Applications get a defined set of CPUs - which they 'see' as being numbered from 0, and they get a defined amount of memory. If

[slurm-dev] RE: What cluster provisioning system do you use?

2016-03-15 Thread John Hearns
Bjorn You should be definitely looking at Bright cluster Manager. I set up a Bright cluster last week with CentOS 7.2 and slurm. Bright works right our of the box with slurm, and it is set up automatically as you provision the nodes. Also have the powersaving scripts etc all set up. Please

[slurm-dev] Re: What cluster provisioning system do you use?

2016-03-15 Thread John Hearns
I am currently setting up a test cluster and shall be looking at - Warewulf If you like Warewulf, you could look at OpenHPC, which uses Warewulf for the provisioning. The slurm version on my OpenHPC server is 15.08.6, and this came from the OpenHPC repositories.

[slurm-dev] RE: checkpoint/restart feature in SLURM

2016-03-19 Thread John Hearns
O I'll we k lo Sent from my Windows Phone From: Husen R Sent: ‎17/‎03/‎2016 05:56 To: slurm-dev Subject: [slurm-dev] checkpoint/restart feature in SLURM Dear Slurm-dev, Does checkpoint/restart feature

[slurm-dev] Re: Slurm service timeout - hints on diagnostics please?

2016-04-12 Thread John Hearns
Thankyou for your help. It turned out I had to run an scontrol update state=RESUME on all nodes also to wake them up. I guess that is something I have to file away in my brain for the future! Thanks once again. From: John Hearns [john.hea...@xma.co.uk

[slurm-dev] Slurm service timeout - hints on diagnostics please?

2016-04-11 Thread John Hearns
I am working on an OpenHPC/Warewulf cluster. When I start the slurmd service on the compute nodes the systemctl sits there for a long time, then reports: Starting slurm (via systemctl): Job for slurm.service failed because a timeout was exceeded. See "systemctl status slurm.service" and

[slurm-dev] Re: Slurm service timeout - hints on diagnostics please?

2016-04-11 Thread John Hearns
done it this way." - Grace Hopper On 12 April 2016 at 13:04, John Hearns <john.hea...@xma.co.uk<redir.aspx?REF=QELRiDtWXSP4awCRwKUE3l3tdYRalEpuf40LpX5Wf4z1qIBjgGLTCAFtYWlsdG86Sm9obi5IZWFybnNAeG1hLmNvLnVr>> wrote: I am working on an OpenHPC/Warewulf cluster. When I st

[slurm-dev] Re: Slurm service timeout - hints on diagnostics please?

2016-04-11 Thread John Hearns
L. -- The most dangerous phrase in the language is, "We've always done it this way." - Grace Hopper On 12 April 2016 at 13:04, John Hearns <john.hea...@xma.co.uk<redir.aspx?REF=ZYdTMqASAtEKm0_l7QFgbf8hIrpFPmgtV4Xd9foduptdEgezgWLTCAFtYWlsdG86Sm9obi5IZWFybnNAeG1hLmNvLnV

[slurm-dev] Time limit on compute nodes?

2016-05-24 Thread John Hearns
Last night I was running an Openfoam job which failed with a message about a time limit on comp13: Time = 23.2 DILUPBiCG: Solving for Ux, Initial residual = 0.00934715, Final residual = 3.76516e-05, No Iterations 3 DILUPBiCG: Solving for Uy, Initial residual = 0.00214164, Final residual =

[slurm-dev] RE: Time limit on compute nodes?

2016-05-24 Thread John Hearns
-A ds004 The job ran with account dc004 and used that walltime! You learn something new every day -Original Message- From: John Hearns [mailto:john.hea...@xma.co.uk] Sent: 24 May 2016 07:26 To: slurm-dev <slurm-dev@schedmd.com> Subject: [slurm-dev] Time limit on compute nodes?

[slurm-dev] RE: Guide for begginers Admin to make prioriries

2016-05-18 Thread John Hearns
Free IPA? Damn. You mean identity management. Not free beer. Sent from my Windows Phone From: Simpson Lachlan Sent: ‎18/‎05/‎2016 01:41 To: slurm-dev Subject: [slurm-dev] RE: Guide for begginers

[slurm-dev] RE: NFSv4

2016-05-25 Thread John Hearns
They've been doing things like this at CERN for donkeys years - with the Andrew File System in the past. Look for Ticket Granting Tickets. Sorry - my memory is getting hazy. -Original Message- From: Mike Johnson [mailto:m.d.john...@durhamonline.org] Sent: 25 May 2016 12:22 To:

[slurm-dev] Re: Storage accounting, with web presentation

2016-07-28 Thread John Hearns
Christian is quite correct to flag up Robinhood, which will be the correct tool for you. However, if you want something you can implement today, and will give you a quick overview of storage use, and offer a 'drill down' into each users area try agedu. I have used it in the past:

[slurm-dev] RE: SGI UV2000 with SLURM

2016-07-20 Thread John Hearns
As Carlos says. I don’t have direct experience on running slurm on a UV, but did run PbsPro with cpusets on a UV. I might be remembering this wrong, but part of the init script was to move the pbs daemon out of the bootcpuset. If you have root privileges you can move your own cpuset. I might

[slurm-dev] Re: queue routing

2016-07-20 Thread John Hearns
So plenty of scope then for seeing (*) Heisenbugs. I shall get my coat (*) Or not seeing, depending if the jobs are being run in a forest From: Christopher Samuel [sam...@unimelb.edu.au] Sent: 20 July 2016 01:23 To: slurm-dev Subject: [slurm-dev]

[slurm-dev] RE: Remote Visualization and Slurm

2016-08-17 Thread John Hearns
Nicholas, As you say there are several solutions out there. The one I have has experience with is NICE Software, which I admit I integrated with PBS Pro. When looking at the code though there are the options to use with SLurm. Please send me an email off list and I can give more information.

[slurm-dev] RE: how to differentiate regular srun and srun with --pty

2017-02-06 Thread John Hearns
Bhanu Regarding recording commands etc. then have a look at this project which was presented at FOSDEM last weekend: https://fosdem.org/2017/schedule/event/ogrt/attachments/slides/1574/export/events/attachments/ogrt/slides/1574/005_ogrt.pdf https://github.com/georg-rath/ogrt This should do

[slurm-dev] Re: Slurm for render farm

2017-01-28 Thread John Hearns
, John Hearns wrote: > The concept at the moment is to run the Renderpal server, which is a Windows > application and it can detect the Linux render clients via a 'heartbeat' > mechanism. > I would spawn the Linux clients as needed via slurm. Sounds reasonable. > Thinking out loud, I

[slurm-dev] Stopping compute usage on login nodes

2017-02-09 Thread John Hearns
Does anyone have a good suggestion for this problem? On a cluster I am implementing I noticed a user is running a code on 16 cores, on one of the login nodes, outside the batch system. What are the accepted techniques to combat this? Other than applying a LART, if you all know what this means.

[slurm-dev] Re: Stopping compute usage on login nodes

2017-02-09 Thread John Hearns
user: > > > $ echo "root* /">>/etc/cgrules.conf > $ echo "* cpuset,memoryusers">>/etc/cgrules.conf > > > Note also that the ''users'' cgroup defined above is inclusive of > **all** users (the * wildcard). So it is not a 4GB RA

[slurm-dev] Abaqus with Slurm

2017-02-09 Thread John Hearns
I would guess quite a few sites are using Abaqus with Slurm. I would be grateful for some pointers on the submission scripts for MPI parallel Abaqus runs. I am setting up Abaqus version 6.14-1 on a system with Slurm 16.05 and an Omnipath interconnect. Specifically I am using this script to

[slurm-dev] Re: Abaqus with Slurm

2017-02-09 Thread John Hearns
inking it was poorly designed or implemented in Abaqus at that time too. Regards Sean On Thu, Feb 09, 2017 at 03:16:09AM -0800, John Hearns wrote: > I would guess quite a few sites are using Abaqus with Slurm. I would be > grateful for some pointers on the submission scripts for MPI parallel Abaqus

[slurm-dev] Slurm with sssd - limits help please

2017-02-16 Thread John Hearns
Looks like there are others out there using slurm with sssd authentication, based on a quick mailing list search. Forgive me if I have not understood something here. On the cluster I am configuring at the moment, looking at the slurm daemon on the compute nodes it has the max locked memory

[slurm-dev] RE: Slurm with sssd - limits help please

2017-02-16 Thread John Hearns
My answer is here maybe? https://slurm.schedmd.com/faq.html#memlock RTFM ? From: John Hearns [mailto:john.hea...@xma.co.uk] Sent: 16 February 2017 15:13 To: slurm-dev <slurm-dev@schedmd.com> Subject: [slurm-dev] Slurm with sssd - limits help please Looks like there are others out there

[slurm-dev] RE: Slurm with sssd - limits help please

2017-02-16 Thread John Hearns
should double check the pam config too: https://slurm.schedmd.com/faq.html#pam Thanks, Guy On 16 February 2017 at 15:17, John Hearns <john.hea...@xma.co.uk<mailto:john.hea...@xma.co.uk>> wrote: My answer is here maybe? https://slurm.schedmd.com/faq.html#memlock RTFM ?

[slurm-dev] Re: Abaqus with Slurm

2017-02-13 Thread John Hearns
ost_list}">>abaqus_v6.env Now to find the correct MPI incantation to get the darn thing running over Omnipath. I forsee meetings at a crossroads at midnight and selling my soul to a shadowy figure -----Original Message- From: John Hearns [mailto:john.hea...@xma.co.uk] Sent:

[slurm-dev] Slurm for render farm

2017-01-16 Thread John Hearns
Is anyone out there using SLurm in conjunction with Renderpal http://www.renderpal.com/http://www.renderpal.com/ Forestalling the obvious replies... yes I know that a render farm manager and a scheduler do basically the same thing. In a rational universe I would be using one or 'tother.

[slurm-dev] RE: Slurm for render farm

2017-01-20 Thread John Hearns
! From: John Hearns Sent: 16 January 2017 18:01 To: slurm-dev Subject: Slurm for render farm Is anyone out there using SLurm in conjunction with Renderpal http://www.renderpal.com/http://www.renderpal.com/ Forestalling the obvious replies... yes I know

[slurm-dev] Job temporary directory

2017-01-20 Thread John Hearns
As I remember, in SGE and in PbsPro a job has a directory created for it on the execution host which is a temporary directory, named with he jobid. you can define int he batch system configuration where the root of these directories is. On running srun env, the only TMPDIR I see is /tmp I know

[slurm-dev] RE: defining jobs slots

2016-08-16 Thread John Hearns
Adrian, forgive my asking but are you running this on a laptop 'natively' or using a virtual machine, eg. On VirtualBox It could be that if you have a VM it is set to have a different number of cores than your real laptop. I could be very very wrong here (I am working on a VirtualBox VM at

[slurm-dev] gres/mic unable to set OFFLOAD_DEVICES

2017-02-28 Thread John Hearns
Some pointers appreciated please. I suspect this is a common error message.Slurm version 16.05.8 In the slurmd logs on compute nodes I am seeing this: [2017-02-27T20:25:04.886] error: gres/mic unable to set OFFLOAD_DEVICES, no device files configured [2017-02-27T20:25:04.898] _run_prolog:

[slurm-dev] Re: Send notification email

2016-09-30 Thread John Hearns
Fanny, You are getting confused between the mail client – that is the program you use to send email as a user, and the mail server which the system uses to route the email to the recipient’s mail server. These are called the MAU (Mail User Agent) and MTA (Mail Transfer Agent) if I remember

[slurm-dev] Re: Send notification email

2016-10-05 Thread John Hearns
Fany, Many clusters which have an internal network which is a private network. However the other interface on the cluster head node, which is normally called the 'external' interface can have a real, proper IP address on your external network. It will therefore be able to send email. The

[slurm-dev] Re: Send notification email

2016-10-05 Thread John Hearns
efuses mails go out of my internal network. I think that's what is happening. I'm wrong? -Mensaje original- De: John Hearns [mailto:john.hea...@xma.co.uk] Enviado el: miércoles, 5 de octubre de 2016 10:17 Para: slurm-dev Asunto: [slurm-dev] Re: Send notification email Fany, Many clusters

[slurm-dev] Re: Send notification email

2016-10-05 Thread John Hearns
com> Subject: [slurm-dev] Re: Send notification email Thanks anyway. All the best. Fany -Mensaje original- De: John Hearns [mailto:john.hea...@xma.co.uk] Enviado el: miércoles, 5 de octubre de 2016 11:20 Para: slurm-dev Asunto: [slurm-dev] Re: Send notification email Fany, You are c

[slurm-dev] Re: Send notification email

2016-10-06 Thread John Hearns
lays=8869/0.01/0/0, dsn=4.4.1, status=deferred $ -Mensaje original----- De: John Hearns [mailto:john.hea...@xma.co.uk] Enviado el: miércoles, 5 de octubre de 2016 11:33 Para: slurm-dev Asunto: [slurm-dev] Re: Send notification email Fany, Are you able to send us some of the lines from the /va

[slurm-dev] Re: cpu identifier

2016-09-14 Thread John Hearns
Andrealphus,# You should be using cpusets You allocate cores 1 and 2 (actually I think they count from 0) as the 'boot cpuset' and run the operating system processes in that. You then create a cpuset for each job. I have done this with PBSPro and it works very well.

[slurm-dev] Re: cpu identifier

2016-09-14 Thread John Hearns
Squee.(*) Just a note - for Mellanox IB users, there is the tuning guide which advises using interrupts on the CPU nearest the HBA. I guess it makes sense to eke out that last fraction of performance to make the reserved cores be local to the HBA. hwloc is you friend here. (*) and see my

[slurm-dev] Slurm web dashboards

2016-09-27 Thread John Hearns
Hello all. What are the thoughts on a Slurm 'dashboard'. The purpose being to display cluster status on a large screen monitor. I rather liked the look of this, based on dashing,io https://github.com/julcollas/dashing-slurm/blob/master/README.md Sadly dashing.io is not being supported, and

[slurm-dev] Re: Using slurm to control container images?

2016-11-16 Thread John Hearns
Lachlan, I am sure it has been mentioned on this thread, but look at Singularity http://singularity.lbl.gov/ From: Lachlan Musicman [mailto:data...@gmail.com] Sent: 16 November 2016 01:45 To: slurm-dev Subject: [slurm-dev] Re: Using slurm to control container images?

[slurm-dev] RE: Suggestions on node memory cleaning

2017-03-30 Thread John Hearns
I think this thread has the answer http://askubuntu.com/questions/609226/freeing-page-cache-using-echo-3-proc-sys-vm-drop-caches-doesnt-work echo 3 | sudo tee /proc/sys/vm/drop_caches From: John Hearns [mailto:john.hea...@xma.co.uk] Sent: 30 March 2017 17:11 To: slurm-dev <slurm-

[slurm-dev] RE: Suggestions on node memory cleaning

2017-03-30 Thread John Hearns
Aha, follow this thread http://www.beowulf.org/pipermail/beowulf/2013-April/031407.html From: John Hearns [mailto:john.hea...@xma.co.uk] Sent: 30 March 2017 17:07 To: slurm-dev <slurm-dev@schedmd.com> Subject: [slurm-dev] RE: Suggestions on node memory cleaning Chad, I did rather

[slurm-dev] RE: Query about web front ends to slurm

2017-03-29 Thread John Hearns
Sean, I cannot say if this satisfies your requirements, however in the past have worked with Enginframe A new version ws recently released. It ceertainly does work with Slurm, https://www.nice-software.com/products/enginframe The LAP (Active Directory) integration works as a user mapping as

[slurm-dev] Re: Fwd: Scheduling jobs according to the CPU load

2017-03-18 Thread John Hearns
even simplest scheduler and if I had such prior knowledge I would not invest so much time and effort to setup slurm. Best regards, Ketiw On Sat, Mar 18, 2017 at 5:42 PM, John Hearns <john.hea...@xma.co.uk<mailto:john.hea...@xma.co.uk>> wrote: Kesim, what you are saying i

[slurm-dev] RE: Fwd: job requeued in held state

2017-04-03 Thread John Hearns
Chris, can the user start an 'srun' session? From: Chris Woelkers - NOAA Affiliate [chris.woelk...@noaa.gov] Sent: 03 April 2017 20:31 To: slurm-dev Subject: [slurm-dev] Fwd: job requeued in held state I am running a small HPC, only 24 nodes, via slurm

[slurm-dev] RE: Does slurm work well with Supermicro KNL Phi boards?

2017-04-01 Thread John Hearns
Kenneth, I cant answer your question directly. However I have quite a lot of experience recentyle in using syscfg on Intel brand servers and motherboards, for an Omnipath cluster at a UK university and my own benchmarking cluster. I find syscfg to be an excellent tool. No more BIOS Settings

[slurm-dev] RE: Job-Specific Working Directory on Local Scratch

2017-03-13 Thread John Hearns
Stegfan, regarding the Prolog/Task Prolog option, David Lee Braun sent me a comprehensive reply on that one back in January. The answer is that you have to se the TMPDIR in a separate /etc/profile.d/slurm.sh Is. The Prolog creates the directory OK, but the TMPDIR variable is only set if a

[slurm-dev] Re: CGroups, Threads as CPUs, TaskPlugins

2017-08-15 Thread John Hearns
For the /proc/self you need to start an interactive job under Slurm. (I'm speaking from a PBSPro viewpoint here. What? What? Maud - release the dogs! Fetch my shotgun! Get off my property Sir!) On 15 August 2017 at 05:15, Lachlan Musicman wrote: > On 15 August 2017 at

[slurm-dev] Re: Per-job tmp directories and namespaces

2017-08-10 Thread John Hearns
14/private_tmp.pdf > for a presentation about the plugin. > > > 2017-08-10 9:31 GMT+02:00 John Hearns <hear...@googlemail.com>: > >> I am sure someone discussed this topic on this list a few months ago... >> if it rings any bells please let me know. >> I am not discus

[slurm-dev] Per-job tmp directories and namespaces

2017-08-10 Thread John Hearns
I am sure someone discussed this topic on this list a few months ago... if it rings any bells please let me know. I am not discussing setting the TMPDIR environment variable and crateing a new TMPDIR directory on a per job basis - though thankyou for the help I did get when discussing this.

[slurm-dev] Re: Per-job tmp directories and namespaces

2017-08-10 Thread John Hearns
We use the spank-private-tmp plugin developed at HPC2N in Sweden: > > https://github.com/hpc2n/spank-private-tmp > > > > See also: https://slurm.schedmd.com/SUG14/private_tmp.pdf > for a presentation about the plugin. > > > > > 2017-08-10 9:31 GMT+02:

[slurm-dev] Re: CGroups, Threads as CPUs, TaskPlugins

2017-08-14 Thread John Hearns
am wrong forgive me) Also top is your friend here. And more usefully 'htop' Just look at top with the flag to show threads -H and 'j' to show last used cpu On 14 August 2017 at 08:12, John Hearns <hear...@googlemail.com> wrote: > Lachlan, forgive me if I am teaching granny to

[slurm-dev] Re: CGroups, Threads as CPUs, TaskPlugins

2017-08-14 Thread John Hearns
Lachlan, forgive me if I am teaching granny to suck eggs..,, I have recently been workign with cgroups. If you run an interactive job what do you see when cat /proc/self/cgroups Also have you explored in /sys/fs/cgroups and checked what resources are in the cgroups which a job has? On 14 August

[slurm-dev] Re: Slurm and Environments and aliases

2017-08-16 Thread John Hearns
Lachlan, I will have to check when I get into work in the morning. I am also sorry if I lead you down the wrong path here, however this does feel to be an issue of login versus non-login shells. Try a bash -l (dash lower case L) I am sure the login/non-login thing has been discussed on here

[slurm-dev] RE: [Non-DoD Source] Re: General Post-Processing Question (UNCLASSIFIED)

2017-07-20 Thread John Hearns
Anthony, I back up what Peter says. I had a project recently where we had a render farm deployed with Slurm. There were 'data mover' jobs needed which ran once a render was complete and we used job dependencies for these. I guess though from what you say that you wil have to monitor how long the

[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread John Hearns
> After Installing nmap, it let me realize that some ports were blocked even > with firewall daemon stopped and disabled. Turned out that iptables was on > and enabled. After stopping iptables everything work just fine. > > > > Best Regards, > > > Said. > -

[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread John Hearns
services are running on the compute node when the controller says >> it's down >> - TCP connections are not being dropped >> - Ports are accessible that are to be used for communication, >> specifically response ports >> - Check the routing rules if any >> - Clocks are synced

[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-05 Thread John Hearns
Said, a problem like this always has a simple cause. We share your frustration, and several people her have offered help. So please do not get discouraged. We have all been in your situation! The only way to handle problems like this is a) start at the beginning and read the manuals and

[slurm-dev] Re: RebootProgram - who uses it?

2017-08-07 Thread John Hearns
Lachlan, in the Name of the Wee Man, so 'reboot' is now a 'legacy tool' https://access.redhat.com/solutions/1580343 Jeez... Look HPC compute node - I'm in charge, gottit? Yeah, fight back all you like with systemd, but I can pull the power plug. Let's see you deal with that one. On 7 August 2017

[slurm-dev] Re: How to get pids of a job

2017-05-11 Thread John Hearns
A good tool to us on the nodes when you have the list of nodes is 'pgrep' https://linux.die.net/man/1/pgrep On 11 May 2017 at 15:44, Jason Bacon wrote: > > > Parse the node names from squeue output (-o can help if you want to > automate this) and then run ps or top on

[slurm-dev] Re: Multinode setup trouble

2017-05-17 Thread John Hearns
Ben, a stupid question, hoever - have you installed and configured Munge authentication on the slave node? On 17 May 2017 at 02:59, Ben Mann wrote: > Hello Slurm dev, > > I just set up a small test cluster on two Ubuntu 14.04 machines, installed > SLURM 17.02 from source. I

[slurm-dev] Re: Issue to startup slurm daemon on Compute nodes

2017-05-09 Thread John Hearns
Followig on from Maik's response, it would be worth mentioning the compat-glibc package for CentOS https://centos-packages.com/7/package/compat-glibc/ https://www.centos.org/forums/viewtopic.php?t=22250 Big get out of jail card - I have never built any version of Slurm on a CentOS 7 system using

[slurm-dev] Re: Is there anyway to commit job with different user?

2017-05-16 Thread John Hearns
Sun, as the others have responded, you should make sure your userids are the same across the cluster. You really must put in the effort to do that. However - SGE does have a usermapping feature https://linux.die.net/man/5/sge_usermapping I do not know if there is somethig similar in Slurm.

[slurm-dev] Re: Launching a VMWare Virtual Machine

2017-06-02 Thread John Hearns
Sean, this sound slike the difference between interactive and non-interactive shells. When you log in directly to the node, you have an interactive shell and the environment is set up, and /etc/profile.d scripts are sourced. Someone will be along in a minute with the correct answer, however try

[slurm-dev] Re: Question about default mail path command

2017-06-14 Thread John Hearns
Bas, you should be able to set that value in slurm.conf when you install/customise your Slurm setup MailProg=/usr/bin/mail On 14 June 2017 at 15:24, Bas van der Vlies wrote: > > I am just starting with slurm and notice that the default mail path > command is

[slurm-dev] Re: Announce: Node status tool "pestat" for Slurm updated to version 0.30

2017-05-03 Thread John Hearns
Ole, a small ask. I si tpossible to put the 'pestat' utility for Slurm and for PBS on a site which uses http? The reason is many (most ?) corporate networks block ftp access. Thankyou On 3 May 2017 at 09:06, Ole Holm Nielsen wrote: > > I'm announcing an updated

[slurm-dev] Re: Announce: Node status tool "pestat" for Slurm updated to version 0.30

2017-05-03 Thread John Hearns
ing the FTP files via HTTP also, please see: > > http://ftp.fysik.dtu.dk/Slurm/ > > Does that work for you? > > /Ole > > On 05/03/2017 12:02 PM, John Hearns wrote: > >> Ole, >> a small ask. I si tpossible to put the 'pestat' utility for Slurm and >> for PBS o

[slurm-dev] RE: Two different GPUs in a compute node (my own answer)

2017-05-04 Thread John Hearns
Daniel, I think that you do not need the CPUs= at all. Also look at specifying the use of cgroups. then when you run a job and request one GPU, that GPU will be made available to you as CUDA_VISIBLE_DEVICES The other GPU will nto be available to you - but can be used by another batch job. On

[slurm-dev] Re: Communication error

2017-05-08 Thread John Hearns
JAson, note that compute-2018 is in IDLE* status - which means that it is not reachable. As Felip suggests, log into that compute node and tail -f /var/log/slurmd.log I would also suggest on your master node running an scontrol to set that node as DRAIN then RESUME, then log into the node and

[slurm-dev] Re: job allocation lag

2017-10-11 Thread John Hearns
Vladimir, in cases where you have a 'hairs on the back of your neck' feeling it is often the case that these indicate something real. However, you do have to be scientific about this. If you think that uptime is an influence, you have to record job startup times each hour, and plot these. Be

[slurm-dev] Re: MPI-Jobs on cluster - how to set batchhost

2017-09-28 Thread John Hearns
Brigitte, are you able to tell us more about this scratch filesystem? You could arrange that the compute nodes mount it directly, so you get the performance you need. Thsi can be achieved by putting a routing node onto the cluster network. Or you could route throug the cluster head node. Also

[slurm-dev] Re: MPI-Jobs on cluster - how to set batchhost

2017-09-28 Thread John Hearns
Brigitte, thankyou. That makes sense. I guess that there is an NFS re-export of the scratch filesystem. I know this is not an answer to the problem at the moment, maybe you shoud look at Bee On Demand for the future. https://www.beegfs.io/wiki/BeeOND With the disclaimer that I have not

[slurm-dev] Re: MPI-Jobs on cluster - how to set batchhost

2017-09-28 Thread John Hearns
Brigitte, I understand what you are trying to achieve. But may I ask - is there local storage n your compute nodes? You coudl run a job where the results are written to local storage, then transferred to your scratch filesystem at the end of the job. It is normal on HPC cluster to have the

[slurm-dev] Re: Interaction between cgroups and NFS

2017-09-03 Thread John Hearns
No, have never seen anything similar. A small bit of help - the 'nfswatch' utility is useful for tracking down NFS problems. ' Less relevant, but on a system which is running low on memory 'watch cat /proc/meminfo' is often good for shining a light. On 2 September 2017 at 00:16, Brendan Moloney

[slurm-dev] Re: Selecting a network interface with srun

2017-10-25 Thread John Hearns
ent purposes - e.g., mpirun! A quick grep of the > mailing list logs will reveal all the woes that created. > > On Oct 25, 2017, at 8:22 AM, John Hearns <hear...@googlemail.com> wrote: > > When using “mpirun” we can specify “-iface ib0”this is true, and the > exact synt

[slurm-dev] RE: Selecting a network interface with srun

2017-10-25 Thread John Hearns
When using “mpirun” we can specify “-iface ib0”this is true, and the exact syntax depends on your MPI of choice, as noted above. However, don't get confused between IPOIB and Infiniband itself. IPOIB is of course sending IP traffic over Infiniband. An Infiniband network can perfectly