Re: [slurm-users] Limit nodes of a partition without managing users

2020-08-17 Thread Brian Andrus
, the devil is in the details on how to define/get what you want. Brian Andrus On 8/17/2020 10:13 AM, Gerhard Strangar wrote: Hello, I'm wondering if it's possible to have slurm 19 run two partitions (low and high prio) that share all the nodes and limit the high prio partition in number of nodes

Re: [slurm-users] is there a way to delay the scheduling.

2020-08-28 Thread Brian Andrus
to schedule them in that fashion outweighs the resources needed by far. Brian Andrus On 8/28/2020 3:30 AM, navin srivastava wrote: Hi Team, facing one issue. several users submitting 2 job in a single batch job which is very short jobs( says 1-2 sec). so while submitting more job slurmctld

Re: [slurm-users] Alternatives for MailProg

2020-08-28 Thread Brian Andrus
That is where you have it call a bash script and within the script you do as needed. Like Ahmet's suggested script. So use his as a template and add the headers you desire. Brian Andrus On 8/28/2020 11:36 AM, Chris Samuel wrote: On 8/27/20 3:42 pm, Brian Andrus wrote: Actually, you can add

Re: [slurm-users] Alternatives for MailProg

2020-08-27 Thread Brian Andrus
Actually, you can add headers of all kinds: Quick search of "sendmail add headers" discovers: https://serverfault.com/questions/347602/sending-e-mail-from-sendmail-with-headers Brian Andrus On 8/26/2020 10:02 PM, Andrew Elwell wrote: Hi folks, I'm getting fed up receiving out

Re: [slurm-users] [EXT] Slurmd problem on client

2020-08-24 Thread Brian Andrus
IIRC, that is because it is trying to do the 'configless' feature of slurm 20 where it uses DNS entries to find the config. This will happen if /etc/slurm.conf does not exist on the node. Check that you have that and that it is the same as the one on the master. Brian Andrus On 8/24/2020 7

Re: [slurm-users] error: user not found

2020-09-29 Thread Brian Andrus
een places where that can take 24 hours. Brian Andrus On 9/29/2020 6:18 AM, Diego Zuccato wrote: Hello all. One of the users is unable to submit jobs to our cluster. The first time he tries, he gets $ sbatch test.job sbatch: fatal: Invalid user id: 621049927 then: $ sbatch test.job sbatch: er

Re: [slurm-users] Quickly throttling/limiting a specific user's jobs

2020-09-22 Thread Brian Andrus
on the node waiting to be resumed, but the node resources may get assigned to other jobs while they wait to resume. Brian Andrus On 9/22/2020 2:33 PM, Ransom, Geoffrey M. wrote: Hello    We had a user post a large number of array jobs with a short actual run time (20-80 seconds, but mostly

Re: [slurm-users] RAM "overbooking"

2020-05-27 Thread Brian Andrus
Heh. That is the on-going "user education" You could change the amount of ram requested using a job_sumit lua script, but that could bite those that are accurate with their requests. Or set a max ram for the partition. Brian Andrus On 5/27/2020 3:46 PM, Marcelo Z. Silva wrote:

Re: [slurm-users] SLES 15 rpmbuild from 20.02.5 tarball wants munge-libs: system munge RPMs don't provide it

2020-10-20 Thread Brian Andrus
packages. Source control for me is just that spec file. Brian Andrus On 10/20/2020 8:46 AM, Michael Jennings wrote: On Tuesday, 20 October 2020, at 15:49:25 (+0800), Kevin Buckley wrote: On 2020/10/20 11:50, Christopher Samuel wrote: I forgot I do have access to a SLES15 SP1 system, that has

Re: [slurm-users] Slurm MySQL database configuration

2020-07-21 Thread Brian Andrus
slurm daemons going down. Brian Andrus On 7/21/2020 7:44 AM, Peter Mayes wrote: Hi, My first post to the list, so apologies if this is a FAQ, My configuration has two nodes allocated for Slurm masters, with a highly-available NFS server mounting a filesystem across the two nodes. I need advice

Re: [slurm-users] Internet connection loss with srun to a node

2020-08-02 Thread Brian Andrus
This is very likely by design of the cluster and/or network. Otherwise users could use the cluster to mine bitcoin and such. Brian Andrus On 8/2/2020 7:11 AM, Mahmood Naderan wrote: I thought that maybe srun doesn't transfer all settings from the head node to the compute node. The wget

Re: [slurm-users] changes in slurm.

2020-07-09 Thread Brian Andrus
, the partition is used to determine which node(s) and filter/order jobs. You should add the node to the new partition, but also leave it in the 'test' partition. If you are looking to remove the 'test' partition, set it to down and once all the running jobs that are in it finish, then remove it. Brian

Re: [slurm-users] Advice for merging accounting data

2020-07-08 Thread Brian Andrus
you set that in the slurm.conf to continue the numbering from where you left off so there are no entries in accounting that get replaced. Brian Andrus On 7/8/2020 3:15 AM, Simon Kainz wrote: Hello, we have a long-running slurm cluster, accounting into slurmdbd/mysql backend on the cluster

Re: [slurm-users] ssh-keys on compute nodes?

2020-06-19 Thread Brian Andrus
thentication <https://en.wikibooks.org/wiki/OpenSSH/Cookbook/Host-based_Authentication>because /*normal users have no business on those servers!*/ Brian Andrus On 6/17/2020 1:26 AM, Ole Holm Nielsen wrote: On 6/9/20 5:45 PM, Michael Jennings wrote: On Tuesday, 09 June 2020, at 12:43:34

Re: [slurm-users] Slurm and shared file systems

2020-06-19 Thread Brian Andrus
them outside the cluster. Brian Andrus On 6/19/2020 5:04 AM, David Baker wrote: Hello, We are currently helping a research group to set up their own Slurm cluster. They have asked a very interesting question about Slurm and file systems. That is, they are posing the question -- do you need

[slurm-users] configless DNS entries

2020-06-09 Thread Brian Andrus
/configless_slurm.html Brian Andrus

Re: [slurm-users] GUI application crash on first allocation, but runs fine on second allocation

2020-06-09 Thread Brian Andrus
Sounds like a race condition where slurmd is starting before the node is truly ready. You can try adding dependencies for slurmd so it will not start until some other needed service is running. The benefits of systemd :) Brian Andrus On 6/9/2020 10:53 AM, Dumont, Joey wrote: Hi, I

Re: [slurm-users] Problem with permisions. CentOS 7.8

2020-06-02 Thread Brian Andrus
are running. slurmd should be running as root. It needs to be able to do a few things including run the job as the user that submitted it. Things that only root should be doing. Brian Andrus On 6/2/2020 2:00 PM, Ferran Planas Padros wrote: Hi Ole, I run the same version of slurm in all

[slurm-users] Jenkins integration

2020-07-24 Thread Brian Andrus
root could be quite useful. Especially for service accounts. Yes, there can be a workaround using sudo, but it seems better if we could track things in slurm to know a job was run 'on behalf of' another user. Thoughts, suggestions, current approaches? Thanks, Brian Andrus

Re: [slurm-users] Running two multiprocessing jobs in one sbatch

2020-07-25 Thread Brian Andrus
Is there a reason to run them as a single job? It may be easier to just have 2 separate jobs of 16 cores each. If there are dependency requirements, that is addressed by adding any dependencies to the job submission. Brian Andrus On 7/25/2020 2:50 AM, Даниил Вахрамеев wrote: Hi everyone

[slurm-users] know time limit from inside job

2020-07-27 Thread Brian Andrus
calls as too many of them can tip a system over. Brian Andrus

Re: [slurm-users] know time limit from inside job

2020-07-27 Thread Brian Andrus
lua, if I may ask? Brian Andrus On 7/27/2020 9:52 AM, Baer, Troy wrote: There's an outstanding feature request for that: https://bugs.schedmd.com/show_bug.cgi?id=8383 While waiting on that, we've taken to injecting it into the job's environment ourselves in the Lua submit filter. --Troy

Re: [slurm-users] slurm & rstudio

2020-07-20 Thread Brian Andrus
You are trying to use sbatch with the "--uid" option which is only allowed by root. Either run sbatch as the user doing the request (which should be the same user that is running rstudio) or use 'sudo -u ' to run sbatch. Brian Andrus On 7/20/2020 7:50 AM, Sidhu, Khushwant wrote:

Re: [slurm-users] slurm & rstudio

2020-07-20 Thread Brian Andrus
Ah, They are assuming you are running the web interface as root. If your environment is secure enough, you can do that. Or, grant your web server user privileges in slurm to be allowed to use the "--uid" option. Brian Andrus On 7/20/2020 8:39 AM, Sidhu, Khushwant wrote: H

Re: [slurm-users] Trouble installing slurm-20.02.4-1.amzn2.x86_64 libnvidia-ml.so.1

2020-12-05 Thread Brian Andrus
That package looks to be built for a system with an nvidia gpu installed. Look for (or build) different packages if you are not going to use a gpu-based node. Brian Andrus On 12/4/2020 11:32 AM, Mullen, Drew wrote: Howdy Im getting this error installing slurm 20.02.4: Error: Package

[slurm-users] MinJobAge

2020-11-23 Thread Brian Andrus
in a completed state for a period of time, but they are not showing up at all on our cluster. How does one have jobs show up that are completed? Brian Andrus

Re: [slurm-users] Burst to AWS cloud

2020-12-15 Thread Brian Andrus
over a direct-connect or VPN. Brian Andrus On 12/15/2020 12:02 PM, Sajesh Singh wrote: We are currently investigating the use of the cloud scheduling features within an on-site Slurm installation and was wondering if anyone had any experiences that they wish to share of trying to use

Re: [slurm-users] slurmctld daemon error

2020-12-14 Thread Brian Andrus
Check your hosts file and ensure 'localhost' does not have an IPV6 address associated with it. Brian Andrus On 12/14/2020 4:19 PM, Alpha Experiment wrote: Hi, I am trying to run slurm on Fedora 33. Upon boot the slurmd daemon is running correctly; however the slurmctld daemon always errors

Re: [slurm-users] Using hyperthreaded processors

2020-11-04 Thread Brian Andrus
to more fetches, wasting effort. This is a VERY simplistic description, but the point is that hyperthreading is not a silver bullet that will improve HPC performance if you are maximizing your resource utilization. Ok, I will get off my soapbox :) Brian Andrus On 11/4/2020 7:30 AM, Jean

Re: [slurm-users] cpu core exclusion?

2021-01-20 Thread Brian Andrus
We would need more information. At a minimum, what client is it? As this is not a slurm issue, you would need to dig into what is causing that behavior with your storage system. Brian Andrus On 1/20/2021 10:53 AM, John McCulloch wrote: Our shared storage client daemon is utilizing 100

Re: [slurm-users] Exclude Slurm packages from the EPEL yum repository

2021-01-25 Thread Brian Andrus
customers for Tim to keep things running as well as he has. I'm pretty sure most folks that use slurm for any period of time has received more value that a small support contract would be. Brian Andrus On 1/25/2021 7:35 AM, Jeffrey T Frey wrote: ...I would say having SLURM rpms in EPEL could be very

Re: [slurm-users] Cluster nodes on multiple cluster networks

2021-01-22 Thread Brian Andrus
You would need to have a direct connect/vpn so the cloud nodes can connect to your head node. Brian Andrus On 1/22/2021 10:37 AM, Sajesh Singh wrote: We are looking at rolling out cloud bursting to our on-prem Slurm cluster and I am wondering how to deal with the slurm.conf variable

Re: [slurm-users] Parent account in AllowAccounts

2021-01-15 Thread Brian Andrus
mean their child can :) Brian Andrus On 1/15/2021 6:38 AM, Durai Arasan wrote: Hi, As you know for each partition you can specify AllowAccounts=account1,account2... I have a parent account say "parent1" with two child accounts "child1" and "child2" I expected that

Re: [slurm-users] Exclude Slurm packages from the EPEL yum repository

2021-01-27 Thread Brian Andrus
have been able to deploy completely to cloud using only slurm. It has the ability to integrate into any cloud cli, so nothing else has been needed. Just for the heck of it, I am thinking of integrating it into Terraform, although not necessary. Brian Andrus On 1/26/2021 11:48 AM, Robert Kudyba

Re: [slurm-users] [EXT]Re: only 1 job running

2021-01-28 Thread Brian Andrus
Ahh. One one of the new nodes do: slurmd -C The output of that will tell you what those settings should be. I suspect they are off, which forces them into drain mode. Brian Andrus On 1/28/2021 12:25 PM, Chandler wrote: Andy Riebs wrote on 1/28/21 07:53: If the only changes to your system

Re: [slurm-users] [EXT]Re: only 1 job running

2021-01-28 Thread Brian Andrus
it, slurm assumes all memory on the node for the job. So, even if you are only using 1 cpu, all the memory is allocated, leaving none for any other job to run on the unallocated cpus. Brian Andrus On 1/28/2021 2:15 PM, Chandler wrote: Brian Andrus wrote on 1/28/21 13:59: What

Re: [slurm-users] [EXT]Re: only 1 job running

2021-01-28 Thread Brian Andrus
You are getting close :) You can see why n010 is able to have multiple jobs. It shows more resources available. What are the specific requests for resources from a job? Nodes, Cores, Memory, threads, etc? Brian Andrus On 1/28/2021 12:52 PM, Chandler wrote: OK I'm getting this same output

Re: [slurm-users] Using "Environment Modules"

2021-01-26 Thread Brian Andrus
The net effect is that the environment gets setup the same as if the user had opened a shell console. Brian Andrus On 1/26/2021 2:13 AM, Gestió Servidors wrote: Hi, My environment is this: * Users are using “bash” as the default shell * A sample of one of my environment modules

Re: [slurm-users] only 1 job running

2021-01-28 Thread Brian Andrus
Heh. Your nodes are drained. do: scontrol update state=resume nodename=n[011-013] If they go back into a drained state, you need to look into why. That will be in the slurmctld log. You can also see it with 'sinfo -R' Brian Andrus On 1/27/2021 10:18 PM, Chandler wrote: Made a little bit

Re: [slurm-users] Exclude Slurm packages from the EPEL yum repository

2021-02-03 Thread Brian Andrus
they can do a thing doesn't mean they should do a thing. There are many ways to achieve what is desired, most of which do not require anyone other than the system admin. If your issue can be solved without affecting others, leave them alone and fix your issue. Brian Andrus

Re: [slurm-users] Slurmrestd unspecified errors.

2021-06-14 Thread Brian Andrus
Using v20.11.7 I have 8081 because that is the port I am running slurmrestd on. How are you starting slurmrestd? If you are using systemd and have the service file, look inside it. Brian Andrus On 6/14/2021 9:48 AM, Heitor wrote: On Mon, 14 Jun 2021 08:30:51 -0700 Brian Andrus wrote

Re: [slurm-users] Slurmrestd unspecified errors.

2021-06-14 Thread Brian Andrus
You don't use the prefix. This works for me on the node running slurmrestd on port 8081: user=someuser curl --header "X-SLURM-USER-NAME: ${user}" --header "X-SLURM-USER-TOKEN: $(sudo scontrol toker username=${user}|cut -d='=' -f2-)" http://localhost:8081/slurm/v0.0.36

Re: [slurm-users] Slurmrestd unspecified errors.

2021-06-14 Thread Brian Andrus
No problem. You may want to set your variables in your /etc/sysconfig/slurmrestd file. That is where you can set that variable along with others (SLURMRESTD_LISTEN, SLURMRESTD_DEBUG, SLURMREST_OPTIONS) and your service file will pick them up. Brian Andrus On 6/14/2021 12:05 PM, Heitor

Re: [slurm-users] Slurm interactive job not populating all groups

2021-05-10 Thread Brian Andrus
Ah. You should put files first. Otherwise, if it finds an entry in SSS, that takes precedence and the local groups/users will not be seen. Brian Andrus On 5/10/2021 1:09 PM, Russell Jones wrote: Thanks! No, we are not. The compute nodes are also properly configured in /etc/nsswitch.conf

Re: [slurm-users] job_submit_lua improvement

2021-05-10 Thread Brian Andrus
As a solution,  I recommend you leverage the ".forward" file You can put anything you want in there and that is where it will go if the user doesn't specify an email. Brian Andrus On 5/10/2021 1:38 PM, Luke Yeager wrote: Contributions are usually handled through Bugzilla. He

Re: [slurm-users] Running vnc after srun fails but works after a direct ssh

2021-05-15 Thread Brian Andrus
er logged out."// //exit// / Simplified, but works well. We can do additional tasks once they start the vncserver (eg stage data) and once they log out (clean up files). Brian Andrus On 5/15/2021 5:02 AM, Jeremy Fix wrote: Hello ! I'm facing a weird issue. With one user, call it gpupro_u

[slurm-users] Purge deleted accounts from slurmdbd

2021-05-12 Thread Brian Andrus
purged, but the account records stay and build up over time. Brian Andrus

Re: [slurm-users] One node, two partitions (gpu and cpu), can SLURM map cpu cores well?

2021-05-08 Thread Brian Andrus
limit what is allowed to be requested in the partition definition and/or a QOS (if you are using accounting). Brian Andrus On 5/7/2021 8:11 PM, Cristóbal Navarro wrote: Hi community, I am unable to tell if SLURM is handling the following situation efficiently in terms of CPU affinities at each

Re: [slurm-users] Slurm interactive job not populating all groups

2021-05-10 Thread Brian Andrus
tml>for more information. That could explain it. Brian Andrus On 5/10/2021 7:57 AM, Russell Jones wrote: Hello, We have a few users we are needing to add to the local "video" group of a specific set of compute nodes. When submitting a job, slurm appears to not be populating

Re: [slurm-users] nodes going to down* and getting stuck in that state

2021-05-20 Thread Brian Andrus
for the operating system and so things don't choke if it comes up a bit lower because some driver took more memory when it loaded. Brian Andrus On 5/19/2021 9:15 PM, Herc Silverstein wrote: Hi, We have a cluster (in Google gcp) which has a few partitions set up to auto-scale, but one partition is set

Re: [slurm-users] Drain node from TaskProlog / TaskEpilog

2021-05-24 Thread Brian Andrus
of slurm), but was wondering if I had misunderstood the slurm docs and there was a simpler way. Best, Mark On Mon, 24 May 2021, Brian Andrus wrote: Not sure I can understand how it can only be detected from inside the job environment for a failed node. That description is more of &quo

Re: [slurm-users] pam_slurm_adopt not working for all users

2021-05-25 Thread Brian Andrus
be executed. We have HPC_Setup.sh in there where we create ssh keys, setup their .forward file and other setup tasks. Brian Andrus On 5/25/2021 5:09 AM, Loris Bennett wrote: Hi everyone, Thanks for all the replies. I think my main problem is that I expect logging in to a node with a job

Re: [slurm-users] pam_slurm_adopt not working for all users

2021-05-21 Thread Brian Andrus
Umm.. Your keys are password protected. If they were not, you would be getting what you expect: Enter passphrase for key '/home/loris/.ssh/id_rsa': Brian Andrus On 5/21/2021 5:53 AM, Loris Bennett wrote: Hi, We have set up pam_slurm_adopt using the official Slurm documentation and Ole's

Re: [slurm-users] pam_slurm_adopt not working for all users

2021-05-21 Thread Brian Andrus
Oh, you could also use the ssh-agent to mange the keys, then use 'ssh-add ~/.ssh/id_rsa' to type the passphrase once for your whole session (from that system). Brian Andrus On 5/21/2021 5:53 AM, Loris Bennett wrote: Hi, We have set up pam_slurm_adopt using the official Slurm documentation

Re: [slurm-users] unable to Hold and release the job using scontrol

2021-05-23 Thread Brian Andrus
Yep, job 28 is already running. If you want it to be on hold to start, use 'sbatch -h test.sh' and it will start out in a hold state. Brian Andrus On 5/22/2021 11:36 PM, Chris Samuel wrote: On Saturday, 22 May 2021 11:05:54 PM PDT Zainul Abiddin wrote: i am trying to hold the job from

Re: [slurm-users] Different max number of jobs in individual and array jobs

2021-06-03 Thread Brian Andrus
Array jobs are individual jobs that have been grouped. Underneath, they each have their own jobid besides the grouped array jobid. Not sure there is an easy way to pull what you are looking to do. Brian Andrus On 6/3/2021 8:12 AM, Shaohao Chen wrote: Hi, We use Slurm on our cluster and set

Re: [slurm-users] nodes going to down* and getting stuck in, that state

2021-06-04 Thread Brian Andrus
Sounds like a firewall issue. When you log on to the 'down' node, can you run 'sinfo' or 'squeue' there? Also, verify munge is configured/running properly on the node. Brian Andrus On 6/4/2021 9:31 AM, Herc Silverstein wrote: Hi, The slurmctld.log shows (for this node): ... [2021-05-25T00

Re: [slurm-users] nodes going to down* and getting stuck in, that state

2021-06-04 Thread Brian Andrus
Oh, also ensure the dns is working properly on the node. It could be that it isn't able to map the name to ip of the master. Brian Andrus On 6/4/2021 9:31 AM, Herc Silverstein wrote: Hi, The slurmctld.log shows (for this node): ... [2021-05-25T00:12:27.481] sched: Allocate JobId=3402729

Re: [slurm-users] Drain node from TaskProlog / TaskEpilog

2021-05-24 Thread Brian Andrus
Not sure I can understand how it can only be detected from inside the job environment for a failed node. That description is more of "our application is behaving badly, but not so bad, the node quits responding." For that situation, your app or job should have something that it is doing to

Re: [slurm-users] Conflicting --nodes and --nodelist

2021-06-01 Thread Brian Andrus
and then request that feature when submitting your job. Brian Andrus On 6/1/2021 4:15 AM, Diego Zuccato wrote: Hello all. I just found that if an user tries to specify a nodelist (say including 2 nodes) and --nodes=1, the job gets rejected with sbatch: error: invalid number of nodes (-N 2-1

Re: [slurm-users] 答复: Is there bug in PrivateData=jobs option of slurmdbd?

2021-07-01 Thread Brian Andrus
Ok. You may want to check your slurmdbd host(s) and ensure the users are known there. If it does not know who a user is, it will not allow access to the data. If you are running sssd, clear the cache and such too. Brian Andrus On 7/1/2021 1:12 AM, taleinterve...@sjtu.edu.cn wrote: I can

[slurm-users] How to avoid a feature?

2021-07-01 Thread Brian Andrus
a feature constraint, but that seems to only apply to those that want the feature. Since we have so many other users, it isn't feasible to have them modify their scripts, so having it avoid by default would work. Any ideas how to do that? Submit LUA perhaps? Brian Andrus

Re: [slurm-users] How to avoid a feature?

2021-07-01 Thread Brian Andrus
cally set in the lua script. I do have it in place already to ensure time and account are set, but that is about it. Brian Andrus On 7/1/2021 9:39 AM, Lyn Gerner wrote: Hey, Brian, Neither I nor you are going to like what I'm about to say (but I think it's where you're headed). :) We have an

Re: [slurm-users] nodes that finished calculation do not become idle

2021-06-27 Thread Brian Andrus
will be such that each part can be run independently of the others. This allows the resources for that part to be released when that part is complete. Bottom line: Resources are not released when they are not being used, they are released when the job is done. Brian Andrus On 6/26/2021 11:59 PM

Re: [slurm-users] Submitting jobs across multiple nodes fails

2021-02-04 Thread Brian Andrus
try: export SLURM_OVERLAP=1 export SLURM_WHOLE=1 before your salloc and see if that helps. I have seen some mpi issues that were resolved with that. You can also try it using just the regular mpirun on the nodes allocated. That will help with a datapoint as well. Brian Andrus On 2/4/2021

Re: [slurm-users] Submitting jobs across multiple nodes fails

2021-02-04 Thread Brian Andrus
Did you compile slurm with mpi support? Your mpi libraries should be the same as that version and they should be available in the same locations for all nodes. Also, ensure they are accessible (PATH, LD_LIBRARY_PATH, etc are set) Brian Andrus On 2/4/2021 1:20 PM, Andrej Prsa wrote: Gentle

Re: [slurm-users] Suspended and released job continues running in a "down" partition

2021-03-24 Thread Brian Andrus
, then cancel to be resumed on another node). Brian Andrus On 3/24/2021 7:31 AM, Gestió Servidors wrote: Hi, I have got this new question for you: In my cluster there is a running job. Then, I change a partition state from “up” to “down”. Then, that job continues “running” because it was already

Re: [slurm-users] Slurm cloud scheduling/power saving

2021-04-01 Thread Brian Andrus
Run 'sinfo -R' to see if any of your nodes are out of the mix. If so, resume them and see if things work. Brian Andrus On 4/1/2021 1:53 AM, Steve Brasier wrote: Hi all, anyone have suggestions for debugging cloud nodes not resuming? I've had this working before but I'm now using "confi

Re: [slurm-users] Limit on number of nodes user able to request

2021-04-01 Thread Brian Andrus
For this one, you want to look closely at the job. Is it targeting a specific partition/nodelist? See what resources it is looking for (scontrol show job ) Also look at the partition limits as well as any QOS items (if you are using them). Brian Andrus On 4/1/2021 10:00 AM, Sajesh Singh

Re: [slurm-users] Limit on number of nodes user able to request

2021-04-01 Thread Brian Andrus
How are you taking them offline? I would expect a SuspendProgram script that is running the command that shuts them down. Also, one of your SlurmctldParameters should be "idle_on_node_suspend" Brian Andrus On 4/1/2021 12:25 PM, Sajesh Singh wrote: Brian,   Targeting the correct

Re: [slurm-users] Is it possible to define multiple partitions for the same node, but each one having a different subset of GPUs?

2021-03-31 Thread Brian Andrus
jobs that request resources you do not want used in a particular queue. Both would take some research to find the best approach, but I think those are the two options available that may do what you are looking for. Brian Andrus On 3/31/2021 8:21 AM, Cristóbal Navarro wrote: Hi Community, I

Re: [slurm-users] Limit on number of nodes user able to request

2021-03-24 Thread Brian Andrus
Do 'sinfo -R' and see if you have any down or drained nodes. Brian Andrus On 3/24/2021 6:31 PM, Sajesh Singh wrote: Slurm 20.02 CentOS 8 I just recently noticed a strange behavior when using the powersave plugin for bursting to AWS. I have a queue configured with 60 nodes, but if I submit

Re: [slurm-users] How can I get complete field values with without specify the length

2021-03-10 Thread Brian Andrus
. Brian Andrus On 3/10/2021 5:05 AM, Marcus Boden wrote: Yeah, I wondered something like that too, as it makes some of my scripts quite fragile. I just tried your name on a test system and now calling squeue paints my cli yellow :D You could write a job_submit plugin to catch 'malicious' input

Re: [slurm-users] Slurm version 20.11.5 is now available

2021-03-22 Thread Brian Andrus
=/dev/shm Then have them use $SCRATCH after something like SCRATCH=$FAST_SCRATCH Just set SCRATCH to the one you want to use. Brian Andrus On 3/21/2021 11:32 PM, Loris Bennett wrote: Brian Andrus writes: The method I use for jobs is to make /scratch a symlink to where ever it may be best

Re: [slurm-users] Slurm version 20.11.5 is now available

2021-03-19 Thread Brian Andrus
The method I use for jobs is to make /scratch a symlink to where ever it may be best suited. Then all users just use /scratch eg: /scratch -> /dev/shm for a ramdisk or /scrach->/mnt/ssd for local ssd, etc Brian Andrus On 3/19/2021 6:25 AM, Paul Edmon wrote: I was about to ask this a

Re: [slurm-users] SLURM slurmctld error on Ubuntu20.04 starting through systemctl

2021-03-17 Thread Brian Andrus
That is looking like your /run folder does not have world execute permissions, making it impossible for anything to access sub-directories. Brian Andrus On 3/17/2021 1:05 PM, Sven Duscha wrote: Hi, On 17.03.21 19:54, Brian Andrus wrote: Be that as it may, you can see it is a permissions

Re: [slurm-users] SLURM slurmctld error on Ubuntu20.04 starting through systemctl

2021-03-17 Thread Brian Andrus
ctld user to one that can write there or change the permissions on the directory to allow the slurmctld user write access. Brian Andrus On 3/17/2021 11:16 AM, Sven Duscha wrote: Hi, I experience with SLURM slurmctld an error on Ubuntu20.04, when starting the service (through systemctl):

Re: [slurm-users] Using Cloud bursting/ powersave options

2021-03-09 Thread Brian Andrus
No issue. In fact that is the default/normal. The 'slurm' user gets created with a shell when you install the rpms. Brian Andrus On 3/9/2021 6:24 AM, Sajesh Singh wrote: I am looking to enable the cloud scheduling feature of Slurm and was wondering if there are any issues with changing

Re: [slurm-users] About sacct --format: how can I get info about the fields

2021-03-03 Thread Brian Andrus
man sacct shows us: -e, --helpformat     Print a list of fields that can be specified with the --format option. Brian Andrus On 3/3/2021 5:42 PM, xiaojingh...@163.com wrote: Hello, guys, I am doing a parsing job on the output of the sacct command and I know that the —format option can

Re: [slurm-users] job submit location :: restricted to HOME?

2021-03-03 Thread Brian Andrus
Looks like the job ran. You should look at the output logs. My guess: The node the job ran on does not have access to that path. Log on to that node and check it out. Brian Andrus On 3/3/2021 1:21 AM, Adrian Sevcenco wrote: Hi! I just encountered the situation that i cannot submit jobs from

Re: [slurm-users] prolog not passing env var to job

2021-03-04 Thread Brian Andrus
--export=ALL,MYVAR=othervalue do 'man srun' and look at the --export option Brian Andrus On 3/3/2021 9:28 PM, Chin,David wrote: ahmet.mer...@uhem.itu.edu.tr wrote: > Prolog and TaskProlog are different parameters and scripts. You should > use the TaskProlog script to set env. variables

Re: [slurm-users] fix missing accounting entries

2021-03-01 Thread Brian Andrus
runaway: sacctmgr show RunawayJobs *From:* slurm-users on behalf of Brian Andrus *Sent:* Monday, March 1, 2021 11:14 AM *To:* slurm-users@lists.schedmd.com *Subject:* [slurm-users] fix missing accounting entries All

[slurm-users] fix missing accounting entries

2021-03-01 Thread Brian Andrus
All, IIRC, there was a command that would repair the accounting tables when a job had no endtime. I can't seem to find the info for that. Does anyone recall such a thing? Brian Andrus

Re: [slurm-users] User id inconsistency

2021-04-19 Thread Brian Andrus
caused by that. Brian Andrus On 4/19/2021 4:41 AM, Bruno Gomes Pessanha wrote: That is showing that I'm in different groups depending on how I run the command id. PS: I'm running the controller and workers in docker containers using privileged mode. Bruno On Mon, 19 Apr 2021 at 13:24

Re: [slurm-users] prolog not passing env var to job

2021-02-12 Thread Brian Andrus
Your prolog script is run by/as the same user as slurmd, so any environment variables you set there will not be available to the job being run. See: https://slurm.schedmd.com/prolog_epilog.html for info. Brian Andrus On 2/12/2021 1:27 PM, mercan wrote: Hi; Prolog and TaskProlog

Re: [slurm-users] Preemption not working for jobs in higher priority partition

2021-08-20 Thread Brian Andrus
IIRC, Preemption is determined by partition first, not node. Since your pending job is in the 'day' partition, it will not preempt something in the 'night' partition (even if the node is in both). Brian Andrus On 8/19/2021 2:49 PM, Russell Jones wrote: Hi all, I could use some help

Re: [slurm-users] slurmd startup problem

2021-08-16 Thread Brian Andrus
I suspect you may have set some "frontendname" or "frontendaddr" in your slurm.conf that triggered that. A FrontEnd is a node that is used to execute batch scripts rather than compute nodes (Cray ALPS systems). If that is not you, you should not set it. Brian Andrus

Re: [slurm-users] max_script_size

2021-09-13 Thread Brian Andrus
a cluster. This may be a good option for you. Brian Andrus On 9/13/2021 7:14 AM, Ozeryan, Vladimir wrote: *max_script_size=#* Specify the maximum size of a batch script, in bytes. The default value is 4 megabytes. Larger values may adversely impact system performance. I have us

Re: [slurm-users] [External] How can I do to prevent a specific job from being prempted?

2021-09-16 Thread Brian Andrus
Modify it and raise the priority to something very, very high. scontrol update job=JOBID priority=999 Brian Andrus On 9/16/2021 8:39 AM, 顏文 wrote: Dear users Thank for the immediate replies.I currently have one important job running. How to prevent the running job from being preempted

Re: [slurm-users] using sacctmgr to change the parent of an account

2021-09-08 Thread Brian Andrus
Yep. I do it all the time when I forget to add a parent. Also when a project/account changes who owns it. sacctmgr will also tell you what it is going to change and gives you 30 seconds to say yes, else it doesn't make the change. Brian Andrus On 9/8/2021 3:41 AM, byron wrote: Hi I've

Re: [slurm-users] Down nodes

2021-07-30 Thread Brian Andrus
That 'not responding' is the issue and usually means 1 of 2 things: 1) slurmd is not running on the node 2) something on the network is stopping the communication between the node and the master (firewall, selinux, congestion, bad nic, routes, etc) Brian Andrus On 7/30/2021 3:51 PM, Soichi

Re: [slurm-users] update in place/db compatibility 19.05 vs 20.11

2021-08-03 Thread Brian Andrus
they are also updated, but that is expected and not to be worried about. It will go away once you also update your compute nodes. Brian Andrus On 8/2/2021 12:34 PM, Adrian Sevcenco wrote: Hi! can a 19.05 cluster be directly upgraded to 20.11? Thank you! Adrian

Re: [slurm-users] Compact scheduling strategy for small GPU jobs

2021-08-10 Thread Brian Andrus
You may also want to look at node weights. By setting them at different levels for each node, you can give a preference to one over the other. That may be a way to do a "try this node first" method of job placement. Brian Andrus On 8/10/2021 9:19 AM, Jack Chen wrote: Thanks for

Re: [slurm-users] how to temporarily avoid node being suspended by SuspendProgram

2021-08-10 Thread Brian Andrus
Certainly, set: * *SuspendExcNodes*: List of nodes to never place in power saving mode. Use Slurm's hostlist expression format. By default, no nodes are excluded. Then do 'scontrol reconfigure' Repeat when you want them to be included Brian Andrus On 8/10/2021 5:46 AM, Josef Dvoracek

Re: [slurm-users] Compact scheduling strategy for small GPU jobs

2021-08-10 Thread Brian Andrus
You may want to look at your resources. If the memory allocation adds up such that there isn't enough left for any job to run, it won't matter that there are still GPUs available. Similar for any other resource (CPUs, cores, etc) Brian Andrus On 8/10/2021 8:07 AM, Jack Chen wrote: Does

Re: [slurm-users] job is pending but resources are available

2021-10-12 Thread Brian Andrus
Something is very odd when you have the node reporting:RealMemory=1 AllocMem=0 FreeMem=47563 Sockets=2 Boards=1 What do you get when you run ‘slurmd -C’ on the node? Brian Andrus From: Adam XuSent: Tuesday, October 12, 2021 6:07 PMTo: slurm-users@lists.schedmd.comSubject: Re: [slurm-users] job

Re: [slurm-users] "Low RealMem" after upgrade

2021-10-01 Thread Brian Andrus
. Also helps with OOM killer situations. Brian Andrus On 10/1/2021 1:22 AM, Diego Zuccato wrote: Hello all. I just upgraded to Debian 11 that brings Slurm 21.08 and the newer nodes upgraded w/o too many issues (just minor config changes, one being RealMemory value in slurm.conf, since

Re: [slurm-users] Possible bug with Prologslurmctld and Epilogslurmctld scripts?

2021-09-27 Thread Brian Andrus
Those would be considered separate for each job. You may want to have your prolog check to see if there is an epilogue running and wait for the epilogue to be done before starting its prolog work. Brian Andrus On 9/27/2021 9:15 AM, Joe Teumer wrote: Should the Prologslurmctld script only

Re: [slurm-users] [EXT] Re: slurmdbd does not work

2021-12-03 Thread Brian Andrus
Which version of Mariadb are you using? Brian Andrus On 12/3/2021 4:20 PM, Giuseppe G. A. Celano wrote: After installation of libmariadb-dev, I have reinstalled the entire slurm with ./configure + options, make, and make install. Still, accounting_storage_mysql.so is missing. On Sat, Dec

Re: [slurm-users] slurmdbd does not work

2021-12-03 Thread Brian Andrus
:41.022] fatal: You are running with a database but for some reason we have no TRES from it.  This should only happen if the database is down and you don't have any state files. On Thu, Dec 2, 2021 at 10:36 PM Brian Andrus wrote: Your slurm needs built with the support. If you have mysql

<    1   2   3   4   >