[slurm-users] Hung tasks and high load when cancelling jobs
Hi, Sometimes when jobs are cancelled I see a spike in system load and hung task errors. It appears to be related to NFS and cgroups. The slurmstepd process gets hung cleaning up cgroups: INFO: task slurmstepd:11222 blocked for more than 120 seconds. Not tainted 4.4.0-119-generic #143-Ubuntu "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. slurmstepd D 8817b1d47808 0 11222 1 0x0004 8817b1d47808 0246 880c48dd1c00 881842de3800 8817b1d48000 880c4f9972c0 7fff 8184bd60 8817b1d47960 8817b1d47820 8184b565 Call Trace: [] ? bit_wait+0x60/0x60 [] schedule+0x35/0x80 [] schedule_timeout+0x1b6/0x270 [] ? hash_ipport6_add+0x6c0/0x6c0 [ip_set_hash_ipport] [] ? ktime_get+0x3e/0xb0 [] ? bit_wait+0x60/0x60 [] io_schedule_timeout+0xa4/0x110 [] bit_wait_io+0x1b/0x70 [] __wait_on_bit+0x5f/0x90 [] wait_on_page_bit+0xcb/0xf0 [] ? autoremove_wake_function+0x40/0x40 [] shrink_page_list+0x78d/0x7a0 [] shrink_inactive_list+0x209/0x520 [] shrink_lruvec+0x583/0x740 [] ? __queue_work+0x139/0x3c0 [] shrink_zone+0xef/0x2e0 [] do_try_to_free_pages+0x15b/0x3b0 [] try_to_free_mem_cgroup_pages+0xba/0x1a0 [] mem_cgroup_force_empty_write+0x70/0xd0 [] cgroup_file_write+0x42/0x110 [] kernfs_fop_write+0x120/0x170 [] __vfs_write+0x1b/0x40 [] vfs_write+0xa9/0x1a0 [] ? do_sys_open+0x1bf/0x2a0 [] SyS_write+0x55/0xc0 [] entry_SYSCALL_64_fastpath+0x1c/0xbb The actual process being submitted seems to always be hung on NFS I/O: INFO: task wb_command:11247 blocked for more than 120 seconds. Not tainted 4.4.0-119-generic #143-Ubuntu "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. wb_command D 880aa39fb9f8 0 11247 1 0x0004 880aa39fb9f8 880c4fc90440 880c48847000 880c43f91c00 880aa39fc000 880c4fc972c0 7fff 8184bd60 880aa39fbb58 880aa39fba10 8184b565 Call Trace: [] ? bit_wait+0x60/0x60 [] schedule+0x35/0x80 [] schedule_timeout+0x1b6/0x270 [] ? dequeue_entity+0x41b/0xa80 [] ? bit_wait+0x60/0x60 [] io_schedule_timeout+0xa4/0x110 [] bit_wait_io+0x1b/0x70 [] __wait_on_bit+0x5f/0x90 [] ? bit_wait+0x60/0x60 [] out_of_line_wait_on_bit+0x82/0xb0 [] ? autoremove_wake_function+0x40/0x40 [] nfs_wait_on_request+0x37/0x40 [nfs] [] nfs_writepage_setup+0x103/0x600 [nfs] [] nfs_updatepage+0xda/0x380 [nfs] [] nfs_write_end+0x13d/0x4b0 [nfs] [] ? iov_iter_copy_from_user_atomic+0x8d/0x220 [] generic_perform_write+0x11b/0x1d0 [] __generic_file_write_iter+0x1a2/0x1e0 [] generic_file_write_iter+0xe5/0x1e0 [] nfs_file_write+0x9a/0x170 [nfs] [] new_sync_write+0xa5/0xf0 [] __vfs_write+0x29/0x40 [] vfs_write+0xa9/0x1a0 [] SyS_write+0x55/0xc0 [] entry_SYSCALL_64_fastpath+0x1c/0xbb I upgraded somewhat recently from 17.02 to 17.11, but I am not positive if this bug is new or just went unnoticed previously. Thanks, Brendan
Re: [slurm-users] "Low socket*core*thre" - solution?
Hi Caleb I noticed the same thing. If you configure a host with more memory than it really has slurm will think that the host has something wrong with it and put it in drain status. At least that is my theory. The vendor can likely give you a better more detailed answer. -jfk On Wed, May 2, 2018 at 6:23 PM, Caleb Smithwrote: > Hi all, > > Out of curiosity, what causes that? It'd be good to know for the future -- > I ran into the same issue and just edited the memory down and it works fine > now, but I'd like to know why/what causes that error. I'm assuming low > resources, ie memory or CPU or whatever. Mind clarifying? > > On Wed, May 2, 2018, 7:11 PM John Kelly wrote: > >> Hi matt >> >> scontrol update nodename=odin state=resume >> scontrol update nodename=odin state=idle >> >> -jfk >> >> >> >> On Wed, May 2, 2018 at 5:28 PM, Matt Hohmeister >> wrote: >> >>> I have a two-node cluster: the server/compute node is a Dell PowerEdge >>> R730; the compute node, a Dell PowerEdge R630. On both of these nodes, >>> slurmd >>> -C gives me the exact same line: >>> >>> >>> >>> [me@odin slurm]$ slurmd -C >>> >>> NodeName=odin CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 >>> ThreadsPerCore=2 RealMemory=128655 >>> >>> >>> >>> [me@thor slurm]$ slurmd -C >>> >>> NodeName=thor CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 >>> ThreadsPerCore=2 RealMemory=128655 >>> >>> >>> >>> So I edited my slurm.conf appropriately: >>> >>> >>> >>> NodeName=odin CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 >>> ThreadsPerCore=2 RealMemory=128655 >>> >>> NodeName=thor CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 >>> ThreadsPerCore=2 RealMemory=128655 >>> >>> >>> >>> …and it looks good, except for the drain on my server/compute node: >>> >>> >>> >>> [me@odin slurm]$ sinfo >>> >>> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST >>> >>> debug* up infinite 1 drain odin >>> >>> debug* up infinite 1 idle thor >>> >>> >>> >>> …for the following reason: >>> >>> >>> >>> [me@odin slurm]$ sinfo -R >>> >>> REASON USER TIMESTAMP NODELIST >>> >>> Low socket*core*thre slurm 2018-05-02T11:55:38 odin >>> >>> >>> >>> Any ideas? >>> >>> >>> >>> Thanks! >>> >>> >>> >>> Matt Hohmeister >>> >>> Systems and Network Administrator >>> >>> Department of Psychology >>> >>> Florida State University >>> >>> PO Box 3064301 >>> >>> Tallahassee, FL 32306-4301 >>> >>> Phone: +1 850 645 1902 >>> >>> Fax: +1 850 644 7739 >>> >>> >>> >> >>
Re: [slurm-users] "Low socket*core*thre" - solution?
Hi all, Out of curiosity, what causes that? It'd be good to know for the future -- I ran into the same issue and just edited the memory down and it works fine now, but I'd like to know why/what causes that error. I'm assuming low resources, ie memory or CPU or whatever. Mind clarifying? On Wed, May 2, 2018, 7:11 PM John Kellywrote: > Hi matt > > scontrol update nodename=odin state=resume > scontrol update nodename=odin state=idle > > -jfk > > > > On Wed, May 2, 2018 at 5:28 PM, Matt Hohmeister > wrote: > >> I have a two-node cluster: the server/compute node is a Dell PowerEdge >> R730; the compute node, a Dell PowerEdge R630. On both of these nodes, slurmd >> -C gives me the exact same line: >> >> >> >> [me@odin slurm]$ slurmd -C >> >> NodeName=odin CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 >> ThreadsPerCore=2 RealMemory=128655 >> >> >> >> [me@thor slurm]$ slurmd -C >> >> NodeName=thor CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 >> ThreadsPerCore=2 RealMemory=128655 >> >> >> >> So I edited my slurm.conf appropriately: >> >> >> >> NodeName=odin CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 >> ThreadsPerCore=2 RealMemory=128655 >> >> NodeName=thor CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 >> ThreadsPerCore=2 RealMemory=128655 >> >> >> >> …and it looks good, except for the drain on my server/compute node: >> >> >> >> [me@odin slurm]$ sinfo >> >> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST >> >> debug* up infinite 1 drain odin >> >> debug* up infinite 1 idle thor >> >> >> >> …for the following reason: >> >> >> >> [me@odin slurm]$ sinfo -R >> >> REASON USER TIMESTAMP NODELIST >> >> Low socket*core*thre slurm 2018-05-02T11:55:38 odin >> >> >> >> Any ideas? >> >> >> >> Thanks! >> >> >> >> Matt Hohmeister >> >> Systems and Network Administrator >> >> Department of Psychology >> >> Florida State University >> >> PO Box 3064301 >> >> Tallahassee, FL 32306-4301 >> >> Phone: +1 850 645 1902 >> >> Fax: +1 850 644 7739 >> >> >> > >
Re: [slurm-users] "Low socket*core*thre" - solution?
Hi matt scontrol update nodename=odin state=resume scontrol update nodename=odin state=idle -jfk On Wed, May 2, 2018 at 5:28 PM, Matt Hohmeisterwrote: > I have a two-node cluster: the server/compute node is a Dell PowerEdge > R730; the compute node, a Dell PowerEdge R630. On both of these nodes, slurmd > -C gives me the exact same line: > > > > [me@odin slurm]$ slurmd -C > > NodeName=odin CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 > ThreadsPerCore=2 RealMemory=128655 > > > > [me@thor slurm]$ slurmd -C > > NodeName=thor CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 > ThreadsPerCore=2 RealMemory=128655 > > > > So I edited my slurm.conf appropriately: > > > > NodeName=odin CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 > ThreadsPerCore=2 RealMemory=128655 > > NodeName=thor CPUs=40 Boards=1 SocketsPerBoard=2 CoresPerSocket=10 > ThreadsPerCore=2 RealMemory=128655 > > > > …and it looks good, except for the drain on my server/compute node: > > > > [me@odin slurm]$ sinfo > > PARTITION AVAIL TIMELIMIT NODES STATE NODELIST > > debug* up infinite 1 drain odin > > debug* up infinite 1 idle thor > > > > …for the following reason: > > > > [me@odin slurm]$ sinfo -R > > REASON USER TIMESTAMP NODELIST > > Low socket*core*thre slurm 2018-05-02T11:55:38 odin > > > > Any ideas? > > > > Thanks! > > > > Matt Hohmeister > > Systems and Network Administrator > > Department of Psychology > > Florida State University > > PO Box 3064301 > > Tallahassee, FL 32306-4301 > > Phone: +1 850 645 1902 > > Fax: +1 850 644 7739 > > >
[slurm-users] Odd sacct behavior?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 Hello all, I'm just wondering if anyone is able to reproduce the behavior I'm seeing with `sacct`, or if anyone has experienced it previously. In a nutshell, I usually can query jobs from specified nodes, similar to the following: `sacct -o $OPTIONLIST -N nodename -S START -E END -s r` Up until today, it has never failed and the results are what I expect. However, I noticed that when I attempted to query a job from ~10 days ago using the formula above, I get zero output and the following message in the slurmdbd log: error: Problem getting jobs for cluster $CLUSTERNAME Looking through my history, it seems that I've always queried jobs in an exact fashion, but usually the jobs I'm looking at are less than a week old, and output is returned. If I exclude a nodename or nodelist, and keep the start and end times, I get results, and no error is returned: `sacct -o $OPTIONLIST -S START -E END -s r` I was able to query the DB itself and was able to retrieve information, so it doesn't appear to be an issue with purged records. I also restarted MySQL and the slurmdbd and didn't see any improvement. I'm using slurm 16.05.10-2 and slurmdbd 16.05.10-2. Thanks, John DeSantis -BEGIN PGP SIGNATURE- iQEzBAEBCgAdFiEEbVacPSiwOGJ0Y8jASZyQGquzmcEFAlrqDKUACgkQSZyQGquz mcHbdggAlBkA9K+97HmDoEZYdbAvN370oFUbrtjnwF5vcbk/tLm5zcnv4xkAoL6H mZlNvWvsapjjztlq4hZ6vAvZ1OnlM++5G0XJ66BEAUmEffk5ilu9Drkoe7Noj4XX 3CLqjBffTrp9Kim/7s5oVAnhjyScajTjL6+9het/lmhuGM3QGRNOIfntCNSMvhZO JIAcd8DR6V1RDScgFqJs3iDIkwZjJ9VA2ZbKTEnNdpoxbvxzRd0BfKMT9zhv39dt nA6Qt0IHeafHjlcZZIwKeRSgdRwUCdFcxteu7it0y7BriR15SwzK4frPwpaGIxRO whblEE5i7S+aUdhw4P/Qy60O6PqQVg== =3LDj -END PGP SIGNATURE-
Re: [slurm-users] GPU / cgroup challenges
So there is a patch? -- Original message-- From: Fulcomer, Samuel Date: Wed, May 2, 2018 11:14 To: Slurm User Community List; Cc: Subject:Re: [slurm-users] GPU / cgroup challenges This came up around 12/17, I think, and as I recall the fixes were added to the src repo then; however, they weren't added to any fo the 17.releases. On Wed, May 2, 2018 at 6:04 AM, R. Paul Wiegand> wrote: I dug into the logs on both the slurmctld side and the slurmd side. For the record, I have debug2 set for both and DebugFlags=CPU_BIND,Gres. I cannot see much that is terribly relevant in the logs. There's a known parameter error reported with the memory cgroup specifications, but I don't think that is germane. When I set "--gres=gpu:1", the slurmd log does have encouraging lines such as: [2018-05-02T08:47:04.916] [203.0] debug: Allowing access to device /dev/nvidia0 for job [2018-05-02T08:47:04.916] [203.0] debug: Not allowing access to device /dev/nvidia1 for job However, I can still "see" both devices from nvidia-smi, and I can still access both if I manually unset CUDA_VISIBLE_DEVICES. When I do *not* specify --gres at all, there is no reference to gres, gpu, nvidia, or anything similar in any log at all. And, of course, I have full access to both GPUs. I am happy to attach the snippets of the relevant logs, if someone more knowledgeable wants to pour through them. I can also set the debug level higher, if you think that would help. Assuming upgrading will solve our problem, in the meantime: Is there a way to ensure that the *default* request always has "--gres=gpu:1"? That is, this situation is doubly bad for us not just because there is *a way* around the resource management of the device but also because the *DEFAULT* behavior if a user issues an srun/sbatch without specifying a Gres is to go around the resource manager. On Tue, May 1, 2018 at 8:29 PM, Christopher Samuel > wrote: > On 02/05/18 10:15, R. Paul Wiegand wrote: > >> Yes, I am sure they are all the same. Typically, I just scontrol >> reconfig; however, I have also tried restarting all daemons. > > > Understood. Any diagnostics in the slurmd logs when trying to start > a GPU job on the node? > >> We are moving to 7.4 in a few weeks during our downtime. We had a >> QDR -> OFED version constraint -> Lustre client version constraint >> issue that delayed our upgrade. > > > I feel your pain.. BTW RHEL 7.5 is out now so you'll need that if > you need current security fixes. > >> Should I just wait and test after the upgrade? > > > Well 17.11.6 will be out then that will include for a deadlock > that some sites hit occasionally, so that will be worth throwing > into the mix too. Do read the RELEASE_NOTES carefully though, > especially if you're using slurmdbd! > > > All the best, > Chris > -- > Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC >
Re: [slurm-users] GPU / cgroup challenges
This came up around 12/17, I think, and as I recall the fixes were added to the src repo then; however, they weren't added to any fo the 17.releases. On Wed, May 2, 2018 at 6:04 AM, R. Paul Wiegandwrote: > I dug into the logs on both the slurmctld side and the slurmd side. > For the record, I have debug2 set for both and > DebugFlags=CPU_BIND,Gres. > > I cannot see much that is terribly relevant in the logs. There's a > known parameter error reported with the memory cgroup specifications, > but I don't think that is germane. > > When I set "--gres=gpu:1", the slurmd log does have encouraging lines such > as: > > [2018-05-02T08:47:04.916] [203.0] debug: Allowing access to device > /dev/nvidia0 for job > [2018-05-02T08:47:04.916] [203.0] debug: Not allowing access to > device /dev/nvidia1 for job > > However, I can still "see" both devices from nvidia-smi, and I can > still access both if I manually unset CUDA_VISIBLE_DEVICES. > > When I do *not* specify --gres at all, there is no reference to gres, > gpu, nvidia, or anything similar in any log at all. And, of course, I > have full access to both GPUs. > > I am happy to attach the snippets of the relevant logs, if someone > more knowledgeable wants to pour through them. I can also set the > debug level higher, if you think that would help. > > > Assuming upgrading will solve our problem, in the meantime: Is there > a way to ensure that the *default* request always has "--gres=gpu:1"? > That is, this situation is doubly bad for us not just because there is > *a way* around the resource management of the device but also because > the *DEFAULT* behavior if a user issues an srun/sbatch without > specifying a Gres is to go around the resource manager. > > > > On Tue, May 1, 2018 at 8:29 PM, Christopher Samuel > wrote: > > On 02/05/18 10:15, R. Paul Wiegand wrote: > > > >> Yes, I am sure they are all the same. Typically, I just scontrol > >> reconfig; however, I have also tried restarting all daemons. > > > > > > Understood. Any diagnostics in the slurmd logs when trying to start > > a GPU job on the node? > > > >> We are moving to 7.4 in a few weeks during our downtime. We had a > >> QDR -> OFED version constraint -> Lustre client version constraint > >> issue that delayed our upgrade. > > > > > > I feel your pain.. BTW RHEL 7.5 is out now so you'll need that if > > you need current security fixes. > > > >> Should I just wait and test after the upgrade? > > > > > > Well 17.11.6 will be out then that will include for a deadlock > > that some sites hit occasionally, so that will be worth throwing > > into the mix too. Do read the RELEASE_NOTES carefully though, > > especially if you're using slurmdbd! > > > > > > All the best, > > Chris > > -- > > Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC > > > >
Re: [slurm-users] GPU / cgroup challenges
I dug into the logs on both the slurmctld side and the slurmd side. For the record, I have debug2 set for both and DebugFlags=CPU_BIND,Gres. I cannot see much that is terribly relevant in the logs. There's a known parameter error reported with the memory cgroup specifications, but I don't think that is germane. When I set "--gres=gpu:1", the slurmd log does have encouraging lines such as: [2018-05-02T08:47:04.916] [203.0] debug: Allowing access to device /dev/nvidia0 for job [2018-05-02T08:47:04.916] [203.0] debug: Not allowing access to device /dev/nvidia1 for job However, I can still "see" both devices from nvidia-smi, and I can still access both if I manually unset CUDA_VISIBLE_DEVICES. When I do *not* specify --gres at all, there is no reference to gres, gpu, nvidia, or anything similar in any log at all. And, of course, I have full access to both GPUs. I am happy to attach the snippets of the relevant logs, if someone more knowledgeable wants to pour through them. I can also set the debug level higher, if you think that would help. Assuming upgrading will solve our problem, in the meantime: Is there a way to ensure that the *default* request always has "--gres=gpu:1"? That is, this situation is doubly bad for us not just because there is *a way* around the resource management of the device but also because the *DEFAULT* behavior if a user issues an srun/sbatch without specifying a Gres is to go around the resource manager. On Tue, May 1, 2018 at 8:29 PM, Christopher Samuelwrote: > On 02/05/18 10:15, R. Paul Wiegand wrote: > >> Yes, I am sure they are all the same. Typically, I just scontrol >> reconfig; however, I have also tried restarting all daemons. > > > Understood. Any diagnostics in the slurmd logs when trying to start > a GPU job on the node? > >> We are moving to 7.4 in a few weeks during our downtime. We had a >> QDR -> OFED version constraint -> Lustre client version constraint >> issue that delayed our upgrade. > > > I feel your pain.. BTW RHEL 7.5 is out now so you'll need that if > you need current security fixes. > >> Should I just wait and test after the upgrade? > > > Well 17.11.6 will be out then that will include for a deadlock > that some sites hit occasionally, so that will be worth throwing > into the mix too. Do read the RELEASE_NOTES carefully though, > especially if you're using slurmdbd! > > > All the best, > Chris > -- > Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC >
Re: [slurm-users] wckey specification error
On Wednesday, 2 May 2018 8:50:12 PM AEST John Hearns wrote: > One learning pointgrep -i is a good default option. This ignores the > case of the search, so you would have found WCKey a bit faster. Also if you need to search recursively below a point then: git grep --no-index -i ${PATTERN} will do git's grep with no need to have a git repository. Plus it paginates, etc, for you. Also pretty fast. :-) -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC
Re: [slurm-users] wckey specification error
Mahmood, good to hear you have a solution. One learning pointgrep -i is a good default option. This ignores the case of the search, so you would have found WCKey a bit faster. On 2 May 2018 at 04:26, Mahmood Naderanwrote: > Thanks Trevor for pointing out that there is an option for such thing > is slurm.conf. Although I previously greped for *wc* and found > nothing, the correct name is TrackWCKey which is set to "yes" by > default. After setting that to "no", the error disappeared. > > About the comments on Rocks and the Slurm roll... in my experiences, > rocks 7 is very good and the unofficial slurm roll provided by Werner > is also very good. It is worth to give them a try. Although I had some > experiences with manual slurm installation on an ubuntu cluster some > years ago, the automatic installation of the roll was very nice > indeed! All the commands and configurations can be extracted from the > roll. So there is no dark point about that. Limited issues about > slurm, e.g. installation, are directly related to Werner. Most of the > other question are related to the slurm itself. For example accounting > and other things. > > > Regards, > Mahmood > > > > > > On Tue, May 1, 2018 at 9:35 PM, Cooper, Trevor wrote: > > > >> On May 1, 2018, at 2:58 AM, John Hearns wrote: > >> > >> Rocks 7 is now available, which is based on CentOS 7.4 > >> I hate to be uncharitable, but I am not a fan of Rocks. I speak from > experience, having installed my share of Rocks clusters. > >> The philosophy just does not fit in with the way I look at the world. > >> > >> Anyway, to install extra software on Rocks you need a 'Roll' Mahmood > Looks like you are using this Roll > >> https://sourceforge.net/projects/slurm-roll/ > >> It seems pretty mpdern as it installs Slurm 17.11.3 > >> > >> > >> On 1 May 2018 at 11:40, Chris Samuel wrote: > >> On Tuesday, 1 May 2018 2:45:21 PM AEST Mahmood Naderan wrote: > >> > >> > The wckey explanation in the manual [1] is not meaningful at the > >> > moment. Can someone explain that? > >> > >> I've never used it, but it sounds like you've configured your system to > require > >> it (or perhaps Rocks has done that?). > >> > >> https://slurm.schedmd.com/wckey.html > >> > >> Good luck, > >> Chris > >> -- > >> Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC > >> > >> > > > > The slurm-roll hosted on sourceforge is developed and supported by > Werner Saar not by developers of Rocks and/or other Rocks application rolls > (e.g. SDSC). > > > > There is ample documentation on sourceforge[1] on how to configure your > Rocks cluster to properly deploy the slurm-roll components and update your > Slurm configuration. > > > > There is also an active discussion group for the slurm-roll on > sourceforge[2] where Werner supports users of the slurm-roll for Rocks. > > > > While we don't use Werner's slurm-roll on our Rocks/Slurm based systems > I have installed it on test system and can say that it works as > expected/documented. > > > > In the default configuration WCKeys were NOT enabled so this something > that you must have added to your Slurm configuration. > > > > If you don't need the WCKeys capability of Slurm perhaps you could > simply disable it in your Slurm configuration. > > > > Hope this helps, > > Trevor > > > > [1] - https://sourceforge.net/projects/slurm-roll/files/ > release-7.0.0-17.11.05/slurm-roll.pdf > > [2] - https://sourceforge.net/p/slurm-roll/discussion/ > > > > -- > > Trevor Cooper > > HPC Systems Programmer > > San Diego Supercomputer Center, UCSD > > 9500 Gilman Drive, 0505 > > La Jolla, CA 92093-0505 > > > >
[slurm-users] sacct fields AllocCPUS and ReqMem are empty
Hi, I have Slurm 17.02.10 installed in a test environment. When I use sacct -o "JobID,JobName,AllocCPUs,ReqMem,Elapsed" and AccountingStorageType = accounting_storage/filetxt, the fields AllocCPUS and ReqMem are empty. JobIDJobName AllocCPUS ReqMemElapsed -- -- -- -- 371 stress_20s 0 0n 00:00:21 372 stress_20s 0 0n 00:00:21 373 stress_20s 0 0n 00:00:21 When I switch to AccountingStorageType = accounting_storage/slurmdbd and start the same jobs, the output works fine: JobIDJobName AllocCPUS ReqMemElapsed -- -- -- -- 382 stress_20s 132004Mn 00:00:20 383 stress_20s 1 2000Mn 00:00:20 384 stress_20s 1 2000Mn 00:00:20 Also, when I set the --starttime filter, it works only with the database. Does anyone have an explanation for this? Marcel