This bug in Slurm will be fixed in version 2.6.5 when released. You
can gat an early patch for this here:
https://github.com/SchedMD/slurm/commit/1ae427dd88133ae62183dba0444d2c68afccb55c.patch
Quoting Oliver Fortmeier <[email protected]>:
Dear slurm-dev,
We have a question on the value of gres_alloc which is reported to
the slurm accounting. Either we do not understand the meaning of
this parameter, or there may be a bug.
The observation is the following:
When "srunning" a job 3572543 using one gres GPU on a node, we
observe in slurm's accounting database that the job has requested
one GPU and one GPU has been allocated:
+------------+---------+-------------+-------------+----------+------------+-----------+
| job_db_inx | id_job | nodelist | nodes_alloc | gres_req |
gres_alloc | gres_used |
+------------+---------+-------------+-------------+----------+------------+-----------+
| 3928247 | 3572543 | <--NODE1--> | 1 | gpu:1 |
gpu:1 | |
+------------+---------+-------------+-------------+----------+------------+-----------+
So far, so good, however, when submitting a second job 3572544
(using one GPU as well), we observe in the accounting, that the
second job has requested one GPU (correct) but two GPUs are
allocated (wrong?):
+------------+---------+-------------+-------------+----------+------------+-----------+
| job_db_inx | id_job | nodelist | nodes_alloc | gres_req |
gres_alloc | gres_used |
+------------+---------+-------------+-------------+----------+------------+-----------+
| 3928249 | 3572544 | <--NODE1--> | 1 | gpu:1 |
gpu:2 | |
+------------+---------+-------------+-------------+----------+------------+-----------+
Please note that the two jobs are running simultaneously on the same
node as the same user. Following this approach and submitting a
third job, we observe that the value of gres_alloc is "gpu:3".
When looking at the slurm code (function _build_gres_alloc_string in
node_scheduler.c), I do not see any dependency of the job when
collecting the allocated general resources. Thus, I have two
questions:
1) What does the parameter gres_alloc exactly describe?
2) Why is there no dependency on the job when collecting the value
of gres_alloc?
Best regards,
Oliver
--
Dr. Oliver Fortmeier
Technical Analyst High-Performance Computing,
Bull GmbH, Germany
Phone: +49 (0) 2203 / 305 2465
Mobile: +49 (0) 173 / 5887589
E-mail: [email protected]
Bull GmbH
Sitz Köln, Amtsgericht Köln, HR B 8173
Ust-Id-Nr.: DE 121965133, WEEE-Reg.-Nr. DE 64193985
Geschäftsführer: Gerd-Lothar Leonhart, Michael Heinrichs, Philippe Miltin
Zentrale:
51149 Köln, Von-der-Wettern-Strasse 27
Telefon: +49 (0) 2203 305-0
Telefax: +49 (0) 2203 305-1699
http://www.bull.de
Bull, Architect of an Open World TM
** Folgen Sie uns auf Twitter: http://twitter.com/bull_de
** Bull Firmenprofil bei XING: https://www.xing.com/companies/bullgmbh