Dear Slurm developers,

we came across the following error message in the slurmctld logs when
using non-consumable resources:

error: gres/potion: job 39 dealloc of node node1 bad node_offset 0 count
is 0

The error comes from _job_dealloc():

#0  _job_dealloc (job_gres_data=0x19ea550,
node_gres_data=0x7f8a18000b70, node_offset=0, gres_name=0x1999e00
"potion", job_id=46,
    node_name=0x1987ab0 "node1") at gres.c:3980
#1  0x0000000000600d8b in gres_plugin_job_dealloc
(job_gres_list=0x199b7c0, node_gres_list=0x199bc38, node_offset=0,
job_id=46,
    node_name=0x1987ab0 "node1") at gres.c:4190
#2  0x00007f8a3134e9db in _rm_job_from_nodes (cr_ptr=0x7f8a18001050,
job_ptr=0x19e9d50, pre_err=0x7f8a31353cb0 "_will_run_test", remove_all=true)
    at select_linear.c:2091
#3  0x00007f8a313518b3 in _will_run_test (job_ptr=0x7f8a24001390,
bitmap=0x7f8a18001ad0, min_nodes=1, max_nodes=1, max_share=1, req_nodes=1,
    preemptee_candidates=0x0, preemptee_job_list=0x7f8a2f910c40) at
select_linear.c:3176
#4  0x00007f8a31351f7b in select_p_job_test (job_ptr=0x7f8a24001390,
bitmap=0x7f8a18001ad0, min_nodes=1, max_nodes=1, req_nodes=1, mode=2,
    preemptee_candidates=0x0, preemptee_job_list=0x7f8a2f910c40,
exc_core_bitmap=0x0) at select_linear.c:3390
#5  0x0000000000515468 in select_g_job_test (job_ptr=0x7f8a24001390,
bitmap=0x7f8a18001ad0, min_nodes=1, max_nodes=1, req_nodes=1, mode=2,
    preemptee_candidates=0x0, preemptee_job_list=0x7f8a2f910c40,
exc_core_bitmap=0x0) at node_select.c:588
#6  0x00007f8a2f91510f in _try_sched (job_ptr=0x7f8a24001390,
avail_bitmap=0x7f8a2f910d38, min_nodes=1, max_nodes=1, req_nodes=1,
exc_core_bitmap=0x0)
    at backfill.c:367
#7  0x00007f8a2f9175fd in _attempt_backfill () at backfill.c:1197
#8  0x00007f8a2f915cef in backfill_agent (args=0x0) at backfill.c:634
#9  0x0000003c82e07ee5 in start_thread () from /lib64/libpthread.so.0
#10 0x0000003c82af4b8d in clone () from /lib64/libc.so.6

The cause of this problem is that _node_state_dup() in gres.c does not
duplicate the no_consume flag.
The cr_ptr passed to _rm_job_from_nodes() is created with _dup_cr()
which calls _node_state_dup().

Below is a simple patch to fix the problem. A "future-proof" alternative
might be to memcpy() from gres_ptr to new_gres and
only handle pointers separately.


Best regards,
Dorian


diff --git a/src/common/gres.c b/src/common/gres.c
index 765c2cb..4833b4f 100644
--- a/src/common/gres.c
+++ b/src/common/gres.c
@@ -2106,6 +2106,7 @@ static void *_node_state_dup(void *gres_data)
        new_gres = xmalloc(sizeof(gres_node_state_t));
        new_gres->gres_cnt_found  = gres_ptr->gres_cnt_found;
        new_gres->gres_cnt_config = gres_ptr->gres_cnt_config;
+       new_gres->no_consume      = gres_ptr->no_consume;
        new_gres->gres_cnt_avail  = gres_ptr->gres_cnt_avail;
        new_gres->gres_cnt_alloc  = gres_ptr->gres_cnt_alloc;
        if (gres_ptr->gres_bit_alloc)




------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------

Reply via email to