[slurm-dev] Re: Strange hostlist/malloc error

Wiegand, Paul Mon, 19 Dec 2016 07:11:00 -0800

Yes.  We have QoSs setup for each PI account, limited to 80K cpu hours per 
month.  But I have played with --time, and I never get this error with fewer 
than 7 nodes no matter how much time I specify, and I never fail to get it even 
when time is very, very low (seconds).


Also, FWIW, our slurmdb setup isn't something that has changed.

------ Original message------
From: Uwe Sauter
Date: Mon, Dec 19, 2016 09:57
To: slurm-dev;
Cc:
Subject:[slurm-dev] Re: Strange hostlist/malloc error


Do you have limits (per partition / group), QoS (with limits per user), etc 
configured?



Am 19.12.2016 um 15:52 schrieb Wiegand, Paul:
> Greetings,
>
>
> We were running slurm 16.05.0 and just upgraded to 16.05.7 during our Fall 
> maintenance cycle along with other changes.
> Now we are having a very strange problem:
>
>
> * When a regular user requests 6 or fewer nodes, one is able to do so without 
> issue;
>
>
> * When a regular user requests 7 or more nodes, one gets the following error:
>
>
> hostlist.c:1007: hostrange shift: malloc failed
> Aborted (core dumped)
>
>
> * It doesn't matter which nodes are being requested ... I've verified that I 
> *can* allocate across any node ... just not
> more than 6;
>
>
> * The root user can request any number of nodes;
>
>
>
> I should say that I do *not* believe this is due to the upgrade per se.  
> There were other changes made during downtime,
> including to how groups are named and handled.  I expect it to be a 
> permissions problem somewhere, but the logs are not
> helpful.  Indeed, I rolled back to 16.05.0 and still get this error.
>
>
> Any help where to look would be appreciated.
>
>
> Thanks,
>
> Paul.
>
>

[slurm-dev] Re: Strange hostlist/malloc error

Reply via email to