Good afternoon,

I have set up the simulator-branch from the  SchedMd fork and I have been
experiencing some issues with using the simulator. Some I could solve
myself, but I am having trouble with them, if anybody had similar
experiences, I think it would be a good thing for all to share. Let's start
first with the one I have not been able to solve:

When a group  of jobs run longer than their runtime, slurmctld sends the
corresponding "REQUEST_KILL_TIMELIMIT" rpc, which triggers the creation of
a number of threads. The first one arrives to slurmd. But It tries to
create more than the available proto-threads and for some reason this leads
to slurmctld to block. Any hint on this problem? Maybe I should run it on a
bigger VM? Now I am using a single core VM,

Now a list of things that I observed I could solve:
- I observed that unless I would add a "sleep(1)" in some threads derived
from slurmctld, newly created threads would make
"_checking_for_new_threads" go on an infinite loop. In particular: agent
thread, backfil agent, _slurmctld_rpc_mgr loop, _slurmctld_background loop,
and in general all the agent loops of the plugins used. (I know it is not a
clean solution, but it worked).
- In sim_lib: I observed that get_new_thread_id return type was changed
form int to uint. that broke the code in pthread_create that detects the
case in which there are no available threads (it returns -1).
- I had to re-write the way the sleep wrapper and the _time_mgr were
communicating with thread_sem and thread_sem_back. As it is it kept
blocking all the time.

Encountering these problems made me wonder if I am working with the correct
branch (schedmd/simulator). or that the evolution of slurm is making the
simulator code rot.  When the solutions are more stable and clean I will do
a patch and poste it here

Also a note to someone who was asking about the test.traces file (I am
answering here because the google groups interface would not allow me to
answer to it): you can find it, together with a synthetic user list at the
original tar file distributed by the BSC, just put it in the sbin dir:
http://www.bsc.es/marenostrum-support-services/services/slurm-simulator

Thanks in advance!

//Gonzalo

Reply via email to