Good afternoon, I have set up the simulator-branch from the SchedMd fork and I have been experiencing some issues with using the simulator. Some I could solve myself, but I am having trouble with them, if anybody had similar experiences, I think it would be a good thing for all to share. Let's start first with the one I have not been able to solve:
When a group of jobs run longer than their runtime, slurmctld sends the corresponding "REQUEST_KILL_TIMELIMIT" rpc, which triggers the creation of a number of threads. The first one arrives to slurmd. But It tries to create more than the available proto-threads and for some reason this leads to slurmctld to block. Any hint on this problem? Maybe I should run it on a bigger VM? Now I am using a single core VM, Now a list of things that I observed I could solve: - I observed that unless I would add a "sleep(1)" in some threads derived from slurmctld, newly created threads would make "_checking_for_new_threads" go on an infinite loop. In particular: agent thread, backfil agent, _slurmctld_rpc_mgr loop, _slurmctld_background loop, and in general all the agent loops of the plugins used. (I know it is not a clean solution, but it worked). - In sim_lib: I observed that get_new_thread_id return type was changed form int to uint. that broke the code in pthread_create that detects the case in which there are no available threads (it returns -1). - I had to re-write the way the sleep wrapper and the _time_mgr were communicating with thread_sem and thread_sem_back. As it is it kept blocking all the time. Encountering these problems made me wonder if I am working with the correct branch (schedmd/simulator). or that the evolution of slurm is making the simulator code rot. When the solutions are more stable and clean I will do a patch and poste it here Also a note to someone who was asking about the test.traces file (I am answering here because the google groups interface would not allow me to answer to it): you can find it, together with a synthetic user list at the original tar file distributed by the BSC, just put it in the sbin dir: http://www.bsc.es/marenostrum-support-services/services/slurm-simulator Thanks in advance! //Gonzalo
