Hi Gonzalo, I am not an expert, so please take the rest of the mail as an opinion and not completely reliable information.
As far as I know, Slurm simulator was developed by a single guy, and it was abandoned at some point. Now it is not supported anymore. It was developed for some old branch of Slurm. As Slurm has been deeply modified from then, it is now not usable. Marina Zapater has collected some information about it and put into her github. https://github.com/marinazapater/slurm-sim there you can download the simulator, input data and the correct Slurm version to make it work. It is designed to be executed on Ubuntu 12.something or 14 (TLS), and I also don't know whether it runs on any other distro. I think this is currently the best starting point to get a running simulator. The code is however not perfect. It still presents some scalability issues (memory leaks? concurrency?) that make it fail when executing large simulations in terms of nodes or tasks. There is also no documentation at all, besides a high level description. I am just starting to get familiar with the simulator, so I cannot give you any more in-depth information. Also, I would welcome any more information, current versions or whatever you can find about this, so please feel free to submit any update to this list (or myself). Best regards, 2015-08-03 22:26 GMT+02:00 Gonzalo Rodrigo Alvarez <[email protected] >: > > > Good afternoon, > > I have set up the simulator-branch from the SchedMd fork and I have been > experiencing some issues with using the simulator. Some I could solve > myself, but I am having trouble with them, if anybody had similar > experiences, I think it would be a good thing for all to share. Let's start > first with the one I have not been able to solve: > > When a group of jobs run longer than their runtime, slurmctld sends the > corresponding "REQUEST_KILL_TIMELIMIT" rpc, which triggers the creation of > a number of threads. The first one arrives to slurmd. But It tries to > create more than the available proto-threads and for some reason this leads > to slurmctld to block. Any hint on this problem? Maybe I should run it on a > bigger VM? Now I am using a single core VM, > > Now a list of things that I observed I could solve: > - I observed that unless I would add a "sleep(1)" in some threads derived > from slurmctld, newly created threads would make > "_checking_for_new_threads" go on an infinite loop. In particular: agent > thread, backfil agent, _slurmctld_rpc_mgr loop, _slurmctld_background loop, > and in general all the agent loops of the plugins used. (I know it is not a > clean solution, but it worked). > - In sim_lib: I observed that get_new_thread_id return type was changed > form int to uint. that broke the code in pthread_create that detects the > case in which there are no available threads (it returns -1). > - I had to re-write the way the sleep wrapper and the _time_mgr were > communicating with thread_sem and thread_sem_back. As it is it kept > blocking all the time. > > Encountering these problems made me wonder if I am working with the > correct branch (schedmd/simulator). or that the evolution of slurm is > making the simulator code rot. When the solutions are more stable and > clean I will do a patch and poste it here > > Also a note to someone who was asking about the test.traces file (I am > answering here because the google groups interface would not allow me to > answer to it): you can find it, together with a synthetic user list at the > original tar file distributed by the BSC, just put it in the sbin dir: > http://www.bsc.es/marenostrum-support-services/services/slurm-simulator > > Thanks in advance! > > //Gonzalo > > > -- Dr. Manuel Rodríguez-Pascual skype: manuel.rodriguez.pascual phone: (+34) 913466173 // (+34) 679925108 CIEMAT-Moncloa Edificio 22, desp. 1.25 Avenida Complutense, 40 28040- MADRID SPAIN
