Hi Manuel, Thanks you the info and the link, I will try with that version and compare.
So far, I got the same feeling with the code but I have been able to progress. Right now I don't have deadlocks, for that I did the following: 1) Re-do the way thread_sem and thread_sem_back communicate beween sim_mgr and the sleep wrapper. 2) Update (BTW it does not get installed) rpc_threads.pl: the script that identifies the the function calls associated with threads that the simulator should ignore. With the correct list I could remove the sleeps that I added in some scripts. It covers: - from binary slurmctld: _slurmctld_rpc_mgr, _slurmctld_signal_hand, _agent, agent, _wdog, _thread_per_group_rpc - from binary accounting_storage_slurmdbd.so: _set_db_inx_thread, _agent The simulator seems to work without blocking, However, there are some issues with the way the actual duration of jobs is communicated to the slurmd: - First sim_mgr sends a REQUEST_SIM_JOB RPC to slurmd with a job-id and a job duration. - Then it uses sbatch, to submit the job. The problem is that the first job-id si calculated by taking into account sbatch fails, and there are cases that sbatch reports failure but the job gets submitted. IN the end this results into slurmd having non coherent information associated to the sam job-id (or no information at all). I will keep updating in case someone else is trying to bring the simulator to life. Thanks! //Gonzalo On Wed, Aug 5, 2015 at 7:25 AM, Manuel Rodríguez Pascual < [email protected]> wrote: > Hi Gonzalo, > > I am not an expert, so please take the rest of the mail as an opinion and > not completely reliable information. > > As far as I know, Slurm simulator was developed by a single guy, and it > was abandoned at some point. Now it is not supported anymore. It was > developed for some old branch of Slurm. As Slurm has been deeply modified > from then, it is now not usable. > > Marina Zapater has collected some information about it and put into her > github. > > https://github.com/marinazapater/slurm-sim > > there you can download the simulator, input data and the correct Slurm > version to make it work. It is designed to be executed on Ubuntu > 12.something or 14 (TLS), and I also don't know whether it runs on any > other distro. I think this is currently the best starting point to get a > running simulator. > > The code is however not perfect. It still presents some scalability issues > (memory leaks? concurrency?) that make it fail when executing large > simulations in terms of nodes or tasks. There is also no documentation at > all, besides a high level description. > > I am just starting to get familiar with the simulator, so I cannot give > you any more in-depth information. Also, I would welcome any more > information, current versions or whatever you can find about this, so > please feel free to submit any update to this list (or myself). > > > Best regards, > > 2015-08-03 22:26 GMT+02:00 Gonzalo Rodrigo Alvarez < > [email protected]>: > >> >> >> Good afternoon, >> >> I have set up the simulator-branch from the SchedMd fork and I have been >> experiencing some issues with using the simulator. Some I could solve >> myself, but I am having trouble with them, if anybody had similar >> experiences, I think it would be a good thing for all to share. Let's start >> first with the one I have not been able to solve: >> >> When a group of jobs run longer than their runtime, slurmctld sends the >> corresponding "REQUEST_KILL_TIMELIMIT" rpc, which triggers the creation of >> a number of threads. The first one arrives to slurmd. But It tries to >> create more than the available proto-threads and for some reason this leads >> to slurmctld to block. Any hint on this problem? Maybe I should run it on a >> bigger VM? Now I am using a single core VM, >> >> Now a list of things that I observed I could solve: >> - I observed that unless I would add a "sleep(1)" in some threads derived >> from slurmctld, newly created threads would make >> "_checking_for_new_threads" go on an infinite loop. In particular: agent >> thread, backfil agent, _slurmctld_rpc_mgr loop, _slurmctld_background loop, >> and in general all the agent loops of the plugins used. (I know it is not a >> clean solution, but it worked). >> - In sim_lib: I observed that get_new_thread_id return type was changed >> form int to uint. that broke the code in pthread_create that detects the >> case in which there are no available threads (it returns -1). >> - I had to re-write the way the sleep wrapper and the _time_mgr were >> communicating with thread_sem and thread_sem_back. As it is it kept >> blocking all the time. >> >> Encountering these problems made me wonder if I am working with the >> correct branch (schedmd/simulator). or that the evolution of slurm is >> making the simulator code rot. When the solutions are more stable and >> clean I will do a patch and poste it here >> >> Also a note to someone who was asking about the test.traces file (I am >> answering here because the google groups interface would not allow me to >> answer to it): you can find it, together with a synthetic user list at the >> original tar file distributed by the BSC, just put it in the sbin dir: >> http://www.bsc.es/marenostrum-support-services/services/slurm-simulator >> >> Thanks in advance! >> >> //Gonzalo >> >> >> > > > -- > Dr. Manuel Rodríguez-Pascual > skype: manuel.rodriguez.pascual > phone: (+34) 913466173 // (+34) 679925108 > > CIEMAT-Moncloa > Edificio 22, desp. 1.25 > Avenida Complutense, 40 > 28040- MADRID > SPAIN >
