Sounds great, thanks for the update :) My team is now preparing a trace extracted from 5 years of usage of our cluster (2000 cores, 15M jobs), so as soon as the simulator -old or new- is working, it will have some real stuff to simulate. We will of course liberate it and post the download link here.
Best regards, Manuel El miércoles, 5 de agosto de 2015, Gonzalo Rodrigo Alvarez < [email protected]> escribió: > Hi Manuel, > > Thanks you the info and the link, I will try with that version and compare. > > So far, I got the same feeling with the code but I have been able to > progress. Right now I don't have deadlocks, for that I did the following: > 1) Re-do the way thread_sem and thread_sem_back communicate beween > sim_mgr and the sleep wrapper. > 2) Update (BTW it does not get installed) rpc_threads.pl: the script that > identifies the the function calls associated with threads that the > simulator should ignore. With the correct list I could remove the sleeps > that I added in some scripts. It covers: > - from binary slurmctld: _slurmctld_rpc_mgr, _slurmctld_signal_hand, > _agent, agent, > _wdog, _thread_per_group_rpc > - from binary accounting_storage_slurmdbd.so: _set_db_inx_thread, _agent > > The simulator seems to work without blocking, However, there are some > issues with the way the actual duration of jobs is communicated to the > slurmd: > - First sim_mgr sends a REQUEST_SIM_JOB RPC to slurmd with a job-id and a > job duration. > - Then it uses sbatch, to submit the job. > The problem is that the first job-id si calculated by taking into account > sbatch fails, and there are cases that sbatch reports failure but the job > gets submitted. IN the end this results into slurmd having non coherent > information associated to the sam job-id (or no information at all). I > will keep updating in case someone else is trying to bring the simulator to > life. > > > Thanks! > > //Gonzalo > > > > > > > > > On Wed, Aug 5, 2015 at 7:25 AM, Manuel Rodríguez Pascual < > [email protected] > <javascript:_e(%7B%7D,'cvml','[email protected]');>> > wrote: > >> Hi Gonzalo, >> >> I am not an expert, so please take the rest of the mail as an opinion and >> not completely reliable information. >> >> As far as I know, Slurm simulator was developed by a single guy, and it >> was abandoned at some point. Now it is not supported anymore. It was >> developed for some old branch of Slurm. As Slurm has been deeply modified >> from then, it is now not usable. >> >> Marina Zapater has collected some information about it and put into her >> github. >> >> https://github.com/marinazapater/slurm-sim >> >> there you can download the simulator, input data and the correct Slurm >> version to make it work. It is designed to be executed on Ubuntu >> 12.something or 14 (TLS), and I also don't know whether it runs on any >> other distro. I think this is currently the best starting point to get a >> running simulator. >> >> The code is however not perfect. It still presents some scalability >> issues (memory leaks? concurrency?) that make it fail when executing large >> simulations in terms of nodes or tasks. There is also no documentation at >> all, besides a high level description. >> >> I am just starting to get familiar with the simulator, so I cannot give >> you any more in-depth information. Also, I would welcome any more >> information, current versions or whatever you can find about this, so >> please feel free to submit any update to this list (or myself). >> >> >> Best regards, >> >> 2015-08-03 22:26 GMT+02:00 Gonzalo Rodrigo Alvarez < >> [email protected] >> <javascript:_e(%7B%7D,'cvml','[email protected]');>>: >> >>> >>> >>> Good afternoon, >>> >>> I have set up the simulator-branch from the SchedMd fork and I have >>> been experiencing some issues with using the simulator. Some I could solve >>> myself, but I am having trouble with them, if anybody had similar >>> experiences, I think it would be a good thing for all to share. Let's start >>> first with the one I have not been able to solve: >>> >>> When a group of jobs run longer than their runtime, slurmctld sends the >>> corresponding "REQUEST_KILL_TIMELIMIT" rpc, which triggers the creation of >>> a number of threads. The first one arrives to slurmd. But It tries to >>> create more than the available proto-threads and for some reason this leads >>> to slurmctld to block. Any hint on this problem? Maybe I should run it on a >>> bigger VM? Now I am using a single core VM, >>> >>> Now a list of things that I observed I could solve: >>> - I observed that unless I would add a "sleep(1)" in some threads >>> derived from slurmctld, newly created threads would make >>> "_checking_for_new_threads" go on an infinite loop. In particular: agent >>> thread, backfil agent, _slurmctld_rpc_mgr loop, _slurmctld_background loop, >>> and in general all the agent loops of the plugins used. (I know it is not a >>> clean solution, but it worked). >>> - In sim_lib: I observed that get_new_thread_id return type was changed >>> form int to uint. that broke the code in pthread_create that detects the >>> case in which there are no available threads (it returns -1). >>> - I had to re-write the way the sleep wrapper and the _time_mgr were >>> communicating with thread_sem and thread_sem_back. As it is it kept >>> blocking all the time. >>> >>> Encountering these problems made me wonder if I am working with the >>> correct branch (schedmd/simulator). or that the evolution of slurm is >>> making the simulator code rot. When the solutions are more stable and >>> clean I will do a patch and poste it here >>> >>> Also a note to someone who was asking about the test.traces file (I am >>> answering here because the google groups interface would not allow me to >>> answer to it): you can find it, together with a synthetic user list at the >>> original tar file distributed by the BSC, just put it in the sbin dir: >>> http://www.bsc.es/marenostrum-support-services/services/slurm-simulator >>> >>> Thanks in advance! >>> >>> //Gonzalo >>> >>> >>> >> >> >> -- >> Dr. Manuel Rodríguez-Pascual >> skype: manuel.rodriguez.pascual >> phone: (+34) 913466173 // (+34) 679925108 >> >> CIEMAT-Moncloa >> Edificio 22, desp. 1.25 >> Avenida Complutense, 40 >> 28040- MADRID >> SPAIN >> > > -- Dr. Manuel Rodríguez-Pascual skype: manuel.rodriguez.pascual phone: (+34) 913466173 // (+34) 679925108 CIEMAT-Moncloa Edificio 22, desp. 1.25 Avenida Complutense, 40 28040- MADRID SPAIN
