Sounds great, thanks for the update :)

My team is now preparing a trace extracted from 5 years of usage of our
cluster (2000 cores, 15M jobs), so as soon as the simulator -old or new- is
working, it will have some real stuff to simulate. We will of course
liberate it and post the download link here.

Best regards,


Manuel

El miércoles, 5 de agosto de 2015, Gonzalo Rodrigo Alvarez <
[email protected]> escribió:

> Hi Manuel,
>
> Thanks you the info and the link, I will try with that version and compare.
>
> So far, I got the same feeling with the code but I have been able to
> progress. Right now I don't have deadlocks, for that I did the following:
> 1) Re-do the way thread_sem and thread_sem_back  communicate beween
> sim_mgr and the sleep wrapper.
> 2) Update (BTW it does not get installed) rpc_threads.pl: the script that
> identifies the the function calls associated with threads that the
> simulator should ignore. With the correct list I could remove the sleeps
> that I added in some scripts. It covers:
> - from binary slurmctld: _slurmctld_rpc_mgr, _slurmctld_signal_hand,
> _agent,  agent,
> _wdog, _thread_per_group_rpc
> - from binary accounting_storage_slurmdbd.so: _set_db_inx_thread, _agent
>
> The simulator seems to work without blocking, However, there are some
> issues with the way the actual duration of jobs is communicated to the
> slurmd:
> - First sim_mgr sends a REQUEST_SIM_JOB RPC to slurmd with a job-id and a
> job duration.
> - Then it uses sbatch, to submit the job.
> The problem is that the first job-id si calculated by taking into account
> sbatch fails, and there are cases that sbatch reports failure but the job
> gets submitted. IN the end this results into slurmd having non coherent
> information associated to the sam job-id (or no information at all).  I
> will keep updating in case someone else is trying to bring the simulator to
> life.
>
>
> Thanks!
>
> //Gonzalo
>
>
>
>
>
>
>
>
> On Wed, Aug 5, 2015 at 7:25 AM, Manuel Rodríguez Pascual <
> [email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
> wrote:
>
>> Hi Gonzalo,
>>
>> I am not an expert, so please take the rest of the mail as an opinion and
>> not completely reliable information.
>>
>> As far as I know, Slurm simulator was developed by a single guy, and it
>> was abandoned at some point. Now  it is not supported anymore.  It was
>> developed for some old branch of Slurm. As Slurm has been deeply modified
>> from then, it is now not usable.
>>
>> Marina Zapater has collected some information about it and put into her
>> github.
>>
>> https://github.com/marinazapater/slurm-sim
>>
>> there you can download the simulator, input data and the correct Slurm
>> version to make it work. It is designed to be executed on Ubuntu
>> 12.something or 14 (TLS), and I also don't know whether it runs on any
>> other distro. I think this is currently the best starting point to get a
>> running simulator.
>>
>> The code is however not perfect. It still presents some scalability
>> issues (memory leaks? concurrency?) that make it fail when executing large
>> simulations in terms of nodes or tasks. There is also no documentation at
>> all, besides a high level description.
>>
>> I am just starting to get familiar with the simulator, so I cannot give
>> you any more in-depth information. Also, I would welcome any more
>> information, current versions or whatever you can find about this, so
>> please feel free to submit any update to this list (or myself).
>>
>>
>> Best regards,
>>
>> 2015-08-03 22:26 GMT+02:00 Gonzalo Rodrigo Alvarez <
>> [email protected]
>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>:
>>
>>>
>>>
>>> Good afternoon,
>>>
>>> I have set up the simulator-branch from the  SchedMd fork and I have
>>> been experiencing some issues with using the simulator. Some I could solve
>>> myself, but I am having trouble with them, if anybody had similar
>>> experiences, I think it would be a good thing for all to share. Let's start
>>> first with the one I have not been able to solve:
>>>
>>> When a group  of jobs run longer than their runtime, slurmctld sends the
>>> corresponding "REQUEST_KILL_TIMELIMIT" rpc, which triggers the creation of
>>> a number of threads. The first one arrives to slurmd. But It tries to
>>> create more than the available proto-threads and for some reason this leads
>>> to slurmctld to block. Any hint on this problem? Maybe I should run it on a
>>> bigger VM? Now I am using a single core VM,
>>>
>>> Now a list of things that I observed I could solve:
>>> - I observed that unless I would add a "sleep(1)" in some threads
>>> derived from slurmctld, newly created threads would make
>>> "_checking_for_new_threads" go on an infinite loop. In particular: agent
>>> thread, backfil agent, _slurmctld_rpc_mgr loop, _slurmctld_background loop,
>>> and in general all the agent loops of the plugins used. (I know it is not a
>>> clean solution, but it worked).
>>> - In sim_lib: I observed that get_new_thread_id return type was changed
>>> form int to uint. that broke the code in pthread_create that detects the
>>> case in which there are no available threads (it returns -1).
>>> - I had to re-write the way the sleep wrapper and the _time_mgr were
>>> communicating with thread_sem and thread_sem_back. As it is it kept
>>> blocking all the time.
>>>
>>> Encountering these problems made me wonder if I am working with the
>>> correct branch (schedmd/simulator). or that the evolution of slurm is
>>> making the simulator code rot.  When the solutions are more stable and
>>> clean I will do a patch and poste it here
>>>
>>> Also a note to someone who was asking about the test.traces file (I am
>>> answering here because the google groups interface would not allow me to
>>> answer to it): you can find it, together with a synthetic user list at the
>>> original tar file distributed by the BSC, just put it in the sbin dir:
>>> http://www.bsc.es/marenostrum-support-services/services/slurm-simulator
>>>
>>> Thanks in advance!
>>>
>>> //Gonzalo
>>>
>>>
>>>
>>
>>
>> --
>> Dr. Manuel Rodríguez-Pascual
>> skype: manuel.rodriguez.pascual
>> phone: (+34) 913466173 // (+34) 679925108
>>
>> CIEMAT-Moncloa
>> Edificio 22, desp. 1.25
>> Avenida Complutense, 40
>> 28040- MADRID
>> SPAIN
>>
>
>

-- 
Dr. Manuel Rodríguez-Pascual
skype: manuel.rodriguez.pascual
phone: (+34) 913466173 // (+34) 679925108

CIEMAT-Moncloa
Edificio 22, desp. 1.25
Avenida Complutense, 40
28040- MADRID
SPAIN

Reply via email to