[slurm-dev] Re: Simulator sync issues

2015-08-05 Thread Manuel Rodríguez Pascual
Hi Gonzalo,

I am not an expert, so please take the rest of the mail as an opinion and
not completely reliable information.

As far as I know, Slurm simulator was developed by a single guy, and it was
abandoned at some point. Now  it is not supported anymore.  It was
developed for some old branch of Slurm. As Slurm has been deeply modified
from then, it is now not usable.

Marina Zapater has collected some information about it and put into her
github.

https://github.com/marinazapater/slurm-sim

there you can download the simulator, input data and the correct Slurm
version to make it work. It is designed to be executed on Ubuntu
12.something or 14 (TLS), and I also don't know whether it runs on any
other distro. I think this is currently the best starting point to get a
running simulator.

The code is however not perfect. It still presents some scalability issues
(memory leaks? concurrency?) that make it fail when executing large
simulations in terms of nodes or tasks. There is also no documentation at
all, besides a high level description.

I am just starting to get familiar with the simulator, so I cannot give you
any more in-depth information. Also, I would welcome any more information,
current versions or whatever you can find about this, so please feel free
to submit any update to this list (or myself).


Best regards,

2015-08-03 22:26 GMT+02:00 Gonzalo Rodrigo Alvarez gprodrigoalva...@lbl.gov
:



 Good afternoon,

 I have set up the simulator-branch from the  SchedMd fork and I have been
 experiencing some issues with using the simulator. Some I could solve
 myself, but I am having trouble with them, if anybody had similar
 experiences, I think it would be a good thing for all to share. Let's start
 first with the one I have not been able to solve:

 When a group  of jobs run longer than their runtime, slurmctld sends the
 corresponding REQUEST_KILL_TIMELIMIT rpc, which triggers the creation of
 a number of threads. The first one arrives to slurmd. But It tries to
 create more than the available proto-threads and for some reason this leads
 to slurmctld to block. Any hint on this problem? Maybe I should run it on a
 bigger VM? Now I am using a single core VM,

 Now a list of things that I observed I could solve:
 - I observed that unless I would add a sleep(1) in some threads derived
 from slurmctld, newly created threads would make
 _checking_for_new_threads go on an infinite loop. In particular: agent
 thread, backfil agent, _slurmctld_rpc_mgr loop, _slurmctld_background loop,
 and in general all the agent loops of the plugins used. (I know it is not a
 clean solution, but it worked).
 - In sim_lib: I observed that get_new_thread_id return type was changed
 form int to uint. that broke the code in pthread_create that detects the
 case in which there are no available threads (it returns -1).
 - I had to re-write the way the sleep wrapper and the _time_mgr were
 communicating with thread_sem and thread_sem_back. As it is it kept
 blocking all the time.

 Encountering these problems made me wonder if I am working with the
 correct branch (schedmd/simulator). or that the evolution of slurm is
 making the simulator code rot.  When the solutions are more stable and
 clean I will do a patch and poste it here

 Also a note to someone who was asking about the test.traces file (I am
 answering here because the google groups interface would not allow me to
 answer to it): you can find it, together with a synthetic user list at the
 original tar file distributed by the BSC, just put it in the sbin dir:
 http://www.bsc.es/marenostrum-support-services/services/slurm-simulator

 Thanks in advance!

 //Gonzalo





-- 
Dr. Manuel Rodríguez-Pascual
skype: manuel.rodriguez.pascual
phone: (+34) 913466173 // (+34) 679925108

CIEMAT-Moncloa
Edificio 22, desp. 1.25
Avenida Complutense, 40
28040- MADRID
SPAIN


[slurm-dev] Re: Simulator sync issues

2015-08-05 Thread Gonzalo Rodrigo Alvarez
Hi Manuel,

Thanks you the info and the link, I will try with that version and compare.

So far, I got the same feeling with the code but I have been able to
progress. Right now I don't have deadlocks, for that I did the following:
1) Re-do the way thread_sem and thread_sem_back  communicate beween sim_mgr
and the sleep wrapper.
2) Update (BTW it does not get installed) rpc_threads.pl: the script that
identifies the the function calls associated with threads that the
simulator should ignore. With the correct list I could remove the sleeps
that I added in some scripts. It covers:
- from binary slurmctld: _slurmctld_rpc_mgr, _slurmctld_signal_hand,
_agent,  agent,
_wdog, _thread_per_group_rpc
- from binary accounting_storage_slurmdbd.so: _set_db_inx_thread, _agent

The simulator seems to work without blocking, However, there are some
issues with the way the actual duration of jobs is communicated to the
slurmd:
- First sim_mgr sends a REQUEST_SIM_JOB RPC to slurmd with a job-id and a
job duration.
- Then it uses sbatch, to submit the job.
The problem is that the first job-id si calculated by taking into account
sbatch fails, and there are cases that sbatch reports failure but the job
gets submitted. IN the end this results into slurmd having non coherent
information associated to the sam job-id (or no information at all).  I
will keep updating in case someone else is trying to bring the simulator to
life.


Thanks!

//Gonzalo








On Wed, Aug 5, 2015 at 7:25 AM, Manuel Rodríguez Pascual 
manuel.rodriguez.pasc...@gmail.com wrote:

 Hi Gonzalo,

 I am not an expert, so please take the rest of the mail as an opinion and
 not completely reliable information.

 As far as I know, Slurm simulator was developed by a single guy, and it
 was abandoned at some point. Now  it is not supported anymore.  It was
 developed for some old branch of Slurm. As Slurm has been deeply modified
 from then, it is now not usable.

 Marina Zapater has collected some information about it and put into her
 github.

 https://github.com/marinazapater/slurm-sim

 there you can download the simulator, input data and the correct Slurm
 version to make it work. It is designed to be executed on Ubuntu
 12.something or 14 (TLS), and I also don't know whether it runs on any
 other distro. I think this is currently the best starting point to get a
 running simulator.

 The code is however not perfect. It still presents some scalability issues
 (memory leaks? concurrency?) that make it fail when executing large
 simulations in terms of nodes or tasks. There is also no documentation at
 all, besides a high level description.

 I am just starting to get familiar with the simulator, so I cannot give
 you any more in-depth information. Also, I would welcome any more
 information, current versions or whatever you can find about this, so
 please feel free to submit any update to this list (or myself).


 Best regards,

 2015-08-03 22:26 GMT+02:00 Gonzalo Rodrigo Alvarez 
 gprodrigoalva...@lbl.gov:



 Good afternoon,

 I have set up the simulator-branch from the  SchedMd fork and I have been
 experiencing some issues with using the simulator. Some I could solve
 myself, but I am having trouble with them, if anybody had similar
 experiences, I think it would be a good thing for all to share. Let's start
 first with the one I have not been able to solve:

 When a group  of jobs run longer than their runtime, slurmctld sends the
 corresponding REQUEST_KILL_TIMELIMIT rpc, which triggers the creation of
 a number of threads. The first one arrives to slurmd. But It tries to
 create more than the available proto-threads and for some reason this leads
 to slurmctld to block. Any hint on this problem? Maybe I should run it on a
 bigger VM? Now I am using a single core VM,

 Now a list of things that I observed I could solve:
 - I observed that unless I would add a sleep(1) in some threads derived
 from slurmctld, newly created threads would make
 _checking_for_new_threads go on an infinite loop. In particular: agent
 thread, backfil agent, _slurmctld_rpc_mgr loop, _slurmctld_background loop,
 and in general all the agent loops of the plugins used. (I know it is not a
 clean solution, but it worked).
 - In sim_lib: I observed that get_new_thread_id return type was changed
 form int to uint. that broke the code in pthread_create that detects the
 case in which there are no available threads (it returns -1).
 - I had to re-write the way the sleep wrapper and the _time_mgr were
 communicating with thread_sem and thread_sem_back. As it is it kept
 blocking all the time.

 Encountering these problems made me wonder if I am working with the
 correct branch (schedmd/simulator). or that the evolution of slurm is
 making the simulator code rot.  When the solutions are more stable and
 clean I will do a patch and poste it here

 Also a note to someone who was asking about the test.traces file (I am
 answering here because the google groups interface would not allow me 

[slurm-dev] Re: Simulator sync issues

2015-08-05 Thread Manuel Rodríguez Pascual
Sounds great, thanks for the update :)

My team is now preparing a trace extracted from 5 years of usage of our
cluster (2000 cores, 15M jobs), so as soon as the simulator -old or new- is
working, it will have some real stuff to simulate. We will of course
liberate it and post the download link here.

Best regards,


Manuel

El miércoles, 5 de agosto de 2015, Gonzalo Rodrigo Alvarez 
gprodrigoalva...@lbl.gov escribió:

 Hi Manuel,

 Thanks you the info and the link, I will try with that version and compare.

 So far, I got the same feeling with the code but I have been able to
 progress. Right now I don't have deadlocks, for that I did the following:
 1) Re-do the way thread_sem and thread_sem_back  communicate beween
 sim_mgr and the sleep wrapper.
 2) Update (BTW it does not get installed) rpc_threads.pl: the script that
 identifies the the function calls associated with threads that the
 simulator should ignore. With the correct list I could remove the sleeps
 that I added in some scripts. It covers:
 - from binary slurmctld: _slurmctld_rpc_mgr, _slurmctld_signal_hand,
 _agent,  agent,
 _wdog, _thread_per_group_rpc
 - from binary accounting_storage_slurmdbd.so: _set_db_inx_thread, _agent

 The simulator seems to work without blocking, However, there are some
 issues with the way the actual duration of jobs is communicated to the
 slurmd:
 - First sim_mgr sends a REQUEST_SIM_JOB RPC to slurmd with a job-id and a
 job duration.
 - Then it uses sbatch, to submit the job.
 The problem is that the first job-id si calculated by taking into account
 sbatch fails, and there are cases that sbatch reports failure but the job
 gets submitted. IN the end this results into slurmd having non coherent
 information associated to the sam job-id (or no information at all).  I
 will keep updating in case someone else is trying to bring the simulator to
 life.


 Thanks!

 //Gonzalo








 On Wed, Aug 5, 2015 at 7:25 AM, Manuel Rodríguez Pascual 
 manuel.rodriguez.pasc...@gmail.com
 javascript:_e(%7B%7D,'cvml','manuel.rodriguez.pasc...@gmail.com');
 wrote:

 Hi Gonzalo,

 I am not an expert, so please take the rest of the mail as an opinion and
 not completely reliable information.

 As far as I know, Slurm simulator was developed by a single guy, and it
 was abandoned at some point. Now  it is not supported anymore.  It was
 developed for some old branch of Slurm. As Slurm has been deeply modified
 from then, it is now not usable.

 Marina Zapater has collected some information about it and put into her
 github.

 https://github.com/marinazapater/slurm-sim

 there you can download the simulator, input data and the correct Slurm
 version to make it work. It is designed to be executed on Ubuntu
 12.something or 14 (TLS), and I also don't know whether it runs on any
 other distro. I think this is currently the best starting point to get a
 running simulator.

 The code is however not perfect. It still presents some scalability
 issues (memory leaks? concurrency?) that make it fail when executing large
 simulations in terms of nodes or tasks. There is also no documentation at
 all, besides a high level description.

 I am just starting to get familiar with the simulator, so I cannot give
 you any more in-depth information. Also, I would welcome any more
 information, current versions or whatever you can find about this, so
 please feel free to submit any update to this list (or myself).


 Best regards,

 2015-08-03 22:26 GMT+02:00 Gonzalo Rodrigo Alvarez 
 gprodrigoalva...@lbl.gov
 javascript:_e(%7B%7D,'cvml','gprodrigoalva...@lbl.gov');:



 Good afternoon,

 I have set up the simulator-branch from the  SchedMd fork and I have
 been experiencing some issues with using the simulator. Some I could solve
 myself, but I am having trouble with them, if anybody had similar
 experiences, I think it would be a good thing for all to share. Let's start
 first with the one I have not been able to solve:

 When a group  of jobs run longer than their runtime, slurmctld sends the
 corresponding REQUEST_KILL_TIMELIMIT rpc, which triggers the creation of
 a number of threads. The first one arrives to slurmd. But It tries to
 create more than the available proto-threads and for some reason this leads
 to slurmctld to block. Any hint on this problem? Maybe I should run it on a
 bigger VM? Now I am using a single core VM,

 Now a list of things that I observed I could solve:
 - I observed that unless I would add a sleep(1) in some threads
 derived from slurmctld, newly created threads would make
 _checking_for_new_threads go on an infinite loop. In particular: agent
 thread, backfil agent, _slurmctld_rpc_mgr loop, _slurmctld_background loop,
 and in general all the agent loops of the plugins used. (I know it is not a
 clean solution, but it worked).
 - In sim_lib: I observed that get_new_thread_id return type was changed
 form int to uint. that broke the code in pthread_create that detects the
 case in which there are no available threads