Hi,
the gstack command will show you the activities of each thread in the
slurmctld process.
This is an example:
david@prometeo ~>gstack 14432
Thread 8 (Thread 0x7fa9c9190700 (LWP 14433)):
#0 0x00000035b90acb8d in nanosleep () from /lib64/libc.so.6
#1 0x00000035b90aca00 in sleep () from /lib64/libc.so.6
#2 0x00007fa9c919414f in _set_db_inx_thread () from
/opt/slurm/26/linux/lib/slurm/accounting_storage_slurmdbd.so
#3 0x00000035b9407851 in start_thread () from /lib64/libpthread.so.0
#4 0x00000035b90e890d in clone () from /lib64/libc.so.6
Thread 7 (Thread 0x7fa9c908f700 (LWP 14434)):
#0 0x00000035b94080ad in pthread_join () from /lib64/libpthread.so.0
#1 0x00007fa9c9194174 in _cleanup_thread () from
/opt/slurm/26/linux/lib/slurm/accounting_storage_slurmdbd.so
#2 0x00000035b9407851 in start_thread () from /lib64/libpthread.so.0
#3 0x00000035b90e890d in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x7fa9c8d8b700 (LWP 14437)):
#0 0x00000035b940b7bb in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1 0x000000000059251f in _agent ()
#2 0x00000035b9407851 in start_thread () from /lib64/libpthread.so.0
#3 0x00000035b90e890d in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7fa9c35f0700 (LWP 14438)):
#0 0x00000035b940b7bb in pthread_cond_timedwait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1 0x00007fa9c35f3abf in _my_sleep () from
/opt/slurm/26/linux/lib/slurm/sched_backfill.so
#2 0x00007fa9c35f3ee2 in backfill_agent () from
/opt/slurm/26/linux/lib/slurm/sched_backfill.so
#3 0x00000035b9407851 in start_thread () from /lib64/libpthread.so.0
#4 0x00000035b90e890d in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7fa9c2946700 (LWP 14439)):
#0 0x00000035b90e14f3 in select () from /lib64/libc.so.6
#1 0x00000000004329e3 in _slurmctld_rpc_mgr ()
#2 0x00000035b9407851 in start_thread () from /lib64/libpthread.so.0
#3 0x00000035b90e890d in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7fa9c2845700 (LWP 14440)):
#0 0x00000035b940f2a5 in sigwait () from /lib64/libpthread.so.0
#1 0x0000000000432462 in _slurmctld_signal_hand ()
#2 0x00000035b9407851 in start_thread () from /lib64/libpthread.so.0
#3 0x00000035b90e890d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7fa9c2744700 (LWP 14441)):
#0 0x00000035b940b43c in pthread_cond_wait@@GLIBC_2.3.2 () from
/lib64/libpthread.so.0
#1 0x000000000049b6b6 in slurmctld_state_save ()
#2 0x00000035b9407851 in start_thread () from /lib64/libpthread.so.0
#3 0x00000035b90e890d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fa9c95aa700 (LWP 14432)):
#0 0x00000035b90acb8d in nanosleep () from /lib64/libc.so.6
#1 0x00000035b90aca00 in sleep () from /lib64/libc.so.6
#2 0x0000000000433a3f in _slurmctld_background ()
#3 0x0000000000431f6b in main ()
*/David*
On Wed, Jun 12, 2013 at 9:43 AM, Paul Edmon <[email protected]> wrote:
>
> I'm also interested in this as I've only ever seen one slurmctld and
> only at 100%. It would be good if making slurm multithreaded was on the
> path for the future. I know we will have 100,000's of jobs in flight
> for our config so it would be good to have something that can take that
> load.
>
> -Paul Edmon-
>
> On 06/12/2013 12:30 PM, Alan V. Cowles wrote:
> > Hey Guys,
> >
> > I've seen a few references to the slurmctld as a multithreaded process
> > but it doesn't seem that way.
> >
> > We had a user submit 18000 jobs to our cluster (512 slots) and it shows
> > 512 fully loaded, shows those jobs running, shows about 9800 currently
> > pending, but upon her submission threw errors around 16500.
> >
> > Submitted batch job 16589
> > Submitted batch job 16590
> > Submitted batch job 16591
> > sbatch: error: Slurm temporarily unable to accept job, sleeping and
> > retrying.
> > sbatch: error: Batch job submission failed: Resource temporarily
> > unavailable.
> >
> > The thing we noticed at this time on our master host is that slurmctld
> > was pegging at 100% on one cpu quite regularly and paged 16GB of virtual
> > memory, while all other cpu's were completely idle.
> >
> > We wondered if the pegging out of the control daemon is what led to the
> > submission failure, as we haven't found any limits set anywhere to any
> > specific job or user, and wondered if perhaps we missed a configure
> > option for this when we did our original install.
> >
> > Any thoughts or ideas? We're running Slurm 2.5.4 on RHEL6.
> >
> > AC
>