Hi,

I've been working on a new tool for monitoring slurm behaviour: sdiag. 
Mainly, sdiag is useful to know how slurm schedulings are working. The
initial reason to implement it was to have a better idea of what
backfilling is doing.

The sdiag path for slurm-2.3.1 is attached with man page included.

This is the sdiag output:

*******************************************************
sdiag output at Mon Dec  5 11:39:05 2011
Data since    Mon Dec  5 10:57:09 2011
*******************************************************
Server thread count: 3
Agent queue size: 0

Jobs submitted: 8
Jobs started: 8
Jobs completed: 7
Jobs canceled: 1
Jobs failed: 0

Main schedule statistics (microseconds):
        Last cycle: 1065
        Max cycle: 1823
        Total cycles: 60
        Mean cycle: 1294
        Mean depth cycle: 144
        Cycles per minute: 1
        Last queue length: 164

Backfilling stats
        Total backfilled jobs (since last slurm start): 3
        Total backfilled jobs (since last stats cycle start): 3
        Total cycles: 29
        Last cycle when: Mon Dec  5 11:38:38 2011
        Last cycle: 805
        Max cycle: 1463
        Mean cycle: 866
        Last depth cycle: 164
        Last depth cycle (try sched): 0
        Depth Mean: 164
        Depth Mean (try depth): 0
        Last queue length: 164
        Queue length Mean: 164





WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer.htm
diff -X avoid -Naur slurm-2.3.1/configure.ac slurm-2.3.1.sdiag-accounting/configure.ac
--- slurm-2.3.1/configure.ac	2011-10-24 19:15:42.000000000 +0200
+++ slurm-2.3.1.sdiag-accounting/configure.ac	2011-11-17 19:20:03.000000000 +0100
@@ -529,6 +529,7 @@
 		 testsuite/slurm_unit/api/Makefile
 		 testsuite/slurm_unit/api/manual/Makefile
 		 testsuite/slurm_unit/common/Makefile
+		 src/sdiag/Makefile
 		 ]
 )
 
diff -X avoid -Naur slurm-2.3.1/doc/man/Makefile.am slurm-2.3.1.sdiag-accounting/doc/man/Makefile.am
--- slurm-2.3.1/doc/man/Makefile.am	2011-10-24 19:15:42.000000000 +0200
+++ slurm-2.3.1.sdiag-accounting/doc/man/Makefile.am	2011-12-05 11:27:22.302808084 +0100
@@ -21,7 +21,8 @@
 	man1/sshare.1 \
 	man1/sstat.1 \
 	man1/strigger.1 \
-	man1/sview.1
+	man1/sview.1 \
+	man1/sdiag.1
 
 man3_MANS = man3/slurm_hostlist_create.3 \
 	man3/slurm_hostlist_destroy.3 \
diff -X avoid -Naur slurm-2.3.1/doc/man/man1/sdiag.1 slurm-2.3.1.sdiag-accounting/doc/man/man1/sdiag.1
--- slurm-2.3.1/doc/man/man1/sdiag.1	1970-01-01 01:00:00.000000000 +0100
+++ slurm-2.3.1.sdiag-accounting/doc/man/man1/sdiag.1	2011-12-05 11:32:00.150897941 +0100
@@ -0,0 +1,165 @@
+.TH "sdiag" "1" "SLURM 2.3" "Dec 2011" "SLURM Commands"
+.SH "NAME"
+.LP
+sdiag \- Diagnostic tool for SLURM
+
+.SH "SYNOPSIS"
+.LP
+sview
+
+.SH "DESCRIPTION"
+.LP
+sdiag shows information related to slurmctld execution about: threads, agents, jobs, 
+and scheduling algorithms. The goal is to obtain data from slurmctld behaviour helping to adjust configuration
+parameters or queues policies. The main reason behind is to know SLURM behaviour under systems with a high throughput.
+.LP
+It has two execution modes. The default mode \fB\--all\fR shows several counters and statistics explained later, and 
+there is another execution option \fB\--reset\fR for resetting those values.
+.LP
+Values are reset at midnight UTC time by default.
+.LP
+The first block of information is related to global slurmctld execution:
+.TP
+\fBServer Thread Count
+The number of current active slurmctld threads. A high number would mean a high load processing events like job submissions, jobs dispatching, jobs completing, ... . If this is often close to MAX_SERVER_THREADS it could point to a potential bottleneck.
+
+.TP
+\fBAgent queue size
+Slurm design has scalability in mind and sending messages to hundreds or thousands of nodes is not a trivial task. The agent mechanism helps to control communication between the slurm daemons and the controller for a best effort. If this values is close to MAX_AGENT_CNT there could be some delays affecting jobs management.
+
+.TP
+\fBJobs Submitted
+Number of jobs submitted since last reset
+
+.TP
+\fBJobs Started
+Number of jobs started since last reset. This includes backfilled jobs.
+
+.TP
+\fBJobs Completed
+Number of jobs completed since last reset.
+
+.TP
+\fBJobs Canceled
+Number of jobs canceled since last reset.
+
+.TP
+\fBJobs Failed
+Number of jobs failed since last reset.
+
+.LP
+The second block of information is related to main scheduling algorithm based on jobs priorities. A
+scheduling cycle implies to get the job_write_lock lock, then trying to get resources for jobs pending,
+starting from the most priority one and going in descendent order. Once a job can not get the resources
+the loop keeps going but just for jobs requesting other partitions. Jobs with dependencies or affected 
+by accounts limits are not processed.
+.TP
+\fBLast Cycle
+Time in microseconds for last scheduling cycle. 
+
+.TP
+\fBMax Cycle
+Time in microseconds for the maximum scheduling cycle since last reset.
+
+.TP
+\fBTotal Cycles
+Number of scheduling cycles since last reset. Scheduling is done in periodically and when a job is submitted
+or a job is completed.
+
+.TP
+\fBMean cycle
+Mean of scheduling cycles since last reset
+
+.TP
+\fBMean Depth Cycle
+Mean of cycle depth. Depth means number of jobs processed in a scheduling cycle.
+
+.TP
+\fBCycles per minute
+Counter of scheduling executions per minute
+
+.TP
+\fBLast queue length
+Length of jobs pending queue.
+
+.LP
+The third block of information is related to backfilling scheduling algorithm. A backfilling scheduling cycle implies 
+to get locks for jobs, nodes and partitions objects then trying to get resources for jobs pending. Jobs are processed
+based on priorities. If a job can not get resources the algorithm calculates when it could get them obtaining 
+a future start time for the job. Then next job is processed and the algorithm tries to get resources for that 
+job but avoiding to affect the \fIprevious ones\fR, and again it calculates the future start time if not current
+resources available. The backfilling algorithm takes more time for each new job to process since more priority jobs
+can not be affected. The algorithm itself takes measures for avoiding a long execution cycle and for taking all the
+locks for too long.
+
+.TP
+\fBTotal backfilled jobs (since last slurm start)
+Number of jobs started thanks to backfilling since last slurm start.
+
+.TP
+\fBTotal backfilled jobs (since last stats cycle start)
+Number of jobs started thanks to backfilling since last time stats where reset. By default these values are reset at
+midnight UTC time.
+
+.TP
+\fBTotal cycles
+Number of scheduling cycles since last reset
+
+.TP
+\fBLast cycle
+Time in microseconds of last backfilling cycle. It counts only execution time removing sleep time inside a scheduling cycle
+when it takes too much.
+
+.TP
+\fBMax cycle
+Time in microseconds of maximum backfilling cycle execution since last reset
+
+.TP
+\fBMean cycle
+Mean of backfilling scheduling cycles in microseconds since last reset
+
+.TP
+\fBLast cycle when
+Time when last execution cycle happened in format "weekday Month MonthDay hour:minute.seconds year"
+
+.TP
+\fBLast depth cycle
+Number of processed jobs during last backfilling scheduling cycle. It counts every process even if it has
+no option to execute due to dependencies or limits.
+
+.TP
+\fBLast depth cycle (try sched)
+Number of processed jobs during last backfilling scheduling cycle. It counts only processes with a chance
+to run waiting for available resources. These jobs are which makes the backfilling algorithm heavier.
+
+.TP
+\fBDepth Mean
+Mean of processed jobs during backfilling scheduling cycles since last reset.
+
+.TP
+\fBDepth Mean (try sched)
+Mean of processed jobs during backfilling scheduling cycles since last reset. It counts only processes with a chance
+to run waiting for available resources. These jobs are which makes the backfilling algorithm heavier.
+
+.TP
+\fBLast queue length
+Number of jobs pending to be processed by backfilling algorithm. A job appears as much times as partitions it requested.
+
+.TP
+\fBQueue length Mean
+Mean of jobs pending to be processed by backfilling algorithm.
+
+.SH "COPYING"
+SLURM is free software; you can redistribute it and/or modify it under
+the terms of the GNU General Public License as published by the Free
+Software Foundation; either version 2 of the License, or (at your option)
+any later version.
+.LP
+SLURM is distributed in the hope that it will be useful, but WITHOUT ANY
+WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more
+details.
+
+.SH "SEE ALSO"
+.LP
+sinfo(1), squeue(1), scontrol(1), slurm.conf(5),
diff -X avoid -Naur slurm-2.3.1/slurm/slurm.h.in slurm-2.3.1.sdiag-accounting/slurm/slurm.h.in
--- slurm-2.3.1/slurm/slurm.h.in	2011-10-24 19:15:42.000000000 +0200
+++ slurm-2.3.1.sdiag-accounting/slurm/slurm.h.in	2011-12-05 10:48:11.052731135 +0100
@@ -1994,6 +1994,40 @@
 /* Opaque data type for slurm_step_ctx_* functions */
 typedef struct slurm_step_ctx_struct slurm_step_ctx_t;
 
+
+typedef struct diag_stats {
+
+	int proc_req_threads;
+	int proc_req_raw;
+	uint32_t schedule_cycle_max;
+	uint32_t schedule_cycle_last;
+	uint64_t schedule_cycle_sum;
+	uint64_t schedule_cycle_counter;
+	uint64_t schedule_cycle_depth;
+	uint32_t schedule_queue_len;
+
+	uint32_t jobs_submitted;
+	uint32_t jobs_started;
+	uint32_t jobs_completed;
+	uint32_t jobs_canceled;
+	uint32_t jobs_failed;
+
+	uint64_t backfilled_jobs;
+	uint64_t last_backfilled_jobs;
+	uint64_t bf_cycle_counter;
+	uint64_t bf_cycle_last;
+	uint64_t bf_cycle_max;
+	uint64_t bf_cycle_sum;
+	uint32_t bf_last_depth;
+	uint32_t bf_last_depth_try;
+	uint32_t bf_depth_sum;
+	uint32_t bf_depth_try_sum;
+	uint32_t bf_queue_len;
+	uint32_t bf_queue_len_sum;
+	time_t   bf_when_last_cycle;
+	uint32_t bf_active;
+} diag_stats_t;
+
 #define TRIGGER_RES_TYPE_JOB            0x0001
 #define TRIGGER_RES_TYPE_NODE           0x0002
 #define TRIGGER_RES_TYPE_SLURMCTLD      0x0003
diff -X avoid -Naur slurm-2.3.1/src/api/Makefile.am slurm-2.3.1.sdiag-accounting/src/api/Makefile.am
--- slurm-2.3.1/src/api/Makefile.am	2011-10-24 19:15:42.000000000 +0200
+++ slurm-2.3.1.sdiag-accounting/src/api/Makefile.am	2011-11-17 19:19:01.000000000 +0100
@@ -105,7 +105,8 @@
 	topo_info.c      \
 	triggers.c       \
 	reconfigure.c    \
-	update_config.c
+	update_config.c  \
+	slurm_get_statistics.c
 
 common_dir = $(top_builddir)/src/common
 
diff -X avoid -Naur slurm-2.3.1/src/api/slurm_get_statistics.c slurm-2.3.1.sdiag-accounting/src/api/slurm_get_statistics.c
--- slurm-2.3.1/src/api/slurm_get_statistics.c	1970-01-01 01:00:00.000000000 +0100
+++ slurm-2.3.1.sdiag-accounting/src/api/slurm_get_statistics.c	2011-12-02 18:49:54.699786059 +0100
@@ -0,0 +1,80 @@
+#ifdef HAVE_CONFIG_H
+#  include "config.h"
+#endif                /* HAVE_CONFIG_H */
+
+
+#include <slurm/slurm.h>
+#include <slurm/slurm_errno.h>
+
+#include "src/common/read_config.h"
+#include "src/common/slurm_protocol_api.h"
+
+
+int slurm_reset_statistics(stats_info_request_msg_t *req){
+
+	int rc;
+	slurm_msg_t req_msg;
+	slurm_msg_t resp_msg;
+
+	slurm_msg_t_init(&req_msg);
+	slurm_msg_t_init(&resp_msg);
+
+	req_msg.msg_type = REQUEST_STATS_INFO;
+	req_msg.data     = req;
+
+	rc = slurm_send_recv_controller_msg(&req_msg, &resp_msg);
+
+	if (rc == SLURM_SOCKET_ERROR)
+		return SLURM_ERROR;
+
+	switch (resp_msg.msg_type) {
+		case RESPONSE_STATS_INFO:
+			break;
+		case RESPONSE_SLURM_RC:
+			rc = ((return_code_msg_t *) resp_msg.data)->return_code;
+			if (rc)
+				slurm_seterrno_ret(rc);
+			break;
+		default:
+			slurm_seterrno_ret(SLURM_UNEXPECTED_MSG_ERROR);
+	}
+
+	return SLURM_PROTOCOL_SUCCESS;
+
+}
+
+int slurm_get_statistics(stats_info_response_msg_t **buf, stats_info_request_msg_t *req){
+    
+	int rc;
+	slurm_msg_t req_msg;
+	slurm_msg_t resp_msg;
+
+	slurm_msg_t_init(&req_msg);
+	slurm_msg_t_init(&resp_msg);
+
+	req_msg.msg_type = REQUEST_STATS_INFO;
+	req_msg.data     = req;
+
+	rc = slurm_send_recv_controller_msg(&req_msg, &resp_msg);
+
+	if (rc == SLURM_SOCKET_ERROR)
+		return SLURM_ERROR;
+
+	switch (resp_msg.msg_type) {
+		case RESPONSE_STATS_INFO:
+			*buf = (stats_info_response_msg_t *)resp_msg.data;
+			break;
+		case RESPONSE_SLURM_RC:
+			rc = ((return_code_msg_t *) resp_msg.data)->return_code;
+			if (rc)
+				slurm_seterrno_ret(rc);
+			buf = NULL;
+			break;
+		default:
+			slurm_seterrno_ret(SLURM_UNEXPECTED_MSG_ERROR);
+	}
+
+	return SLURM_PROTOCOL_SUCCESS;
+
+}
+
diff -X avoid -Naur slurm-2.3.1/src/common/slurm_protocol_defs.c slurm-2.3.1.sdiag-accounting/src/common/slurm_protocol_defs.c
--- slurm-2.3.1/src/common/slurm_protocol_defs.c	2011-10-24 19:15:42.000000000 +0200
+++ slurm-2.3.1.sdiag-accounting/src/common/slurm_protocol_defs.c	2011-12-05 10:28:27.203732188 +0100
@@ -841,6 +841,16 @@
 	xfree(msg);
 }
 
+/*extern void slurm_free_stats_request_msg(stats_desc_msg_t *msg)
+{
+	xfree(msg);
+}*/
+
+extern void slurm_free_stats_response_msg(stats_info_response_msg_t *msg)
+{
+	xfree(msg);
+}
+
 extern void slurm_free_spank_env_request_msg(spank_env_request_msg_t *msg)
 {
 	xfree(msg);
@@ -2238,6 +2248,14 @@
 	}
 }
 
+
+inline void slurm_free_stats_info_request_msg(
+	stats_info_request_msg_t *msg)
+{
+	xfree(msg);
+}
+
+
 extern void slurm_destroy_priority_factors_object(void *object)
 {
 	priority_factors_object_t *obj_ptr =
diff -X avoid -Naur slurm-2.3.1/src/common/slurm_protocol_defs.h slurm-2.3.1.sdiag-accounting/src/common/slurm_protocol_defs.h
--- slurm-2.3.1/src/common/slurm_protocol_defs.h	2011-10-24 19:15:42.000000000 +0200
+++ slurm-2.3.1.sdiag-accounting/src/common/slurm_protocol_defs.h	2011-12-05 10:41:12.321732260 +0100
@@ -210,6 +210,10 @@
 	RESPONSE_FRONT_END_INFO,
 	REQUEST_SPANK_ENVIRONMENT,
 	RESPONCE_SPANK_ENVIRONMENT,
+	REQUEST_STATS_INFO,
+	RESPONSE_STATS_INFO,
+	REQUEST_STATS_RESET,
+	RESPONSE_STATS_RESET,
 
 	REQUEST_UPDATE_JOB = 3001,
 	REQUEST_UPDATE_NODE,
@@ -509,6 +513,52 @@
 	uint16_t show_flags;
 } part_info_request_msg_t;
 
+typedef struct stats_info_request_msg {
+	uint16_t command_id;
+} stats_info_request_msg_t;
+
+typedef struct stats_info_response_msg {
+	uint32_t parts_packed;
+	time_t req_time;
+	time_t req_time_start;
+	uint32_t server_thread_count;
+	uint32_t agent_queue_size;
+
+	uint32_t schedule_cycle_max;
+	uint32_t schedule_cycle_last;
+	uint64_t schedule_cycle_sum;
+	uint64_t schedule_cycle_counter;
+	uint64_t schedule_cycle_depth;
+	uint64_t schedule_queue_len;
+
+	uint32_t jobs_submitted;
+	uint32_t jobs_started;
+	uint32_t jobs_completed;
+	uint32_t jobs_canceled;
+	uint32_t jobs_failed;
+
+	/* Backfilling stats */
+	uint64_t bf_backfilled_jobs;
+	uint64_t bf_last_backfilled_jobs;
+	uint64_t bf_cycle_counter;
+	uint64_t bf_cycle_sum;
+	uint64_t bf_cycle_last;
+	uint64_t bf_cycle_max;
+	time_t   bf_last_depth;
+	time_t   bf_last_depth_try;
+	uint64_t bf_depth_sum;
+	uint64_t bf_depth_try_sum;
+	uint32_t bf_queue_len;
+	uint64_t bf_queue_len_sum;
+	time_t   bf_when_last_cycle;
+	uint32_t bf_active;
+} stats_info_response_msg_t;
+
+/*typedef struct stats_desc_msg {
+	int level;
+} stats_desc_msg_t;*/
+
+
 typedef struct resv_info_request_msg {
         time_t last_update;
 } resv_info_request_msg_t;
@@ -971,6 +1021,7 @@
 		front_end_info_request_msg_t *msg);
 extern void slurm_free_node_info_request_msg(node_info_request_msg_t *msg);
 extern void slurm_free_part_info_request_msg(part_info_request_msg_t *msg);
+extern void slurm_free_stats_info_request_msg(stats_info_request_msg_t *msg);
 extern void slurm_free_resv_info_request_msg(resv_info_request_msg_t *msg);
 extern void slurm_free_set_debug_flags_msg(set_debug_flags_msg_t *msg);
 extern void slurm_free_set_debug_level_msg(set_debug_level_msg_t *msg);
diff -X avoid -Naur slurm-2.3.1/src/common/slurm_protocol_pack.c slurm-2.3.1.sdiag-accounting/src/common/slurm_protocol_pack.c
--- slurm-2.3.1/src/common/slurm_protocol_pack.c	2011-10-24 19:15:42.000000000 +0200
+++ slurm-2.3.1.sdiag-accounting/src/common/slurm_protocol_pack.c	2011-12-05 10:44:36.034798062 +0100
@@ -73,6 +73,7 @@
 #define _pack_front_end_info_msg(msg,buf)	_pack_buffer_msg(msg,buf)
 #define _pack_node_info_msg(msg,buf)		_pack_buffer_msg(msg,buf)
 #define _pack_partition_info_msg(msg,buf)	_pack_buffer_msg(msg,buf)
+#define _pack_stats_response_msg(msg,buf)	_pack_buffer_msg(msg,buf)
 #define _pack_reserve_info_msg(msg,buf)		_pack_buffer_msg(msg,buf)
 
 static void _pack_assoc_shares_object(void *in, Buf buffer,
@@ -599,6 +600,12 @@
 static int _unpack_spank_env_responce_msg(spank_env_responce_msg_t ** msg_ptr,
 					  Buf buffer, uint16_t protocol_version);
 
+
+static void _pack_stats_request_msg(stats_info_request_msg_t *msg, Buf buffer);
+static int  _unpack_stats_request_msg(stats_info_request_msg_t **msg_ptr, Buf buffer);
+static int  _unpack_stats_response_msg(stats_info_response_msg_t **msg_ptr, Buf buffer);
+static int  _unpack_stats_reset_response_msg(stats_info_response_msg_t **msg_ptr, Buf buffer);
+
 /* pack_header
  * packs a slurm protocol header that precedes every slurm message
  * IN header - the header structure to pack
@@ -1159,6 +1166,15 @@
 			(spank_env_responce_msg_t *)msg->data, buffer,
 			msg->protocol_version);
 		break;
+
+	case REQUEST_STATS_INFO:
+		_pack_stats_request_msg((stats_info_request_msg_t *)msg->data, buffer);
+		break;
+
+	case RESPONSE_STATS_INFO:
+		_pack_stats_response_msg((slurm_msg_t *)msg, buffer);
+		break;
+
 	default:
 		debug("No pack method for msg type %u", msg->msg_type);
 		return EINVAL;
@@ -1703,6 +1719,15 @@
 			(spank_env_responce_msg_t **)&msg->data, buffer,
 			msg->protocol_version);
 		break;
+
+	case REQUEST_STATS_INFO:
+		_unpack_stats_request_msg((stats_info_request_msg_t **)&msg->data, buffer);
+		break;
+
+	case RESPONSE_STATS_INFO:
+		_unpack_stats_response_msg((stats_info_response_msg_t **)&msg->data, buffer);
+		break;
+ 
 	default:
 		debug("No unpack method for msg type %u", msg->msg_type);
 		return EINVAL;
@@ -9984,3 +10009,81 @@
    return SLURM_ERROR;
    }
 */
+
+
+static void _pack_stats_request_msg(stats_info_request_msg_t *msg, Buf buffer)
+{
+	xassert ( msg != NULL );
+
+	pack16((uint32_t)msg->command_id, buffer ) ;
+}
+
+static int  _unpack_stats_request_msg(stats_info_request_msg_t **msg_ptr, Buf buffer)
+{
+	stats_info_request_msg_t * msg;
+	xassert ( msg_ptr != NULL );
+
+	msg = xmalloc ( sizeof (stats_info_request_msg_t) );
+	*msg_ptr = msg ;
+
+	safe_unpack16(&msg->command_id ,      buffer ) ;
+	return SLURM_SUCCESS;
+
+unpack_error:
+	info("SIM: unpack_stats_request_msg error!\n");
+	*msg_ptr = NULL;
+	slurm_free_stats_info_request_msg(msg);
+	return SLURM_ERROR;
+}
+
+static int  _unpack_stats_response_msg(stats_info_response_msg_t **msg_ptr, Buf buffer)
+{
+	stats_info_response_msg_t * msg;
+	xassert ( msg_ptr != NULL );
+
+	msg = xmalloc ( sizeof (stats_info_response_msg_t) );
+	*msg_ptr = msg ;
+
+	safe_unpack32(&msg->parts_packed ,      buffer ) ;
+	if(msg->parts_packed){
+		safe_unpack_time(&msg->req_time ,      buffer ) ;
+		safe_unpack_time(&msg->req_time_start ,      buffer ) ;
+		safe_unpack32(&msg->server_thread_count ,      buffer ) ;
+		safe_unpack32(&msg->agent_queue_size ,      buffer ) ;
+		safe_unpack32(&msg->jobs_submitted ,      buffer ) ;
+		safe_unpack32(&msg->jobs_started ,      buffer ) ;
+		safe_unpack32(&msg->jobs_completed ,      buffer ) ;
+		safe_unpack32(&msg->jobs_canceled ,      buffer ) ;
+		safe_unpack32(&msg->jobs_failed ,      buffer ) ;
+		safe_unpack32(&msg->schedule_cycle_max ,      buffer ) ;
+		safe_unpack32(&msg->schedule_cycle_last ,      buffer ) ;
+		safe_unpack64(&msg->schedule_cycle_sum ,      buffer ) ;
+		safe_unpack64(&msg->schedule_cycle_counter ,      buffer ) ;
+		safe_unpack64(&msg->schedule_cycle_depth ,      buffer ) ;
+		safe_unpack32(&msg->schedule_queue_len ,      buffer ) ;
+
+		safe_unpack64(&msg->bf_backfilled_jobs ,      buffer ) ;
+		safe_unpack64(&msg->bf_last_backfilled_jobs ,      buffer ) ;
+		safe_unpack64(&msg->bf_cycle_counter ,      buffer ) ;
+		safe_unpack64(&msg->bf_cycle_sum ,      buffer ) ;
+		safe_unpack64(&msg->bf_cycle_last ,      buffer ) ;
+		safe_unpack64(&msg->bf_last_depth ,      buffer ) ;
+		safe_unpack64(&msg->bf_last_depth_try ,      buffer ) ;
+		safe_unpack32(&msg->bf_queue_len ,      buffer ) ;
+		safe_unpack64(&msg->bf_cycle_max ,      buffer ) ;
+		safe_unpack_time(&msg->bf_when_last_cycle ,      buffer ) ;
+		safe_unpack64(&msg->bf_depth_sum ,      buffer ) ;
+		safe_unpack64(&msg->bf_depth_try_sum ,      buffer ) ;
+		safe_unpack64(&msg->bf_queue_len_sum ,      buffer ) ;
+		safe_unpack32(&msg->bf_active,      buffer ) ;
+	}
+
+	return SLURM_SUCCESS;
+
+unpack_error:
+	info("SIM: unpack_stats_response_msg error!\n");
+	*msg_ptr = NULL;
+	slurm_free_stats_response_msg(msg);
+	return SLURM_ERROR;
+}
+
diff -X avoid -Naur slurm-2.3.1/src/Makefile.am slurm-2.3.1.sdiag-accounting/src/Makefile.am
--- slurm-2.3.1/src/Makefile.am	2011-10-24 19:15:42.000000000 +0200
+++ slurm-2.3.1.sdiag-accounting/src/Makefile.am	2011-11-17 19:17:07.000000000 +0100
@@ -2,7 +2,7 @@
 	slurmctld slurmd slurmdbd plugins sbcast \
 	scontrol scancel squeue sinfo smap sview salloc \
 	sbatch sattach strigger sacct sacctmgr sreport sstat \
-	sshare sprio
+	sshare sprio sdiag
 
 if !BUILD_SRUN2APRUN
 if !REAL_BG_L_P_LOADED
diff -X avoid -Naur slurm-2.3.1/src/plugins/sched/backfill/backfill.c slurm-2.3.1.sdiag-accounting/src/plugins/sched/backfill/backfill.c
--- slurm-2.3.1/src/plugins/sched/backfill/backfill.c	2011-10-24 19:15:42.000000000 +0200
+++ slurm-2.3.1.sdiag-accounting/src/plugins/sched/backfill/backfill.c	2011-12-02 18:32:29.855732233 +0100
@@ -98,7 +98,10 @@
 	bitstr_t *avail_bitmap;
 	int next;	/* next record, by time, zero termination */
 } node_space_map_t;
-int backfilled_jobs = 0;
+
+/* Diag statistics */
+extern diag_stats_t slurmctld_diag_stats;
+int bf_last_ints = 0;
 
 /*********************** local variables *********************/
 static bool stop_backfill = false;
@@ -373,6 +376,26 @@
 	config_flag = true;
 }
 
+int _do_diag_stats(struct timeval *tv1, struct timeval *tv2){
+
+	long delta_t;
+	long bf_interval_usecs = backfill_interval * 1000000;
+
+	delta_t  = (tv2->tv_sec  - tv1->tv_sec) * 1000000;
+	delta_t +=  tv2->tv_usec - tv1->tv_usec;
+
+	slurmctld_diag_stats.bf_cycle_counter++;
+	slurmctld_diag_stats.bf_cycle_sum += (delta_t - (bf_last_ints * bf_interval_usecs));
+   	slurmctld_diag_stats.bf_cycle_last = delta_t - (bf_last_ints * bf_interval_usecs);
+	slurmctld_diag_stats.bf_depth_sum += slurmctld_diag_stats.bf_last_depth;
+	slurmctld_diag_stats.bf_depth_try_sum += slurmctld_diag_stats.bf_last_depth_try;
+	if(slurmctld_diag_stats.bf_cycle_last > slurmctld_diag_stats.bf_cycle_max)
+			slurmctld_diag_stats.bf_cycle_max = slurmctld_diag_stats.bf_cycle_last;
+
+	slurmctld_diag_stats.bf_active = 0;
+}
+
+
 /* backfill_agent - detached thread periodically attempts to backfill jobs */
 extern void *backfill_agent(void *args)
 {
@@ -426,6 +449,7 @@
 	part_update = last_part_update;
 
 	unlock_slurmctld(all_locks);
+    	bf_last_ints++;
 	_my_sleep(backfill_interval);
 	lock_slurmctld(all_locks);
 
@@ -453,6 +477,7 @@
 	bitstr_t *avail_bitmap = NULL, *resv_bitmap = NULL;
 	time_t now = time(NULL), sched_start, later_start, start_res;
 	node_space_map_t *node_space;
+	struct timeval tv1, tv2;
 	static int sched_timeout = 0;
 	int this_sched_timeout = 0, rc = 0;
 
@@ -486,6 +511,16 @@
 		return 0;
 	}
 
+	gettimeofday(&tv1, NULL);
+
+	slurmctld_diag_stats.bf_queue_len = list_count(job_queue);
+	slurmctld_diag_stats.bf_queue_len_sum += slurmctld_diag_stats.bf_queue_len;
+	slurmctld_diag_stats.bf_last_depth = 0;
+	slurmctld_diag_stats.bf_last_depth_try = 0;
+	slurmctld_diag_stats.bf_when_last_cycle = now;
+	bf_last_ints = 0;
+	slurmctld_diag_stats.bf_active = 1;
+
 	node_space = xmalloc(sizeof(node_space_map_t) *
 			     (max_backfill_job_cnt + 3));
 	node_space[0].begin_time = sched_start;
@@ -503,11 +538,14 @@
 		xfree(job_queue_rec);
 		if (!IS_JOB_PENDING(job_ptr))
 			continue;	/* started in other partition */
+
 		job_ptr->part_ptr = part_ptr;
 
 		if (debug_flags & DEBUG_FLAG_BACKFILL)
 			info("backfill test for job %u", job_ptr->job_id);
 
+		 slurmctld_diag_stats.bf_last_depth++;
+
 		if ((job_ptr->state_reason == WAIT_ASSOC_JOB_LIMIT) ||
 		    (job_ptr->state_reason == WAIT_ASSOC_RESOURCE_LIMIT) ||
 		    (job_ptr->state_reason == WAIT_ASSOC_TIME_LIMIT) ||
@@ -523,7 +561,7 @@
 			       job_ptr->priority);
 			continue;
 		}
-
+	
 		if (((part_ptr->state_up & PARTITION_SCHED) == 0) ||
 		    (part_ptr->node_bitmap == NULL))
 		 	continue;
@@ -650,6 +688,8 @@
 		/* this is the time consuming operation */
 		debug2("backfill: entering _try_sched for job %u.",
 		       job_ptr->job_id);
+
+		slurmctld_diag_stats.bf_last_depth_try++;
 		j = _try_sched(job_ptr, &avail_bitmap,
 			       min_nodes, max_nodes, req_nodes);
 		debug2("backfill: finished _try_sched for job %u.",
@@ -748,6 +788,8 @@
 	}
 	xfree(node_space);
 	list_destroy(job_queue);
+	gettimeofday(&tv2, NULL);
+	_do_diag_stats(&tv1,&tv2);
 	return rc;
 }
 
@@ -779,10 +821,11 @@
 			srun_allocate(job_ptr->job_id);
 		else if (job_ptr->details->prolog_running == 0)
 			launch_job(job_ptr);
-		backfilled_jobs++;
+		slurmctld_diag_stats.backfilled_jobs++;
+		slurmctld_diag_stats.last_backfilled_jobs++;
 		if (debug_flags & DEBUG_FLAG_BACKFILL) {
 			info("backfill: Jobs backfilled since boot: %d",
-			     backfilled_jobs);
+			     slurmctld_diag_stats.backfilled_jobs);
 		}
 	} else if ((job_ptr->job_id != fail_jobid) &&
 		   (rc != ESLURM_ACCOUNTING_POLICY)) {
diff -X avoid -Naur slurm-2.3.1/src/sdiag/Makefile.am slurm-2.3.1.sdiag-accounting/src/sdiag/Makefile.am
--- slurm-2.3.1/src/sdiag/Makefile.am	1970-01-01 01:00:00.000000000 +0100
+++ slurm-2.3.1.sdiag-accounting/src/sdiag/Makefile.am	2011-11-16 16:29:30.000000000 +0100
@@ -0,0 +1,19 @@
+#
+# Makefile for sinfo
+
+AUTOMAKE_OPTIONS = foreign
+
+INCLUDES = -I$(top_srcdir) $(BG_INCLUDES)
+bin_PROGRAMS = sdiag
+
+sdiag_LDADD = $(top_builddir)/src/api/libslurm.o -ldl
+
+#noinst_HEADERS = sinfo.h print.h
+sdiag_SOURCES = sdiag.c opts.c
+
+force:
+$(sdiag_LDADD) : force
+	@cd `dirname $@` && $(MAKE) `basename $@`
+
+sdiag_LDFLAGS = -export-dynamic $(CMD_LDFLAGS)
+
diff -X avoid -Naur slurm-2.3.1/src/sdiag/opts.c slurm-2.3.1.sdiag-accounting/src/sdiag/opts.c
--- slurm-2.3.1/src/sdiag/opts.c	1970-01-01 01:00:00.000000000 +0100
+++ slurm-2.3.1.sdiag-accounting/src/sdiag/opts.c	2011-12-02 19:00:11.836752572 +0100
@@ -0,0 +1,94 @@
+#if HAVE_CONFIG_H
+#  include "config.h"
+#endif
+
+#ifndef _GNU_SOURCE
+#  define _GNU_SOURCE
+#endif
+
+#if HAVE_GETOPT_H
+#  include <getopt.h>
+#else
+#  include "src/common/getopt.h"
+#endif
+
+#include <stdlib.h>
+#include <unistd.h>
+
+#include "src/common/xstring.h"
+#include "src/common/proc_args.h"
+
+#define OPT_LONG_USAGE 0x101
+
+static void  _help( void );
+static int   _parse_format( char* );
+static void  _parse_token( char *token, char *field, int *field_size,
+                           bool *right_justify, char **suffix);
+static void  _print_options( void );
+static void  _usage( void );
+
+extern int sdiag_param;
+
+/*
+ * parse_command_line, fill in params data structure with data
+ */
+extern void parse_command_line(int argc, char *argv[])
+{
+	char *env_val = NULL;
+	int opt_char;
+	int option_index;
+	int verbose = 0;
+	static struct option long_options[] = {
+		{"all",       no_argument,       0, 'a'},
+		{"reset",        no_argument,       0, 'r'},
+		{"help",        no_argument,       0, 'h'},
+		{"usage",        no_argument,       0, OPT_LONG_USAGE},
+		{NULL,        0,                 0, 0}
+	};
+
+	while((opt_char = getopt_long(argc, argv, "arh", long_options, &option_index)) != -1) {
+		
+		switch (opt_char) {
+			case (int)'?':
+				fprintf(stderr,
+						"Try \"sdiag --help\" for more information\n");
+				exit(1);
+				break;
+
+			case (int)'a':
+				printf("Using all option\n");
+				sdiag_param = 1;
+				break;
+			case (int)'r':
+				printf("Using reset option\n");
+				sdiag_param = 0;
+				break;
+			case (int)'h':
+				verbose = 1;
+				printf("Using help option\n");
+				break;
+			case (int)OPT_LONG_USAGE:
+				verbose = 1;
+				break;
+		}
+		if(verbose)
+			_help();
+	}
+}
+
+
+static void _usage( void )
+{
+	printf("\nUsage: sdiag [-ar] \n");
+}
+
+static void _help( void )
+{
+	printf ("\
+Usage: sdiag [OPTIONS]\n\
+  --a,                   all statistics\n\
+  --r,                   reset statistics\n\
+\nHelp options:\n\
+  --help                     show this help message\n\
+  --usage                    display brief usage message\n");
+}
diff -X avoid -Naur slurm-2.3.1/src/sdiag/sdiag.c slurm-2.3.1.sdiag-accounting/src/sdiag/sdiag.c
--- slurm-2.3.1/src/sdiag/sdiag.c	1970-01-01 01:00:00.000000000 +0100
+++ slurm-2.3.1.sdiag-accounting/src/sdiag/sdiag.c	2011-12-05 10:50:32.254751735 +0100
@@ -0,0 +1,110 @@
+#if HAVE_CONFIG_H
+#  include "config.h"
+#endif
+
+#include <slurm.h>
+#include "src/common/xstring.h"
+#include "src/common/macros.h"
+#include "src/common/slurm_protocol_defs.h"
+
+/********************
+ * Global Variables *
+ ********************/
+int sdiag_param = 1;
+
+stats_info_response_msg_t *buf;
+
+static int get_info();
+static int print_info();
+
+stats_info_request_msg_t req;
+
+extern void parse_command_line(int argc, char *argv[]);
+extern int slurm_get_statistics(stats_info_response_msg_t *buf, stats_info_request_msg_t *req);
+extern int slurm_reset_statistics(stats_info_request_msg_t *req);
+
+int main(int argc, char *argv[])
+{
+	log_options_t opts = LOG_OPTS_STDERR_ONLY;
+	int rc = 0;
+
+	parse_command_line(argc, argv);
+
+	if(sdiag_param == 0){
+	       	req.command_id = 0;
+	       	slurm_reset_statistics((stats_info_request_msg_t *)&req);
+	       	exit(rc);
+       	}
+	else
+		get_info();
+
+	if(req.command_id)
+		print_info();    
+
+	exit(rc);
+}
+
+static int get_info()
+{
+	req.command_id = 1;
+	slurm_get_statistics(&buf, (stats_info_request_msg_t *)&req);
+}
+
+static int print_info()
+{
+	if(!buf){
+		printf("No data available. Probably slurmctld is not working\n");
+		return  0;
+	}
+
+	printf("*******************************************************\n");
+	printf("sdiag output at %s", ctime(&buf->req_time));
+	printf("Data since      %s", ctime(&buf->req_time_start));
+	printf("*******************************************************\n");
+
+	printf("Server thread count: %d\n", buf->server_thread_count);
+	printf("Agent queue size: %d\n\n", buf->agent_queue_size);
+	printf("Jobs submitted: %d\n", buf->jobs_submitted);
+	printf("Jobs started: %d\n", buf->jobs_started + buf->bf_last_backfilled_jobs);
+	printf("Jobs completed: %d\n", buf->jobs_completed);
+	printf("Jobs canceled: %d\n", buf->jobs_canceled);
+	printf("Jobs failed: %d\n", buf->jobs_failed);
+	printf("\nMain schedule statistics (microseconds):\n");
+	printf("\tLast cycle: %u\n", buf->schedule_cycle_last);
+	printf("\tMax cycle: %u\n", buf->schedule_cycle_max);
+	printf("\tTotal cycles: %u\n", buf->schedule_cycle_counter);
+	if(buf->schedule_cycle_counter > 0){
+		printf("\tMean cycle: %u\n", buf->schedule_cycle_sum / buf->schedule_cycle_counter);
+		printf("\tMean depth cycle: %u\n", buf->schedule_cycle_depth / buf->schedule_cycle_counter);
+	}
+	if((buf->req_time - buf->req_time_start) > 60)
+		printf("\tCycles per minute: %u\n", buf->schedule_cycle_counter / ((buf->req_time - buf->req_time_start) / 60));
+	
+	printf("\tLast queue length: %u\n", buf->schedule_queue_len);
+
+	if(buf->bf_active)
+		printf("\nBackfilling stats (WARNING: data obtained in the middle of backfilling execution\n");
+	else
+		printf("\nBackfilling stats\n");
+
+	printf("\tTotal backfilled jobs (since last slurm start): %u\n", buf->bf_backfilled_jobs);
+	printf("\tTotal backfilled jobs (since last stats cycle start): %u\n", buf->bf_last_backfilled_jobs);
+	printf("\tTotal cycles: %d\n", buf->bf_cycle_counter);
+	printf("\tLast cycle when: %s", ctime(&buf->bf_when_last_cycle));
+	printf("\tLast cycle: %u\n", buf->bf_cycle_last);
+	printf("\tMax cycle: %u\n", buf->bf_cycle_max);
+	if(buf->bf_cycle_counter > 0){
+		printf("\tMean cycle: %u\n", buf->bf_cycle_sum / buf->bf_cycle_counter);
+	}
+	printf("\tLast depth cycle: %u\n", buf->bf_last_depth);
+	printf("\tLast depth cycle (try sched): %u\n", buf->bf_last_depth_try);
+	if(buf->bf_cycle_counter > 0){
+		printf("\tDepth Mean: %u\n", buf->bf_depth_sum / buf->bf_cycle_counter);
+		printf("\tDepth Mean (try depth): %u\n", buf->bf_depth_try_sum / buf->bf_cycle_counter);
+	}
+	printf("\tLast queue length: %u\n", buf->bf_queue_len);
+	if(buf->bf_cycle_counter > 0){
+		printf("\tQueue length Mean: %u\n", buf->bf_queue_len_sum / buf->bf_cycle_counter);
+	}
+}
+
diff -X avoid -Naur slurm-2.3.1/src/slurmctld/agent.c slurm-2.3.1.sdiag-accounting/src/slurmctld/agent.c
--- slurm-2.3.1/src/slurmctld/agent.c	2011-10-24 19:15:42.000000000 +0200
+++ slurm-2.3.1.sdiag-accounting/src/slurmctld/agent.c	2011-11-17 19:07:24.000000000 +0100
@@ -1120,6 +1120,16 @@
 }
 
 
+int retry_list_size(){
+
+    if(retry_list == NULL)
+        return 0;
+    return list_count(retry_list);
+}
+
+ /*
+  * agent_retry - Agent for retrying pending RPCs. One pending request is
+
 /*
  * agent_retry - Agent for retrying pending RPCs. One pending request is
  *	issued if it has been pending for at least min_wait seconds
diff -X avoid -Naur slurm-2.3.1/src/slurmctld/controller.c slurm-2.3.1.sdiag-accounting/src/slurmctld/controller.c
--- slurm-2.3.1/src/slurmctld/controller.c	2011-10-24 19:15:42.000000000 +0200
+++ slurm-2.3.1.sdiag-accounting/src/slurmctld/controller.c	2011-12-02 19:01:39.443752668 +0100
@@ -162,6 +162,9 @@
 uint32_t      cluster_cpus = 0;
 int   with_slurmdbd = 0;
 
+/* Next used for stats/diagnostics */
+diag_stats_t slurmctld_diag_stats;
+
 /* Local variables */
 static int	daemonize = DEFAULT_DAEMONIZE;
 static int	debug_level = 0;
@@ -217,6 +220,11 @@
 	int newsockfd;
 } connection_arg_t;
 
+time_t last_proc_req_start = 0;
+time_t next_stats_reset = 0;
+
+extern int reset_stats(void);
+
 /* main - slurmctld main function, start various threads and process RPCs */
 int main(int argc, char *argv[])
 {
@@ -967,8 +975,9 @@
 			no_thread = 0;
 
 		if (no_thread) {
-			_service_connection((void *) conn_arg);
-		}
+			slurmctld_diag_stats.proc_req_raw++;
+		       	_service_connection((void *) conn_arg);
+	       	}
 	}
 
 	debug3("_slurmctld_rpc_mgr shutting down");
@@ -1501,6 +1510,21 @@
 			last_node_acct = now;
 			_accounting_cluster_ready();
 		}
+ 
+
+		if(last_proc_req_start == 0){
+			/* Stats will reset at midnigh (aprox) */
+			/* Uhmmm... UTC time?... It is  not so important. Just resetting during the night */
+			last_proc_req_start = now;
+			next_stats_reset = last_proc_req_start - (last_proc_req_start % 86400) + 86400;
+		}
+
+		if((next_stats_reset > 0) && (now > next_stats_reset)){
+			/* Resetting stats values */
+			last_proc_req_start = now;
+			next_stats_reset = now - (now % 86400) + 86400;
+			reset_stats();
+		}
 
 		/* Reassert this machine as the primary controller.
 		 * A network or security problem could result in
diff -X avoid -Naur slurm-2.3.1/src/slurmctld/job_scheduler.c slurm-2.3.1.sdiag-accounting/src/slurmctld/job_scheduler.c
--- slurm-2.3.1/src/slurmctld/job_scheduler.c	2011-10-24 19:15:42.000000000 +0200
+++ slurm-2.3.1.sdiag-accounting/src/slurmctld/job_scheduler.c	2011-12-05 10:54:19.624787060 +0100
@@ -90,6 +90,8 @@
 
 static int	save_last_part_update = 0;
 
+extern diag_stats_t slurmctld_diag_stats;
+
 /*
  * _build_user_job_list - build list of jobs for a given user
  *			  and an optional job name
@@ -313,6 +315,17 @@
 	return false;
 }
 
+int do_diag_stats(struct timeval tv1, struct timeval tv2){
+
+    if(slurm_diff_tv(&tv1,&tv2) > slurmctld_diag_stats.schedule_cycle_max)
+        slurmctld_diag_stats.schedule_cycle_max = slurm_diff_tv(&tv1,&tv2);
+
+    slurmctld_diag_stats.schedule_cycle_sum += slurm_diff_tv(&tv1,&tv2);
+    slurmctld_diag_stats.schedule_cycle_last = slurm_diff_tv(&tv1,&tv2);
+    slurmctld_diag_stats.schedule_cycle_counter++;
+}
+
+
 /*
  * schedule - attempt to schedule all pending jobs
  *	pending jobs for each partition will be scheduled in priority
@@ -426,6 +439,7 @@
 
 	debug("sched: Running job scheduler");
 	job_queue = build_job_queue(false);
+	slurmctld_diag_stats.schedule_queue_len = list_count(job_queue);
 	while ((job_queue_rec = list_pop_bottom(job_queue, sort_job_queue2))) {
 		job_ptr  = job_queue_rec->job_ptr;
 		part_ptr = job_queue_rec->part_ptr;
@@ -439,6 +453,9 @@
 			       job_depth);
 			break;
 		}
+
+		slurmctld_diag_stats.schedule_cycle_depth++;
+
 		if (!IS_JOB_PENDING(job_ptr))
 			continue;	/* started in other partition */
 		if (job_ptr->priority == 0)	{ /* held */
@@ -588,6 +605,7 @@
 		} else if (error_code == SLURM_SUCCESS) {
 			/* job initiated */
 			debug3("sched: JobId=%u initiated", job_ptr->job_id);
+			slurmctld_diag_stats.jobs_started++;
 			last_job_update = now;
 #ifdef HAVE_BG
 			select_g_select_jobinfo_get(job_ptr->select_jobinfo,
@@ -639,6 +657,9 @@
 	list_destroy(job_queue);
 	unlock_slurmctld(job_write_lock);
 	END_TIMER2("schedule");
+
+    do_diag_stats(tv1, tv2);
+
 	return job_cnt;
 }
 
diff -X avoid -Naur slurm-2.3.1/src/slurmctld/Makefile.am slurm-2.3.1.sdiag-accounting/src/slurmctld/Makefile.am
--- slurm-2.3.1/src/slurmctld/Makefile.am	2011-10-24 19:15:42.000000000 +0200
+++ slurm-2.3.1.sdiag-accounting/src/slurmctld/Makefile.am	2011-11-17 19:18:31.000000000 +0100
@@ -57,7 +57,8 @@
 	state_save.h	\
 	step_mgr.c	\
 	trigger_mgr.c	\
-	trigger_mgr.h
+	trigger_mgr.h   \
+	statistics.c
 
 
 sbin_PROGRAMS = slurmctld
diff -X avoid -Naur slurm-2.3.1/src/slurmctld/proc_req.c slurm-2.3.1.sdiag-accounting/src/slurmctld/proc_req.c
--- slurm-2.3.1/src/slurmctld/proc_req.c	2011-10-24 19:15:42.000000000 +0200
+++ slurm-2.3.1.sdiag-accounting/src/slurmctld/proc_req.c	2011-12-02 19:04:10.200787127 +0100
@@ -157,9 +157,13 @@
 inline static void  _slurm_rpc_update_partition(slurm_msg_t * msg);
 inline static void  _slurm_rpc_update_block(slurm_msg_t * msg);
 inline static void _slurm_rpc_dump_spank(slurm_msg_t * msg);
+inline static void  _slurm_rpc_dump_stats(slurm_msg_t * msg);
 
 inline static void  _update_cred_key(void);
 
+extern int reset_stats(void);
+
+extern diag_stats_t slurmctld_diag_stats;
 
 /*
  * slurmctld_req  - Process an individual RPC request
@@ -426,6 +430,10 @@
 		_slurm_rpc_dump_spank(msg);
 		slurm_free_spank_env_request_msg(msg->data);
 		break;
+	case REQUEST_STATS_INFO:
+		_slurm_rpc_dump_stats(msg);
+		slurm_free_stats_info_request_msg(msg->data);
+		break;
 	default:
 		error("invalid RPC msg_type=%d", msg->msg_type);
 		slurm_send_rc_msg(msg, EINVAL);
@@ -1341,6 +1349,7 @@
 			if (job_step_kill_msg->signal == SIGKILL) {
 				info("sched: Cancel of JobId=%u by UID=%u, %s",
 				     job_step_kill_msg->job_id, uid, TIME_STR);
+				slurmctld_diag_stats.jobs_canceled++;
 			} else {
 				info("Signal %u of JobId=%u by UID=%u, %s",
 				     job_step_kill_msg->signal,
@@ -1539,6 +1548,7 @@
 		      comp_msg->job_id,
 		      msg_title, nodes,
 		      slurm_strerror(comp_msg->slurm_rc));
+		slurmctld_diag_stats.jobs_failed++;
 		if (error_code == SLURM_SUCCESS) {
 #ifdef HAVE_BG
 			if (job_ptr) {
@@ -1607,6 +1617,7 @@
 		debug2("_slurm_rpc_complete_batch_script JobId=%u %s",
 		       comp_msg->job_id, TIME_STR);
 		slurm_send_rc_msg(msg, SLURM_SUCCESS);
+		slurmctld_diag_stats.jobs_completed++;
 		dump_job = true;
 	}
 	if (dump_job)
@@ -2634,6 +2645,7 @@
 		response_msg.msg_type = RESPONSE_SUBMIT_BATCH_JOB;
 		response_msg.data = &submit_msg;
 		slurm_send_node_msg(msg->conn_fd, &response_msg);
+		slurmctld_diag_stats.jobs_submitted++;
 		schedule(0);		/* has own locks */
 		schedule_job_save();	/* has own locks */
 		schedule_node_save();	/* has own locks */
@@ -4136,3 +4148,39 @@
 	slurm_send_node_msg(msg->conn_fd, &response_msg);
 	slurm_free_spank_env_responce_msg(spank_resp_msg);
 }
+
+
+/* _slurm_rpc_dump_stats - process RPC for statistics information */
+static void _slurm_rpc_dump_stats(slurm_msg_t * msg)
+{
+	char *dump;
+	int dump_size;
+	stats_info_request_msg_t *request_msg;
+	slurm_msg_t response_msg;
+
+	request_msg = (stats_info_request_msg_t *)msg->data;
+
+	info("SIM: Processing RPC: MESSAGE_REALTIME_STATS (command: %d)", request_msg->command_id);
+
+	slurm_msg_t_init(&response_msg);
+	response_msg.protocol_version = msg->protocol_version;
+	response_msg.address = msg->address;
+	response_msg.msg_type = RESPONSE_STATS_INFO;
+
+	if(request_msg->command_id == 0){
+		reset_stats();
+		pack_all_stat(0, &dump, &dump_size, msg->protocol_version);
+		response_msg.data = dump;
+		response_msg.data_size = dump_size;
+	}
+	else{
+		pack_all_stat(1, &dump, &dump_size, msg->protocol_version);
+		response_msg.data = dump;
+		response_msg.data_size = dump_size;
+	}
+
+	/* send message */
+	slurm_send_node_msg(msg->conn_fd, &response_msg);
+	xfree(dump);
+}
+
diff -X avoid -Naur slurm-2.3.1/src/slurmctld/statistics.c slurm-2.3.1.sdiag-accounting/src/slurmctld/statistics.c
--- slurm-2.3.1/src/slurmctld/statistics.c	1970-01-01 01:00:00.000000000 +0100
+++ slurm-2.3.1.sdiag-accounting/src/slurmctld/statistics.c	2011-12-05 10:42:46.177877125 +0100
@@ -0,0 +1,111 @@
+#ifdef HAVE_CONFIG_H
+#  include "config.h"
+#endif
+
+#include <ctype.h>
+#include <errno.h>
+#include <stdio.h>
+
+#include "src/slurmctld/slurmctld.h"
+#include "src/common/pack.h"
+#include "src/common/xstring.h"
+#include "src/common/list.h"
+
+extern slurmctld_config_t slurmctld_config;
+
+extern int retry_list_size(void);
+
+extern diag_stats_t slurmctld_diag_stats;
+extern time_t last_proc_req_start;
+
+extern pack_all_stat(int resp, char **buffer_ptr, int *buffer_size,
+                      uint16_t protocol_version)
+{
+	int tmp_offset;
+	Buf buffer;
+	int parts_packed;
+	int agent_queue_size;
+	time_t now = time(NULL);
+
+	buffer_ptr[0] = NULL;
+	*buffer_size = 0;
+	
+	buffer = init_buf(BUF_SIZE);
+	
+	parts_packed = resp;
+	pack32(parts_packed, buffer);
+	
+	if(resp){
+		pack_time(now, buffer);
+		info("pack_all_stat: time = %u\n", last_proc_req_start);
+		pack_time(last_proc_req_start, buffer);
+			
+		info("pack_all_stat: server_thread_count = %d\n", slurmctld_config.server_thread_count);
+		pack32(slurmctld_config.server_thread_count,buffer);
+			
+		agent_queue_size = retry_list_size();
+		pack32(agent_queue_size,buffer);
+			
+		pack32(slurmctld_diag_stats.jobs_submitted,buffer);
+		pack32(slurmctld_diag_stats.jobs_started,buffer);
+		pack32(slurmctld_diag_stats.jobs_completed,buffer);
+		pack32(slurmctld_diag_stats.jobs_canceled,buffer);
+		pack32(slurmctld_diag_stats.jobs_failed,buffer);
+
+		pack32(slurmctld_diag_stats.schedule_cycle_max,buffer);
+		pack32(slurmctld_diag_stats.schedule_cycle_last,buffer);
+		pack64(slurmctld_diag_stats.schedule_cycle_sum,buffer);
+		pack64(slurmctld_diag_stats.schedule_cycle_counter,buffer);
+		pack64(slurmctld_diag_stats.schedule_cycle_depth,buffer);
+		pack32(slurmctld_diag_stats.schedule_queue_len,buffer);
+		
+		pack64(slurmctld_diag_stats.backfilled_jobs, buffer);
+		pack64(slurmctld_diag_stats.last_backfilled_jobs, buffer);
+		pack64(slurmctld_diag_stats.bf_cycle_counter, buffer);
+		pack64(slurmctld_diag_stats.bf_cycle_sum, buffer);
+		pack64(slurmctld_diag_stats.bf_cycle_last, buffer);
+		pack64(slurmctld_diag_stats.bf_last_depth, buffer);
+		pack64(slurmctld_diag_stats.bf_last_depth_try, buffer);
+		pack32(slurmctld_diag_stats.bf_queue_len, buffer);
+		pack64(slurmctld_diag_stats.bf_cycle_max, buffer);
+		pack_time(slurmctld_diag_stats.bf_when_last_cycle, buffer);
+		pack64(slurmctld_diag_stats.bf_depth_sum, buffer);
+		pack64(slurmctld_diag_stats.bf_depth_try_sum, buffer);
+		pack64(slurmctld_diag_stats.bf_queue_len_sum, buffer);
+		pack32(slurmctld_diag_stats.bf_active, buffer);
+	}
+
+
+	*buffer_size = get_buf_offset(buffer);
+	buffer_ptr[0] = xfer_buf_data(buffer);
+}
+
+int reset_stats(){
+
+	slurmctld_diag_stats.proc_req_raw = 0;
+	slurmctld_diag_stats.proc_req_threads = 0;
+	slurmctld_diag_stats.schedule_cycle_max = 0;
+	slurmctld_diag_stats.schedule_cycle_sum = 0;
+	slurmctld_diag_stats.schedule_cycle_counter = 0;
+	slurmctld_diag_stats.schedule_cycle_depth = 0;
+	slurmctld_diag_stats.jobs_submitted = 0;
+	slurmctld_diag_stats.jobs_started = 0;
+	slurmctld_diag_stats.jobs_completed = 0;
+	slurmctld_diag_stats.jobs_canceled = 0;
+	slurmctld_diag_stats.jobs_failed = 0;
+
+	slurmctld_diag_stats.backfilled_jobs = 0;
+	slurmctld_diag_stats.last_backfilled_jobs = 0;
+	slurmctld_diag_stats.bf_cycle_counter = 0;
+	slurmctld_diag_stats.bf_cycle_sum = 0;
+	slurmctld_diag_stats.bf_cycle_last = 0;
+	slurmctld_diag_stats.bf_depth_sum = 0;
+	slurmctld_diag_stats.bf_depth_try_sum = 0;
+	slurmctld_diag_stats.bf_queue_len = 0;
+	slurmctld_diag_stats.bf_queue_len_sum = 0;
+	slurmctld_diag_stats.bf_cycle_max = 0;
+	slurmctld_diag_stats.bf_last_depth = 0;
+	slurmctld_diag_stats.bf_last_depth_try = 0;
+	slurmctld_diag_stats.bf_active = 0;
+
+}

Reply via email to