I'm interested in hearing opinions on this, if any. Basically, I think there is an easy solution to the problem of a user using few CPUs but a lot of memory and that not being reflected well in the CPU-centric usage stats.

Below is my proposal. There are likely some other good approaches out there too (Don and Janne presented some) so feel free to tell me that you don't like this idea :)


Short version

I propose that the Raw Usage be modified to *optionally* be ("CPU equivalents" * time) instead of just (CPUs * time). The "CPU equivalent" would be a MAX() of CPUs, memory, nodes, GPUs, energy over that time period, or whatever multiplied by a corresponding charge rate that an admin can configure on a per partition basis.

I wrote a simple proof of concept patch to demonstrate this (see "Proof of Concept" below for details).


Longer version

The CPU equivalent would be used in place of total_cpus for calculating usage_raw. I propose that the default charge rate be 1.0 for each CPU in a job and 0.0 for everything else. This is the current behavior so there are no behavior changes if you choose not to define a different charge rate. The reason I think this should be done on a partition basis is because different partitions may have nodes with different memory/core ratios, etc. so one partition may have 2 GB/core and another may have 8 GB/core nodes and you may want to charge differently on each.

If you define the charge rate for each CPU to be 1.0 and the charge rate per GB of memory to be 0.5, that is saying that 2 GB of memory will be equivalent to the charge rate for 1 CPU. 4 GB of memory would be equivalent to 2 CPUs (4 GB * 0.5/GB). Since it is a MAX() of all the available (resource * charge_rate) combinations, the largest value is chosen. If a user uses 1 CPU and 1 TB of RAM out of a 1 TB node, the user gets charged for using all the RAM. If a user uses 16 CPUs and 1 MB, the user gets charged for 16 CPUs.


Downsides

The problem that is not completely solved is if a user uses 1 CPU but 3/4 of the memory on a node. Then they only get billed for 3/4 of the node but might make it unusable for others who need a whole or half node. I'm not sure of a great way to solve that besides modifying the request in a job submit plugin or requiring exclusive node access.

One other complication is for resources that include a counter rather than a static allocation value, such as network bandwidth or energy. This is a problem because the current approach is to immediately begin decaying the cputime (aka usage) as it accumulates. This means you would have to keep a delta value for each resource with a counter, meaning you track that 5 GB have been transmitted since the last decay thread iteration then only add that 5 GB. This could get messy when comparing MAX(total_cpus * charge_per_cpu, total_bw * charge_bw_per_gb) each iteration since the bandwidth may never reach a high enough value to matter between iterations but might when considered as an entire job.

I don't think this proposal would be too bad for something like energy. You could define a charge rate per joule (or kilojoule or whatever) that would equal the node's minimum power divided by core count. Then you look at the delta of that time period. If they were allocated all cores and used minimum power, they get charged 1.0 * core count. If they were allocated all cores and used maximum power, they effectively get charged for the difference in the node's max energy and min energy times the energy charge rate. This calculation, as with others, would occur once per decay thread iteration.


User Education

The reason I like this approach is that it is incredibly simple to implement and I don't think it takes much effort to explain to users. It would be easy to add other resources you want to charge for (it would require a code addition, though it would be pretty simple if the data is available in the right structs). It doesn't require any RPC changes. sshare, etc only need manpage clarifications to say that the usage data is "CPU equivalents". No new fields are required.

As for user education, you just need to explain the concept of "CPU equivalents", something that can be easily done in the documentation. The slurm.conf partition lines would be relatively easy to read too. If you don't need to change the behavior, no slurm.conf changes or explanations to users are required.


Proof of Concept

I did a really quick proof of concept (attached) based on the master branch. It is very simple to charge for most things as long as the data is there in the existing structs. One caveat for the test patch is that I didn't see a float handler in the config parser so I skipped over that for the test. Instead, each config parameter in slurm.conf should be set to (desired_value * 1000) for now. Proper float handling can be added if this is the route people want to take. The patch currently implements charging for CPUs, memory (GB), and nodes.

Note: I saw a similar idea in a bug report from the University of Chicago: http://bugs.schedmd.com/show_bug.cgi?id=858.

Ryan

On 07/25/2014 10:31 AM, Ryan Cox wrote:

Bill and Don,

We have wondered about this ourselves. I just came up with this idea and haven't thought it through completely, but option two seems like the easiest. For example, you could modify lines like https://github.com/SchedMD/slurm/blob/8a1e1384bacf690aed4c1f384da77a0cd978a63f/src/plugins/priority/multifactor/priority_multifactor.c#L952 to have a MAX() of a few different types.

I seem to recall seeing this on the list or in a bug report somewhere already, but you could have different charge rates for memory or GPUs compared to a CPU, maybe on a per partition basis. You could give each of them a charge rate like: PartitionName=p1 ChargePerCPU=1.0 ChargePerGB=0.5 ChargePerGPU=2.0 ......

So the line I referenced would be something like the following (except using real code and real struct members, etc): real_decay = run_decay * MAX(CPUs*ChargePerCPU, TotalJobMemory*ChargePerGB, GPUs*ChargePerGPU);

In this case, each CPU is 1.0 but each GB of RAM is 0.5. Assuming no GPUs used, if the user requests 1 CPU and 2 GB of RAM the resulting usage is 1.0. But if they use 4 GB of RAM and 1 CPU, it is 2.0 just like they had been using 2 CPUs. Essentially you define every 2 GB of RAM to be equal to 1 CPU, so raw_usage could be redefined to deal with "cpu equivalents".

It might be harder to explain to users but I don't think it would be too bad.

Ryan

On 07/25/2014 10:05 AM, Lipari, Don wrote:
Bill,

As I understand the dilemma you presented, you want to maximize the utilization of node resources when running with Slurm configured for SelectType=select/cons_res. To do this, you would like to nudge users into requesting only the amount of memory they will need for their jobs. The nudge would be in the form of decreased fair-share priority for users' jobs that request only one core but lots of memory.

I don't know of a way for Slurm to do this as it exists. I can only offer alternatives that have their pros and cons.

One alternative would be to add memory usage support to the multifactor priority plugin. This would be a substantial undertaking as it touches code not just in multifactor/priority_multifactor.c but also in structures that are defined in common/assoc_mgr.h as well as sshare itself.

A second less invasive option would be to redefine the multifactor/priority_multifactor.c's raw_usage to make it a configurable blend of cpu and memory usage. These changes could be more localized to the multifactor/priority_multifactor.c module. However you would have a harder time justifying a user's sshare report because the usage numbers would no longer track jobs' historical cpu usage. You response to a user who asked you to justify their sshare usage report would be, "trust me, it's right".

A third alternative (as I'm sure you know) is to give up on perfectly packed nodes and make every 4G of memory requested cost 1 cpu of allocation.

Perhaps there are other options, but those are the ones that immediately come to mind.

Don Lipari

-----Original Message-----
From: Bill Wichser [mailto:b...@princeton.edu]
Sent: Friday, July 25, 2014 6:14 AM
To: slurm-dev
Subject: [slurm-dev] fairshare - memory resource allocation


I'd like to revisit this...


After struggling with memory allocations in some flavor of PBS for over
20 years, it was certainly a wonderful thing to have cgroup support
right out of the box with Slurm.  No longer do we have a shared node's
jobs eating all the memory and killing everything running there. But we
have found that there is a cost to this and that is a failure to
adequately feed back this information to the fairshare mechanism.

In looking at running jobs over the past 4 months, we found a spot where
we could reduce the DefMemPerCPU allocation in slurm.conf to a value
about 1G less than the actual G/core available.  This meant that we had
to notify the users close to this max value so that they could adjust
their scripts. We also notified users that if this value was too high
that they'd do best to reduce that limit to exactly what they require.
This has proven much less successful.

So our default is 3G/core with an actual node having 4G/core available.
   This allows some bigger memory jobs and some smaller memory jobs to
make use of the node as there are available cores but not enough memory
for the default case.

Now that is good. It allows higher utilization of nodes, all the while
protecting the memory of each other's processes.  But the problem of
fairshare comes about pretty quickly when there are jobs requiring say
half the node's memory.  This is mostly serial jobs requesting a single
core.  So this leaves about 11 cores with only about 2G/core left.
Worse, when it comes to fairshare calculations it appears that these
jobs are only using a single core when in fact they are using half a
node.  You can see where this is causing issues.

Fairshare has a number of other issues as well, which I will send under
a different email.

Now maybe this is just a matter of constant monitoring of user jobs and
proactively going after those users having small memory per core
requirements.  We have attempted this in the past and have found that
the first job they run which crashes due to insufficient memory results
in all scripts being increased and so the process is never ending.

Another solution is to simply trust the users and just keep reminding
them about allocations.  They are usually a smart bunch and are quite
creative when it comes to getting jobs to run!  So maybe I am concerned
over nothing at all and things will just work out.

Bill


--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University

diff --git a/src/common/read_config.c b/src/common/read_config.c
index e7bfcd1..71b48c2 100644
--- a/src/common/read_config.c
+++ b/src/common/read_config.c
@@ -1029,12 +1029,19 @@ static int _parse_partitionname(void **dest, slurm_parser_enum_t type,
 	s_p_hashtbl_t *tbl, *dflt;
 	slurm_conf_partition_t *p;
 	char *tmp = NULL;
+	uint32_t tmpu32 = 0;
 	static s_p_options_t _partition_options[] = {
 		{"AllocNodes", S_P_STRING},
 		{"AllowAccounts",S_P_STRING},
 		{"AllowGroups", S_P_STRING},
 		{"AllowQos", S_P_STRING},
 		{"Alternate", S_P_STRING},
+		/* FIXME: add float handler. For proof of concept,
+		 * config file value == 1000 * desired value). */
+		{"ChargePerCPU", S_P_UINT32},
+		{"ChargePerNode", S_P_UINT32},
+		{"ChargePerMemGB", S_P_UINT32},
+
 		{"DefMemPerCPU", S_P_UINT32},
 		{"DefMemPerNode", S_P_UINT32},
 		{"Default", S_P_BOOLEAN}, /* YES or NO */
@@ -1081,6 +1088,26 @@ static int _parse_partitionname(void **dest, slurm_parser_enum_t type,
 
 		p->name = xstrdup(value);
 
+		/* FIXME: add float handler. For proof of concept,
+		 * config file value == 1000 * desired value). */
+		if (!s_p_get_uint32(&tmpu32, "ChargePerCPU", tbl) &&
+		    !s_p_get_uint32(&tmpu32, "ChargePerCPU", dflt))
+			p->charge_cpu = 1.0;
+		else
+			p->charge_cpu = (double)tmpu32 / 1000.0;
+
+		if (!s_p_get_uint32(&tmpu32, "ChargePerNode", tbl) &&
+		    !s_p_get_uint32(&tmpu32, "ChargePerNode", dflt))
+			p->charge_node = 0.0;
+		else
+			p->charge_node = (double)tmpu32 / 1000.0;
+
+		if (!s_p_get_uint32(&tmpu32, "ChargePerMemGB", tbl) &&
+		    !s_p_get_uint32(&tmpu32, "ChargePerMemGB", dflt))
+			p->charge_mem_gb = 0.0;
+		else
+			p->charge_mem_gb = (double)tmpu32 / 1000.0;
+
 		if (!s_p_get_string(&p->allow_accounts, "AllowAccounts",tbl))
 			s_p_get_string(&p->allow_accounts, "AllowAccounts", dflt);
 		if (p->allow_accounts &&
diff --git a/src/common/read_config.h b/src/common/read_config.h
index b733daa..9233d9a 100644
--- a/src/common/read_config.h
+++ b/src/common/read_config.h
@@ -242,6 +242,12 @@ typedef struct slurm_conf_partition {
 	char *alternate;	/* name of alternate partition */
 	uint16_t cr_type;	/* Custom CR values for partition (supported
 				 * by select/cons_res plugin only) */
+
+	/* not sure if these should be doubles or floats */
+	double charge_cpu;
+	double charge_node;
+	double charge_mem_gb;
+
 	uint32_t def_mem_per_cpu; /* default MB memory per allocated CPU */
 	bool default_flag;	/* Set if default partition */
 	uint32_t default_time;	/* minutes or INFINITE */
diff --git a/src/plugins/priority/multifactor/priority_multifactor.c b/src/plugins/priority/multifactor/priority_multifactor.c
index 6a3bbd0..f675822 100644
--- a/src/plugins/priority/multifactor/priority_multifactor.c
+++ b/src/plugins/priority/multifactor/priority_multifactor.c
@@ -604,6 +604,31 @@ extern void set_priority_factors(time_t start_time, struct job_record *job_ptr)
 }
 
 
+static double _cpu_equivalents(struct job_record *job_ptr)
+{
+	double equiv = 0.0;
+	struct part_record *part_ptr = job_ptr->part_ptr;
+	uint32_t total_memory;
+
+	if (job_ptr->details->pn_min_memory & MEM_PER_CPU)
+		total_memory = (job_ptr->details->pn_min_memory ^ MEM_PER_CPU) *
+			       job_ptr->total_cpus;
+	else
+		total_memory = job_ptr->details->pn_min_memory *
+			       job_ptr->total_nodes;
+
+	info("_cpu_equivalents: job %d", job_ptr->job_id);
+	equiv = job_ptr->total_cpus * part_ptr->charge_cpu;
+	info("_cpu_equivalents: job_ptr->total_cpus * part_ptr->charge_cpu: %d * %f", job_ptr->total_cpus, part_ptr->charge_cpu);
+	equiv = MAX(equiv, job_ptr->total_nodes * part_ptr->charge_node);
+	info("_cpu_equivalents: job_ptr->total_nodes * part_ptr->charge_node: %d * %f", job_ptr->total_nodes, part_ptr->charge_node);
+	equiv = MAX(equiv, (total_memory/1024.0) * part_ptr->charge_mem_gb);
+	info("_cpu_equivalents: total_memory * part_ptr->charge_mem_gb: %d * %f", total_memory, part_ptr->charge_mem_gb);
+	info("_cpu_equivalents: FINAL: %f", equiv);
+	return equiv;
+}
+
+
 /*
  * apply decay factor to all associations usage_raw
  * IN: real_decay - decay to be applied to each associations' used
@@ -1388,7 +1413,7 @@ static int _apply_new_usage(struct job_record *job_ptr,
 	/* get the time in decayed fashion */
 	run_decay = run_delta * pow(decay_factor, run_delta);
 
-	real_decay = run_decay * (double)job_ptr->total_cpus;
+	real_decay = run_decay * _cpu_equivalents(job_ptr);
 
 	assoc_mgr_lock(&locks);
 	/* Just to make sure we don't make a
diff --git a/src/slurmctld/read_config.c b/src/slurmctld/read_config.c
index 4f8c09e..8f35054 100644
--- a/src/slurmctld/read_config.c
+++ b/src/slurmctld/read_config.c
@@ -692,6 +692,9 @@ static int _build_single_partitionline_info(slurm_conf_partition_t *part)
 	part_ptr->state_up       = part->state_up;
 	part_ptr->grace_time     = part->grace_time;
 	part_ptr->cr_type        = part->cr_type;
+	part_ptr->charge_cpu     = part->charge_cpu;
+	part_ptr->charge_node    = part->charge_node;
+	part_ptr->charge_mem_gb  = part->charge_mem_gb;
 
 	if (part->allow_accounts) {
 		xfree(part_ptr->allow_accounts);
diff --git a/src/slurmctld/slurmctld.h b/src/slurmctld/slurmctld.h
index a9a40c5..509ddb9 100644
--- a/src/slurmctld/slurmctld.h
+++ b/src/slurmctld/slurmctld.h
@@ -314,6 +314,12 @@ struct part_record {
 	bitstr_t *allow_qos_bitstr; /* (DON'T PACK) assocaited with
 				 * char *allow_qos but used internally */
 	char *alternate; 	/* name of alternate partition */
+
+	/* double or float for these? */
+	double charge_cpu;
+	double charge_node;
+	double charge_mem_gb;
+
 	uint32_t def_mem_per_cpu; /* default MB memory per allocated CPU */
 	uint32_t default_time;	/* minutes, NO_VAL or INFINITE */
 	char *deny_accounts;	/* comma delimited list of denied accounts */

Reply via email to