[slurm-dev] RE: fairshare - memory resource allocation

Ryan Cox Tue, 29 Jul 2014 08:48:13 -0700

I'm interested in hearing opinions on this, if any. Basically, I thinkthere is an easy solution to the problem of a user using few CPUs but alot of memory and that not being reflected well in the CPU-centric usagestats.

Below is my proposal. There are likely some other good approaches outthere too (Don and Janne presented some) so feel free to tell me thatyou don't like this idea :)



Short version

I propose that the Raw Usage be modified to *optionally* be ("CPUequivalents" * time) instead of just (CPUs * time). The "CPUequivalent" would be a MAX() of CPUs, memory, nodes, GPUs, energy overthat time period, or whatever multiplied by a corresponding charge ratethat an admin can configure on a per partition basis.

I wrote a simple proof of concept patch to demonstrate this (see "Proofof Concept" below for details).



Longer version

The CPU equivalent would be used in place of total_cpus for calculatingusage_raw. I propose that the default charge rate be 1.0 for each CPUin a job and 0.0 for everything else. This is the current behavior sothere are no behavior changes if you choose not to define a differentcharge rate. The reason I think this should be done on a partitionbasis is because different partitions may have nodes with differentmemory/core ratios, etc. so one partition may have 2 GB/core and anothermay have 8 GB/core nodes and you may want to charge differently on each.

If you define the charge rate for each CPU to be 1.0 and the charge rateper GB of memory to be 0.5, that is saying that 2 GB of memory will beequivalent to the charge rate for 1 CPU. 4 GB of memory would beequivalent to 2 CPUs (4 GB * 0.5/GB). Since it is a MAX() of all theavailable (resource * charge_rate) combinations, the largest value ischosen. If a user uses 1 CPU and 1 TB of RAM out of a 1 TB node, theuser gets charged for using all the RAM. If a user uses 16 CPUs and 1MB, the user gets charged for 16 CPUs.



Downsides

The problem that is not completely solved is if a user uses 1 CPU but3/4 of the memory on a node. Then they only get billed for 3/4 of thenode but might make it unusable for others who need a whole or halfnode. I'm not sure of a great way to solve that besides modifying therequest in a job submit plugin or requiring exclusive node access.

One other complication is for resources that include a counter ratherthan a static allocation value, such as network bandwidth or energy.This is a problem because the current approach is to immediately begindecaying the cputime (aka usage) as it accumulates. This means youwould have to keep a delta value for each resource with a counter,meaning you track that 5 GB have been transmitted since the last decaythread iteration then only add that 5 GB. This could get messy whencomparing MAX(total_cpus * charge_per_cpu, total_bw * charge_bw_per_gb)each iteration since the bandwidth may never reach a high enough valueto matter between iterations but might when considered as an entire job.

I don't think this proposal would be too bad for something like energy.You could define a charge rate per joule (or kilojoule or whatever) thatwould equal the node's minimum power divided by core count. Then youlook at the delta of that time period. If they were allocated all coresand used minimum power, they get charged 1.0 * core count. If they wereallocated all cores and used maximum power, they effectively get chargedfor the difference in the node's max energy and min energy times theenergy charge rate. This calculation, as with others, would occur onceper decay thread iteration.



User Education

The reason I like this approach is that it is incredibly simple toimplement and I don't think it takes much effort to explain to users.It would be easy to add other resources you want to charge for (it wouldrequire a code addition, though it would be pretty simple if the data isavailable in the right structs). It doesn't require any RPC changes.sshare, etc only need manpage clarifications to say that the usage datais "CPU equivalents". No new fields are required.

As for user education, you just need to explain the concept of "CPUequivalents", something that can be easily done in the documentation.The slurm.conf partition lines would be relatively easy to read too. Ifyou don't need to change the behavior, no slurm.conf changes orexplanations to users are required.



Proof of Concept

I did a really quick proof of concept (attached) based on the masterbranch. It is very simple to charge for most things as long as the datais there in the existing structs. One caveat for the test patch is thatI didn't see a float handler in the config parser so I skipped over thatfor the test. Instead, each config parameter in slurm.conf should beset to (desired_value * 1000) for now. Proper float handling can beadded if this is the route people want to take. The patch currentlyimplements charging for CPUs, memory (GB), and nodes.

Note: I saw a similar idea in a bug report from the University ofChicago: http://bugs.schedmd.com/show_bug.cgi?id=858.


Ryan

On 07/25/2014 10:31 AM, Ryan Cox wrote:

Bill and Don,
We have wondered about this ourselves. I just came up with this ideaand haven't thought it through completely, but option two seems likethe easiest. For example, you could modify lines likehttps://github.com/SchedMD/slurm/blob/8a1e1384bacf690aed4c1f384da77a0cd978a63f/src/plugins/priority/multifactor/priority_multifactor.c#L952to have a MAX() of a few different types.
I seem to recall seeing this on the list or in a bug report somewherealready, but you could have different charge rates for memory or GPUscompared to a CPU, maybe on a per partition basis. You could give eachof them a charge rate like:PartitionName=p1 ChargePerCPU=1.0 ChargePerGB=0.5 ChargePerGPU=2.0......
So the line I referenced would be something like the following (exceptusing real code and real struct members, etc):real_decay = run_decay * MAX(CPUs*ChargePerCPU,TotalJobMemory*ChargePerGB, GPUs*ChargePerGPU);
In this case, each CPU is 1.0 but each GB of RAM is 0.5. Assuming noGPUs used, if the user requests 1 CPU and 2 GB of RAM the resultingusage is 1.0. But if they use 4 GB of RAM and 1 CPU, it is 2.0 justlike they had been using 2 CPUs. Essentially you define every 2 GB ofRAM to be equal to 1 CPU, so raw_usage could be redefined to deal with"cpu equivalents".
It might be harder to explain to users but I don't think it would betoo bad.
Ryan

On 07/25/2014 10:05 AM, Lipari, Don wrote:
Bill,
As I understand the dilemma you presented, you want to maximize theutilization of node resources when running with Slurm configured forSelectType=select/cons_res. To do this, you would like to nudgeusers into requesting only the amount of memory they will need fortheir jobs. The nudge would be in the form of decreased fair-sharepriority for users' jobs that request only one core but lots of memory.
I don't know of a way for Slurm to do this as it exists. I can onlyoffer alternatives that have their pros and cons.
One alternative would be to add memory usage support to themultifactor priority plugin. This would be a substantial undertakingas it touches code not just in multifactor/priority_multifactor.c butalso in structures that are defined in common/assoc_mgr.h as well assshare itself.
A second less invasive option would be to redefine themultifactor/priority_multifactor.c's raw_usage to make it aconfigurable blend of cpu and memory usage. These changes could bemore localized to the multifactor/priority_multifactor.c module.However you would have a harder time justifying a user's ssharereport because the usage numbers would no longer track jobs'historical cpu usage. You response to a user who asked you tojustify their sshare usage report would be, "trust me, it's right".
A third alternative (as I'm sure you know) is to give up on perfectlypacked nodes and make every 4G of memory requested cost 1 cpu ofallocation.
Perhaps there are other options, but those are the ones thatimmediately come to mind.
Don Lipari
-----Original Message-----
From: Bill Wichser [mailto:b...@princeton.edu]
Sent: Friday, July 25, 2014 6:14 AM
To: slurm-dev
Subject: [slurm-dev] fairshare - memory resource allocation


I'd like to revisit this...


After struggling with memory allocations in some flavor of PBS for over
20 years, it was certainly a wonderful thing to have cgroup support
right out of the box with Slurm.  No longer do we have a shared node's
jobs eating all the memory and killing everything running there.But we
have found that there is a cost to this and that is a failure to
adequately feed back this information to the fairshare mechanism.
In looking at running jobs over the past 4 months, we found a spotwhere
we could reduce the DefMemPerCPU allocation in slurm.conf to a value
about 1G less than the actual G/core available.  This meant that we had
to notify the users close to this max value so that they could adjust
their scripts. We also notified users that if this value was too high
that they'd do best to reduce that limit to exactly what they require.
This has proven much less successful.

So our default is 3G/core with an actual node having 4G/core available.
   This allows some bigger memory jobs and some smaller memory jobs to
make use of the node as there are available cores but not enough memory
for the default case.

Now that is good. It allows higher utilization of nodes, all the while
protecting the memory of each other's processes.  But the problem of
fairshare comes about pretty quickly when there are jobs requiring say
half the node's memory.  This is mostly serial jobs requesting a single
core.  So this leaves about 11 cores with only about 2G/core left.
Worse, when it comes to fairshare calculations it appears that these
jobs are only using a single core when in fact they are using half a
node.  You can see where this is causing issues.

Fairshare has a number of other issues as well, which I will send under
a different email.

Now maybe this is just a matter of constant monitoring of user jobs and
proactively going after those users having small memory per core
requirements.  We have attempted this in the past and have found that
the first job they run which crashes due to insufficient memory results
in all scripts being increased and so the process is never ending.

Another solution is to simply trust the users and just keep reminding
them about allocations.  They are usually a smart bunch and are quite
creative when it comes to getting jobs to run!  So maybe I am concerned
over nothing at all and things will just work out.

Bill


--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University

diff --git a/src/common/read_config.c b/src/common/read_config.c
index e7bfcd1..71b48c2 100644
--- a/src/common/read_config.c
+++ b/src/common/read_config.c
@@ -1029,12 +1029,19 @@ static int _parse_partitionname(void **dest, slurm_parser_enum_t type,
 	s_p_hashtbl_t *tbl, *dflt;
 	slurm_conf_partition_t *p;
 	char *tmp = NULL;
+	uint32_t tmpu32 = 0;
 	static s_p_options_t _partition_options[] = {
 		{"AllocNodes", S_P_STRING},
 		{"AllowAccounts",S_P_STRING},
 		{"AllowGroups", S_P_STRING},
 		{"AllowQos", S_P_STRING},
 		{"Alternate", S_P_STRING},
+		/* FIXME: add float handler. For proof of concept,
+		 * config file value == 1000 * desired value). */
+		{"ChargePerCPU", S_P_UINT32},
+		{"ChargePerNode", S_P_UINT32},
+		{"ChargePerMemGB", S_P_UINT32},
+
 		{"DefMemPerCPU", S_P_UINT32},
 		{"DefMemPerNode", S_P_UINT32},
 		{"Default", S_P_BOOLEAN}, /* YES or NO */
@@ -1081,6 +1088,26 @@ static int _parse_partitionname(void **dest, slurm_parser_enum_t type,
 
 		p->name = xstrdup(value);
 
+		/* FIXME: add float handler. For proof of concept,
+		 * config file value == 1000 * desired value). */
+		if (!s_p_get_uint32(&tmpu32, "ChargePerCPU", tbl) &&
+		    !s_p_get_uint32(&tmpu32, "ChargePerCPU", dflt))
+			p->charge_cpu = 1.0;
+		else
+			p->charge_cpu = (double)tmpu32 / 1000.0;
+
+		if (!s_p_get_uint32(&tmpu32, "ChargePerNode", tbl) &&
+		    !s_p_get_uint32(&tmpu32, "ChargePerNode", dflt))
+			p->charge_node = 0.0;
+		else
+			p->charge_node = (double)tmpu32 / 1000.0;
+
+		if (!s_p_get_uint32(&tmpu32, "ChargePerMemGB", tbl) &&
+		    !s_p_get_uint32(&tmpu32, "ChargePerMemGB", dflt))
+			p->charge_mem_gb = 0.0;
+		else
+			p->charge_mem_gb = (double)tmpu32 / 1000.0;
+
 		if (!s_p_get_string(&p->allow_accounts, "AllowAccounts",tbl))
 			s_p_get_string(&p->allow_accounts, "AllowAccounts", dflt);
 		if (p->allow_accounts &&
diff --git a/src/common/read_config.h b/src/common/read_config.h
index b733daa..9233d9a 100644
--- a/src/common/read_config.h
+++ b/src/common/read_config.h
@@ -242,6 +242,12 @@ typedef struct slurm_conf_partition {
 	char *alternate;	/* name of alternate partition */
 	uint16_t cr_type;	/* Custom CR values for partition (supported
 				 * by select/cons_res plugin only) */
+
+	/* not sure if these should be doubles or floats */
+	double charge_cpu;
+	double charge_node;
+	double charge_mem_gb;
+
 	uint32_t def_mem_per_cpu; /* default MB memory per allocated CPU */
 	bool default_flag;	/* Set if default partition */
 	uint32_t default_time;	/* minutes or INFINITE */
diff --git a/src/plugins/priority/multifactor/priority_multifactor.c b/src/plugins/priority/multifactor/priority_multifactor.c
index 6a3bbd0..f675822 100644
--- a/src/plugins/priority/multifactor/priority_multifactor.c
+++ b/src/plugins/priority/multifactor/priority_multifactor.c
@@ -604,6 +604,31 @@ extern void set_priority_factors(time_t start_time, struct job_record *job_ptr)
 }
 
 
+static double _cpu_equivalents(struct job_record *job_ptr)
+{
+	double equiv = 0.0;
+	struct part_record *part_ptr = job_ptr->part_ptr;
+	uint32_t total_memory;
+
+	if (job_ptr->details->pn_min_memory & MEM_PER_CPU)
+		total_memory = (job_ptr->details->pn_min_memory ^ MEM_PER_CPU) *
+			       job_ptr->total_cpus;
+	else
+		total_memory = job_ptr->details->pn_min_memory *
+			       job_ptr->total_nodes;
+
+	info("_cpu_equivalents: job %d", job_ptr->job_id);
+	equiv = job_ptr->total_cpus * part_ptr->charge_cpu;
+	info("_cpu_equivalents: job_ptr->total_cpus * part_ptr->charge_cpu: %d * %f", job_ptr->total_cpus, part_ptr->charge_cpu);
+	equiv = MAX(equiv, job_ptr->total_nodes * part_ptr->charge_node);
+	info("_cpu_equivalents: job_ptr->total_nodes * part_ptr->charge_node: %d * %f", job_ptr->total_nodes, part_ptr->charge_node);
+	equiv = MAX(equiv, (total_memory/1024.0) * part_ptr->charge_mem_gb);
+	info("_cpu_equivalents: total_memory * part_ptr->charge_mem_gb: %d * %f", total_memory, part_ptr->charge_mem_gb);
+	info("_cpu_equivalents: FINAL: %f", equiv);
+	return equiv;
+}
+
+
 /*
  * apply decay factor to all associations usage_raw
  * IN: real_decay - decay to be applied to each associations' used
@@ -1388,7 +1413,7 @@ static int _apply_new_usage(struct job_record *job_ptr,
 	/* get the time in decayed fashion */
 	run_decay = run_delta * pow(decay_factor, run_delta);
 
-	real_decay = run_decay * (double)job_ptr->total_cpus;
+	real_decay = run_decay * _cpu_equivalents(job_ptr);
 
 	assoc_mgr_lock(&locks);
 	/* Just to make sure we don't make a
diff --git a/src/slurmctld/read_config.c b/src/slurmctld/read_config.c
index 4f8c09e..8f35054 100644
--- a/src/slurmctld/read_config.c
+++ b/src/slurmctld/read_config.c
@@ -692,6 +692,9 @@ static int _build_single_partitionline_info(slurm_conf_partition_t *part)
 	part_ptr->state_up       = part->state_up;
 	part_ptr->grace_time     = part->grace_time;
 	part_ptr->cr_type        = part->cr_type;
+	part_ptr->charge_cpu     = part->charge_cpu;
+	part_ptr->charge_node    = part->charge_node;
+	part_ptr->charge_mem_gb  = part->charge_mem_gb;
 
 	if (part->allow_accounts) {
 		xfree(part_ptr->allow_accounts);
diff --git a/src/slurmctld/slurmctld.h b/src/slurmctld/slurmctld.h
index a9a40c5..509ddb9 100644
--- a/src/slurmctld/slurmctld.h
+++ b/src/slurmctld/slurmctld.h
@@ -314,6 +314,12 @@ struct part_record {
 	bitstr_t *allow_qos_bitstr; /* (DON'T PACK) assocaited with
 				 * char *allow_qos but used internally */
 	char *alternate; 	/* name of alternate partition */
+
+	/* double or float for these? */
+	double charge_cpu;
+	double charge_node;
+	double charge_mem_gb;
+
 	uint32_t def_mem_per_cpu; /* default MB memory per allocated CPU */
 	uint32_t default_time;	/* minutes, NO_VAL or INFINITE */
 	char *deny_accounts;	/* comma delimited list of denied accounts */

[slurm-dev] RE: fairshare - memory resource allocation

Reply via email to