from:"riel"

[PATCH 3/6] sched,numa: preparations for complex topology placement

2014-10-17 Thread riel

From: Rik van Riel 

Preparatory patch for adding NUMA placement on systems with
complex NUMA topology. Also fix a potential divide by zero
in group_weight()

Signed-off-by: Rik van Riel 
Tested-by: Chegu Vinod 
---
 kernel/sched/fair.c | 57 ++---
 1 file changed, 37 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6d44052..1c540f5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -930,9 +930,10 @@ static inline unsigned long group_faults_cpu(struct 
numa_group *group, int nid)
  * larger multiplier, in order to group tasks together that are almost
  * evenly spread out between numa nodes.
  */
-static inline unsigned long task_weight(struct task_struct *p, int nid)
+static inline unsigned long task_weight(struct task_struct *p, int nid,
+   int dist)
 {
-   unsigned long total_faults;
+   unsigned long faults, total_faults;
 
if (!p->numa_faults_memory)
return 0;
@@ -942,15 +943,25 @@ static inline unsigned long task_weight(struct 
task_struct *p, int nid)
if (!total_faults)
return 0;
 
-   return 1000 * task_faults(p, nid) / total_faults;
+   faults = task_faults(p, nid);
+   return 1000 * faults / total_faults;
 }
 
-static inline unsigned long group_weight(struct task_struct *p, int nid)
+static inline unsigned long group_weight(struct task_struct *p, int nid,
+int dist)
 {
-   if (!p->numa_group || !p->numa_group->total_faults)
+   unsigned long faults, total_faults;
+
+   if (!p->numa_group)
+   return 0;
+
+   total_faults = p->numa_group->total_faults;
+
+   if (!total_faults)
return 0;
 
-   return 1000 * group_faults(p, nid) / p->numa_group->total_faults;
+   faults = group_faults(p, nid);
+   return 1000 * faults / total_faults;
 }
 
 bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
@@ -1083,6 +1094,7 @@ struct task_numa_env {
struct numa_stats src_stats, dst_stats;
 
int imbalance_pct;
+   int dist;
 
struct task_struct *best_task;
long best_imp;
@@ -1162,6 +1174,7 @@ static void task_numa_compare(struct task_numa_env *env,
long load;
long imp = env->p->numa_group ? groupimp : taskimp;
long moveimp = imp;
+   int dist = env->dist;
 
rcu_read_lock();
cur = ACCESS_ONCE(dst_rq->curr);
@@ -1185,8 +1198,8 @@ static void task_numa_compare(struct task_numa_env *env,
 * in any group then look only at task weights.
 */
if (cur->numa_group == env->p->numa_group) {
-   imp = taskimp + task_weight(cur, env->src_nid) -
- task_weight(cur, env->dst_nid);
+   imp = taskimp + task_weight(cur, env->src_nid, dist) -
+ task_weight(cur, env->dst_nid, dist);
/*
 * Add some hysteresis to prevent swapping the
 * tasks within a group over tiny differences.
@@ -1200,11 +1213,11 @@ static void task_numa_compare(struct task_numa_env *env,
 * instead.
 */
if (cur->numa_group)
-   imp += group_weight(cur, env->src_nid) -
-  group_weight(cur, env->dst_nid);
+   imp += group_weight(cur, env->src_nid, dist) -
+  group_weight(cur, env->dst_nid, dist);
else
-   imp += task_weight(cur, env->src_nid) -
-  task_weight(cur, env->dst_nid);
+   imp += task_weight(cur, env->src_nid, dist) -
+  task_weight(cur, env->dst_nid, dist);
}
}
 
@@ -1303,7 +1316,7 @@ static int task_numa_migrate(struct task_struct *p)
};
struct sched_domain *sd;
unsigned long taskweight, groupweight;
-   int nid, ret;
+   int nid, ret, dist;
long taskimp, groupimp;
 
/*
@@ -1331,12 +1344,13 @@ static int task_numa_migrate(struct task_struct *p)
return -EINVAL;
}
 
-   taskweight = task_weight(p, env.src_nid);
-   groupweight = group_weight(p, env.src_nid);
-   update_numa_stats(_stats, env.src_nid);
env.dst_nid = p->numa_preferred_nid;
-   taskimp = task_weight(p, env.dst_nid) - taskweight;
-   groupimp = group_weight(p, env.dst_nid) - groupweight;
+   dist = env.dist = node_distance(env.src_nid, env.dst_nid);
+   taskweight = task_weight(p, env.src_nid, dist);
+   group

[PATCH 0/6] sched,numa: weigh nearby nodes for task placement on complex NUMA topologies (v2)

2014-10-17 Thread riel

This patch set integrates two algorithms I have previously tested,
one for glueless mesh NUMA topologies, where NUMA nodes communicate
with far-away nodes through intermediary nodes, and backplane
topologies, where communication with far-away NUMA nodes happens
through backplane controllers (which cannot run tasks).

Due to the inavailability of 8 node systems, and the fact that I
am flying out to Linuxcon Europe / Plumbers / KVM Forum on Friday,
I have not tested these patches yet. However, with a conference (and
many familiar faces) coming up, it seemed like a good idea to get
the code out there, anyway.

Vinod tested the v1 series + patch 6 on a backplane topology
system, and I changed v2 on a glueless mesh topology system.
The patches appear to behave.

Placement of tasks on smaller, directly connected, NUMA systems
should not be affected at all by this patch serie


v2: remove the hops table, and use numa_distance instead, tested
on a system with glueless mesh topology, not yet tested on
a backplane style system (but the algorithm should be unchanged)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/6] sched,numa: classify the NUMA topology of a system

2014-10-17 Thread riel

From: Rik van Riel 

Smaller NUMA systems tend to have all NUMA nodes directly connected
to each other. This includes the degenerate case of a system with just
one node, ie. a non-NUMA system.

Larger systems can have two kinds of NUMA topology, which affects how
tasks and memory should be placed on the system.

On glueless mesh systems, nodes that are not directly connected to
each other will bounce traffic through intermediary nodes. Task groups
can be run closer to each other by moving tasks from a node to an
intermediary node between it and the task's preferred node.

On NUMA systems with backplane controllers, the intermediary hops
are incapable of running programs. This creates "islands" of nodes
that are at an equal distance to anywhere else in the system.

Each kind of topology requires a slightly different placement
algorithm; this patch provides the mechanism to detect the kind
of NUMA topology of a system.

Signed-off-by: Rik van Riel 
Tested-by: Chegu Vinod 
---
 include/linux/topology.h |  7 +++
 kernel/sched/core.c  | 53 
 2 files changed, 60 insertions(+)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index dda6ee5..40d6cea 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -48,6 +48,13 @@
 
 int arch_update_cpu_topology(void);
 
+enum numa_topology_type {
+   NUMA_DIRECT,
+   NUMA_GLUELESS_MESH,
+   NUMA_BACKPLANE,
+};
+extern enum numa_topology_type sched_numa_topology_type;
+
 /* Conform to ACPI 2.0 SLIT distance definitions */
 #define LOCAL_DISTANCE 10
 #define REMOTE_DISTANCE20
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ed427f9..19e6c1b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6075,6 +6075,7 @@ static void claim_allocations(int cpu, struct 
sched_domain *sd)
 
 #ifdef CONFIG_NUMA
 static int sched_domains_numa_levels;
+enum numa_topology_type sched_numa_topology_type;
 static int *sched_domains_numa_distance;
 int sched_max_numa_distance;
 static struct cpumask ***sched_domains_numa_masks;
@@ -6263,6 +6264,56 @@ bool find_numa_distance(int distance)
return false;
 }
 
+/*
+ * A system can have three types of NUMA topology:
+ * NUMA_DIRECT: all nodes are directly connected, or not a NUMA system
+ * NUMA_GLUELESS_MESH: some nodes reachable through intermediary nodes
+ * NUMA_BACKPLANE: nodes can reach other nodes through a backplane
+ *
+ * The difference between a glueless mesh topology and a backplane
+ * topology lies in whether communication between not directly
+ * connected nodes goes through intermediary nodes (where programs
+ * could run), or through backplane controllers. This affects
+ * placement of programs.
+ *
+ * The type of topology can be discerned with the following tests:
+ * - If the maximum distance between any nodes is 1 hop, the system
+ *   is directly connected.
+ * - If for two nodes A and B, located N > 1 hops away from each other,
+ *   there is an intermediary node C, which is < N hops away from both
+ *   nodes A and B, the system is a glueless mesh.
+ */
+static void init_numa_topology_type(void)
+{
+   int a, b, c, n;
+
+   n = sched_max_numa_distance;
+
+   if (n <= 1)
+   sched_numa_topology_type = NUMA_DIRECT;
+
+   for_each_online_node(a) {
+   for_each_online_node(b) {
+   /* Find two nodes furthest removed from each other. */
+   if (node_distance(a, b) < n)
+   continue;
+
+   /* Is there an intermediary node between a and b? */
+   for_each_online_node(c) {
+   if (node_distance(a, c) < n &&
+   node_distance(b, c) < n) {
+   sched_numa_topology_type =
+   NUMA_GLUELESS_MESH;
+   return;
+   }
+   }
+
+   sched_numa_topology_type = NUMA_BACKPLANE;
+   return;
+   }
+   }
+}
+
 static void sched_init_numa(void)
 {
int next_distance, curr_distance = node_distance(0, 0);
@@ -6396,6 +6447,8 @@ static void sched_init_numa(void)
 
sched_domains_numa_levels = level;
sched_max_numa_distance = sched_domains_numa_distance[level - 1];
+
+   init_numa_topology_type();
 }
 
 static void sched_domains_numa_masks_set(int cpu)
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/6] sched,numa: calculate node scores in complex NUMA topologies

2014-10-17 Thread riel

From: Rik van Riel 

In order to do task placement on systems with complex NUMA topologies,
it is necessary to count the faults on nodes nearby the node that is
being examined for a potential move.

In case of a system with a backplane interconnect, we are dealing with
groups of NUMA nodes; each of the nodes within a group is the same number
of hops away from nodes in other groups in the system. Optimal placement
on this topology is achieved by counting all nearby nodes equally. When
comparing nodes A and B at distance N, nearby nodes are those at distances
smaller than N from nodes A or B.

Placement strategy on a system with a glueless mesh NUMA topology needs
to be different, because there are no natural groups of nodes determined
by the hardware. Instead, when dealing with two nodes A and B at distance
N, N >= 2, there will be intermediate nodes at distance < N from both nodes
A and B. Good placement can be achieved by right shifting the faults on
nearby nodes by the number of hops from the node being scored. In this
context, a nearby node is any node less than the maximum distance in the
system away from the node. Those nodes are skipped for efficiency reasons,
there is no real policy reason to do so.

Placement policy on directly connected NUMA systems is not affected.

Signed-off-by: Rik van Riel 
Tested-by: Chegu Vinod 
---
 kernel/sched/fair.c | 74 +
 1 file changed, 74 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1c540f5..6df460f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -924,6 +924,71 @@ static inline unsigned long group_faults_cpu(struct 
numa_group *group, int nid)
group->faults_cpu[task_faults_idx(nid, 1)];
 }
 
+/* Handle placement on systems where not all nodes are directly connected. */
+static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
+   int maxdist, bool task)
+{
+   unsigned long score = 0;
+   int node;
+
+   /*
+* All nodes are directly connected, and the same distance
+* from each other. No need for fancy placement algorithms.
+*/
+   if (sched_numa_topology_type == NUMA_DIRECT)
+   return 0;
+
+   /*
+* This code is called for each node, introducing N^2 complexity,
+* which should be ok given the number of nodes rarely exceeds 8.
+*/
+   for_each_online_node(node) {
+   unsigned long faults;
+   int dist = node_distance(nid, node);
+
+   /*
+* The furthest away nodes in the system are not interesting
+* for placement; nid was already counted.
+*/
+   if (dist == sched_max_numa_distance || node == nid)
+   continue;
+
+   /*
+* On systems with a backplane NUMA topology, compare groups
+* of nodes, and move tasks towards the group with the most
+* memory accesses. When comparing two nodes at distance
+* "hoplimit", only nodes closer by than "hoplimit" are part
+* of each group. Skip other nodes.
+*/
+   if (sched_numa_topology_type == NUMA_BACKPLANE &&
+   dist > maxdist)
+   continue;
+
+   /* Add up the faults from nearby nodes. */
+   if (task)
+   faults = task_faults(p, node);
+   else
+   faults = group_faults(p, node);
+
+   /*
+* On systems with a glueless mesh NUMA topology, there are
+* no fixed "groups of nodes". Instead, nodes that are not
+* directly connected bounce traffic through intermediate
+* nodes; a numa_group can occupy any set of nodes.
+* The further away a node is, the less the faults count.
+* This seems to result in good task placement.
+*/
+   if (sched_numa_topology_type == NUMA_GLUELESS_MESH) {
+   faults *= (sched_max_numa_distance - dist);
+   faults /= (sched_max_numa_distance - LOCAL_DISTANCE);
+   }
+
+   score += faults;
+   }
+
+   return score;
+}
+
 /*
  * These return the fraction of accesses done by a particular task, or
  * task group, on a particular numa node.  The group weight is given a
@@ -944,6 +1009,8 @@ static inline unsigned long task_weight(struct task_struct 
*p, int nid,
return 0;
 
faults = task_faults(p, nid);
+   faults += score_nearby_nodes(p, nid, dist, true);
+
return 1000 * faults / total_faults;
 }
 
@@ -961,6 +1028,8 @@ static inline unsigned long group_weight(struct 
task_struct *p, int nid,

[RFC PATCH 11/11] nohz,kvm,time: teach account_process_tick about guest time

2015-06-24 Thread riel

From: Rik van Riel 

When tick based accounting is run from a remote CPU, it is actually
possible to encounter a task with PF_VCPU set. Make sure to account
those as guest time.

Signed-off-by: Rik van Riel 
---
 kernel/sched/cputime.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 593f97b0fe3c..6295679fe5f5 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -488,7 +488,9 @@ void account_process_tick(struct task_struct *p, int 
user_tick)
if (steal_account_process_tick(cpu))
return;
 
-   if (user_tick)
+   if (p->flags & PF_VCPU)
+   account_guest_time(p, cputime_one_jiffy, one_jiffy_scaled);
+   else if (user_tick)
account_user_time(p, cputime_one_jiffy, one_jiffy_scaled);
else if ((p != rq->idle) || (irq_count() != HARDIRQ_OFFSET))
account_system_time(p, HARDIRQ_OFFSET, cputime_one_jiffy,
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 08/11] nohz,timer: have housekeeper call account_process_tick for nohz cpus

2015-06-24 Thread riel

From: Rik van Riel 

Have the housekeeper CPU call account_process_tick to do tick based
accounting for remote nohz_full CPUs.

Signed-off-by: Rik van Riel 
---
 kernel/time/timer.c | 28 
 1 file changed, 28 insertions(+)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 2ece3aa5069c..6adebb373317 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include "../sched/sched.h"
 
 #include 
 #include 
@@ -1382,6 +1383,29 @@ unsigned long get_next_timer_interrupt(unsigned long now)
 }
 #endif
 
+#ifdef CONFIG_NO_HZ_FULL
+static void account_remote_process_ticks(void)
+{
+   int cpu;
+
+   /*
+* The current task on another CPU can get rescheduled while
+* we are updating the statistics. The rcu read lock ensures
+* the task does not get freed, so at worst the statistics will
+* be off a little bit, which is expected with tick based sampling.
+*/
+   rcu_read_lock();
+   for_each_cpu_and(cpu, tick_nohz_full_mask, cpu_online_mask) {
+   struct task_struct *p = cpu_curr(cpu);
+   int user_tick = (per_cpu(context_tracking.state, cpu) ==
+   CONTEXT_USER);
+   
+   account_process_tick(p, user_tick);
+   }
+   rcu_read_unlock();
+}
+#endif
+
 /*
  * Called from the timer interrupt handler to charge one tick to the current
  * process.  user_tick is 1 if the tick is user time, 0 for system.
@@ -1392,6 +1416,10 @@ void update_process_times(int user_tick)
 
/* Note: this timer irq context must be accounted for as well. */
account_process_tick(p, user_tick);
+#ifdef CONFIG_NO_HZ_FULL
+   if (is_timer_housekeeping_cpu(smp_processor_id()))
+   account_remote_process_ticks();
+#endif
run_local_timers();
rcu_check_callbacks(user_tick);
 #ifdef CONFIG_IRQ_WORK
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 03/11] time,nohz: add cpu parameter to irqtime_account_process_tick

2015-06-24 Thread riel

From: Rik van Riel 

Add a cpu parameter to irqtime_account_process_tick, to specify what
cpu to run the statistics for.

In order for this to actually work on a different cpu, all the functions
called by irqtime_account_process_tick need to be able to handle workng
for another CPU.

Signed-off-by: Rik van Riel 
---
 kernel/sched/cputime.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 84b2d24a2238..7df761cd6dfc 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -336,12 +336,14 @@ void thread_group_cputime(struct task_struct *tsk, struct 
task_cputime *times)
  * p->stime and friends are only updated on system time and not on irq
  * softirq as those do not count in task exec_runtime any more.
  */
-static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
+static void irqtime_account_process_tick(struct task_struct *p, int cpu,
+int user_tick,
 struct rq *rq, int ticks)
 {
cputime_t scaled = cputime_to_scaled(cputime_one_jiffy);
+   struct kernel_cpustat *kstat = _cpu(cpu);
u64 cputime = (__force u64) cputime_one_jiffy;
-   u64 *cpustat = kcpustat_this_cpu->cpustat;
+   u64 *cpustat = kstat->cpustat;
 
if (steal_account_process_tick())
return;
@@ -374,12 +376,14 @@ static void irqtime_account_process_tick(struct 
task_struct *p, int user_tick,
 static void irqtime_account_idle_ticks(int ticks)
 {
struct rq *rq = this_rq();
+   int cpu = smp_processor_id();
 
-   irqtime_account_process_tick(current, 0, rq, ticks);
+   irqtime_account_process_tick(current, cpu, 0, rq, ticks);
 }
 #else /* CONFIG_IRQ_TIME_ACCOUNTING */
 static inline void irqtime_account_idle_ticks(int ticks) {}
-static inline void irqtime_account_process_tick(struct task_struct *p, int 
user_tick,
+static inline void irqtime_account_process_tick(struct task_struct *p, int cpu,
+   int user_tick,
struct rq *rq, int nr_ticks) {}
 #endif /* CONFIG_IRQ_TIME_ACCOUNTING */
 
@@ -475,7 +479,7 @@ void account_process_tick(struct task_struct *p, int 
user_tick)
return;
 
if (sched_clock_irqtime) {
-   irqtime_account_process_tick(p, user_tick, rq, 1);
+   irqtime_account_process_tick(p, cpu, user_tick, rq, 1);
return;
}
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 01/11] nohz,time: make account_process_tick work on the task's CPU

2015-06-24 Thread riel

From: Rik van Riel 

Teach account_process_tick to work on the CPU of the task
specified in the function argument. This allows us to do
remote tick based sampling of a nohz_full cpu from a
housekeeping CPU.

Signed-off-by: Rik van Riel 
---
 kernel/sched/cputime.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 8394b1ee600c..97077c282626 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -463,8 +463,14 @@ void thread_group_cputime_adjusted(struct task_struct *p, 
cputime_t *ut, cputime
 void account_process_tick(struct task_struct *p, int user_tick)
 {
cputime_t one_jiffy_scaled = cputime_to_scaled(cputime_one_jiffy);
-   struct rq *rq = this_rq();
+   int cpu = task_cpu(p);
+   struct rq *rq = cpu_rq(cpu);
 
+   /*
+* Tests current CPU, not "cpu", to see whether account_process_tick()
+* should do work on this invocation, or whether time keeping for
+* this CPU is done in some other way.
+*/
if (vtime_accounting_enabled())
return;
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC INCOMPLETE] tick based timekeeping from a housekeeping CPU

2015-06-24 Thread riel

This series seems to make basic tick based time sampling from a
housekeeping CPU work, allowing us to have tick based accounting
on a nohz_full CPU, and no longer doing vtime accounting on those
CPUs.

It still needs a major  cleanup, and steal time accounting and irq
accounting are still missing.

Just posting this to get a sense of whether I am headed in the right
direction, or whether we need some major overhaul/cleanup of part of
the code first.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 04/11] time,nohz: add cpu parameter to steal_account_process_tick

2015-06-24 Thread riel

From: Rik van Riel 

Add a cpu parameter to steal_account_process_tick, so it can
be used to do CPU time accounting for another CPU.

Signed-off-by: Rik van Riel 
---
 kernel/sched/cputime.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 7df761cd6dfc..9717d86cf2ab 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -254,15 +254,15 @@ void account_idle_time(cputime_t cputime)
cpustat[CPUTIME_IDLE] += (__force u64) cputime;
 }
 
-static __always_inline bool steal_account_process_tick(void)
+static __always_inline bool steal_account_process_tick(int cpu)
 {
 #ifdef CONFIG_PARAVIRT
if (static_key_false(_steal_enabled)) {
u64 steal;
cputime_t steal_ct;
 
-   steal = paravirt_steal_clock(smp_processor_id());
-   steal -= this_rq()->prev_steal_time;
+   steal = paravirt_steal_clock(cpu);
+   steal -= cpu_rq(cpu)->prev_steal_time;
 
/*
 * cputime_t may be less precise than nsecs (eg: if it's
@@ -270,7 +270,7 @@ static __always_inline bool steal_account_process_tick(void)
 * granularity and account the rest on the next rounds.
 */
steal_ct = nsecs_to_cputime(steal);
-   this_rq()->prev_steal_time += cputime_to_nsecs(steal_ct);
+   cpu_rq(cpu)->prev_steal_time += cputime_to_nsecs(steal_ct);
 
account_steal_time(steal_ct);
return steal_ct;
@@ -345,7 +345,7 @@ static void irqtime_account_process_tick(struct task_struct 
*p, int cpu,
u64 cputime = (__force u64) cputime_one_jiffy;
u64 *cpustat = kstat->cpustat;
 
-   if (steal_account_process_tick())
+   if (steal_account_process_tick(cpu))
return;
 
cputime *= ticks;
@@ -483,7 +483,7 @@ void account_process_tick(struct task_struct *p, int 
user_tick)
return;
}
 
-   if (steal_account_process_tick())
+   if (steal_account_process_tick(cpu))
return;
 
if (user_tick)
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 09/11] nohz,time: add tick_accounting_remote macro

2015-06-24 Thread riel

From: Rik van Riel 

With the introduction of remote tick based sampling, we now have
three ways of gathering time statistics:
- local tick based sampling
- vtime accounting (used natively on some architectures)
- remote tick based sampling

On a system with remote tick based sampling, the housekeeping
CPUs will still do local tick based sampling. This results in
needing two macros for switching the timekeeping code.

Signed-off-by: Rik van Riel 
---
 include/linux/vtime.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index 4f5c1a3712e7..a587058c7967 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -17,6 +17,7 @@ struct task_struct;
  */
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 static inline bool tick_accounting_disabled(void) { return true; }
+static inline bool tick_accounting_remote(void) { return false; }
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
@@ -29,10 +30,12 @@ static inline bool tick_accounting_disabled(void)
 
return false;
 }
+static inline bool tick_accounting_remote(void) { return true; }
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_GEN */
 
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING
 static inline bool tick_accounting_disabled(void) { return false; }
+static inline bool tick_accounting_remote(void) { return false; }
 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING */
 
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 02/11] time,nohz: rename vtime_accounting_enabled to tick_accounting_disabled

2015-06-24 Thread riel

From: Rik van Riel 

Rename vtime_accounting_enabled to tick_accounting_disabled, because it
can mean either that vtime accounting is enabled, or that the system
is doing tick based sampling from a housekeeping CPU for nohz_full CPUs.

Signed-off-by: Rik van Riel 
---
 include/linux/context_tracking.h |  4 ++--
 include/linux/vtime.h| 17 ++---
 kernel/sched/cputime.c   |  2 +-
 kernel/time/tick-sched.c |  2 +-
 4 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index dc3b169b2b70..d7ee7eb9e0bc 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -90,7 +90,7 @@ static inline void context_tracking_init(void) { }
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 static inline void guest_enter(void)
 {
-   if (vtime_accounting_enabled())
+   if (tick_accounting_disabled())
vtime_guest_enter(current);
else
current->flags |= PF_VCPU;
@@ -104,7 +104,7 @@ static inline void guest_exit(void)
if (context_tracking_is_enabled())
__context_tracking_exit(CONTEXT_GUEST);
 
-   if (vtime_accounting_enabled())
+   if (tick_accounting_disabled())
vtime_guest_exit(current);
else
current->flags &= ~PF_VCPU;
diff --git a/include/linux/vtime.h b/include/linux/vtime.h
index c5165fd256f9..4f5c1a3712e7 100644
--- a/include/linux/vtime.h
+++ b/include/linux/vtime.h
@@ -10,14 +10,17 @@
 struct task_struct;
 
 /*
- * vtime_accounting_enabled() definitions/declarations
+ * tick_accounting_disabled() definitions/declarations
+ *
+ * This indicates that either vtime accounting is used, or that tick
+ * based sampling is done from a housekeeping CPU for nohz_full CPUs.
  */
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
-static inline bool vtime_accounting_enabled(void) { return true; }
+static inline bool tick_accounting_disabled(void) { return true; }
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
-static inline bool vtime_accounting_enabled(void)
+static inline bool tick_accounting_disabled(void)
 {
if (context_tracking_is_enabled()) {
if (context_tracking_cpu_is_enabled())
@@ -29,7 +32,7 @@ static inline bool vtime_accounting_enabled(void)
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_GEN */
 
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING
-static inline bool vtime_accounting_enabled(void) { return false; }
+static inline bool tick_accounting_disabled(void) { return false; }
 #endif /* !CONFIG_VIRT_CPU_ACCOUNTING */
 
 
@@ -44,7 +47,7 @@ extern void vtime_task_switch(struct task_struct *prev);
 extern void vtime_common_task_switch(struct task_struct *prev);
 static inline void vtime_task_switch(struct task_struct *prev)
 {
-   if (vtime_accounting_enabled())
+   if (tick_accounting_disabled())
vtime_common_task_switch(prev);
 }
 #endif /* __ARCH_HAS_VTIME_TASK_SWITCH */
@@ -59,7 +62,7 @@ extern void vtime_account_irq_enter(struct task_struct *tsk);
 extern void vtime_common_account_irq_enter(struct task_struct *tsk);
 static inline void vtime_account_irq_enter(struct task_struct *tsk)
 {
-   if (vtime_accounting_enabled())
+   if (tick_accounting_disabled())
vtime_common_account_irq_enter(tsk);
 }
 #endif /* __ARCH_HAS_VTIME_ACCOUNT */
@@ -78,7 +81,7 @@ extern void vtime_gen_account_irq_exit(struct task_struct 
*tsk);
 
 static inline void vtime_account_irq_exit(struct task_struct *tsk)
 {
-   if (vtime_accounting_enabled())
+   if (tick_accounting_disabled())
vtime_gen_account_irq_exit(tsk);
 }
 
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 97077c282626..84b2d24a2238 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -471,7 +471,7 @@ void account_process_tick(struct task_struct *p, int 
user_tick)
 * should do work on this invocation, or whether time keeping for
 * this CPU is done in some other way.
 */
-   if (vtime_accounting_enabled())
+   if (tick_accounting_disabled())
return;
 
if (sched_clock_irqtime) {
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 914259128145..3bb5a7accc9f 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -924,7 +924,7 @@ static void tick_nohz_account_idle_ticks(struct tick_sched 
*ts)
 #ifndef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
unsigned long ticks;
 
-   if (vtime_accounting_enabled())
+   if (tick_accounting_disabled())
return;
/*
 * We stopped the tick in idle. Update process times would miss the
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at

[RFC PATCH 05/11] time,nohz: add cpu parameter to account_steal_time

2015-06-24 Thread riel

From: Rik van Riel 

Simple transformation to allow tick based sampling from a remote
cpu. Additional changes may be needed to actually acquire the
steal time info for remote cpus from the host/hypervisor.

Signed-off-by: Rik van Riel 
---
 include/linux/kernel_stat.h | 2 +-
 kernel/sched/cputime.c  | 9 +
 2 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 25a822f6f000..4490aef2f149 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -80,7 +80,7 @@ static inline unsigned int kstat_cpu_irqs_sum(unsigned int 
cpu)
 
 extern void account_user_time(struct task_struct *, cputime_t, cputime_t);
 extern void account_system_time(struct task_struct *, int, cputime_t, 
cputime_t);
-extern void account_steal_time(cputime_t);
+extern void account_steal_time(int cpu, cputime_t);
 extern void account_idle_time(cputime_t);
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 9717d86cf2ab..b684c48ad954 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -232,9 +232,10 @@ void account_system_time(struct task_struct *p, int 
hardirq_offset,
  * Account for involuntary wait time.
  * @cputime: the cpu time spent in involuntary wait
  */
-void account_steal_time(cputime_t cputime)
+void account_steal_time(int cpu, cputime_t cputime)
 {
-   u64 *cpustat = kcpustat_this_cpu->cpustat;
+   struct kernel_cpustat *kstat = _cpu(cpu);
+   u64 *cpustat = kstat->cpustat;
 
cpustat[CPUTIME_STEAL] += (__force u64) cputime;
 }
@@ -272,7 +273,7 @@ static __always_inline bool steal_account_process_tick(int 
cpu)
steal_ct = nsecs_to_cputime(steal);
cpu_rq(cpu)->prev_steal_time += cputime_to_nsecs(steal_ct);
 
-   account_steal_time(steal_ct);
+   account_steal_time(cpu, steal_ct);
return steal_ct;
}
 #endif
@@ -502,7 +503,7 @@ void account_process_tick(struct task_struct *p, int 
user_tick)
  */
 void account_steal_ticks(unsigned long ticks)
 {
-   account_steal_time(jiffies_to_cputime(ticks));
+   account_steal_time(smp_processor_id(), jiffies_to_cputime(ticks));
 }
 
 /*
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 06/11] time,nohz: add cpu parameter to account_idle_time

2015-06-24 Thread riel

From: Rik van Riel 

Simple transformation to allow account_idle_time to account the
idle time for another CPU.

Signed-off-by: Rik van Riel 
---
 arch/ia64/kernel/time.c |  2 +-
 arch/powerpc/kernel/time.c  |  2 +-
 arch/s390/kernel/idle.c |  2 +-
 include/linux/kernel_stat.h |  2 +-
 kernel/sched/cputime.c  | 15 ---
 5 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c
index 9a0104a38cd3..61928b01d548 100644
--- a/arch/ia64/kernel/time.c
+++ b/arch/ia64/kernel/time.c
@@ -140,7 +140,7 @@ EXPORT_SYMBOL_GPL(vtime_account_system);
 
 void vtime_account_idle(struct task_struct *tsk)
 {
-   account_idle_time(vtime_delta(tsk));
+   account_idle_time(task_cpu(tsk), vtime_delta(tsk));
 }
 
 #endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 56f44848b044..f7c4cfdf0157 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -356,7 +356,7 @@ void vtime_account_idle(struct task_struct *tsk)
u64 delta, sys_scaled, stolen;
 
delta = vtime_delta(tsk, _scaled, );
-   account_idle_time(delta + stolen);
+   account_idle_time(task_cpu(tsk), delta + stolen);
 }
 
 /*
diff --git a/arch/s390/kernel/idle.c b/arch/s390/kernel/idle.c
index 7a55c29b0b33..fc3945e3dc18 100644
--- a/arch/s390/kernel/idle.c
+++ b/arch/s390/kernel/idle.c
@@ -43,7 +43,7 @@ void enabled_wait(void)
idle->clock_idle_enter = idle->clock_idle_exit = 0ULL;
idle->idle_time += idle_time;
idle->idle_count++;
-   account_idle_time(idle_time);
+   account_idle_time(smp_processor_id(), idle_time);
write_seqcount_end(>seqcount);
 }
 NOKPROBE_SYMBOL(enabled_wait);
diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h
index 4490aef2f149..0d6e07406fd0 100644
--- a/include/linux/kernel_stat.h
+++ b/include/linux/kernel_stat.h
@@ -81,7 +81,7 @@ static inline unsigned int kstat_cpu_irqs_sum(unsigned int 
cpu)
 extern void account_user_time(struct task_struct *, cputime_t, cputime_t);
 extern void account_system_time(struct task_struct *, int, cputime_t, 
cputime_t);
 extern void account_steal_time(int cpu, cputime_t);
-extern void account_idle_time(cputime_t);
+extern void account_idle_time(int cpu, cputime_t);
 
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
 static inline void account_process_tick(struct task_struct *tsk, int user)
diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index b684c48ad954..593f97b0fe3c 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -244,10 +244,11 @@ void account_steal_time(int cpu, cputime_t cputime)
  * Account for idle time.
  * @cputime: the cpu time spent in idle wait
  */
-void account_idle_time(cputime_t cputime)
+void account_idle_time(int cpu, cputime_t cputime)
 {
-   u64 *cpustat = kcpustat_this_cpu->cpustat;
-   struct rq *rq = this_rq();
+   struct kernel_cpustat *kstat = _cpu(cpu);
+   u64 *cpustat = kstat->cpustat;
+   struct rq *rq = cpu_rq(cpu);
 
if (atomic_read(>nr_iowait) > 0)
cpustat[CPUTIME_IOWAIT] += (__force u64) cputime;
@@ -366,7 +367,7 @@ static void irqtime_account_process_tick(struct task_struct 
*p, int cpu,
} else if (user_tick) {
account_user_time(p, cputime, scaled);
} else if (p == rq->idle) {
-   account_idle_time(cputime);
+   account_idle_time(cpu, cputime);
} else if (p->flags & PF_VCPU) { /* System time or guest time */
account_guest_time(p, cputime, scaled);
} else {
@@ -493,7 +494,7 @@ void account_process_tick(struct task_struct *p, int 
user_tick)
account_system_time(p, HARDIRQ_OFFSET, cputime_one_jiffy,
one_jiffy_scaled);
else
-   account_idle_time(cputime_one_jiffy);
+   account_idle_time(cpu, cputime_one_jiffy);
 }
 
 /*
@@ -518,7 +519,7 @@ void account_idle_ticks(unsigned long ticks)
return;
}
 
-   account_idle_time(jiffies_to_cputime(ticks));
+   account_idle_time(smp_processor_id(), jiffies_to_cputime(ticks));
 }
 
 /*
@@ -748,7 +749,7 @@ void vtime_account_idle(struct task_struct *tsk)
 {
cputime_t delta_cpu = get_vtime_delta(tsk);
 
-   account_idle_time(delta_cpu);
+   account_idle_time(task_cpu(tsk), delta_cpu);
 }
 
 void arch_vtime_task_switch(struct task_struct *prev)
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 07/11] nohz,timer: designate timer housekeeping cpu

2015-06-24 Thread riel

From: Rik van Riel 

The timer housekeeping CPU can do tick based sampling for remote
CPUs. For now this is the first CPU in the housekeeping_mask.

Eventually we could move to having one timer housekeeping cpu per
socket, if needed.

Signed-off-by: Rik van Riel 
---
 include/linux/tick.h | 9 +
 1 file changed, 9 insertions(+)

diff --git a/include/linux/tick.h b/include/linux/tick.h
index f8492da57ad3..2fb51030587b 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -158,6 +158,15 @@ static inline bool is_housekeeping_cpu(int cpu)
return true;
 }
 
+static inline bool is_timer_housekeeping_cpu(int cpu)
+{
+#ifdef CONFIG_NO_HZ_FULL
+   if (tick_nohz_full_running)
+   return (cpumask_first(housekeeping_mask) == cpu);
+#endif
+   return false;
+}
+
 static inline void housekeeping_affine(struct task_struct *t)
 {
 #ifdef CONFIG_NO_HZ_FULL
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC PATCH 10/11] nohz,kvm,time: skip vtime accounting at kernel entry & exit

2015-06-24 Thread riel

From: Rik van Riel 

When timer statistics are sampled from a remote CPU, vtime calculations
at the kernel/user and kernel/guest boundary are no longer necessary.
Skip them.

Signed-off-by: Rik van Riel 
---
 include/linux/context_tracking.h | 4 ++--
 kernel/context_tracking.c| 6 --
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index d7ee7eb9e0bc..e3e7c674543f 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -90,7 +90,7 @@ static inline void context_tracking_init(void) { }
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 static inline void guest_enter(void)
 {
-   if (tick_accounting_disabled())
+   if (tick_accounting_disabled() && !tick_accounting_remote())
vtime_guest_enter(current);
else
current->flags |= PF_VCPU;
@@ -104,7 +104,7 @@ static inline void guest_exit(void)
if (context_tracking_is_enabled())
__context_tracking_exit(CONTEXT_GUEST);
 
-   if (tick_accounting_disabled())
+   if (tick_accounting_disabled() && !tick_accounting_remote())
vtime_guest_exit(current);
else
current->flags &= ~PF_VCPU;
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 6da4205c3184..a58cbed13ebd 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -63,7 +63,8 @@ void __context_tracking_enter(enum ctx_state state)
 */
if (state == CONTEXT_USER) {
trace_user_enter(0);
-   vtime_user_enter(current);
+   if (!tick_accounting_remote())
+   vtime_user_enter(current);
}
rcu_user_enter();
}
@@ -135,7 +136,8 @@ void __context_tracking_exit(enum ctx_state state)
 */
rcu_user_exit();
if (state == CONTEXT_USER) {
-   vtime_user_exit(current);
+   if (!tick_accounting_remote())
+   vtime_user_exit(current);
trace_user_exit(0);
}
}
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] show isolated cpus in sysfs

2015-04-24 Thread riel

From: Rik van Riel 

After system bootup, there is no totally reliable way to see
which CPUs are isolated, because the kernel may modify the
CPUs specified on the isolcpus= kernel command line option.

Export the CPU list that actually got isolated in sysfs,
specifically in the file /sys/devices/system/cpu/isolated

This can be used by system management tools like libvirt,
openstack, and others to ensure proper placement of tasks.

Suggested-by: Li Zefan 
Signed-off-by: Rik van Riel 
---
 drivers/base/cpu.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index f160ea44a86d..ea23ee7b545b 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -265,6 +265,17 @@ static ssize_t print_cpus_offline(struct device *dev,
 }
 static DEVICE_ATTR(offline, 0444, print_cpus_offline, NULL);
 
+static ssize_t print_cpus_isolated(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+   int n = 0, len = PAGE_SIZE-2;
+
+   n = scnprintf(buf, len, "%*pbl\n", cpumask_pr_args(cpu_isolated_map));
+
+   return n;
+}
+static DEVICE_ATTR(isolated, 0444, print_cpus_isolated, NULL);
+
 static void cpu_device_release(struct device *dev)
 {
/*
@@ -431,6 +442,7 @@ static struct attribute *cpu_root_attrs[] = {
_attrs[2].attr.attr,
_attr_kernel_max.attr,
_attr_offline.attr,
+   _attr_isolated.attr,
 #ifdef CONFIG_GENERIC_CPU_AUTOPROBE
_attr_modalias.attr,
 #endif
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/2 resend] show isolated & nohz_full cpus in sysfs

2015-04-24 Thread riel

Currently there is no good way to get the isolated and nohz_full
CPUs at runtime, because the kernel may have changed the CPUs
specified on the commandline (when specifying all CPUs as
isolated, or CPUs that do not exist, ...)

This series adds two files to /sys/devices/system/cpu, which can
be used by system management tools like libvirt, openstack, etc.
to ensure proper task placement.

These patches were kind of (but not formally) acked by
Mike and Frederic, see https://lkml.org/lkml/2015/3/27/852

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] show nohz_full cpus in sysfs

2015-04-24 Thread riel

From: Rik van Riel 

Currently there is no way to query which CPUs are in nohz_full
mode from userspace.

Export the CPU list running in nohz_full mode in sysfs,
specifically in the file /sys/devices/system/cpu/nohz_full

This can be used by system management tools like libvirt,
openstack, and others to ensure proper task placement.

Signed-off-by: Rik van Riel 
---
 drivers/base/cpu.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index ea23ee7b545b..78720e706176 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "base.h"
 
@@ -276,6 +277,19 @@ static ssize_t print_cpus_isolated(struct device *dev,
 }
 static DEVICE_ATTR(isolated, 0444, print_cpus_isolated, NULL);
 
+#ifdef CONFIG_NO_HZ_FULL
+static ssize_t print_cpus_nohz_full(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+   int n = 0, len = PAGE_SIZE-2;
+
+   n = scnprintf(buf, len, "%*pbl\n", 
cpumask_pr_args(tick_nohz_full_mask));
+
+   return n;
+}
+static DEVICE_ATTR(nohz_full, 0444, print_cpus_nohz_full, NULL);
+#endif
+
 static void cpu_device_release(struct device *dev)
 {
/*
@@ -443,6 +457,9 @@ static struct attribute *cpu_root_attrs[] = {
_attr_kernel_max.attr,
_attr_offline.attr,
_attr_isolated.attr,
+#ifdef CONFIG_NO_HZ_FULL
+   _attr_nohz_full.attr,
+#endif
 #ifdef CONFIG_GENERIC_CPU_AUTOPROBE
_attr_modalias.attr,
 #endif
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/3] reduce nohz_full syscall overhead by 10%

2015-04-30 Thread riel

Profiling reveals that a lot of the overhead from the nohz_full
accounting seems to come not from the accounting itself, but from
disabling and re-enabling interrupts.

This patch series removes the interrupt disabling & re-enabling
from __acct_update_integrals, which is called on both syscall
entry and exit, as well as from the user_exit called on syscall
entry.

Together they speed up a benchmark that calls getpriority in
a row 10 million times by about 10%:

run timesystem time
vanilla 5.49s   2.08s
__acct patch5.21s   1.92s
both patches4.88s   1.71s

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry

2015-04-30 Thread riel

From: Rik van Riel 

On syscall entry with nohz_full on, we enable interrupts, call user_exit,
disable interrupts, do something, re-enable interrupts, and go on our
merry way.

Profiling shows that a large amount of the nohz_full overhead comes
from the extraneous disabling and re-enabling of interrupts. Andy
suggested simply not enabling interrupts until after the context
tracking code has done its thing, which allows us to skip a whole
interrupt disable & re-enable cycle.

This patch builds on top of these patches by Paolo:
https://lkml.org/lkml/2015/4/28/188
https://lkml.org/lkml/2015/4/29/139

Together with this patch I posted earlier this week, the syscall path
on a nohz_full cpu seems to be about 10% faster.
https://lkml.org/lkml/2015/4/24/394

My test is a simple microbenchmark that calls getpriority() in a loop
10 million times:

run timesystem time
vanilla 5.49s   2.08s
__acct patch5.21s   1.92s
both patches4.88s   1.71s

Cc: Frederic Weisbecker 
Cc: Ingo Molnar 
Cc: Paolo Bonzini 
Cc: Heiko Carstens 
Cc: Thomas Gleixner 
Suggested-by: Andy Lutomirsky 
Signed-off-by: Rik van Riel 
---
 arch/x86/kernel/entry_32.S   |  4 ++--
 arch/x86/kernel/entry_64.S   |  4 ++--
 arch/x86/kernel/ptrace.c |  6 +-
 include/linux/context_tracking.h | 11 +++
 4 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index 1c309763e321..0bdf8c7057e4 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -406,7 +406,6 @@ ENTRY(ia32_sysenter_target)
 
pushl_cfi %eax
SAVE_ALL
-   ENABLE_INTERRUPTS(CLBR_NONE)
 
 /*
  * Load the potential sixth argument from user stack.
@@ -424,6 +423,7 @@ ENTRY(ia32_sysenter_target)
 
testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%ebp)
jnz sysenter_audit
+   ENABLE_INTERRUPTS(CLBR_NONE)
 sysenter_do_call:
cmpl $(NR_syscalls), %eax
jae sysenter_badsys
@@ -647,7 +647,7 @@ END(work_pending)
 syscall_trace_entry:
movl $-ENOSYS,PT_EAX(%esp)
movl %esp, %eax
-   call syscall_trace_enter
+   call syscall_trace_enter/* returns with irqs enabled */
/* What it returned is what we'll actually use.  */
cmpl $(NR_syscalls), %eax
jnae syscall_call
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 02c2eff7478d..f7751da7b53e 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -228,7 +228,6 @@ GLOBAL(system_call_after_swapgs)
 * task preemption. We must enable interrupts only after we're done
 * with using rsp_scratch:
 */
-   ENABLE_INTERRUPTS(CLBR_NONE)
pushq_cfi   %r11/* pt_regs->flags */
pushq_cfi   $__USER_CS  /* pt_regs->cs */
pushq_cfi   %rcx/* pt_regs->ip */
@@ -248,6 +247,7 @@ GLOBAL(system_call_after_swapgs)
 
testl $_TIF_WORK_SYSCALL_ENTRY, ASM_THREAD_INFO(TI_flags, %rsp, 
SIZEOF_PTREGS)
jnz tracesys
+   ENABLE_INTERRUPTS(CLBR_NONE)
 system_call_fastpath:
 #if __SYSCALL_MASK == ~0
cmpq $__NR_syscall_max,%rax
@@ -313,7 +313,7 @@ GLOBAL(system_call_after_swapgs)
 tracesys:
movq %rsp, %rdi
movl $AUDIT_ARCH_X86_64, %esi
-   call syscall_trace_enter_phase1
+   call syscall_trace_enter_phase1 /* returns with interrupts enabled */
test %rax, %rax
jnz tracesys_phase2 /* if needed, run the slow path */
RESTORE_C_REGS_EXCEPT_RAX   /* else restore clobbered regs */
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index a7bc79480719..066c86d0b68c 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -1456,6 +1456,8 @@ static void do_audit_syscall_entry(struct pt_regs *regs, 
u32 arch)
  *
  * NB: We don't have full pt_regs here, but regs->orig_ax and regs->ax
  * are fully functional.
+ * Called with IRQs disabled, to be enabled after the context tracking
+ * code has run.
  *
  * For phase 2's benefit, our return value is:
  * 0:  resume the syscall
@@ -1477,10 +1479,12 @@ unsigned long syscall_trace_enter_phase1(struct pt_regs 
*regs, u32 arch)
 * doing anything that could touch RCU.
 */
if (work & _TIF_NOHZ) {
-   user_exit();
+   user_exit_irqsoff();
work &= ~_TIF_NOHZ;
}
 
+   local_irq_enable();
+
 #ifdef CONFIG_SECCOMP
/*
 * Do seccomp first -- it should minimize exposure of other
diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 5d3719aed958..dc3b169b2b70 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -25,12 +25,23 @@ static inline void user_enter(void)
context_tracking_enter(CONTEXT_USER);
 
 }
+
 static

[PATCH 2/3] remove local_irq_save from __acct_update_integrals

2015-04-30 Thread riel

From: Rik van Riel 

The function __acct_update_integrals() is called both from irq context
and task context. This creates a race where irq context can advance
tsk->acct_timexpd to a value larger than time, leading to a negative
value, which causes a divide error. See commit 6d5b5acca9e5
("Fix fixpoint divide exception in acct_update_integrals")

In 2012, __acct_update_integrals() was changed to get utime and stime
as function parameters. This re-introduced the bug, because an irq
can hit in-between the call to task_cputime() and where irqs actually
get disabled.

However, this race condition was originally reproduced on Hercules,
and I have not seen any reports of it re-occurring since it was
re-introduced 3 years ago.

On the other hand, the irq disabling and re-enabling, which no longer
even protects us against the race today, show up prominently in the
perf profile of a program that makes a very large number of system calls
in a short period of time, when nohz_full= (and context tracking) is
enabled.

This patch replaces the (now ineffective) irq blocking with a cheaper
way to test for the race condition, and speeds up my microbenchmark
with 10 million iterations, average of 5 runs, tiny stddev:

run timesystem time
vanilla 5.49s   2.08s
patch   5.21s   1.92s

Cc: Andy Lutomirsky 
Cc: Frederic Weisbecker 
Cc: Peter Zijlstra 
Cc: Heiko Carstens 
Cc: Thomas Gleixner 
Signed-off-by: Rik van Riel 
---
 arch/powerpc/include/asm/cputime.h|  3 +++
 arch/s390/include/asm/cputime.h   |  3 +++
 include/asm-generic/cputime_jiffies.h |  2 ++
 include/asm-generic/cputime_nsecs.h   |  3 +++
 kernel/tsacct.c   | 16 
 5 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/cputime.h 
b/arch/powerpc/include/asm/cputime.h
index e2452550bcb1..e41b32f68a2c 100644
--- a/arch/powerpc/include/asm/cputime.h
+++ b/arch/powerpc/include/asm/cputime.h
@@ -32,6 +32,9 @@ static inline void setup_cputime_one_jiffy(void) { }
 typedef u64 __nocast cputime_t;
 typedef u64 __nocast cputime64_t;
 
+typedef s64 signed_cputime_t;
+typedef s64 signed_cputime64_t;
+
 #define cmpxchg_cputime(ptr, old, new) cmpxchg(ptr, old, new)
 
 #ifdef __KERNEL__
diff --git a/arch/s390/include/asm/cputime.h b/arch/s390/include/asm/cputime.h
index 221b454c734a..2e8c268cc2a7 100644
--- a/arch/s390/include/asm/cputime.h
+++ b/arch/s390/include/asm/cputime.h
@@ -18,6 +18,9 @@
 typedef unsigned long long __nocast cputime_t;
 typedef unsigned long long __nocast cputime64_t;
 
+typedef signed long long signed_cputime_t;
+typedef signed long long signed_cputime64_t;
+
 #define cmpxchg_cputime(ptr, old, new) cmpxchg64(ptr, old, new)
 
 static inline unsigned long __div(unsigned long long n, unsigned long base)
diff --git a/include/asm-generic/cputime_jiffies.h 
b/include/asm-generic/cputime_jiffies.h
index fe386fc6e85e..b96b6a1b6c97 100644
--- a/include/asm-generic/cputime_jiffies.h
+++ b/include/asm-generic/cputime_jiffies.h
@@ -2,6 +2,7 @@
 #define _ASM_GENERIC_CPUTIME_JIFFIES_H
 
 typedef unsigned long __nocast cputime_t;
+typedef signed long signed_cputime_t;
 
 #define cmpxchg_cputime(ptr, old, new) cmpxchg(ptr, old, new)
 
@@ -11,6 +12,7 @@ typedef unsigned long __nocast cputime_t;
 #define jiffies_to_cputime(__hz)   (__force cputime_t)(__hz)
 
 typedef u64 __nocast cputime64_t;
+typedef s64 signed_cputime_t;
 
 #define cputime64_to_jiffies64(__ct)   (__force u64)(__ct)
 #define jiffies64_to_cputime64(__jif)  (__force cputime64_t)(__jif)
diff --git a/include/asm-generic/cputime_nsecs.h 
b/include/asm-generic/cputime_nsecs.h
index 0419485891f2..c1ad2f90a4d9 100644
--- a/include/asm-generic/cputime_nsecs.h
+++ b/include/asm-generic/cputime_nsecs.h
@@ -21,6 +21,9 @@
 typedef u64 __nocast cputime_t;
 typedef u64 __nocast cputime64_t;
 
+typedef s64 signed_cputime_t;
+typedef s64 signed_cputime64_t;
+
 #define cmpxchg_cputime(ptr, old, new) cmpxchg64(ptr, old, new)
 
 #define cputime_one_jiffy  jiffies_to_cputime(1)
diff --git a/kernel/tsacct.c b/kernel/tsacct.c
index 9e225425bc3a..e497c1c05675 100644
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -125,15 +125,24 @@ static void __acct_update_integrals(struct task_struct 
*tsk,
 {
cputime_t time, dtime;
struct timeval value;
-   unsigned long flags;
u64 delta;
 
if (unlikely(!tsk->mm))
return;
 
-   local_irq_save(flags);
+   /*
+* This code is called both from task context and irq context.
+* There is a rare race where irq context advances tsk->acct_timexpd
+* to a value larger than time, leading to a negative dtime, which
+* could lead to a divide error in cputime_to_jiffies.
+* The statistics updated here are fairly rough estimates; just
+* ignore irq and task double accounting the same timer tick.
+*/
time = stime + u

[PATCH 1/3] reduce indentation in __acct_update_integrals

2015-04-30 Thread riel

From: Peter Zijlstra 

Reduce indentation in __acct_update_integrals.

Cc: Andy Lutomirsky 
Cc: Frederic Weisbecker 
Cc: Peter Zijlstra 
Cc: Heiko Carstens 
Cc: Thomas Gleixner 
Signed-off-by: Peter Zijlstra 
Signed-off-by: Rik van Riel 
---
 kernel/tsacct.c | 34 +-
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/kernel/tsacct.c b/kernel/tsacct.c
index 975cb49e32bf..9e225425bc3a 100644
--- a/kernel/tsacct.c
+++ b/kernel/tsacct.c
@@ -123,27 +123,27 @@ void xacct_add_tsk(struct taskstats *stats, struct 
task_struct *p)
 static void __acct_update_integrals(struct task_struct *tsk,
cputime_t utime, cputime_t stime)
 {
-   if (likely(tsk->mm)) {
-   cputime_t time, dtime;
-   struct timeval value;
-   unsigned long flags;
-   u64 delta;
-
-   local_irq_save(flags);
-   time = stime + utime;
-   dtime = time - tsk->acct_timexpd;
-   jiffies_to_timeval(cputime_to_jiffies(dtime), );
-   delta = value.tv_sec;
-   delta = delta * USEC_PER_SEC + value.tv_usec;
-
-   if (delta == 0)
-   goto out;
+   cputime_t time, dtime;
+   struct timeval value;
+   unsigned long flags;
+   u64 delta;
+
+   if (unlikely(!tsk->mm))
+   return;
+
+   local_irq_save(flags);
+   time = stime + utime;
+   dtime = time - tsk->acct_timexpd;
+   jiffies_to_timeval(cputime_to_jiffies(dtime), );
+   delta = value.tv_sec;
+   delta = delta * USEC_PER_SEC + value.tv_usec;
+
+   if (delta) {
tsk->acct_timexpd = time;
tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm);
tsk->acct_vm_mem1 += delta * tsk->mm->total_vm;
-   out:
-   local_irq_restore(flags);
}
+   local_irq_restore(flags);
 }
 
 /**
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC 1/5] sched,numa: build table of node hop distance

2014-10-08 Thread riel

From: Rik van Riel 

In order to more efficiently figure out where to place workloads
that span multiple NUMA nodes, it makes sense to estimate how
many hops away nodes are from each other.

Also add some comments to sched_init_numa.

Signed-off-by: Rik van Riel 
Suggested-by: Peter Zijlstra 
---
 include/linux/topology.h |  1 +
 kernel/sched/core.c  | 35 +--
 2 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index dda6ee5..33002f4 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -47,6 +47,7 @@
if (nr_cpus_node(node))
 
 int arch_update_cpu_topology(void);
+extern int node_hops(int i, int j);
 
 /* Conform to ACPI 2.0 SLIT distance definitions */
 #define LOCAL_DISTANCE 10
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5a4ad05..0cf501e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6076,6 +6076,7 @@ static void claim_allocations(int cpu, struct 
sched_domain *sd)
 #ifdef CONFIG_NUMA
 static int sched_domains_numa_levels;
 static int *sched_domains_numa_distance;
+static int *sched_domains_numa_hops;
 static struct cpumask ***sched_domains_numa_masks;
 static int sched_domains_curr_level;
 #endif
@@ -6247,6 +6248,19 @@ static void sched_numa_warn(const char *str)
printk(KERN_WARNING "\n");
 }
 
+int node_hops(int i, int j)
+{
+   if (!sched_domains_numa_hops)
+   return 0;
+
+   return sched_domains_numa_hops[i * nr_node_ids + j];
+}
+
+static void set_node_hops(int i, int j, int hops)
+{
+   sched_domains_numa_hops[i * nr_node_ids + j] = hops;
+}
+
 static bool find_numa_distance(int distance)
 {
int i;
@@ -6273,6 +6287,10 @@ static void sched_init_numa(void)
if (!sched_domains_numa_distance)
return;
 
+   sched_domains_numa_hops = kzalloc(sizeof(int) * nr_node_ids * 
nr_node_ids, GFP_KERNEL);
+   if (!sched_domains_numa_hops)
+   return;
+
/*
 * O(nr_nodes^2) deduplicating selection sort -- in order to find the
 * unique distances in the node_distance() table.
@@ -6340,7 +6358,7 @@ static void sched_init_numa(void)
 
/*
 * Now for each level, construct a mask per node which contains all
-* cpus of nodes that are that many hops away from us.
+* cpus of nodes that are that many hops away from us and closer by.
 */
for (i = 0; i < level; i++) {
sched_domains_numa_masks[i] =
@@ -6348,6 +6366,9 @@ static void sched_init_numa(void)
if (!sched_domains_numa_masks[i])
return;
 
+   /* A node is 0 hops away from itself. */
+   set_node_hops(i, i, 0);
+
for (j = 0; j < nr_node_ids; j++) {
struct cpumask *mask = kzalloc(cpumask_size(), 
GFP_KERNEL);
if (!mask)
@@ -6356,10 +6377,20 @@ static void sched_init_numa(void)
sched_domains_numa_masks[i][j] = mask;
 
for (k = 0; k < nr_node_ids; k++) {
-   if (node_distance(j, k) > 
sched_domains_numa_distance[i])
+   int distance = node_distance(j, k);
+   if (distance > sched_domains_numa_distance[i])
continue;
 
+   /* All CPUs at distance or less. */
cpumask_or(mask, mask, cpumask_of_node(k));
+
+   /*
+* The number of hops is one larger than i,
+* because sched_domains_numa_distance[]
+* excludes the local distance.
+*/
+   if (distance == sched_domains_numa_distance[i])
+   set_node_hops(j, k, i+1);
}
}
}
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC 0/5] sched,numa: task placement with complex NUMA topologies

2014-10-08 Thread riel

This patch set integrates two algorithms I have previously tested,
one for glueless mesh NUMA topologies, where NUMA nodes communicate
with far-away nodes through intermediary nodes, and backplane
topologies, where communication with far-away NUMA nodes happens
through backplane controllers (which cannot run tasks).

Due to the inavailability of 8 node systems, and the fact that I
am flying out to Linuxcon Europe / Plumbers / KVM Forum on Friday,
I have not tested these patches yet. However, with a conference (and
many familiar faces) coming up, it seemed like a good idea to get
the code out there, anyway.

The algorithms have been tested before, on both kinds of system.
The new thing about this series is that both algorithms have been
integrated into the same code base, and new code to select the
preferred_nid for tasks in numa groups.

Placement of tasks on smaller, directly connected, NUMA systems
should not be affected at all by this patch series.

I am interested in reviews, as well as test results on larger
NUMA systems :)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC 2/5] sched,numa: classify the NUMA topology of a system

2014-10-08 Thread riel

From: Rik van Riel 

Smaller NUMA systems tend to have all NUMA nodes directly connected
to each other. This includes the degenerate case of a system with just
one node, ie. a non-NUMA system.

Larger systems can have two kinds of NUMA topology, which affects how
tasks and memory should be placed on the system.

On glueless mesh systems, nodes that are not directly connected to
each other will bounce traffic through intermediary nodes. Task groups
can be run closer to each other by moving tasks from a node to an
intermediary node between it and the task's preferred node.

On NUMA systems with backplane controllers, the intermediary hops
are incapable of running programs. This creates "islands" of nodes
that are at an equal distance to anywhere else in the system.

Each kind of topology requires a slightly different placement
algorithm; this patch provides the mechanism to detect the kind
of NUMA topology of a system.

Signed-off-by: Rik van Riel 
---
 include/linux/topology.h |  7 +++
 kernel/sched/core.c  | 53 
 2 files changed, 60 insertions(+)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index 33002f4..bf40d46 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -49,6 +49,13 @@
 int arch_update_cpu_topology(void);
 extern int node_hops(int i, int j);
 
+enum numa_topology_type {
+   NUMA_DIRECT,
+   NUMA_GLUELESS_MESH,
+   NUMA_BACKPLANE,
+};
+extern enum numa_topology_type sched_numa_topology_type;
+
 /* Conform to ACPI 2.0 SLIT distance definitions */
 #define LOCAL_DISTANCE 10
 #define REMOTE_DISTANCE20
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0cf501e..1898914 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6075,6 +6075,7 @@ static void claim_allocations(int cpu, struct 
sched_domain *sd)
 
 #ifdef CONFIG_NUMA
 static int sched_domains_numa_levels;
+enum numa_topology_type sched_numa_topology_type;
 static int *sched_domains_numa_distance;
 static int *sched_domains_numa_hops;
 static struct cpumask ***sched_domains_numa_masks;
@@ -6276,6 +6277,56 @@ static bool find_numa_distance(int distance)
return false;
 }
 
+/*
+ * A system can have three types of NUMA topology:
+ * NUMA_DIRECT: all nodes are directly connected, or not a NUMA system
+ * NUMA_GLUELESS_MESH: some nodes reachable through intermediary nodes
+ * NUMA_BACKPLANE: nodes can reach other nodes through a backplane
+ *
+ * The difference between a glueless mesh topology and a backplane
+ * topology lies in whether communication between not directly
+ * connected nodes goes through intermediary nodes (where programs
+ * could run), or through backplane controllers. This affects
+ * placement of programs.
+ *
+ * The type of topology can be discerned with the following tests:
+ * - If the maximum distance between any nodes is 1 hop, the system
+ *   is directly connected.
+ * - If for two nodes A and B, located N > 1 hops away from each other,
+ *   there is an intermediary node C, which is < N hops away from both
+ *   nodes A and B, the system is a glueless mesh.
+ */
+static void init_numa_topology_type(void)
+{
+   int a, b, c, n;
+
+   n = sched_domains_numa_levels;
+
+   if (n <= 1)
+   sched_numa_topology_type = NUMA_DIRECT;
+
+   for_each_online_node(a) {
+   for_each_online_node(b) {
+   /* Find two nodes furthest removed from each other. */
+   if (node_hops(a, b) < n)
+   continue;
+
+   /* Is there an intermediary node between a and b? */
+   for_each_online_node(c) {
+   if (node_hops(a, c) < n &&
+   node_hops(b, c) < n) {
+   sched_numa_topology_type =
+   NUMA_GLUELESS_MESH;
+   return;
+   }
+   }
+
+   sched_numa_topology_type = NUMA_BACKPLANE;
+   return;
+   }
+   }
+}
+
 static void sched_init_numa(void)
 {
int next_distance, curr_distance = node_distance(0, 0);
@@ -6425,6 +6476,8 @@ static void sched_init_numa(void)
sched_domain_topology = tl;
 
sched_domains_numa_levels = level;
+
+   init_numa_topology_type();
 }
 
 static void sched_domains_numa_masks_set(int cpu)
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH RFC 3/5] sched,numa: preparations for complex topology placement

2014-10-08 Thread riel

From: Rik van Riel 

Preparatory patch for adding NUMA placement on systems with
complex NUMA topology. Also fix a potential divide by zero
in group_weight()

Signed-off-by: Rik van Riel 
---
 include/linux/topology.h |  1 +
 kernel/sched/core.c  |  2 +-
 kernel/sched/fair.c  | 57 +++-
 3 files changed, 39 insertions(+), 21 deletions(-)

diff --git a/include/linux/topology.h b/include/linux/topology.h
index bf40d46..f8dfad9 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -47,6 +47,7 @@
if (nr_cpus_node(node))
 
 int arch_update_cpu_topology(void);
+extern int sched_domains_numa_levels;
 extern int node_hops(int i, int j);
 
 enum numa_topology_type {
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1898914..2528f97 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6074,7 +6074,7 @@ static void claim_allocations(int cpu, struct 
sched_domain *sd)
 }
 
 #ifdef CONFIG_NUMA
-static int sched_domains_numa_levels;
+int sched_domains_numa_levels;
 enum numa_topology_type sched_numa_topology_type;
 static int *sched_domains_numa_distance;
 static int *sched_domains_numa_hops;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6d44052..8b3f884 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -930,9 +930,10 @@ static inline unsigned long group_faults_cpu(struct 
numa_group *group, int nid)
  * larger multiplier, in order to group tasks together that are almost
  * evenly spread out between numa nodes.
  */
-static inline unsigned long task_weight(struct task_struct *p, int nid)
+static inline unsigned long task_weight(struct task_struct *p, int nid,
+   int hops)
 {
-   unsigned long total_faults;
+   unsigned long faults, total_faults;
 
if (!p->numa_faults_memory)
return 0;
@@ -942,15 +943,25 @@ static inline unsigned long task_weight(struct 
task_struct *p, int nid)
if (!total_faults)
return 0;
 
-   return 1000 * task_faults(p, nid) / total_faults;
+   faults = task_faults(p, nid);
+   return 1000 * faults / total_faults;
 }
 
-static inline unsigned long group_weight(struct task_struct *p, int nid)
+static inline unsigned long group_weight(struct task_struct *p, int nid,
+int hops)
 {
-   if (!p->numa_group || !p->numa_group->total_faults)
+   unsigned long faults, total_faults;
+
+   if (!p->numa_group)
+   return 0;
+
+   total_faults = p->numa_group->total_faults;
+
+   if (!total_faults)
return 0;
 
-   return 1000 * group_faults(p, nid) / p->numa_group->total_faults;
+   faults = group_faults(p, nid);
+   return 1000 * faults / total_faults;
 }
 
 bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
@@ -1083,6 +1094,7 @@ struct task_numa_env {
struct numa_stats src_stats, dst_stats;
 
int imbalance_pct;
+   int hops;
 
struct task_struct *best_task;
long best_imp;
@@ -1162,6 +1174,7 @@ static void task_numa_compare(struct task_numa_env *env,
long load;
long imp = env->p->numa_group ? groupimp : taskimp;
long moveimp = imp;
+   int hops = env->hops;
 
rcu_read_lock();
cur = ACCESS_ONCE(dst_rq->curr);
@@ -1185,8 +1198,8 @@ static void task_numa_compare(struct task_numa_env *env,
 * in any group then look only at task weights.
 */
if (cur->numa_group == env->p->numa_group) {
-   imp = taskimp + task_weight(cur, env->src_nid) -
- task_weight(cur, env->dst_nid);
+   imp = taskimp + task_weight(cur, env->src_nid, hops) -
+ task_weight(cur, env->dst_nid, hops);
/*
 * Add some hysteresis to prevent swapping the
 * tasks within a group over tiny differences.
@@ -1200,11 +1213,11 @@ static void task_numa_compare(struct task_numa_env *env,
 * instead.
 */
if (cur->numa_group)
-   imp += group_weight(cur, env->src_nid) -
-  group_weight(cur, env->dst_nid);
+   imp += group_weight(cur, env->src_nid, hops) -
+  group_weight(cur, env->dst_nid, hops);
else
-   imp += task_weight(cur, env->src_nid) -
-  task_weight(cur, env->dst_nid);
+   imp += task_weight(cur, env->src_nid, hops) -
+  task_weight(cur, env->dst_nid, hops);

[PATCH RFC 4/5] sched,numa: calculate node scores in complex NUMA topologies

2014-10-08 Thread riel

From: Rik van Riel 

In order to do task placement on systems with complex NUMA topologies,
it is necessary to count the faults on nodes nearby the node that is
being examined for a potential move.

In case of a system with a backplane interconnect, we are dealing with
groups of NUMA nodes; each of the nodes within a group is the same number
of hops away from nodes in other groups in the system. Optimal placement
on this topology is achieved by counting all nearby nodes equally. When
comparing nodes A and B at distance N, nearby nodes are those at distances
smaller than N from nodes A or B.

Placement strategy on a system with a glueless mesh NUMA topology needs
to be different, because there are no natural groups of nodes determined
by the hardware. Instead, when dealing with two nodes A and B at distance
N, N >= 2, there will be intermediate nodes at distance < N from both nodes
A and B. Good placement can be achieved by right shifting the faults on
nearby nodes by the number of hops from the node being scored. In this
context, a nearby node is any node less than the maximum distance in the
system away from the node. Those nodes are skipped for efficiency reasons,
there is no real policy reason to do so.

Placement policy on directly connected NUMA systems is not affected.

Signed-off-by: Rik van Riel 
---
 kernel/sched/fair.c | 68 +
 1 file changed, 68 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8b3f884..fb22caf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -924,6 +924,65 @@ static inline unsigned long group_faults_cpu(struct 
numa_group *group, int nid)
group->faults_cpu[task_faults_idx(nid, 1)];
 }
 
+/* Handle placement on systems where not all nodes are directly connected. */
+static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
+   int hoplimit, bool task)
+{
+   unsigned long score = 0;
+   int node;
+
+   /*
+* All nodes are directly connected, and the same distance
+* from each other. No need for fancy placement algorithms.
+*/
+   if (sched_numa_topology_type == NUMA_DIRECT)
+   return 0;
+
+   for_each_online_node(node) {
+   unsigned long faults;
+   int hops = node_hops(nid, node);
+
+   /*
+* The furthest away nodes in the system are not interesting
+* for placement; nid was already counted.
+*/
+   if (hops == sched_domains_numa_levels || node == nid)
+   continue;
+
+   /*
+* On systems with a backplane NUMA topology, compare groups
+* of nodes, and move tasks towards the group with the most
+* memory accesses. When comparing two nodes at distance
+* "hoplimit", only nodes closer by than "hoplimit" are part
+* of each group. Skip other nodes.
+*/
+   if (sched_numa_topology_type == NUMA_BACKPLANE &&
+   hops >= hoplimit)
+   continue;
+
+   /* Add up the faults from nearby nodes. */
+   if (task)
+   faults = task_faults(p, node);
+   else
+   faults = group_faults(p, node);
+
+   /*
+* On systems with a glueless mesh NUMA topology, there are
+* no fixed "groups of nodes". Instead, nodes that are not
+* directly connected bounce traffic through intermediate
+* nodes; a numa_group can occupy any set of nodes. Counting
+* the faults on nearby hops progressively less as distance
+* increases seems to result in good task placement.
+*/
+   if (sched_numa_topology_type == NUMA_GLUELESS_MESH)
+   faults >>= hops;
+
+   score += faults;
+   }
+
+   return score;
+}
+
 /*
  * These return the fraction of accesses done by a particular task, or
  * task group, on a particular numa node.  The group weight is given a
@@ -944,6 +1003,8 @@ static inline unsigned long task_weight(struct task_struct 
*p, int nid,
return 0;
 
faults = task_faults(p, nid);
+   faults += score_nearby_nodes(p, nid, hops, true);
+
return 1000 * faults / total_faults;
 }
 
@@ -961,6 +1022,8 @@ static inline unsigned long group_weight(struct 
task_struct *p, int nid,
return 0;
 
faults = group_faults(p, nid);
+   faults += score_nearby_nodes(p, nid, hops, false);
+
return 1000 * faults / total_faults;
 }
 
@@ -1363,6 +1426,11 @@ static int task_numa_migrate(struct task_struct *p)
continue;

[PATCH RFC 5/5] sched,numa: find the preferred nid with complex NUMA topology

2014-10-08 Thread riel

From: Rik van Riel 

On systems with complex NUMA topologies, the node scoring is adjusted
to allow workloads to converge on nodes that are near each other.

The way a task group's preferred nid is determined needs to be adjusted,
in order for the preferred_nid to be consistent with group_weight scoring.
This ensures that we actually try to converge workloads on adjacent nodes.

Signed-off-by: Rik van Riel 
---
 kernel/sched/fair.c | 83 -
 1 file changed, 82 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fb22caf..17ebf41 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1642,6 +1642,87 @@ static u64 numa_get_avg_runtime(struct task_struct *p, 
u64 *period)
return delta;
 }
 
+/*
+ * Determine the preferred nid for a task in a numa_group. This needs to
+ * be done in a way that produces consistent results with group_weight,
+ * otherwise workloads might not converge.
+ */ 
+static int preferred_group_nid(struct task_struct *p, int nid)
+{
+   nodemask_t nodes;
+   int hops;
+
+   /* Direct connections between all NUMA nodes. */
+   if (sched_numa_topology_type == NUMA_DIRECT)
+   return nid;
+
+   /*
+* On a system with glueless mesh NUMA topology, group_weight
+* scores nodes according to the number of NUMA hinting faults on
+* both the node itself, and on nearby nodes.
+*/
+   if (sched_numa_topology_type == NUMA_GLUELESS_MESH) {
+   unsigned long score, max_score = 0;
+   int node, max_node = nid;
+
+   hops = sched_domains_numa_levels;
+
+   for_each_online_node(node) {
+   score = group_weight(p, node, hops);
+   if (score > max_score) {
+   max_score = score;
+   max_node = node;
+   }
+   }
+   return max_node;
+   }
+
+   /*
+* Finding the preferred nid in a system with NUMA backplane
+* interconnect topology is more involved. The goal is to locate
+* tasks from numa_groups near each other in the system, and
+* untangle workloads from different sides of the system. This requires
+* searching down the hierarchy of node groups, recursively searching
+* inside the highest scoring group of nodes. The nodemask tricks
+* keep the complexity of the search down.
+*/
+   nodes = node_online_map;
+   for (hops = sched_domains_numa_levels; hops; hops--) {
+   unsigned long max_faults = 0;
+   nodemask_t max_group;
+   int a, b;
+
+   for_each_node_mask(a, nodes) {
+   unsigned long faults = 0;
+   nodemask_t this_group;
+   nodes_clear(this_group);
+
+   /* Sum group's NUMA faults; includes a==b case. */
+   for_each_node_mask(b, nodes) {
+   if (node_hops(a, b) < hops) {
+   faults += group_faults(p, b);
+   node_set(b, this_group);
+   node_clear(b, nodes);
+   }
+   }
+
+   /* Remember the top group. */
+   if (faults > max_faults) {
+   max_faults = faults;
+   max_group = this_group;
+   /*
+* subtle: once hops==1 there is just one
+* node left, which is the preferred nid.
+*/
+   nid = a;
+   }
+   }
+   /* Next round, evaluate the nodes within max_group. */
+   nodes = max_group;
+   }
+   return nid;
+}
+
 static void task_numa_placement(struct task_struct *p)
 {
int seq, nid, max_nid = -1, max_group_nid = -1;
@@ -1724,7 +1805,7 @@ static void task_numa_placement(struct task_struct *p)
if (p->numa_group) {
update_numa_active_node_mask(p->numa_group);
spin_unlock_irq(group_lock);
-   max_nid = max_group_nid;
+   max_nid = preferred_group_nid(p, max_group_nid);
}
 
if (max_faults) {
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/8] x86, fpu: unlazy_fpu: don't do __thread_fpu_end() if use_eager_fpu()

2015-02-06 Thread riel

From: Oleg Nesterov 

unlazy_fpu()->__thread_fpu_end() doesn't look right if use_eager_fpu().
Unconditional __thread_fpu_end() is only correct if we know that this
thread can't return to user-mode and use FPU.

Fortunately it has only 2 callers. fpu_copy() checks use_eager_fpu(),
and init_fpu(current) can be only called by the coredumping thread via
regset->get(). But it is exported to modules, and imo this should be
fixed anyway.

And if we check use_eager_fpu() we can use __save_fpu() like fpu_copy()
and save_init_fpu() do.

- It seems that even !use_eager_fpu() case doesn't need the unconditional
  __thread_fpu_end(), we only need it if __save_init_fpu() returns 0.

- It is still not clear to me if __save_init_fpu() can safely nest with
  another save + restore from __kernel_fpu_begin(). If not, we can use
  kernel_fpu_disable() to fix the race.

Signed-off-by: Oleg Nesterov 
Signed-off-by: Rik van Riel 
---
 arch/x86/kernel/i387.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index c3b92c0975cd..8e070a6c30e5 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -120,8 +120,12 @@ void unlazy_fpu(struct task_struct *tsk)
 {
preempt_disable();
if (__thread_has_fpu(tsk)) {
-   __save_init_fpu(tsk);
-   __thread_fpu_end(tsk);
+   if (use_eager_fpu()) {
+   __save_fpu(tsk);
+   } else {
+   __save_init_fpu(tsk);
+   __thread_fpu_end(tsk);
+   }
}
preempt_enable();
 }
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 8/8] x86,fpu: also check fpu_lazy_restore when use_eager_fpu

2015-02-06 Thread riel

From: Rik van Riel 

With Oleg's patch "x86, fpu: don't abuse FPU in kernel threads if
use_eager_fpu()", kernel threads no longer have an FPU state even
on systems with use_eager_fpu()

That in turn means that a task may still have its FPU state
loaded in the FPU registers, if the task only got interrupted by
kernel threads from when it went to sleep, to when it woke up
again.

In that case, there is no need to restore the FPU state for
this task, since it is still in the registers.

The kernel can simply use the same logic to determine this as
is used for !use_eager_fpu() systems.

Signed-off-by: Rik van Riel 
---
 arch/x86/include/asm/fpu-internal.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/fpu-internal.h 
b/arch/x86/include/asm/fpu-internal.h
index 06af286593d7..723b74da0685 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -457,7 +457,7 @@ static inline fpu_switch_t switch_fpu_prepare(struct 
task_struct *old, struct ta
task_disable_lazy_fpu_restore(old);
if (fpu.preload) {
new->thread.fpu_counter++;
-   if (!use_eager_fpu() && fpu_lazy_restore(new, cpu))
+   if (fpu_lazy_restore(new, cpu))
fpu.preload = 0;
else
prefetch(new->thread.fpu.state);
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 7/8] x86,fpu: use disable_task_lazy_fpu_restore helper

2015-02-06 Thread riel

From: Rik van Riel 

Replace magic assignments of fpu.last_cpu = ~0 with more explicit
disable_task_lazy_fpu_restore calls.

This also fixes the lazy FPU restore disabling in drop_fpu, which
only really works when !use_eager_fpu().  This is fine for now,
because fpu_lazy_restore() is only used when !use_eager_fpu()
currently, but we may want to expand that.

Signed-off-by: Rik van Riel 
---
 arch/x86/include/asm/fpu-internal.h | 4 ++--
 arch/x86/kernel/i387.c  | 2 +-
 arch/x86/kernel/process.c   | 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h 
b/arch/x86/include/asm/fpu-internal.h
index 04063751ac80..06af286593d7 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -440,7 +440,7 @@ static inline fpu_switch_t switch_fpu_prepare(struct 
task_struct *old, struct ta
 new->thread.fpu_counter > 5);
if (__thread_has_fpu(old)) {
if (!__save_init_fpu(old))
-   old->thread.fpu.last_cpu = ~0;
+   task_disable_lazy_fpu_restore(old);
else
old->thread.fpu.last_cpu = cpu;
old->thread.fpu.has_fpu = 0;/* But leave fpu_owner_task! */
@@ -454,7 +454,7 @@ static inline fpu_switch_t switch_fpu_prepare(struct 
task_struct *old, struct ta
stts();
} else {
old->thread.fpu_counter = 0;
-   old->thread.fpu.last_cpu = ~0;
+   task_disable_lazy_fpu_restore(old);
if (fpu.preload) {
new->thread.fpu_counter++;
if (!use_eager_fpu() && fpu_lazy_restore(new, cpu))
diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index 8e070a6c30e5..8416b5f85806 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -250,7 +250,7 @@ int init_fpu(struct task_struct *tsk)
if (tsk_used_math(tsk)) {
if (cpu_has_fpu && tsk == current)
unlazy_fpu(tsk);
-   tsk->thread.fpu.last_cpu = ~0;
+   task_disable_lazy_fpu_restore(tsk);
return 0;
}
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index dd9a069a5ec5..83480373a642 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -68,8 +68,8 @@ int arch_dup_task_struct(struct task_struct *dst, struct 
task_struct *src)
 
dst->thread.fpu_counter = 0;
dst->thread.fpu.has_fpu = 0;
-   dst->thread.fpu.last_cpu = ~0;
dst->thread.fpu.state = NULL;
+   task_disable_lazy_fpu_restore(dst);
if (tsk_used_math(src)) {
int err = fpu_alloc(>thread.fpu);
if (err)
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/8] x86,fpu: various small FPU cleanups and optimizations

2015-02-06 Thread riel

This includes the three patches by Oleg that are not in -tip yet,
and five more by myself.

I believe the changes to my patches address all the comments by
reviewers on the previous version.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 5/8] x86,fpu: introduce task_disable_lazy_fpu_restore helper

2015-02-06 Thread riel

From: Rik van Riel 

Currently there are a few magic assignments sprinkled through the
code that disable lazy FPU state restoring, some more effective than
others, and all equally mystifying.

It would be easier to have a helper to explicitly disable lazy
FPU state restoring for a task.

Signed-off-by: Rik van Riel 
---
 arch/x86/include/asm/fpu-internal.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/include/asm/fpu-internal.h 
b/arch/x86/include/asm/fpu-internal.h
index 439ac3921a1e..c1f66261ad12 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -79,6 +79,16 @@ static inline void __cpu_disable_lazy_restore(unsigned int 
cpu)
per_cpu(fpu_owner_task, cpu) = NULL;
 }
 
+/*
+ * Used to indicate that the FPU state in memory is newer than the FPU
+ * state in registers, and the FPU state should be reloaded next time the
+ * task is run. Only safe on the current task, or non-running tasks.
+ */
+static inline void task_disable_lazy_fpu_restore(struct task_struct *tsk)
+{
+   tsk->thread.fpu.last_cpu = ~0;
+}
+
 static inline int fpu_lazy_restore(struct task_struct *new, unsigned int cpu)
 {
return new == this_cpu_read_stable(fpu_owner_task) &&
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 6/8] x86,fpu: use an explicit if/else in switch_fpu_prepare

2015-02-06 Thread riel

From: Rik van Riel 

Use an explicit if/else branch after __save_init_fpu(old) in
switch_fpu_prepare.  This makes substituting the assignment
with a call to task_disable_lazy_fpu() in the next patch easier
to review.

Signed-off-by: Rik van Riel 
---
 arch/x86/include/asm/fpu-internal.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h 
b/arch/x86/include/asm/fpu-internal.h
index c1f66261ad12..04063751ac80 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -440,8 +440,9 @@ static inline fpu_switch_t switch_fpu_prepare(struct 
task_struct *old, struct ta
 new->thread.fpu_counter > 5);
if (__thread_has_fpu(old)) {
if (!__save_init_fpu(old))
-   cpu = ~0;
-   old->thread.fpu.last_cpu = cpu;
+   old->thread.fpu.last_cpu = ~0;
+   else
+   old->thread.fpu.last_cpu = cpu;
old->thread.fpu.has_fpu = 0;/* But leave fpu_owner_task! */
 
/* Don't change CR0.TS if we just switch! */
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/8] x86, fpu: kill save_init_fpu(), change math_error() to use unlazy_fpu()

2015-02-06 Thread riel

From: Oleg Nesterov 

math_error() calls save_init_fpu() after conditional_sti(), this means
that the caller can be preempted. If !use_eager_fpu() we can hit the
WARN_ON_ONCE(!__thread_has_fpu(tsk)) and/or save the wrong FPU state.

Change math_error() to use unlazy_fpu() and kill save_init_fpu().

Signed-off-by: Oleg Nesterov 
Signed-off-by: Rik van Riel 
---
 arch/x86/include/asm/fpu-internal.h | 18 --
 arch/x86/kernel/traps.c |  2 +-
 2 files changed, 1 insertion(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h 
b/arch/x86/include/asm/fpu-internal.h
index 0dbc08282291..27d00e04f911 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -520,24 +520,6 @@ static inline void __save_fpu(struct task_struct *tsk)
 }
 
 /*
- * These disable preemption on their own and are safe
- */
-static inline void save_init_fpu(struct task_struct *tsk)
-{
-   WARN_ON_ONCE(!__thread_has_fpu(tsk));
-
-   if (use_eager_fpu()) {
-   __save_fpu(tsk);
-   return;
-   }
-
-   preempt_disable();
-   __save_init_fpu(tsk);
-   __thread_fpu_end(tsk);
-   preempt_enable();
-}
-
-/*
  * i387 state interaction
  */
 static inline unsigned short get_fpu_cwd(struct task_struct *tsk)
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index fb4cb6adf225..51c465846f06 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -663,7 +663,7 @@ static void math_error(struct pt_regs *regs, int 
error_code, int trapnr)
/*
 * Save the info for the exception handler and clear the error.
 */
-   save_init_fpu(task);
+   unlazy_fpu(task);
task->thread.trap_nr = trapnr;
task->thread.error_code = error_code;
info.si_signo = SIGFPE;
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/8] x86, fpu: unlazy_fpu: don't reset thread.fpu_counter

2015-02-06 Thread riel

From: Oleg Nesterov 

It is not clear why the "else" branch clears ->fpu_counter, this makes
no sense.

If use_eager_fpu() then this has no effect. Otherwise, if we actually
wanted to prevent fpu preload after the context switch we would need to
reset it unconditionally, even if __thread_has_fpu().

Signed-off-by: Oleg Nesterov 
Signed-off-by: Rik van Riel 
---
 arch/x86/kernel/i387.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index 47348653503a..c3b92c0975cd 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -122,8 +122,7 @@ void unlazy_fpu(struct task_struct *tsk)
if (__thread_has_fpu(tsk)) {
__save_init_fpu(tsk);
__thread_fpu_end(tsk);
-   } else
-   tsk->thread.fpu_counter = 0;
+   }
preempt_enable();
 }
 EXPORT_SYMBOL(unlazy_fpu);
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/8] x86,fpu: move lazy restore functions up a few lines

2015-02-06 Thread riel

From: Rik van Riel 

We need another lazy restore related function, that will be called
from a function that is above where the lazy restore functions are
now. It would be nice to keep all three functions grouped together.

Signed-off-by: Rik van Riel 
---
 arch/x86/include/asm/fpu-internal.h | 36 ++--
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h 
b/arch/x86/include/asm/fpu-internal.h
index 27d00e04f911..439ac3921a1e 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -67,6 +67,24 @@ extern void finit_soft_fpu(struct i387_soft_struct *soft);
 static inline void finit_soft_fpu(struct i387_soft_struct *soft) {}
 #endif
 
+/*
+ * Must be run with preemption disabled: this clears the fpu_owner_task,
+ * on this CPU.
+ *
+ * This will disable any lazy FPU state restore of the current FPU state,
+ * but if the current thread owns the FPU, it will still be saved by.
+ */
+static inline void __cpu_disable_lazy_restore(unsigned int cpu)
+{
+   per_cpu(fpu_owner_task, cpu) = NULL;
+}
+
+static inline int fpu_lazy_restore(struct task_struct *new, unsigned int cpu)
+{
+   return new == this_cpu_read_stable(fpu_owner_task) &&
+   cpu == new->thread.fpu.last_cpu;
+}
+
 static inline int is_ia32_compat_frame(void)
 {
return config_enabled(CONFIG_IA32_EMULATION) &&
@@ -400,24 +418,6 @@ static inline void drop_init_fpu(struct task_struct *tsk)
  */
 typedef struct { int preload; } fpu_switch_t;
 
-/*
- * Must be run with preemption disabled: this clears the fpu_owner_task,
- * on this CPU.
- *
- * This will disable any lazy FPU state restore of the current FPU state,
- * but if the current thread owns the FPU, it will still be saved by.
- */
-static inline void __cpu_disable_lazy_restore(unsigned int cpu)
-{
-   per_cpu(fpu_owner_task, cpu) = NULL;
-}
-
-static inline int fpu_lazy_restore(struct task_struct *new, unsigned int cpu)
-{
-   return new == this_cpu_read_stable(fpu_owner_task) &&
-   cpu == new->thread.fpu.last_cpu;
-}
-
 static inline fpu_switch_t switch_fpu_prepare(struct task_struct *old, struct 
task_struct *new, int cpu)
 {
fpu_switch_t fpu;
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH -v3 0/6] rcu,nohz,kvm: use RCU extended quiescent state when running KVM guest

2015-02-09 Thread riel

When running a KVM guest on a system with NOHZ_FULL enabled, and the
KVM guest running with idle=poll mode, we still get wakeups of the
rcuos/N threads.

This problem has already been solved for user space by telling the
RCU subsystem that the CPU is in an extended quiescent state while
running user space code.

This patch series extends that code a little bit to make it usable
to track KVM guest space, too.

I tested the code by booting a KVM guest with idle=poll, on a system
with NOHZ_FULL enabled on most CPUs, and a VCPU thread bound to a
CPU. In a 10 second interval, rcuos/N threads on other CPUs got woken
up several times, while the rcuos thread on the CPU running the bound
and alwasy running VCPU thread never got woken up once.

Thanks to Christian Borntraeger and Paul McKenney for reviewing the
first version of this patch series, and helping optimize patch 4/5.
Thanks to Frederic Weisbecker for further enhancements.

Apologies to Catalin and Will for not fixing up ARM. I am not
familiar with ARM assembly, and not sure how to pass a constant
argument to a function from assembly code on ARM :)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/6] rcu,nohz: add state parameter to context_tracking_user_enter/exit

2015-02-09 Thread riel

From: Rik van Riel 

Add the expected ctx_state as a parameter to context_tracking_user_enter
and context_tracking_user_exit, allowing the same functions to not just
track kernel <> user space switching, but also kernel <> guest transitions.

Catalin, Will: this patch and the next one break ARM, entry.S needs to be
modified to add the IN_USER parameter to context_tracking_enter & _exit.

Cc: catalin.mari...@arm.com
Cc: will.dea...@arm.com
Signed-off-by: Rik van Riel 
---
 include/linux/context_tracking.h | 12 ++--
 kernel/context_tracking.c| 10 +-
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 37b81bd51ec0..bd9f000fc98d 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -10,21 +10,21 @@
 #ifdef CONFIG_CONTEXT_TRACKING
 extern void context_tracking_cpu_set(int cpu);
 
-extern void context_tracking_user_enter(void);
-extern void context_tracking_user_exit(void);
+extern void context_tracking_user_enter(enum ctx_state state);
+extern void context_tracking_user_exit(enum ctx_state state);
 extern void __context_tracking_task_switch(struct task_struct *prev,
   struct task_struct *next);
 
 static inline void user_enter(void)
 {
if (context_tracking_is_enabled())
-   context_tracking_user_enter();
+   context_tracking_user_enter(IN_USER);
 
 }
 static inline void user_exit(void)
 {
if (context_tracking_is_enabled())
-   context_tracking_user_exit();
+   context_tracking_user_exit(IN_USER);
 }
 
 static inline enum ctx_state exception_enter(void)
@@ -35,7 +35,7 @@ static inline enum ctx_state exception_enter(void)
return 0;
 
prev_ctx = this_cpu_read(context_tracking.state);
-   context_tracking_user_exit();
+   context_tracking_user_exit(prev_ctx);
 
return prev_ctx;
 }
@@ -44,7 +44,7 @@ static inline void exception_exit(enum ctx_state prev_ctx)
 {
if (context_tracking_is_enabled()) {
if (prev_ctx == IN_USER)
-   context_tracking_user_enter();
+   context_tracking_user_enter(prev_ctx);
}
 }
 
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 937ecdfdf258..4c010787c9ec 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -47,7 +47,7 @@ void context_tracking_cpu_set(int cpu)
  * to execute won't use any RCU read side critical section because this
  * function sets RCU in extended quiescent state.
  */
-void context_tracking_user_enter(void)
+void context_tracking_user_enter(enum ctx_state state)
 {
unsigned long flags;
 
@@ -75,7 +75,7 @@ void context_tracking_user_enter(void)
WARN_ON_ONCE(!current->mm);
 
local_irq_save(flags);
-   if ( __this_cpu_read(context_tracking.state) != IN_USER) {
+   if ( __this_cpu_read(context_tracking.state) != state) {
if (__this_cpu_read(context_tracking.active)) {
trace_user_enter(0);
/*
@@ -101,7 +101,7 @@ void context_tracking_user_enter(void)
 * OTOH we can spare the calls to vtime and RCU when 
context_tracking.active
 * is false because we know that CPU is not tickless.
 */
-   __this_cpu_write(context_tracking.state, IN_USER);
+   __this_cpu_write(context_tracking.state, state);
}
local_irq_restore(flags);
 }
@@ -118,7 +118,7 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
  * This call supports re-entrancy. This way it can be called from any exception
  * handler without needing to know if we came from userspace or not.
  */
-void context_tracking_user_exit(void)
+void context_tracking_user_exit(enum ctx_state state)
 {
unsigned long flags;
 
@@ -129,7 +129,7 @@ void context_tracking_user_exit(void)
return;
 
local_irq_save(flags);
-   if (__this_cpu_read(context_tracking.state) == IN_USER) {
+   if (__this_cpu_read(context_tracking.state) == state) {
if (__this_cpu_read(context_tracking.active)) {
/*
 * We are going to run code that may use RCU. Inform
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/6] nohz,kvm: export context_tracking_user_enter/exit

2015-02-09 Thread riel

From: Rik van Riel 

Export context_tracking_user_enter/exit so it can be used by KVM.

Signed-off-by: Rik van Riel 
---
 kernel/context_tracking.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 2d94147c07b2..8c5f2e939eee 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -108,6 +108,7 @@ void context_tracking_enter(enum ctx_state state)
local_irq_restore(flags);
 }
 NOKPROBE_SYMBOL(context_tracking_enter);
+EXPORT_SYMBOL_GPL(context_tracking_enter);
 
 /**
  * context_tracking_exit - Inform the context tracking that the CPU is
@@ -149,6 +150,7 @@ void context_tracking_exit(enum ctx_state state)
local_irq_restore(flags);
 }
 NOKPROBE_SYMBOL(context_tracking_exit);
+EXPORT_SYMBOL_GPL(context_tracking_exit);
 
 /**
  * __context_tracking_task_switch - context switch the syscall callbacks
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 6/6] nohz: add stub context_tracking_is_enabled

2015-02-09 Thread riel

From: Rik van Riel 

With code elsewhere doing something conditional on whether or not
context tracking is enabled, we want a stub function that tells us
context tracking is not enabled, when CONFIG_CONTEXT_TRACKING is
not set.

Signed-off-by: Rik van Riel 
---
 include/linux/context_tracking_state.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/context_tracking_state.h 
b/include/linux/context_tracking_state.h
index f3ef027af749..90a7bab8779e 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -40,6 +40,8 @@ static inline bool context_tracking_in_user(void)
 #else
 static inline bool context_tracking_in_user(void) { return false; }
 static inline bool context_tracking_active(void) { return false; }
+static inline bool context_tracking_is_enabled(void) { return false; }
+static inline bool context_tracking_cpu_is_enabled(void) { return false; }
 #endif /* CONFIG_CONTEXT_TRACKING */
 
 #endif
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/6] rcu,nohz: rename context_tracking_enter & _exit

2015-02-09 Thread riel

From: Rik van Riel 

Rename context_tracking_user_enter & context_tracking_user_exit
to just context_tracking_enter & context_tracking_exit, since it
will be used to track guest state, too.

This also breaks ARM. The rest of the series does not look like
it impacts ARM.

Cc: will.dea...@arm.com
Cc: catalin.mari...@arm.com
Suggested-by: Frederic Weisbecker 
Signed-off-by: Rik van Riel 
---
 include/linux/context_tracking.h | 12 ++--
 kernel/context_tracking.c| 31 ---
 2 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index bd9f000fc98d..29d7fecb365a 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -10,21 +10,21 @@
 #ifdef CONFIG_CONTEXT_TRACKING
 extern void context_tracking_cpu_set(int cpu);
 
-extern void context_tracking_user_enter(enum ctx_state state);
-extern void context_tracking_user_exit(enum ctx_state state);
+extern void context_tracking_enter(enum ctx_state state);
+extern void context_tracking_exit(enum ctx_state state);
 extern void __context_tracking_task_switch(struct task_struct *prev,
   struct task_struct *next);
 
 static inline void user_enter(void)
 {
if (context_tracking_is_enabled())
-   context_tracking_user_enter(IN_USER);
+   context_tracking_enter(IN_USER);
 
 }
 static inline void user_exit(void)
 {
if (context_tracking_is_enabled())
-   context_tracking_user_exit(IN_USER);
+   context_tracking_exit(IN_USER);
 }
 
 static inline enum ctx_state exception_enter(void)
@@ -35,7 +35,7 @@ static inline enum ctx_state exception_enter(void)
return 0;
 
prev_ctx = this_cpu_read(context_tracking.state);
-   context_tracking_user_exit(prev_ctx);
+   context_tracking_exit(prev_ctx);
 
return prev_ctx;
 }
@@ -44,7 +44,7 @@ static inline void exception_exit(enum ctx_state prev_ctx)
 {
if (context_tracking_is_enabled()) {
if (prev_ctx == IN_USER)
-   context_tracking_user_enter(prev_ctx);
+   context_tracking_enter(prev_ctx);
}
 }
 
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 4c010787c9ec..e031e8c0fb91 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -39,15 +39,15 @@ void context_tracking_cpu_set(int cpu)
 }
 
 /**
- * context_tracking_user_enter - Inform the context tracking that the CPU is 
going to
- *   enter userspace mode.
+ * context_tracking_enter - Inform the context tracking that the CPU is going
+ *  to enter user or guest space mode.
  *
  * This function must be called right before we switch from the kernel
- * to userspace, when it's guaranteed the remaining kernel instructions
- * to execute won't use any RCU read side critical section because this
- * function sets RCU in extended quiescent state.
+ * to user or guest space, when it's guaranteed the remaining kernel
+ * instructions to execute won't use any RCU read side critical section
+ * because this function sets RCU in extended quiescent state.
  */
-void context_tracking_user_enter(enum ctx_state state)
+void context_tracking_enter(enum ctx_state state)
 {
unsigned long flags;
 
@@ -105,20 +105,21 @@ void context_tracking_user_enter(enum ctx_state state)
}
local_irq_restore(flags);
 }
-NOKPROBE_SYMBOL(context_tracking_user_enter);
+NOKPROBE_SYMBOL(context_tracking_enter);
 
 /**
- * context_tracking_user_exit - Inform the context tracking that the CPU is
- *  exiting userspace mode and entering the kernel.
+ * context_tracking_exit - Inform the context tracking that the CPU is
+ * exiting user or guest mode and entering the kernel.
  *
- * This function must be called after we entered the kernel from userspace
- * before any use of RCU read side critical section. This potentially include
- * any high level kernel code like syscalls, exceptions, signal handling, 
etc...
+ * This function must be called after we entered the kernel from user or
+ * guest space before any use of RCU read side critical section. This
+ * potentially include any high level kernel code like syscalls, exceptions,
+ * signal handling, etc...
  *
  * This call supports re-entrancy. This way it can be called from any exception
- * handler without needing to know if we came from userspace or not.
+ * handler without needing to know if we came from user or guest space or not.
  */
-void context_tracking_user_exit(enum ctx_state state)
+void context_tracking_exit(enum ctx_state state)
 {
unsigned long flags;
 
@@ -143,7 +144,7 @@ void context_tracking_user_exit(enum ctx_state state)
}
local_irq_restore(flags);
 }
-NOKPROBE_SYMBOL(context_tracking_user_exit);
+NOKPROB

[PATCH 3/6] rcu,nohz: run vtime_user_enter/exit only when state == IN_USER

2015-02-09 Thread riel

From: Rik van Riel 

Only run vtime_user_enter, vtime_user_exit, and the user enter & exit
trace points when we are entering or exiting user state, respectively.

The RCU code only distinguishes between "idle" and "not idle or kernel".
There should be no need to add an additional (unused) state there.

Signed-off-by: Rik van Riel 
---
 kernel/context_tracking.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index e031e8c0fb91..2d94147c07b2 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -77,7 +77,6 @@ void context_tracking_enter(enum ctx_state state)
local_irq_save(flags);
if ( __this_cpu_read(context_tracking.state) != state) {
if (__this_cpu_read(context_tracking.active)) {
-   trace_user_enter(0);
/*
 * At this stage, only low level arch entry code 
remains and
 * then we'll run in userspace. We can assume there 
won't be
@@ -85,7 +84,10 @@ void context_tracking_enter(enum ctx_state state)
 * user_exit() or rcu_irq_enter(). Let's remove RCU's 
dependency
 * on the tick.
 */
-   vtime_user_enter(current);
+   if (state == IN_USER) {
+   trace_user_enter(0);
+   vtime_user_enter(current);
+   }
rcu_user_enter();
}
/*
@@ -137,8 +139,10 @@ void context_tracking_exit(enum ctx_state state)
 * RCU core about that (ie: we may need the tick again).
 */
rcu_user_exit();
-   vtime_user_exit(current);
-   trace_user_exit(0);
+   if (state == IN_USER) {
+   vtime_user_exit(current);
+   trace_user_exit(0);
+   }
}
__this_cpu_write(context_tracking.state, IN_KERNEL);
}
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 5/6] kvm,rcu,nohz: use RCU extended quiescent state when running KVM guest

2015-02-09 Thread riel

From: Rik van Riel 

The host kernel is not doing anything while the CPU is executing
a KVM guest VCPU, so it can be marked as being in an extended
quiescent state, identical to that used when running user space
code.

The only exception to that rule is when the host handles an
interrupt, which is already handled by the irq code, which
calls rcu_irq_enter and rcu_irq_exit.

The guest_enter and guest_exit functions already switch vtime
accounting independent of context tracking, so leave those calls
where they are, instead of moving them into the context tracking
code.

Signed-off-by: Rik van Riel 
---
 include/linux/context_tracking.h   | 8 +++-
 include/linux/context_tracking_state.h | 1 +
 include/linux/kvm_host.h   | 3 ++-
 3 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 29d7fecb365a..c70d7e760061 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -43,7 +43,7 @@ static inline enum ctx_state exception_enter(void)
 static inline void exception_exit(enum ctx_state prev_ctx)
 {
if (context_tracking_is_enabled()) {
-   if (prev_ctx == IN_USER)
+   if (prev_ctx != IN_KERNEL)
context_tracking_enter(prev_ctx);
}
 }
@@ -74,6 +74,9 @@ static inline void context_tracking_init(void) { }
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_GEN
 static inline void guest_enter(void)
 {
+   if (context_tracking_is_enabled())
+   context_tracking_enter(IN_GUEST);
+
if (vtime_accounting_enabled())
vtime_guest_enter(current);
else
@@ -86,6 +89,9 @@ static inline void guest_exit(void)
vtime_guest_exit(current);
else
current->flags &= ~PF_VCPU;
+
+   if (context_tracking_is_enabled())
+   context_tracking_exit(IN_GUEST);
 }
 
 #else
diff --git a/include/linux/context_tracking_state.h 
b/include/linux/context_tracking_state.h
index 97a81225d037..f3ef027af749 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -15,6 +15,7 @@ struct context_tracking {
enum ctx_state {
IN_KERNEL = 0,
IN_USER,
+   IN_GUEST,
} state;
 };
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 26f106022c88..c7828a6a9614 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -772,7 +772,8 @@ static inline void kvm_guest_enter(void)
 * one time slice). Lets treat guest mode as quiescent state, just like
 * we do with user-mode execution.
 */
-   rcu_virt_note_context_switch(smp_processor_id());
+   if (!context_tracking_cpu_is_enabled())
+   rcu_virt_note_context_switch(smp_processor_id());
 }
 
 static inline void kvm_guest_exit(void)
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH -v4 0/6] rcu,nohz,kvm: use RCU extended quiescent state when running KVM guest

2015-02-10 Thread riel

When running a KVM guest on a system with NOHZ_FULL enabled, and the
KVM guest running with idle=poll mode, we still get wakeups of the
rcuos/N threads.

This problem has already been solved for user space by telling the
RCU subsystem that the CPU is in an extended quiescent state while
running user space code.

This patch series extends that code a little bit to make it usable
to track KVM guest space, too.

I tested the code by booting a KVM guest with idle=poll, on a system
with NOHZ_FULL enabled on most CPUs, and a VCPU thread bound to a
CPU. In a 10 second interval, rcuos/N threads on other CPUs got woken
up several times, while the rcuos thread on the CPU running the bound
and alwasy running VCPU thread never got woken up once.

Thanks to Christian Borntraeger, Paul McKenney, Paulo Bonzini,
Frederic Weisbecker, and Will Deacon for reviewing and improving
earlier versions of this patch series.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/6] rcu,nohz: add context_tracking_user_enter/exit wrapper functions

2015-02-10 Thread riel

From: Rik van Riel 

These wrapper functions allow architecture code (eg. ARM) to keep
calling context_tracking_user_enter & context_tracking_user_exit
the same way it always has, without error prone tricks like duplicate
defines of argument values in assembly code.

Signed-off-by: Rik van Riel 
---
 include/linux/context_tracking.h |  2 ++
 kernel/context_tracking.c| 37 +
 2 files changed, 27 insertions(+), 12 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 37b81bd51ec0..03b9c733eae7 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -10,6 +10,8 @@
 #ifdef CONFIG_CONTEXT_TRACKING
 extern void context_tracking_cpu_set(int cpu);
 
+extern void context_tracking_enter(void);
+extern void context_tracking_exit(void);
 extern void context_tracking_user_enter(void);
 extern void context_tracking_user_exit(void);
 extern void __context_tracking_task_switch(struct task_struct *prev,
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 937ecdfdf258..bbdc423936e6 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -39,15 +39,15 @@ void context_tracking_cpu_set(int cpu)
 }
 
 /**
- * context_tracking_user_enter - Inform the context tracking that the CPU is 
going to
- *   enter userspace mode.
+ * context_tracking_enter - Inform the context tracking that the CPU is going
+ *  enter user or guest space mode.
  *
  * This function must be called right before we switch from the kernel
- * to userspace, when it's guaranteed the remaining kernel instructions
- * to execute won't use any RCU read side critical section because this
- * function sets RCU in extended quiescent state.
+ * to user or guest space, when it's guaranteed the remaining kernel
+ * instructions to execute won't use any RCU read side critical section
+ * because this function sets RCU in extended quiescent state.
  */
-void context_tracking_user_enter(void)
+void context_tracking_enter(void)
 {
unsigned long flags;
 
@@ -105,20 +105,27 @@ void context_tracking_user_enter(void)
}
local_irq_restore(flags);
 }
+NOKPROBE_SYMBOL(context_tracking_enter);
+
+void context_tracking_user_enter(void)
+{
+   context_tracking_enter();
+}
 NOKPROBE_SYMBOL(context_tracking_user_enter);
 
 /**
- * context_tracking_user_exit - Inform the context tracking that the CPU is
- *  exiting userspace mode and entering the kernel.
+ * context_tracking_exit - Inform the context tracking that the CPU is
+ * exiting user or guest mode and entering the kernel.
  *
- * This function must be called after we entered the kernel from userspace
- * before any use of RCU read side critical section. This potentially include
- * any high level kernel code like syscalls, exceptions, signal handling, 
etc...
+ * This function must be called after we entered the kernel from user or
+ * guest space before any use of RCU read side critical section. This
+ * potentially include any high level kernel code like syscalls, exceptions,
+ * signal handling, etc...
  *
  * This call supports re-entrancy. This way it can be called from any exception
  * handler without needing to know if we came from userspace or not.
  */
-void context_tracking_user_exit(void)
+void context_tracking_exit(void)
 {
unsigned long flags;
 
@@ -143,6 +150,12 @@ void context_tracking_user_exit(void)
}
local_irq_restore(flags);
 }
+NOKPROBE_SYMBOL(context_tracking_exit);
+
+void context_tracking_user_exit(void)
+{
+   context_tracking_exit();
+}
 NOKPROBE_SYMBOL(context_tracking_user_exit);
 
 /**
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/6] nohz: add stub context_tracking_is_enabled

2015-02-10 Thread riel

From: Rik van Riel 

With code elsewhere doing something conditional on whether or not
context tracking is enabled, we want a stub function that tells us
context tracking is not enabled, when CONFIG_CONTEXT_TRACKING is
not set.

Signed-off-by: Rik van Riel 
---
 include/linux/context_tracking_state.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/context_tracking_state.h 
b/include/linux/context_tracking_state.h
index 97a81225d037..72ab10fe1e46 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -39,6 +39,8 @@ static inline bool context_tracking_in_user(void)
 #else
 static inline bool context_tracking_in_user(void) { return false; }
 static inline bool context_tracking_active(void) { return false; }
+static inline bool context_tracking_is_enabled(void) { return false; }
+static inline bool context_tracking_cpu_is_enabled(void) { return false; }
 #endif /* CONFIG_CONTEXT_TRACKING */
 
 #endif
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/6] rcu,nohz: run vtime_user_enter/exit only when state == IN_USER

2015-02-10 Thread riel

From: Rik van Riel 

Only run vtime_user_enter, vtime_user_exit, and the user enter & exit
trace points when we are entering or exiting user state, respectively.

The KVM code in guest_enter and guest_exit already take care of calling
vtime_guest_enter and vtime_guest_exit, respectively.

The RCU code only distinguishes between "idle" and "not idle or kernel".
There should be no need to add an additional (unused) state there.

Signed-off-by: Rik van Riel 
---
 kernel/context_tracking.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 38e38aeac8b9..0e4e318d5ea4 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -77,7 +77,6 @@ void context_tracking_enter(enum ctx_state state)
local_irq_save(flags);
if ( __this_cpu_read(context_tracking.state) != state) {
if (__this_cpu_read(context_tracking.active)) {
-   trace_user_enter(0);
/*
 * At this stage, only low level arch entry code 
remains and
 * then we'll run in userspace. We can assume there 
won't be
@@ -85,7 +84,10 @@ void context_tracking_enter(enum ctx_state state)
 * user_exit() or rcu_irq_enter(). Let's remove RCU's 
dependency
 * on the tick.
 */
-   vtime_user_enter(current);
+   if (state == IN_USER) {
+   trace_user_enter(0);
+   vtime_user_enter(current);
+   }
rcu_user_enter();
}
/*
@@ -143,8 +145,10 @@ void context_tracking_exit(enum ctx_state state)
 * RCU core about that (ie: we may need the tick again).
 */
rcu_user_exit();
-   vtime_user_exit(current);
-   trace_user_exit(0);
+   if (state == IN_USER) {
+   vtime_user_exit(current);
+   trace_user_exit(0);
+   }
}
__this_cpu_write(context_tracking.state, IN_KERNEL);
}
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/6] rcu,nohz: add state parameter to context_tracking_enter/exit

2015-02-10 Thread riel

From: Rik van Riel 

Add the expected ctx_state as a parameter to context_tracking_enter and
context_tracking_exit, allowing the same functions to not just track
kernel <> user space switching, but also kernel <> guest transitions.

Signed-off-by: Rik van Riel 
---
 include/linux/context_tracking.h | 10 +-
 kernel/context_tracking.c| 14 +++---
 2 files changed, 12 insertions(+), 12 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 03b9c733eae7..954253283709 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -10,8 +10,8 @@
 #ifdef CONFIG_CONTEXT_TRACKING
 extern void context_tracking_cpu_set(int cpu);
 
-extern void context_tracking_enter(void);
-extern void context_tracking_exit(void);
+extern void context_tracking_enter(enum ctx_state state);
+extern void context_tracking_exit(enum ctx_state state);
 extern void context_tracking_user_enter(void);
 extern void context_tracking_user_exit(void);
 extern void __context_tracking_task_switch(struct task_struct *prev,
@@ -37,7 +37,7 @@ static inline enum ctx_state exception_enter(void)
return 0;
 
prev_ctx = this_cpu_read(context_tracking.state);
-   context_tracking_user_exit();
+   context_tracking_exit(prev_ctx);
 
return prev_ctx;
 }
@@ -45,8 +45,8 @@ static inline enum ctx_state exception_enter(void)
 static inline void exception_exit(enum ctx_state prev_ctx)
 {
if (context_tracking_is_enabled()) {
-   if (prev_ctx == IN_USER)
-   context_tracking_user_enter();
+   if (prev_ctx != IN_KERNEL)
+   context_tracking_enter(prev_ctx);
}
 }
 
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index bbdc423936e6..38e38aeac8b9 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -47,7 +47,7 @@ void context_tracking_cpu_set(int cpu)
  * instructions to execute won't use any RCU read side critical section
  * because this function sets RCU in extended quiescent state.
  */
-void context_tracking_enter(void)
+void context_tracking_enter(enum ctx_state state)
 {
unsigned long flags;
 
@@ -75,7 +75,7 @@ void context_tracking_enter(void)
WARN_ON_ONCE(!current->mm);
 
local_irq_save(flags);
-   if ( __this_cpu_read(context_tracking.state) != IN_USER) {
+   if ( __this_cpu_read(context_tracking.state) != state) {
if (__this_cpu_read(context_tracking.active)) {
trace_user_enter(0);
/*
@@ -101,7 +101,7 @@ void context_tracking_enter(void)
 * OTOH we can spare the calls to vtime and RCU when 
context_tracking.active
 * is false because we know that CPU is not tickless.
 */
-   __this_cpu_write(context_tracking.state, IN_USER);
+   __this_cpu_write(context_tracking.state, state);
}
local_irq_restore(flags);
 }
@@ -109,7 +109,7 @@ NOKPROBE_SYMBOL(context_tracking_enter);
 
 void context_tracking_user_enter(void)
 {
-   context_tracking_enter();
+   context_tracking_enter(IN_USER);
 }
 NOKPROBE_SYMBOL(context_tracking_user_enter);
 
@@ -125,7 +125,7 @@ NOKPROBE_SYMBOL(context_tracking_user_enter);
  * This call supports re-entrancy. This way it can be called from any exception
  * handler without needing to know if we came from userspace or not.
  */
-void context_tracking_exit(void)
+void context_tracking_exit(enum ctx_state state)
 {
unsigned long flags;
 
@@ -136,7 +136,7 @@ void context_tracking_exit(void)
return;
 
local_irq_save(flags);
-   if (__this_cpu_read(context_tracking.state) == IN_USER) {
+   if (__this_cpu_read(context_tracking.state) == state) {
if (__this_cpu_read(context_tracking.active)) {
/*
 * We are going to run code that may use RCU. Inform
@@ -154,7 +154,7 @@ NOKPROBE_SYMBOL(context_tracking_exit);
 
 void context_tracking_user_exit(void)
 {
-   context_tracking_exit();
+   context_tracking_exit(IN_USER);
 }
 NOKPROBE_SYMBOL(context_tracking_user_exit);
 
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 5/6] nohz,kvm: export context_tracking_user_enter/exit

2015-02-10 Thread riel

From: Rik van Riel 

Export context_tracking_user_enter/exit so it can be used by KVM.

Signed-off-by: Rik van Riel 
---
 kernel/context_tracking.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 0e4e318d5ea4..5bdf1a342ab3 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -108,6 +108,7 @@ void context_tracking_enter(enum ctx_state state)
local_irq_restore(flags);
 }
 NOKPROBE_SYMBOL(context_tracking_enter);
+EXPORT_SYMBOL_GPL(context_tracking_enter);
 
 void context_tracking_user_enter(void)
 {
@@ -155,6 +156,7 @@ void context_tracking_exit(enum ctx_state state)
local_irq_restore(flags);
 }
 NOKPROBE_SYMBOL(context_tracking_exit);
+EXPORT_SYMBOL_GPL(context_tracking_exit);
 
 void context_tracking_user_exit(void)
 {
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 6/6] kvm,rcu,nohz: use RCU extended quiescent state when running KVM guest

2015-02-10 Thread riel

From: Rik van Riel 

The host kernel is not doing anything while the CPU is executing
a KVM guest VCPU, so it can be marked as being in an extended
quiescent state, identical to that used when running user space
code.

The only exception to that rule is when the host handles an
interrupt, which is already handled by the irq code, which
calls rcu_irq_enter and rcu_irq_exit.

The guest_enter and guest_exit functions already switch vtime
accounting independent of context tracking. Leave those calls
where they are, instead of moving them into the context tracking
code.

Signed-off-by: Rik van Riel 
---
 include/linux/context_tracking.h   | 6 ++
 include/linux/context_tracking_state.h | 1 +
 include/linux/kvm_host.h   | 3 ++-
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 954253283709..b65fd1420e53 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -80,10 +80,16 @@ static inline void guest_enter(void)
vtime_guest_enter(current);
else
current->flags |= PF_VCPU;
+
+   if (context_tracking_is_enabled())
+   context_tracking_enter(IN_GUEST);
 }
 
 static inline void guest_exit(void)
 {
+   if (context_tracking_is_enabled())
+   context_tracking_exit(IN_GUEST);
+
if (vtime_accounting_enabled())
vtime_guest_exit(current);
else
diff --git a/include/linux/context_tracking_state.h 
b/include/linux/context_tracking_state.h
index 72ab10fe1e46..90a7bab8779e 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -15,6 +15,7 @@ struct context_tracking {
enum ctx_state {
IN_KERNEL = 0,
IN_USER,
+   IN_GUEST,
} state;
 };
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 26f106022c88..c7828a6a9614 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -772,7 +772,8 @@ static inline void kvm_guest_enter(void)
 * one time slice). Lets treat guest mode as quiescent state, just like
 * we do with user-mode execution.
 */
-   rcu_virt_note_context_switch(smp_processor_id());
+   if (!context_tracking_cpu_is_enabled())
+   rcu_virt_note_context_switch(smp_processor_id());
 }
 
 static inline void kvm_guest_exit(void)
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/3] cleanups to the disable lazy fpu restore code

2015-01-30 Thread riel

These go on top of Oleg's patches from yesterday.

The mechanism to disable lazy FPU restore is inscrutible
in several places, and dubious at best in one.

These patches make things explicit.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/3] x86,fpu: move lazy restore functions up a few lines

2015-01-30 Thread riel

From: Rik van Riel 

We need another lazy restore related function, that will be called
from a function that is above where the lazy restore functions are
now. It would be nice to keep all three functions grouped together.

Signed-off-by: Rik van Riel 
---
 arch/x86/include/asm/fpu-internal.h | 36 ++--
 1 file changed, 18 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h 
b/arch/x86/include/asm/fpu-internal.h
index 27d00e04f911..439ac3921a1e 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -67,6 +67,24 @@ extern void finit_soft_fpu(struct i387_soft_struct *soft);
 static inline void finit_soft_fpu(struct i387_soft_struct *soft) {}
 #endif
 
+/*
+ * Must be run with preemption disabled: this clears the fpu_owner_task,
+ * on this CPU.
+ *
+ * This will disable any lazy FPU state restore of the current FPU state,
+ * but if the current thread owns the FPU, it will still be saved by.
+ */
+static inline void __cpu_disable_lazy_restore(unsigned int cpu)
+{
+   per_cpu(fpu_owner_task, cpu) = NULL;
+}
+
+static inline int fpu_lazy_restore(struct task_struct *new, unsigned int cpu)
+{
+   return new == this_cpu_read_stable(fpu_owner_task) &&
+   cpu == new->thread.fpu.last_cpu;
+}
+
 static inline int is_ia32_compat_frame(void)
 {
return config_enabled(CONFIG_IA32_EMULATION) &&
@@ -400,24 +418,6 @@ static inline void drop_init_fpu(struct task_struct *tsk)
  */
 typedef struct { int preload; } fpu_switch_t;
 
-/*
- * Must be run with preemption disabled: this clears the fpu_owner_task,
- * on this CPU.
- *
- * This will disable any lazy FPU state restore of the current FPU state,
- * but if the current thread owns the FPU, it will still be saved by.
- */
-static inline void __cpu_disable_lazy_restore(unsigned int cpu)
-{
-   per_cpu(fpu_owner_task, cpu) = NULL;
-}
-
-static inline int fpu_lazy_restore(struct task_struct *new, unsigned int cpu)
-{
-   return new == this_cpu_read_stable(fpu_owner_task) &&
-   cpu == new->thread.fpu.last_cpu;
-}
-
 static inline fpu_switch_t switch_fpu_prepare(struct task_struct *old, struct 
task_struct *new, int cpu)
 {
fpu_switch_t fpu;
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/3] x86,fpu: use disable_task_lazy_fpu_restore helper

2015-01-30 Thread riel

From: Rik van Riel 

Replace magic assignments of fpu.last_cpu = ~0 with more explicit
disable_task_lazy_fpu_restore calls.

This also fixes the lazy FPU restore disabling in drop_fpu, which
only really works when !use_eager_fpu().  This is fine for now,
because fpu_lazy_restore() is only used when !use_eager_fpu()
currently, but we may want to expand that.

Signed-off-by: Rik van Riel 
---
 arch/x86/include/asm/fpu-internal.h | 9 +
 arch/x86/kernel/i387.c  | 2 +-
 arch/x86/kernel/process.c   | 2 +-
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/fpu-internal.h 
b/arch/x86/include/asm/fpu-internal.h
index c1f66261ad12..e2832f9dfed5 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -396,7 +396,7 @@ static inline void drop_fpu(struct task_struct *tsk)
 * Forget coprocessor state..
 */
preempt_disable();
-   tsk->thread.fpu_counter = 0;
+   task_disable_lazy_fpu_restore(tsk);
__drop_fpu(tsk);
clear_used_math();
preempt_enable();
@@ -440,8 +440,9 @@ static inline fpu_switch_t switch_fpu_prepare(struct 
task_struct *old, struct ta
 new->thread.fpu_counter > 5);
if (__thread_has_fpu(old)) {
if (!__save_init_fpu(old))
-   cpu = ~0;
-   old->thread.fpu.last_cpu = cpu;
+   task_disable_lazy_fpu_restore(old);
+   else
+   old->thread.fpu.last_cpu = cpu;
old->thread.fpu.has_fpu = 0;/* But leave fpu_owner_task! */
 
/* Don't change CR0.TS if we just switch! */
@@ -453,7 +454,7 @@ static inline fpu_switch_t switch_fpu_prepare(struct 
task_struct *old, struct ta
stts();
} else {
old->thread.fpu_counter = 0;
-   old->thread.fpu.last_cpu = ~0;
+   task_disable_lazy_fpu_restore(old);
if (fpu.preload) {
new->thread.fpu_counter++;
if (!use_eager_fpu() && fpu_lazy_restore(new, cpu))
diff --git a/arch/x86/kernel/i387.c b/arch/x86/kernel/i387.c
index 8e070a6c30e5..8416b5f85806 100644
--- a/arch/x86/kernel/i387.c
+++ b/arch/x86/kernel/i387.c
@@ -250,7 +250,7 @@ int init_fpu(struct task_struct *tsk)
if (tsk_used_math(tsk)) {
if (cpu_has_fpu && tsk == current)
unlazy_fpu(tsk);
-   tsk->thread.fpu.last_cpu = ~0;
+   task_disable_lazy_fpu_restore(tsk);
return 0;
}
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index dd9a069a5ec5..83480373a642 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -68,8 +68,8 @@ int arch_dup_task_struct(struct task_struct *dst, struct 
task_struct *src)
 
dst->thread.fpu_counter = 0;
dst->thread.fpu.has_fpu = 0;
-   dst->thread.fpu.last_cpu = ~0;
dst->thread.fpu.state = NULL;
+   task_disable_lazy_fpu_restore(dst);
if (tsk_used_math(src)) {
int err = fpu_alloc(>thread.fpu);
if (err)
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/3] x86,fpu: introduce task_disable_lazy_fpu_restore helper

2015-01-30 Thread riel

From: Rik van Riel 

Currently there are a few magic assignments sprinkled through the
code that disable lazy FPU state restoring, some more effective than
others, and all equally mystifying.

It would be easier to have a helper to explicitly disable lazy
FPU state restoring for a task.

Signed-off-by: Rik van Riel 
---
 arch/x86/include/asm/fpu-internal.h | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/include/asm/fpu-internal.h 
b/arch/x86/include/asm/fpu-internal.h
index 439ac3921a1e..c1f66261ad12 100644
--- a/arch/x86/include/asm/fpu-internal.h
+++ b/arch/x86/include/asm/fpu-internal.h
@@ -79,6 +79,16 @@ static inline void __cpu_disable_lazy_restore(unsigned int 
cpu)
per_cpu(fpu_owner_task, cpu) = NULL;
 }
 
+/*
+ * Used to indicate that the FPU state in memory is newer than the FPU
+ * state in registers, and the FPU state should be reloaded next time the
+ * task is run. Only safe on the current task, or non-running tasks.
+ */
+static inline void task_disable_lazy_fpu_restore(struct task_struct *tsk)
+{
+   tsk->thread.fpu.last_cpu = ~0;
+}
+
 static inline int fpu_lazy_restore(struct task_struct *new, unsigned int cpu)
 {
return new == this_cpu_read_stable(fpu_owner_task) &&
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] cpusets,isolcpus: add file to show isolated cpus in cpuset

2015-02-23 Thread riel

From: Rik van Riel 

The previous patch makes it so the code skips over isolcpus when
building scheduler load balancing domains. This makes it hard to
see for a user which of the CPUs in a cpuset are participating in
load balancing, and which ones are isolated cpus.

Add a cpuset.isolcpus file with info on which cpus in a cpuset are
isolated CPUs.

This file is read-only for now. In the future we could extend things
so isolcpus can be changed at run time, for the root (system wide)
cpuset only.

Cc: Peter Zijlstra 
Cc: Clark Williams 
Cc: Li Zefan 
Cc: Ingo Molnar 
Cc: Luiz Capitulino 
Cc: cgro...@vger.kernel.org
Signed-off-by: Rik van Riel 
---
 kernel/cpuset.c | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 1ad63fa37cb4..19ad5d3377f8 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1563,6 +1563,7 @@ typedef enum {
FILE_MEMORY_PRESSURE,
FILE_SPREAD_PAGE,
FILE_SPREAD_SLAB,
+   FILE_ISOLCPUS,
 } cpuset_filetype_t;
 
 static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype 
*cft,
@@ -1704,6 +1705,23 @@ static ssize_t cpuset_write_resmask(struct 
kernfs_open_file *of,
return retval ?: nbytes;
 }
 
+static size_t cpuset_sprintf_isolcpus(char *s, ssize_t pos, struct cpuset *cs)
+{
+   cpumask_var_t my_isolated_cpus;
+   ssize_t count;
+   
+   if (!alloc_cpumask_var(_isolated_cpus, GFP_KERNEL))
+   return 0;
+
+   cpumask_and(my_isolated_cpus, cs->cpus_allowed, cpu_isolated_map);
+
+   count = cpulist_scnprintf(s, pos, my_isolated_cpus);
+
+   free_cpumask_var(my_isolated_cpus);
+
+   return count;
+}
+
 /*
  * These ascii lists should be read in a single call, by using a user
  * buffer large enough to hold the entire map.  If read in smaller
@@ -1738,6 +1756,9 @@ static int cpuset_common_seq_show(struct seq_file *sf, 
void *v)
case FILE_EFFECTIVE_MEMLIST:
s += nodelist_scnprintf(s, count, cs->effective_mems);
break;
+   case FILE_ISOLCPUS:
+   s += cpuset_sprintf_isolcpus(s, count, cs);
+   break;
default:
ret = -EINVAL;
goto out_unlock;
@@ -1906,6 +1927,12 @@ static struct cftype files[] = {
.private = FILE_MEMORY_PRESSURE_ENABLED,
},
 
+   {
+   .name = "isolcpus",
+   .seq_show = cpuset_common_seq_show,
+   .private = FILE_ISOLCPUS,
+   },
+
{ } /* terminate */
 };
 
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/2] cpusets,isolcpus: add file to show isolated cpus in cpuset

2015-02-25 Thread riel

From: Rik van Riel 

The previous patch makes it so the code skips over isolcpus when
building scheduler load balancing domains. This makes it hard to
see for a user which of the CPUs in a cpuset are participating in
load balancing, and which ones are isolated cpus.

Add a cpuset.isolcpus file with info on which cpus in a cpuset are
isolated CPUs.

This file is read-only for now. In the future we could extend things
so isolcpus can be changed at run time, for the root (system wide)
cpuset only.

Cc: Peter Zijlstra 
Cc: Clark Williams 
Cc: Li Zefan 
Cc: Ingo Molnar 
Cc: Luiz Capitulino 
Cc: David Rientjes 
Cc: Mike Galbraith 
Cc: cgro...@vger.kernel.org
Signed-off-by: Rik van Riel 
---
 kernel/cpuset.c | 24 
 1 file changed, 24 insertions(+)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index b544e5229d99..94bf59588e23 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1563,6 +1563,7 @@ typedef enum {
FILE_MEMORY_PRESSURE,
FILE_SPREAD_PAGE,
FILE_SPREAD_SLAB,
+   FILE_ISOLCPUS,
 } cpuset_filetype_t;
 
 static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype 
*cft,
@@ -1704,6 +1705,20 @@ static ssize_t cpuset_write_resmask(struct 
kernfs_open_file *of,
return retval ?: nbytes;
 }
 
+static void cpuset_seq_print_isolcpus(struct seq_file *sf, struct cpuset *cs)
+{
+   cpumask_var_t my_isolated_cpus;
+
+   if (!alloc_cpumask_var(_isolated_cpus, GFP_KERNEL))
+   return;
+
+   cpumask_and(my_isolated_cpus, cs->cpus_allowed, cpu_isolated_map);
+
+   seq_printf(sf, "%*pbl\n", nodemask_pr_args(my_isolated_cpus));
+
+   free_cpumask_var(my_isolated_cpus);
+}
+
 /*
  * These ascii lists should be read in a single call, by using a user
  * buffer large enough to hold the entire map.  If read in smaller
@@ -1733,6 +1748,9 @@ static int cpuset_common_seq_show(struct seq_file *sf, 
void *v)
case FILE_EFFECTIVE_MEMLIST:
seq_printf(sf, "%*pbl\n", 
nodemask_pr_args(>effective_mems));
break;
+   case FILE_ISOLCPUS:
+   cpuset_seq_print_isolcpus(sf, cs);
+   break;
default:
ret = -EINVAL;
}
@@ -1893,6 +1911,12 @@ static struct cftype files[] = {
.private = FILE_MEMORY_PRESSURE_ENABLED,
},
 
+   {
+   .name = "isolcpus",
+   .seq_show = cpuset_common_seq_show,
+   .private = FILE_ISOLCPUS,
+   },
+
{ } /* terminate */
 };
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] cpusets,isolcpus: exclude isolcpus from load balancing in cpusets

2015-02-25 Thread riel

From: Rik van Riel 

Ensure that cpus specified with the isolcpus= boot commandline
option stay outside of the load balancing in the kernel scheduler.

Operations like load balancing can introduce unwanted latencies,
which is exactly what the isolcpus= commandline is there to prevent.

Previously, simply creating a new cpuset, without even touching the
cpuset.cpus field inside the new cpuset, would undo the effects of
isolcpus=, by creating a scheduler domain spanning the whole system,
and setting up load balancing inside that domain. The cpuset root
cpuset.cpus file is read-only, so there was not even a way to undo
that effect.

This does not impact the majority of cpusets users, since isolcpus=
is a fairly specialized feature used for realtime purposes.

Cc: Peter Zijlstra 
Cc: Clark Williams 
Cc: Li Zefan 
Cc: Ingo Molnar 
Cc: Luiz Capitulino 
Cc: Mike Galbraith 
Cc: cgro...@vger.kernel.org
Signed-off-by: Rik van Riel 
Tested-by: David Rientjes 
---
 include/linux/sched.h |  2 ++
 kernel/cpuset.c   | 13 +++--
 kernel/sched/core.c   |  2 +-
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d77432e14ff..aeae02435717 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1038,6 +1038,8 @@ static inline struct cpumask *sched_domain_span(struct 
sched_domain *sd)
 extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
struct sched_domain_attr *dattr_new);
 
+extern cpumask_var_t cpu_isolated_map;
+
 /* Allocate an array of sched domains, for partition_sched_domains(). */
 cpumask_var_t *alloc_sched_domains(unsigned int ndoms);
 void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms);
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 1d1fe9361d29..b544e5229d99 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -625,6 +625,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
int csn;/* how many cpuset ptrs in csa so far */
int i, j, k;/* indices for partition finding loops */
cpumask_var_t *doms;/* resulting partition; i.e. sched domains */
+   cpumask_var_t non_isolated_cpus;  /* load balanced CPUs */
struct sched_domain_attr *dattr;  /* attributes for custom domains */
int ndoms = 0;  /* number of sched domains in result */
int nslot;  /* next empty doms[] struct cpumask slot */
@@ -634,6 +635,10 @@ static int generate_sched_domains(cpumask_var_t **domains,
dattr = NULL;
csa = NULL;
 
+   if (!alloc_cpumask_var(_isolated_cpus, GFP_KERNEL))
+   goto done;
+   cpumask_andnot(non_isolated_cpus, cpu_possible_mask, cpu_isolated_map);
+
/* Special case for the 99% of systems with one, full, sched domain */
if (is_sched_load_balance(_cpuset)) {
ndoms = 1;
@@ -646,7 +651,8 @@ static int generate_sched_domains(cpumask_var_t **domains,
*dattr = SD_ATTR_INIT;
update_domain_attr_tree(dattr, _cpuset);
}
-   cpumask_copy(doms[0], top_cpuset.effective_cpus);
+   cpumask_and(doms[0], top_cpuset.effective_cpus,
+non_isolated_cpus);
 
goto done;
}
@@ -669,7 +675,8 @@ static int generate_sched_domains(cpumask_var_t **domains,
 * the corresponding sched domain.
 */
if (!cpumask_empty(cp->cpus_allowed) &&
-   !is_sched_load_balance(cp))
+   !(is_sched_load_balance(cp) &&
+ cpumask_intersects(cp->cpus_allowed, non_isolated_cpus)))
continue;
 
if (is_sched_load_balance(cp))
@@ -751,6 +758,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
 
if (apn == b->pn) {
cpumask_or(dp, dp, b->effective_cpus);
+   cpumask_and(dp, dp, non_isolated_cpus);
if (dattr)
update_domain_attr_tree(dattr + nslot, 
b);
 
@@ -763,6 +771,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
BUG_ON(nslot != ndoms);
 
 done:
+   free_cpumask_var(non_isolated_cpus);
kfree(csa);
 
/*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f0f831e8a345..3db1beace19b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5812,7 +5812,7 @@ cpu_attach_domain(struct sched_domain *sd, struct 
root_domain *rd, int cpu)
 }
 
 /* cpus with isolated domains */
-static cpumask_var_t cpu_isolated_map;
+cpumask_var_t cpu_isolated_map;
 
 /* Setup the mask of cpus configured for isolated domains */
 static int __init isolated_cpu_setup(char *str)
-- 
2.1.0

--
To unsubscribe from this list: send the li

[PATCH -v2 0/2] cpusets,isolcpus: resolve conflict between cpusets and isolcpus

2015-02-25 Thread riel

-v2 addresses the conflict David Rientjes spotted between my previous
patches and commit e8e6d97c9b ("cpuset: use %*pb[l] to print bitmaps
including cpumasks and nodemasks")

Ensure that cpus specified with the isolcpus= boot commandline
option stay outside of the load balancing in the kernel scheduler.

Operations like load balancing can introduce unwanted latencies,
which is exactly what the isolcpus= commandline is there to prevent.

Previously, simply creating a new cpuset, without even touching the
cpuset.cpus field inside the new cpuset, would undo the effects of
isolcpus=, by creating a scheduler domain spanning the whole system,
and setting up load balancing inside that domain. The cpuset root
cpuset.cpus file is read-only, so there was not even a way to undo
that effect.

This does not impact the majority of cpusets users, since isolcpus=
is a fairly specialized feature used for realtime purposes.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 RESEND 0/4] cpusets,isolcpus: exclude isolcpus from load balancing in cpusets

2015-03-09 Thread riel

Ensure that cpus specified with the isolcpus= boot commandline
option stay outside of the load balancing in the kernel scheduler.

Operations like load balancing can introduce unwanted latencies,
which is exactly what the isolcpus= commandline is there to prevent.

Previously, simply creating a new cpuset, without even touching the
cpuset.cpus field inside the new cpuset, would undo the effects of
isolcpus=, by creating a scheduler domain spanning the whole system,
and setting up load balancing inside that domain. The cpuset root
cpuset.cpus file is read-only, so there was not even a way to undo
that effect.

This does not impact the majority of cpusets users, since isolcpus=
is a fairly specialized feature used for realtime purposes.

This version fixes the UP compilation issue, in the same way done
for the other cpumasks.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/4] cpuset,isolcpus: document relationship between cpusets & isolcpus

2015-03-09 Thread riel

From: Rik van Riel 

Document the subtly changed relationship between cpusets and isolcpus.
Turns out the old documentation did not match the code...

Signed-off-by: Rik van Riel 
Suggested-by: Peter Zijlstra 
---
 Documentation/cgroups/cpusets.txt | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/Documentation/cgroups/cpusets.txt 
b/Documentation/cgroups/cpusets.txt
index f2235a162529..fdf7dff3f607 100644
--- a/Documentation/cgroups/cpusets.txt
+++ b/Documentation/cgroups/cpusets.txt
@@ -392,8 +392,10 @@ Put simply, it costs less to balance between two smaller 
sched domains
 than one big one, but doing so means that overloads in one of the
 two domains won't be load balanced to the other one.
 
-By default, there is one sched domain covering all CPUs, except those
-marked isolated using the kernel boot time "isolcpus=" argument.
+By default, there is one sched domain covering all CPUs, including those
+marked isolated using the kernel boot time "isolcpus=" argument. However,
+the isolated CPUs will not participate in load balancing, and will not
+have tasks running on them unless explicitly assigned.
 
 This default load balancing across all CPUs is not well suited for
 the following two situations:
@@ -465,6 +467,10 @@ such partially load balanced cpusets, as they may be 
artificially
 constrained to some subset of the CPUs allowed to them, for lack of
 load balancing to the other CPUs.
 
+CPUs in "cpuset.isolcpus" were excluded from load balancing by the
+isolcpus= kernel boot option, and will never be load balanced regardless
+of the value of "cpuset.sched_load_balance" in any cpuset.
+
 1.7.1 sched_load_balance implementation details.
 
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/4] cpusets,isolcpus: add file to show isolated cpus in cpuset

2015-03-09 Thread riel

From: Rik van Riel 

The previous patch makes it so the code skips over isolcpus when
building scheduler load balancing domains. This makes it hard to
see for a user which of the CPUs in a cpuset are participating in
load balancing, and which ones are isolated cpus.

Add a cpuset.isolcpus file with info on which cpus in a cpuset are
isolated CPUs.

This file is read-only for now. In the future we could extend things
so isolcpus can be changed at run time, for the root (system wide)
cpuset only.

Acked-by: David Rientjes 
Cc: Peter Zijlstra 
Cc: Clark Williams 
Cc: Li Zefan 
Cc: Ingo Molnar 
Cc: Luiz Capitulino 
Cc: David Rientjes 
Cc: Mike Galbraith 
Cc: cgro...@vger.kernel.org
Signed-off-by: Rik van Riel 
---
 kernel/cpuset.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index b544e5229d99..5462e1ca90bd 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1563,6 +1563,7 @@ typedef enum {
FILE_MEMORY_PRESSURE,
FILE_SPREAD_PAGE,
FILE_SPREAD_SLAB,
+   FILE_ISOLCPUS,
 } cpuset_filetype_t;
 
 static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype 
*cft,
@@ -1704,6 +1705,16 @@ static ssize_t cpuset_write_resmask(struct 
kernfs_open_file *of,
return retval ?: nbytes;
 }
 
+/* protected by the lock in cpuset_common_seq_show */
+static cpumask_var_t print_isolated_cpus;
+
+static void cpuset_seq_print_isolcpus(struct seq_file *sf, struct cpuset *cs)
+{
+   cpumask_and(print_isolated_cpus, cs->cpus_allowed, cpu_isolated_map);
+
+   seq_printf(sf, "%*pbl\n", cpumask_pr_args(print_isolated_cpus));
+}
+
 /*
  * These ascii lists should be read in a single call, by using a user
  * buffer large enough to hold the entire map.  If read in smaller
@@ -1733,6 +1744,9 @@ static int cpuset_common_seq_show(struct seq_file *sf, 
void *v)
case FILE_EFFECTIVE_MEMLIST:
seq_printf(sf, "%*pbl\n", 
nodemask_pr_args(>effective_mems));
break;
+   case FILE_ISOLCPUS:
+   cpuset_seq_print_isolcpus(sf, cs);
+   break;
default:
ret = -EINVAL;
}
@@ -1893,6 +1907,12 @@ static struct cftype files[] = {
.private = FILE_MEMORY_PRESSURE_ENABLED,
},
 
+   {
+   .name = "isolcpus",
+   .seq_show = cpuset_common_seq_show,
+   .private = FILE_ISOLCPUS,
+   },
+
{ } /* terminate */
 };
 
@@ -2070,6 +2090,8 @@ int __init cpuset_init(void)
BUG();
if (!alloc_cpumask_var(_cpuset.effective_cpus, GFP_KERNEL))
BUG();
+   if (!alloc_cpumask_var(_isolated_cpus, GFP_KERNEL))
+   BUG();
 
cpumask_setall(top_cpuset.cpus_allowed);
nodes_setall(top_cpuset.mems_allowed);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/4] cpusets,isolcpus: exclude isolcpus from load balancing in cpusets

2015-03-09 Thread riel

From: Rik van Riel 

Ensure that cpus specified with the isolcpus= boot commandline
option stay outside of the load balancing in the kernel scheduler.

Operations like load balancing can introduce unwanted latencies,
which is exactly what the isolcpus= commandline is there to prevent.

Previously, simply creating a new cpuset, without even touching the
cpuset.cpus field inside the new cpuset, would undo the effects of
isolcpus=, by creating a scheduler domain spanning the whole system,
and setting up load balancing inside that domain. The cpuset root
cpuset.cpus file is read-only, so there was not even a way to undo
that effect.

This does not impact the majority of cpusets users, since isolcpus=
is a fairly specialized feature used for realtime purposes.

Cc: Peter Zijlstra 
Cc: Clark Williams 
Cc: Li Zefan 
Cc: Ingo Molnar 
Cc: Luiz Capitulino 
Cc: Mike Galbraith 
Cc: cgro...@vger.kernel.org
Signed-off-by: Rik van Riel 
Tested-by: David Rientjes 
---
 kernel/cpuset.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 1d1fe9361d29..b544e5229d99 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -625,6 +625,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
int csn;/* how many cpuset ptrs in csa so far */
int i, j, k;/* indices for partition finding loops */
cpumask_var_t *doms;/* resulting partition; i.e. sched domains */
+   cpumask_var_t non_isolated_cpus;  /* load balanced CPUs */
struct sched_domain_attr *dattr;  /* attributes for custom domains */
int ndoms = 0;  /* number of sched domains in result */
int nslot;  /* next empty doms[] struct cpumask slot */
@@ -634,6 +635,10 @@ static int generate_sched_domains(cpumask_var_t **domains,
dattr = NULL;
csa = NULL;
 
+   if (!alloc_cpumask_var(_isolated_cpus, GFP_KERNEL))
+   goto done;
+   cpumask_andnot(non_isolated_cpus, cpu_possible_mask, cpu_isolated_map);
+
/* Special case for the 99% of systems with one, full, sched domain */
if (is_sched_load_balance(_cpuset)) {
ndoms = 1;
@@ -646,7 +651,8 @@ static int generate_sched_domains(cpumask_var_t **domains,
*dattr = SD_ATTR_INIT;
update_domain_attr_tree(dattr, _cpuset);
}
-   cpumask_copy(doms[0], top_cpuset.effective_cpus);
+   cpumask_and(doms[0], top_cpuset.effective_cpus,
+non_isolated_cpus);
 
goto done;
}
@@ -669,7 +675,8 @@ static int generate_sched_domains(cpumask_var_t **domains,
 * the corresponding sched domain.
 */
if (!cpumask_empty(cp->cpus_allowed) &&
-   !is_sched_load_balance(cp))
+   !(is_sched_load_balance(cp) &&
+ cpumask_intersects(cp->cpus_allowed, non_isolated_cpus)))
continue;
 
if (is_sched_load_balance(cp))
@@ -751,6 +758,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
 
if (apn == b->pn) {
cpumask_or(dp, dp, b->effective_cpus);
+   cpumask_and(dp, dp, non_isolated_cpus);
if (dattr)
update_domain_attr_tree(dattr + nslot, 
b);
 
@@ -763,6 +771,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
BUG_ON(nslot != ndoms);
 
 done:
+   free_cpumask_var(non_isolated_cpus);
kfree(csa);
 
/*
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/4] sched,isolcpu: make cpu_isolated_map visible outside scheduler

2015-03-09 Thread riel

From: Rik van Riel 

Needed by the next patch. Also makes cpu_isolated_map present
when compiled without SMP and/or with CONFIG_NR_CPUS=1, like
the other cpu masks.

At some point we may want to clean things up so cpumasks do
not exist in UP kernels. Maybe something for the CONFIG_TINY
crowd.

Cc: Peter Zijlstra 
Cc: Clark Williams 
Cc: Li Zefan 
Cc: Ingo Molnar 
Cc: Luiz Capitulino 
Cc: David Rientjes 
Cc: Mike Galbraith 
Cc: cgro...@vger.kernel.org
Signed-off-by: Rik van Riel 
---
 include/linux/sched.h | 2 ++
 kernel/sched/core.c   | 6 +++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d77432e14ff..ca365d79480c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -329,6 +329,8 @@ extern asmlinkage void schedule_tail(struct task_struct 
*prev);
 extern void init_idle(struct task_struct *idle, int cpu);
 extern void init_idle_bootup_task(struct task_struct *idle);
 
+extern cpumask_var_t cpu_isolated_map;
+
 extern int runqueue_is_locked(int cpu);
 
 #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f0f831e8a345..b578bb23410b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -306,6 +306,9 @@ __read_mostly int scheduler_running;
  */
 int sysctl_sched_rt_runtime = 95;
 
+/* cpus with isolated domains */
+cpumask_var_t cpu_isolated_map;
+
 /*
  * this_rq_lock - lock this runqueue and disable interrupts.
  */
@@ -5811,9 +5814,6 @@ cpu_attach_domain(struct sched_domain *sd, struct 
root_domain *rd, int cpu)
update_top_cache_domain(cpu);
 }
 
-/* cpus with isolated domains */
-static cpumask_var_t cpu_isolated_map;
-
 /* Setup the mask of cpus configured for isolated domains */
 static int __init isolated_cpu_setup(char *str)
 {
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH v4 0/4] cpusets,isolcpus: exclude isolcpus from load balancing in cpusets

2015-03-03 Thread riel

Ensure that cpus specified with the isolcpus= boot commandline
option stay outside of the load balancing in the kernel scheduler.

Operations like load balancing can introduce unwanted latencies,
which is exactly what the isolcpus= commandline is there to prevent.

Previously, simply creating a new cpuset, without even touching the
cpuset.cpus field inside the new cpuset, would undo the effects of
isolcpus=, by creating a scheduler domain spanning the whole system,
and setting up load balancing inside that domain. The cpuset root
cpuset.cpus file is read-only, so there was not even a way to undo
that effect.

This does not impact the majority of cpusets users, since isolcpus=
is a fairly specialized feature used for realtime purposes.

This version fixes the UP compilation issue, in the same way done
for the other cpumasks.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 4/4] cpuset,isolcpus: document relationship between cpusets & isolcpus

2015-03-03 Thread riel

From: Rik van Riel 

Document the subtly changed relationship between cpusets and isolcpus.
Turns out the old documentation did not match the code...

Signed-off-by: Rik van Riel 
Suggested-by: Peter Zijlstra 
---
 Documentation/cgroups/cpusets.txt | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/Documentation/cgroups/cpusets.txt 
b/Documentation/cgroups/cpusets.txt
index f2235a162529..fdf7dff3f607 100644
--- a/Documentation/cgroups/cpusets.txt
+++ b/Documentation/cgroups/cpusets.txt
@@ -392,8 +392,10 @@ Put simply, it costs less to balance between two smaller 
sched domains
 than one big one, but doing so means that overloads in one of the
 two domains won't be load balanced to the other one.
 
-By default, there is one sched domain covering all CPUs, except those
-marked isolated using the kernel boot time "isolcpus=" argument.
+By default, there is one sched domain covering all CPUs, including those
+marked isolated using the kernel boot time "isolcpus=" argument. However,
+the isolated CPUs will not participate in load balancing, and will not
+have tasks running on them unless explicitly assigned.
 
 This default load balancing across all CPUs is not well suited for
 the following two situations:
@@ -465,6 +467,10 @@ such partially load balanced cpusets, as they may be 
artificially
 constrained to some subset of the CPUs allowed to them, for lack of
 load balancing to the other CPUs.
 
+CPUs in "cpuset.isolcpus" were excluded from load balancing by the
+isolcpus= kernel boot option, and will never be load balanced regardless
+of the value of "cpuset.sched_load_balance" in any cpuset.
+
 1.7.1 sched_load_balance implementation details.
 
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/4] sched,isolcpu: make cpu_isolated_map visible outside scheduler

2015-03-03 Thread riel

From: Rik van Riel 

Needed by the next patch. Also makes cpu_isolated_map present
when compiled without SMP and/or with CONFIG_NR_CPUS=1, like
the other cpu masks.

At some point we may want to clean things up so cpumasks do
not exist in UP kernels. Maybe something for the CONFIG_TINY
crowd.

Cc: Peter Zijlstra 
Cc: Clark Williams 
Cc: Li Zefan 
Cc: Ingo Molnar 
Cc: Luiz Capitulino 
Cc: David Rientjes 
Cc: Mike Galbraith 
Cc: cgro...@vger.kernel.org
Signed-off-by: Rik van Riel 
---
 include/linux/sched.h | 2 ++
 kernel/sched/core.c   | 6 +++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6d77432e14ff..ca365d79480c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -329,6 +329,8 @@ extern asmlinkage void schedule_tail(struct task_struct 
*prev);
 extern void init_idle(struct task_struct *idle, int cpu);
 extern void init_idle_bootup_task(struct task_struct *idle);
 
+extern cpumask_var_t cpu_isolated_map;
+
 extern int runqueue_is_locked(int cpu);
 
 #if defined(CONFIG_SMP) && defined(CONFIG_NO_HZ_COMMON)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f0f831e8a345..b578bb23410b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -306,6 +306,9 @@ __read_mostly int scheduler_running;
  */
 int sysctl_sched_rt_runtime = 95;
 
+/* cpus with isolated domains */
+cpumask_var_t cpu_isolated_map;
+
 /*
  * this_rq_lock - lock this runqueue and disable interrupts.
  */
@@ -5811,9 +5814,6 @@ cpu_attach_domain(struct sched_domain *sd, struct 
root_domain *rd, int cpu)
update_top_cache_domain(cpu);
 }
 
-/* cpus with isolated domains */
-static cpumask_var_t cpu_isolated_map;
-
 /* Setup the mask of cpus configured for isolated domains */
 static int __init isolated_cpu_setup(char *str)
 {
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/4] cpusets,isolcpus: add file to show isolated cpus in cpuset

2015-03-03 Thread riel

From: Rik van Riel 

The previous patch makes it so the code skips over isolcpus when
building scheduler load balancing domains. This makes it hard to
see for a user which of the CPUs in a cpuset are participating in
load balancing, and which ones are isolated cpus.

Add a cpuset.isolcpus file with info on which cpus in a cpuset are
isolated CPUs.

This file is read-only for now. In the future we could extend things
so isolcpus can be changed at run time, for the root (system wide)
cpuset only.

Acked-by: David Rientjes 
Cc: Peter Zijlstra 
Cc: Clark Williams 
Cc: Li Zefan 
Cc: Ingo Molnar 
Cc: Luiz Capitulino 
Cc: David Rientjes 
Cc: Mike Galbraith 
Cc: cgro...@vger.kernel.org
Signed-off-by: Rik van Riel 
---
 kernel/cpuset.c | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index b544e5229d99..5462e1ca90bd 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1563,6 +1563,7 @@ typedef enum {
FILE_MEMORY_PRESSURE,
FILE_SPREAD_PAGE,
FILE_SPREAD_SLAB,
+   FILE_ISOLCPUS,
 } cpuset_filetype_t;
 
 static int cpuset_write_u64(struct cgroup_subsys_state *css, struct cftype 
*cft,
@@ -1704,6 +1705,16 @@ static ssize_t cpuset_write_resmask(struct 
kernfs_open_file *of,
return retval ?: nbytes;
 }
 
+/* protected by the lock in cpuset_common_seq_show */
+static cpumask_var_t print_isolated_cpus;
+
+static void cpuset_seq_print_isolcpus(struct seq_file *sf, struct cpuset *cs)
+{
+   cpumask_and(print_isolated_cpus, cs->cpus_allowed, cpu_isolated_map);
+
+   seq_printf(sf, "%*pbl\n", cpumask_pr_args(print_isolated_cpus));
+}
+
 /*
  * These ascii lists should be read in a single call, by using a user
  * buffer large enough to hold the entire map.  If read in smaller
@@ -1733,6 +1744,9 @@ static int cpuset_common_seq_show(struct seq_file *sf, 
void *v)
case FILE_EFFECTIVE_MEMLIST:
seq_printf(sf, "%*pbl\n", 
nodemask_pr_args(>effective_mems));
break;
+   case FILE_ISOLCPUS:
+   cpuset_seq_print_isolcpus(sf, cs);
+   break;
default:
ret = -EINVAL;
}
@@ -1893,6 +1907,12 @@ static struct cftype files[] = {
.private = FILE_MEMORY_PRESSURE_ENABLED,
},
 
+   {
+   .name = "isolcpus",
+   .seq_show = cpuset_common_seq_show,
+   .private = FILE_ISOLCPUS,
+   },
+
{ } /* terminate */
 };
 
@@ -2070,6 +2090,8 @@ int __init cpuset_init(void)
BUG();
if (!alloc_cpumask_var(_cpuset.effective_cpus, GFP_KERNEL))
BUG();
+   if (!alloc_cpumask_var(_isolated_cpus, GFP_KERNEL))
+   BUG();
 
cpumask_setall(top_cpuset.cpus_allowed);
nodes_setall(top_cpuset.mems_allowed);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/4] cpusets,isolcpus: exclude isolcpus from load balancing in cpusets

2015-03-03 Thread riel

From: Rik van Riel 

Ensure that cpus specified with the isolcpus= boot commandline
option stay outside of the load balancing in the kernel scheduler.

Operations like load balancing can introduce unwanted latencies,
which is exactly what the isolcpus= commandline is there to prevent.

Previously, simply creating a new cpuset, without even touching the
cpuset.cpus field inside the new cpuset, would undo the effects of
isolcpus=, by creating a scheduler domain spanning the whole system,
and setting up load balancing inside that domain. The cpuset root
cpuset.cpus file is read-only, so there was not even a way to undo
that effect.

This does not impact the majority of cpusets users, since isolcpus=
is a fairly specialized feature used for realtime purposes.

Cc: Peter Zijlstra 
Cc: Clark Williams 
Cc: Li Zefan 
Cc: Ingo Molnar 
Cc: Luiz Capitulino 
Cc: Mike Galbraith 
Cc: cgro...@vger.kernel.org
Signed-off-by: Rik van Riel 
Tested-by: David Rientjes 
---
 kernel/cpuset.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 1d1fe9361d29..b544e5229d99 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -625,6 +625,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
int csn;/* how many cpuset ptrs in csa so far */
int i, j, k;/* indices for partition finding loops */
cpumask_var_t *doms;/* resulting partition; i.e. sched domains */
+   cpumask_var_t non_isolated_cpus;  /* load balanced CPUs */
struct sched_domain_attr *dattr;  /* attributes for custom domains */
int ndoms = 0;  /* number of sched domains in result */
int nslot;  /* next empty doms[] struct cpumask slot */
@@ -634,6 +635,10 @@ static int generate_sched_domains(cpumask_var_t **domains,
dattr = NULL;
csa = NULL;
 
+   if (!alloc_cpumask_var(_isolated_cpus, GFP_KERNEL))
+   goto done;
+   cpumask_andnot(non_isolated_cpus, cpu_possible_mask, cpu_isolated_map);
+
/* Special case for the 99% of systems with one, full, sched domain */
if (is_sched_load_balance(_cpuset)) {
ndoms = 1;
@@ -646,7 +651,8 @@ static int generate_sched_domains(cpumask_var_t **domains,
*dattr = SD_ATTR_INIT;
update_domain_attr_tree(dattr, _cpuset);
}
-   cpumask_copy(doms[0], top_cpuset.effective_cpus);
+   cpumask_and(doms[0], top_cpuset.effective_cpus,
+non_isolated_cpus);
 
goto done;
}
@@ -669,7 +675,8 @@ static int generate_sched_domains(cpumask_var_t **domains,
 * the corresponding sched domain.
 */
if (!cpumask_empty(cp->cpus_allowed) &&
-   !is_sched_load_balance(cp))
+   !(is_sched_load_balance(cp) &&
+ cpumask_intersects(cp->cpus_allowed, non_isolated_cpus)))
continue;
 
if (is_sched_load_balance(cp))
@@ -751,6 +758,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
 
if (apn == b->pn) {
cpumask_or(dp, dp, b->effective_cpus);
+   cpumask_and(dp, dp, non_isolated_cpus);
if (dattr)
update_domain_attr_tree(dattr + nslot, 
b);
 
@@ -763,6 +771,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
BUG_ON(nslot != ndoms);
 
 done:
+   free_cpumask_var(non_isolated_cpus);
kfree(csa);
 
/*
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 5/5] kvm,rcu,nohz: use RCU extended quiescent state when running KVM guest

2015-02-10 Thread riel

From: Rik van Riel 

The host kernel is not doing anything while the CPU is executing
a KVM guest VCPU, so it can be marked as being in an extended
quiescent state, identical to that used when running user space
code.

The only exception to that rule is when the host handles an
interrupt, which is already handled by the irq code, which
calls rcu_irq_enter and rcu_irq_exit.

The guest_enter and guest_exit functions already switch vtime
accounting independent of context tracking. Leave those calls
where they are, instead of moving them into the context tracking
code.

Signed-off-by: Rik van Riel 
---
 include/linux/context_tracking.h   | 6 ++
 include/linux/context_tracking_state.h | 1 +
 include/linux/kvm_host.h   | 3 ++-
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 954253283709..b65fd1420e53 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -80,10 +80,16 @@ static inline void guest_enter(void)
vtime_guest_enter(current);
else
current->flags |= PF_VCPU;
+
+   if (context_tracking_is_enabled())
+   context_tracking_enter(IN_GUEST);
 }
 
 static inline void guest_exit(void)
 {
+   if (context_tracking_is_enabled())
+   context_tracking_exit(IN_GUEST);
+
if (vtime_accounting_enabled())
vtime_guest_exit(current);
else
diff --git a/include/linux/context_tracking_state.h 
b/include/linux/context_tracking_state.h
index 72ab10fe1e46..90a7bab8779e 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -15,6 +15,7 @@ struct context_tracking {
enum ctx_state {
IN_KERNEL = 0,
IN_USER,
+   IN_GUEST,
} state;
 };
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 26f106022c88..c7828a6a9614 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -772,7 +772,8 @@ static inline void kvm_guest_enter(void)
 * one time slice). Lets treat guest mode as quiescent state, just like
 * we do with user-mode execution.
 */
-   rcu_virt_note_context_switch(smp_processor_id());
+   if (!context_tracking_cpu_is_enabled())
+   rcu_virt_note_context_switch(smp_processor_id());
 }
 
 static inline void kvm_guest_exit(void)
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 3/5] rcu,nohz: run vtime_user_enter/exit only when state == IN_USER

2015-02-10 Thread riel

From: Rik van Riel 

Only run vtime_user_enter, vtime_user_exit, and the user enter & exit
trace points when we are entering or exiting user state, respectively.

The KVM code in guest_enter and guest_exit already take care of calling
vtime_guest_enter and vtime_guest_exit, respectively.

The RCU code only distinguishes between "idle" and "not idle or kernel".
There should be no need to add an additional (unused) state there.

Signed-off-by: Rik van Riel 
---
 kernel/context_tracking.c | 12 
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 38e38aeac8b9..0e4e318d5ea4 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -77,7 +77,6 @@ void context_tracking_enter(enum ctx_state state)
local_irq_save(flags);
if ( __this_cpu_read(context_tracking.state) != state) {
if (__this_cpu_read(context_tracking.active)) {
-   trace_user_enter(0);
/*
 * At this stage, only low level arch entry code 
remains and
 * then we'll run in userspace. We can assume there 
won't be
@@ -85,7 +84,10 @@ void context_tracking_enter(enum ctx_state state)
 * user_exit() or rcu_irq_enter(). Let's remove RCU's 
dependency
 * on the tick.
 */
-   vtime_user_enter(current);
+   if (state == IN_USER) {
+   trace_user_enter(0);
+   vtime_user_enter(current);
+   }
rcu_user_enter();
}
/*
@@ -143,8 +145,10 @@ void context_tracking_exit(enum ctx_state state)
 * RCU core about that (ie: we may need the tick again).
 */
rcu_user_exit();
-   vtime_user_exit(current);
-   trace_user_exit(0);
+   if (state == IN_USER) {
+   vtime_user_exit(current);
+   trace_user_exit(0);
+   }
}
__this_cpu_write(context_tracking.state, IN_KERNEL);
}
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH -v5 0/5] rcu,nohz,kvm: use RCU extended quiescent state when running KVM guest

2015-02-10 Thread riel

When running a KVM guest on a system with NOHZ_FULL enabled, and the
KVM guest running with idle=poll mode, we still get wakeups of the
rcuos/N threads.

This problem has already been solved for user space by telling the
RCU subsystem that the CPU is in an extended quiescent state while
running user space code.

This patch series extends that code a little bit to make it usable
to track KVM guest space, too.

I tested the code by booting a KVM guest with idle=poll, on a system
with NOHZ_FULL enabled on most CPUs, and a VCPU thread bound to a
CPU. In a 10 second interval, rcuos/N threads on other CPUs got woken
up several times, while the rcuos thread on the CPU running the bound
and alwasy running VCPU thread never got woken up once.

Thanks to Christian Borntraeger, Paul McKenney, Paulo Bonzini,
Frederic Weisbecker, and Will Deacon for reviewing and improving
earlier versions of this patch series.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/5] context_tracking: generalize context tracking APIs to support user and guest

2015-02-10 Thread riel

From: Rik van Riel 

Split out the mechanism from context_tracking_user_enter and
context_tracking_user_exit into context_tracking_enter and
context_tracking_exit. Leave the old functions in order to avoid
breaking ARM, which calls these functions from assembler code,
and cannot easily use C enum parameters.

Add the expected ctx_state as a parameter to context_tracking_enter and
context_tracking_exit, allowing the same functions to not just track
kernel <> user space switching, but also kernel <> guest transitions.

Signed-off-by: Rik van Riel 
---
 include/linux/context_tracking.h |  8 +---
 kernel/context_tracking.c| 43 ++--
 2 files changed, 33 insertions(+), 18 deletions(-)

diff --git a/include/linux/context_tracking.h b/include/linux/context_tracking.h
index 37b81bd51ec0..954253283709 100644
--- a/include/linux/context_tracking.h
+++ b/include/linux/context_tracking.h
@@ -10,6 +10,8 @@
 #ifdef CONFIG_CONTEXT_TRACKING
 extern void context_tracking_cpu_set(int cpu);
 
+extern void context_tracking_enter(enum ctx_state state);
+extern void context_tracking_exit(enum ctx_state state);
 extern void context_tracking_user_enter(void);
 extern void context_tracking_user_exit(void);
 extern void __context_tracking_task_switch(struct task_struct *prev,
@@ -35,7 +37,7 @@ static inline enum ctx_state exception_enter(void)
return 0;
 
prev_ctx = this_cpu_read(context_tracking.state);
-   context_tracking_user_exit();
+   context_tracking_exit(prev_ctx);
 
return prev_ctx;
 }
@@ -43,8 +45,8 @@ static inline enum ctx_state exception_enter(void)
 static inline void exception_exit(enum ctx_state prev_ctx)
 {
if (context_tracking_is_enabled()) {
-   if (prev_ctx == IN_USER)
-   context_tracking_user_enter();
+   if (prev_ctx != IN_KERNEL)
+   context_tracking_enter(prev_ctx);
}
 }
 
diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 937ecdfdf258..38e38aeac8b9 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -39,15 +39,15 @@ void context_tracking_cpu_set(int cpu)
 }
 
 /**
- * context_tracking_user_enter - Inform the context tracking that the CPU is 
going to
- *   enter userspace mode.
+ * context_tracking_enter - Inform the context tracking that the CPU is going
+ *  enter user or guest space mode.
  *
  * This function must be called right before we switch from the kernel
- * to userspace, when it's guaranteed the remaining kernel instructions
- * to execute won't use any RCU read side critical section because this
- * function sets RCU in extended quiescent state.
+ * to user or guest space, when it's guaranteed the remaining kernel
+ * instructions to execute won't use any RCU read side critical section
+ * because this function sets RCU in extended quiescent state.
  */
-void context_tracking_user_enter(void)
+void context_tracking_enter(enum ctx_state state)
 {
unsigned long flags;
 
@@ -75,7 +75,7 @@ void context_tracking_user_enter(void)
WARN_ON_ONCE(!current->mm);
 
local_irq_save(flags);
-   if ( __this_cpu_read(context_tracking.state) != IN_USER) {
+   if ( __this_cpu_read(context_tracking.state) != state) {
if (__this_cpu_read(context_tracking.active)) {
trace_user_enter(0);
/*
@@ -101,24 +101,31 @@ void context_tracking_user_enter(void)
 * OTOH we can spare the calls to vtime and RCU when 
context_tracking.active
 * is false because we know that CPU is not tickless.
 */
-   __this_cpu_write(context_tracking.state, IN_USER);
+   __this_cpu_write(context_tracking.state, state);
}
local_irq_restore(flags);
 }
+NOKPROBE_SYMBOL(context_tracking_enter);
+
+void context_tracking_user_enter(void)
+{
+   context_tracking_enter(IN_USER);
+}
 NOKPROBE_SYMBOL(context_tracking_user_enter);
 
 /**
- * context_tracking_user_exit - Inform the context tracking that the CPU is
- *  exiting userspace mode and entering the kernel.
+ * context_tracking_exit - Inform the context tracking that the CPU is
+ * exiting user or guest mode and entering the kernel.
  *
- * This function must be called after we entered the kernel from userspace
- * before any use of RCU read side critical section. This potentially include
- * any high level kernel code like syscalls, exceptions, signal handling, 
etc...
+ * This function must be called after we entered the kernel from user or
+ * guest space before any use of RCU read side critical section. This
+ * potentially include any high level kernel code like syscalls, exceptions,
+ * signal handling, etc...
  *
  * This call supports re-entrancy. This way it can be called fro

[PATCH 4/5] nohz,kvm: export context_tracking_user_enter/exit

2015-02-10 Thread riel

From: Rik van Riel 

Export context_tracking_user_enter/exit so it can be used by KVM.

Signed-off-by: Rik van Riel 
---
 kernel/context_tracking.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/context_tracking.c b/kernel/context_tracking.c
index 0e4e318d5ea4..5bdf1a342ab3 100644
--- a/kernel/context_tracking.c
+++ b/kernel/context_tracking.c
@@ -108,6 +108,7 @@ void context_tracking_enter(enum ctx_state state)
local_irq_restore(flags);
 }
 NOKPROBE_SYMBOL(context_tracking_enter);
+EXPORT_SYMBOL_GPL(context_tracking_enter);
 
 void context_tracking_user_enter(void)
 {
@@ -155,6 +156,7 @@ void context_tracking_exit(enum ctx_state state)
local_irq_restore(flags);
 }
 NOKPROBE_SYMBOL(context_tracking_exit);
+EXPORT_SYMBOL_GPL(context_tracking_exit);
 
 void context_tracking_user_exit(void)
 {
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 2/5] nohz: add stub context_tracking_is_enabled

2015-02-10 Thread riel

From: Rik van Riel 

With code elsewhere doing something conditional on whether or not
context tracking is enabled, we want a stub function that tells us
context tracking is not enabled, when CONFIG_CONTEXT_TRACKING is
not set.

Signed-off-by: Rik van Riel 
---
 include/linux/context_tracking_state.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/context_tracking_state.h 
b/include/linux/context_tracking_state.h
index 97a81225d037..72ab10fe1e46 100644
--- a/include/linux/context_tracking_state.h
+++ b/include/linux/context_tracking_state.h
@@ -39,6 +39,8 @@ static inline bool context_tracking_in_user(void)
 #else
 static inline bool context_tracking_in_user(void) { return false; }
 static inline bool context_tracking_active(void) { return false; }
+static inline bool context_tracking_is_enabled(void) { return false; }
+static inline bool context_tracking_cpu_is_enabled(void) { return false; }
 #endif /* CONFIG_CONTEXT_TRACKING */
 
 #endif
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 0/2] cpusets,isolcpus: resolve conflict between cpusets and isolcpus

2015-02-23 Thread riel

Ensure that cpus specified with the isolcpus= boot commandline
option stay outside of the load balancing in the kernel scheduler.

Operations like load balancing can introduce unwanted latencies,
which is exactly what the isolcpus= commandline is there to prevent.

Previously, simply creating a new cpuset, without even touching the
cpuset.cpus field inside the new cpuset, would undo the effects of
isolcpus=, by creating a scheduler domain spanning the whole system,
and setting up load balancing inside that domain. The cpuset root
cpuset.cpus file is read-only, so there was not even a way to undo
that effect.

This does not impact the majority of cpusets users, since isolcpus=
is a fairly specialized feature used for realtime purposes.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH 1/2] cpusets,isolcpus: exclude isolcpus from load balancing in cpusets

2015-02-23 Thread riel

From: Rik van Riel 

Ensure that cpus specified with the isolcpus= boot commandline
option stay outside of the load balancing in the kernel scheduler.

Operations like load balancing can introduce unwanted latencies,
which is exactly what the isolcpus= commandline is there to prevent.

Previously, simply creating a new cpuset, without even touching the
cpuset.cpus field inside the new cpuset, would undo the effects of
isolcpus=, by creating a scheduler domain spanning the whole system,
and setting up load balancing inside that domain. The cpuset root
cpuset.cpus file is read-only, so there was not even a way to undo
that effect.

This does not impact the majority of cpusets users, since isolcpus=
is a fairly specialized feature used for realtime purposes.

Cc: Peter Zijlstra 
Cc: Clark Williams 
Cc: Li Zefan 
Cc: Ingo Molnar 
Cc: Luiz Capitulino 
Cc: cgro...@vger.kernel.org
Signed-off-by: Rik van Riel 
---
 include/linux/sched.h |  2 ++
 kernel/cpuset.c   | 13 +++--
 kernel/sched/core.c   |  2 +-
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index cb5cdc777c8a..af1b32a5ddcc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1038,6 +1038,8 @@ static inline struct cpumask *sched_domain_span(struct 
sched_domain *sd)
 extern void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
struct sched_domain_attr *dattr_new);
 
+extern cpumask_var_t cpu_isolated_map;
+
 /* Allocate an array of sched domains, for partition_sched_domains(). */
 cpumask_var_t *alloc_sched_domains(unsigned int ndoms);
 void free_sched_domains(cpumask_var_t doms[], unsigned int ndoms);
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 64b257f6bca2..1ad63fa37cb4 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -625,6 +625,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
int csn;/* how many cpuset ptrs in csa so far */
int i, j, k;/* indices for partition finding loops */
cpumask_var_t *doms;/* resulting partition; i.e. sched domains */
+   cpumask_var_t non_isolated_cpus;  /* load balanced CPUs */
struct sched_domain_attr *dattr;  /* attributes for custom domains */
int ndoms = 0;  /* number of sched domains in result */
int nslot;  /* next empty doms[] struct cpumask slot */
@@ -634,6 +635,10 @@ static int generate_sched_domains(cpumask_var_t **domains,
dattr = NULL;
csa = NULL;
 
+   if (!alloc_cpumask_var(_isolated_cpus, GFP_KERNEL))
+   goto done;
+   cpumask_andnot(non_isolated_cpus, cpu_possible_mask, cpu_isolated_map);
+
/* Special case for the 99% of systems with one, full, sched domain */
if (is_sched_load_balance(_cpuset)) {
ndoms = 1;
@@ -646,7 +651,8 @@ static int generate_sched_domains(cpumask_var_t **domains,
*dattr = SD_ATTR_INIT;
update_domain_attr_tree(dattr, _cpuset);
}
-   cpumask_copy(doms[0], top_cpuset.effective_cpus);
+   cpumask_and(doms[0], top_cpuset.effective_cpus,
+non_isolated_cpus);
 
goto done;
}
@@ -669,7 +675,8 @@ static int generate_sched_domains(cpumask_var_t **domains,
 * the corresponding sched domain.
 */
if (!cpumask_empty(cp->cpus_allowed) &&
-   !is_sched_load_balance(cp))
+   !(is_sched_load_balance(cp) &&
+ cpumask_intersects(cp->cpus_allowed, non_isolated_cpus)))
continue;
 
if (is_sched_load_balance(cp))
@@ -751,6 +758,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
 
if (apn == b->pn) {
cpumask_or(dp, dp, b->effective_cpus);
+   cpumask_and(dp, dp, non_isolated_cpus);
if (dattr)
update_domain_attr_tree(dattr + nslot, 
b);
 
@@ -763,6 +771,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
BUG_ON(nslot != ndoms);
 
 done:
+   free_cpumask_var(non_isolated_cpus);
kfree(csa);
 
/*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 97fe79cf613e..6069f3703240 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5831,7 +5831,7 @@ cpu_attach_domain(struct sched_domain *sd, struct 
root_domain *rd, int cpu)
 }
 
 /* cpus with isolated domains */
-static cpumask_var_t cpu_isolated_map;
+cpumask_var_t cpu_isolated_map;
 
 /* Setup the mask of cpus configured for isolated domains */
 static int __init isolated_cpu_setup(char *str)
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel&qu

[PATCH 7/9] x86/fpu: rename lazy restore functions to "register state valid"

2016-10-04 Thread riel

From: Rik van Riel 

Name the functions after the state they track, rather than the function
they currently enable. This should make it more obvious when we use the
fpu_register_state_valid function for something else in the future.

Signed-off-by: Rik van Riel 
---
 arch/x86/include/asm/fpu/internal.h | 26 --
 arch/x86/kernel/fpu/core.c  |  4 ++--
 arch/x86/kernel/smpboot.c   |  2 +-
 3 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h 
b/arch/x86/include/asm/fpu/internal.h
index 499d6ed0e376..d2cfe16dd9fa 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -479,18 +479,32 @@ extern int copy_fpstate_to_sigframe(void __user *buf, 
void __user *fp, int size)
 DECLARE_PER_CPU(struct fpu *, fpu_fpregs_owner_ctx);
 
 /*
+ * The in-register FPU state for an FPU context on a CPU is assumed to be
+ * valid if the fpu->last_cpu matches the CPU, and the fpu_fpregs_owner_ctx
+ * matches the FPU.
+ *
+ * If the FPU register state is valid, the kernel can skip restoring the
+ * FPU state from memory.
+ *
+ * Any code that clobbers the FPU registers or updates the in-memory
+ * FPU state for a task MUST let the rest of the kernel know that the
+ * FPU registers are no longer valid for this task. Calling either of
+ * these two invalidate functions is enough, use whichever is convenient.
+ *
  * Must be run with preemption disabled: this clears the fpu_fpregs_owner_ctx,
  * on this CPU.
- *
- * This will disable any lazy FPU state restore of the current FPU state,
- * but if the current thread owns the FPU, it will still be saved by.
  */
-static inline void __cpu_disable_lazy_restore(unsigned int cpu)
+static inline void __cpu_invalidate_fpregs_state(unsigned int cpu)
 {
per_cpu(fpu_fpregs_owner_ctx, cpu) = NULL;
 }
 
-static inline int fpu_want_lazy_restore(struct fpu *fpu, unsigned int cpu)
+static inline void __fpu_invalidate_fpregs_state(struct fpu *fpu)
+{
+   fpu->last_cpu = -1;
+}
+
+static inline int fpregs_state_valid(struct fpu *fpu, unsigned int cpu)
 {
return fpu == this_cpu_read_stable(fpu_fpregs_owner_ctx) && cpu == 
fpu->last_cpu;
 }
@@ -588,7 +602,7 @@ switch_fpu_prepare(struct fpu *old_fpu, struct fpu 
*new_fpu, int cpu)
} else {
old_fpu->last_cpu = -1;
if (fpu.preload) {
-   if (fpu_want_lazy_restore(new_fpu, cpu))
+   if (fpregs_state_valid(new_fpu, cpu))
fpu.preload = 0;
else
prefetch(_fpu->state);
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 6a37d525bdbe..25a45ddfdbcf 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -336,7 +336,7 @@ void fpu__activate_fpstate_write(struct fpu *fpu)
 
if (fpu->fpstate_active) {
/* Invalidate any lazy state: */
-   fpu->last_cpu = -1;
+   __fpu_invalidate_fpregs_state(fpu);
} else {
fpstate_init(>state);
trace_x86_fpu_init_state(fpu);
@@ -379,7 +379,7 @@ void fpu__current_fpstate_write_begin(void)
 * ensures we will not be lazy and skip a XRSTOR in the
 * future.
 */
-   fpu->last_cpu = -1;
+   __fpu_invalidate_fpregs_state(fpu);
 }
 
 /*
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 42a93621f5b0..ca4c4ca2f6af 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -,7 +,7 @@ int native_cpu_up(unsigned int cpu, struct task_struct 
*tidle)
return err;
 
/* the FPU context is blank, nobody can own it */
-   __cpu_disable_lazy_restore(cpu);
+   __cpu_invalidate_fpregs_state(cpu);
 
common_cpu_up(cpu, tidle);
 
-- 
2.7.4

[PATCH 3/9] x86/fpu: Remove the XFEATURE_MASK_EAGER/LAZY distinction

2016-10-04 Thread riel

From: Andy Lutomirski 

Now that lazy mode is gone, we don't need to distinguish which
xfeatures require eager mode.

Signed-off-by: Rik van Riel 
Signed-off-by: Andy Lutomirski 
---
 arch/x86/include/asm/fpu/xstate.h | 14 +-
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/fpu/xstate.h 
b/arch/x86/include/asm/fpu/xstate.h
index d4957ac72b48..1b2799e0699a 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -21,20 +21,16 @@
 /* Supervisor features */
 #define XFEATURE_MASK_SUPERVISOR (XFEATURE_MASK_PT)
 
-/* Supported features which support lazy state saving */
-#define XFEATURE_MASK_LAZY (XFEATURE_MASK_FP | \
+/* All currently supported features */
+#define XCNTXT_MASK(XFEATURE_MASK_FP | \
 XFEATURE_MASK_SSE | \
 XFEATURE_MASK_YMM | \
 XFEATURE_MASK_OPMASK | \
 XFEATURE_MASK_ZMM_Hi256 | \
 XFEATURE_MASK_Hi16_ZMM  | \
-XFEATURE_MASK_PKRU)
-
-/* Supported features which require eager state saving */
-#define XFEATURE_MASK_EAGER(XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR)
-
-/* All currently supported features */
-#define XCNTXT_MASK(XFEATURE_MASK_LAZY | XFEATURE_MASK_EAGER)
+XFEATURE_MASK_PKRU | \
+XFEATURE_MASK_BNDREGS | \
+XFEATURE_MASK_BNDCSR)
 
 #ifdef CONFIG_X86_64
 #define REX_PREFIX "0x48, "
-- 
2.7.4

[PATCH 9/9] x86/fpu: split old & new fpu code paths

2016-10-04 Thread riel

From: Rik van Riel 

Now that CR0.TS is no longer being manipulated, we can simplify
switch_fpu_prepare by no longer nesting the handling of the new
fpu inside the two branches for the old FPU.

Signed-off-by: Rik van Riel 
---
 arch/x86/include/asm/fpu/internal.h | 22 --
 1 file changed, 8 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h 
b/arch/x86/include/asm/fpu/internal.h
index aa7a117b43f8..ef52935f8a17 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -581,23 +581,17 @@ switch_fpu_prepare(struct fpu *old_fpu, struct fpu 
*new_fpu, int cpu)
/* But leave fpu_fpregs_owner_ctx! */
old_fpu->fpregs_active = 0;
trace_x86_fpu_regs_deactivated(old_fpu);
+   } else
+   old_fpu->last_cpu = -1;
 
-   /* Don't change CR0.TS if we just switch! */
-   if (fpu.preload) {
-   fpregs_activate(new_fpu);
-   trace_x86_fpu_regs_activated(new_fpu);
+   if (fpu.preload) {
+   if (fpregs_state_valid(new_fpu, cpu))
+   fpu.preload = 0;
+   else
prefetch(_fpu->state);
-   }
-   } else {
-   old_fpu->last_cpu = -1;
-   if (fpu.preload) {
-   if (fpregs_state_valid(new_fpu, cpu))
-   fpu.preload = 0;
-   else
-   prefetch(_fpu->state);
-   fpregs_activate(new_fpu);
-   }
+   fpregs_activate(new_fpu);
}
+
return fpu;
 }
 
-- 
2.7.4

[PATCH 8/9] x86/fpu: remove __fpregs_(de)activate

2016-10-04 Thread riel

From: Rik van Riel 

Now that fpregs_activate and fpregs_deactivate do nothing except
call the double underscored versions of themselves, we can get
rid of the double underscore version.

Signed-off-by: Rik van Riel 
---
 arch/x86/include/asm/fpu/internal.h | 25 +++--
 1 file changed, 7 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h 
b/arch/x86/include/asm/fpu/internal.h
index d2cfe16dd9fa..aa7a117b43f8 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -509,8 +509,11 @@ static inline int fpregs_state_valid(struct fpu *fpu, 
unsigned int cpu)
return fpu == this_cpu_read_stable(fpu_fpregs_owner_ctx) && cpu == 
fpu->last_cpu;
 }
 
-
-static inline void __fpregs_deactivate(struct fpu *fpu)
+/*
+ * These generally need preemption protection to work,
+ * do try to avoid using these on their own.
+ */
+static inline void fpregs_deactivate(struct fpu *fpu)
 {
WARN_ON_FPU(!fpu->fpregs_active);
 
@@ -519,7 +522,7 @@ static inline void __fpregs_deactivate(struct fpu *fpu)
trace_x86_fpu_regs_deactivated(fpu);
 }
 
-static inline void __fpregs_activate(struct fpu *fpu)
+static inline void fpregs_activate(struct fpu *fpu)
 {
WARN_ON_FPU(fpu->fpregs_active);
 
@@ -544,20 +547,6 @@ static inline int fpregs_active(void)
 }
 
 /*
- * These generally need preemption protection to work,
- * do try to avoid using these on their own.
- */
-static inline void fpregs_activate(struct fpu *fpu)
-{
-   __fpregs_activate(fpu);
-}
-
-static inline void fpregs_deactivate(struct fpu *fpu)
-{
-   __fpregs_deactivate(fpu);
-}
-
-/*
  * FPU state switching for scheduling.
  *
  * This is a two-stage process:
@@ -595,7 +584,7 @@ switch_fpu_prepare(struct fpu *old_fpu, struct fpu 
*new_fpu, int cpu)
 
/* Don't change CR0.TS if we just switch! */
if (fpu.preload) {
-   __fpregs_activate(new_fpu);
+   fpregs_activate(new_fpu);
trace_x86_fpu_regs_activated(new_fpu);
prefetch(_fpu->state);
}
-- 
2.7.4

[PATCH 5/9] x86/fpu: remove fpu.counter

2016-10-04 Thread riel

From: Rik van Riel 

With the lazy FPU code gone, we no longer use the counter field
in struct fpu for anything. Get rid it.

Signed-off-by: Rik van Riel 
---
 arch/x86/include/asm/fpu/internal.h |  3 ---
 arch/x86/include/asm/fpu/types.h| 11 ---
 arch/x86/include/asm/trace/fpu.h|  5 +
 arch/x86/kernel/fpu/core.c  |  3 ---
 4 files changed, 1 insertion(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h 
b/arch/x86/include/asm/fpu/internal.h
index 7801d32347a2..499d6ed0e376 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -581,16 +581,13 @@ switch_fpu_prepare(struct fpu *old_fpu, struct fpu 
*new_fpu, int cpu)
 
/* Don't change CR0.TS if we just switch! */
if (fpu.preload) {
-   new_fpu->counter++;
__fpregs_activate(new_fpu);
trace_x86_fpu_regs_activated(new_fpu);
prefetch(_fpu->state);
}
} else {
-   old_fpu->counter = 0;
old_fpu->last_cpu = -1;
if (fpu.preload) {
-   new_fpu->counter++;
if (fpu_want_lazy_restore(new_fpu, cpu))
fpu.preload = 0;
else
diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 48df486b02f9..e31332d6f0e8 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -322,17 +322,6 @@ struct fpu {
unsigned char   fpregs_active;
 
/*
-* @counter:
-*
-* This counter contains the number of consecutive context switches
-* during which the FPU stays used. If this is over a threshold, the
-* lazy FPU restore logic becomes eager, to save the trap overhead.
-* This is an unsigned char so that after 256 iterations the counter
-* wraps and the context switch behavior turns lazy again; this is to
-* deal with bursty apps that only use the FPU for a short time:
-*/
-   unsigned char   counter;
-   /*
 * @state:
 *
 * In-memory copy of all FPU registers that we save/restore
diff --git a/arch/x86/include/asm/trace/fpu.h b/arch/x86/include/asm/trace/fpu.h
index 9217ab1f5bf6..342e59789fcd 100644
--- a/arch/x86/include/asm/trace/fpu.h
+++ b/arch/x86/include/asm/trace/fpu.h
@@ -14,7 +14,6 @@ DECLARE_EVENT_CLASS(x86_fpu,
__field(struct fpu *, fpu)
__field(bool, fpregs_active)
__field(bool, fpstate_active)
-   __field(int, counter)
__field(u64, xfeatures)
__field(u64, xcomp_bv)
),
@@ -23,17 +22,15 @@ DECLARE_EVENT_CLASS(x86_fpu,
__entry->fpu= fpu;
__entry->fpregs_active  = fpu->fpregs_active;
__entry->fpstate_active = fpu->fpstate_active;
-   __entry->counter= fpu->counter;
if (boot_cpu_has(X86_FEATURE_OSXSAVE)) {
__entry->xfeatures = fpu->state.xsave.header.xfeatures;
__entry->xcomp_bv  = fpu->state.xsave.header.xcomp_bv;
}
),
-   TP_printk("x86/fpu: %p fpregs_active: %d fpstate_active: %d counter: %d 
xfeatures: %llx xcomp_bv: %llx",
+   TP_printk("x86/fpu: %p fpregs_active: %d fpstate_active: %d xfeatures: 
%llx xcomp_bv: %llx",
__entry->fpu,
__entry->fpregs_active,
__entry->fpstate_active,
-   __entry->counter,
__entry->xfeatures,
__entry->xcomp_bv
)
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 036e14fe3b77..6a37d525bdbe 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -222,7 +222,6 @@ EXPORT_SYMBOL_GPL(fpstate_init);
 
 int fpu__copy(struct fpu *dst_fpu, struct fpu *src_fpu)
 {
-   dst_fpu->counter = 0;
dst_fpu->fpregs_active = 0;
dst_fpu->last_cpu = -1;
 
@@ -430,7 +429,6 @@ void fpu__restore(struct fpu *fpu)
trace_x86_fpu_before_restore(fpu);
fpregs_activate(fpu);
copy_kernel_to_fpregs(>state);
-   fpu->counter++;
trace_x86_fpu_after_restore(fpu);
kernel_fpu_enable();
 }
@@ -448,7 +446,6 @@ EXPORT_SYMBOL_GPL(fpu__restore);
 void fpu__drop(struct fpu *fpu)
 {
preempt_disable();
-   fpu->counter = 0;
 
if (fpu->fpregs_active) {
/* Ignore delayed exceptions from user space */
-- 
2.7.4

[PATCH 6/9] x86/fpu,kvm: remove kvm vcpu->fpu_counter

2016-10-04 Thread riel

From: Rik van Riel 

With the removal of the lazy FPU code, this field is no longer used.
Get rid of it.

Signed-off-by: Rik van Riel 
---
 arch/x86/kvm/x86.c   | 4 +---
 include/linux/kvm_host.h | 1 -
 2 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 59d7761fd6df..2c7e775d7295 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7348,10 +7348,8 @@ void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
 
 void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
 {
-   if (!vcpu->guest_fpu_loaded) {
-   vcpu->fpu_counter = 0;
+   if (!vcpu->guest_fpu_loaded)
return;
-   }
 
vcpu->guest_fpu_loaded = 0;
copy_fpregs_to_fpstate(>arch.guest_fpu);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 9c28b4d4c90b..4e6905cd1e8e 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -224,7 +224,6 @@ struct kvm_vcpu {
 
int fpu_active;
int guest_fpu_loaded, guest_xcr0_loaded;
-   unsigned char fpu_counter;
struct swait_queue_head wq;
struct pid *pid;
int sigset_active;
-- 
2.7.4

[PATCH 0/9] x86/fpu: remove lazy FPU mode & various FPU cleanups

2016-10-04 Thread riel

This series removes lazy FPU mode, and cleans up various bits
and pieces around the FPU code.

I have run this through a basic floating point test that involves
about 1.5 billion context switches and 45 minutes of swapping at
250MB/second.

This seems to tease out bugs fairly well, though I would not mind
an actual floating point test suite...

[PATCH 4/9] x86/fpu: Remove use_eager_fpu()

2016-10-04 Thread riel

From: Andy Lutomirski 

This removes all the obvious code paths that depend on lazy FPU mode.
It shouldn't change the generated code at all.

Signed-off-by: Rik van Riel 
Signed-off-by: Andy Lutomirski 
---
 arch/x86/crypto/crc32c-intel_glue.c | 17 -
 arch/x86/include/asm/fpu/internal.h | 34 +
 arch/x86/kernel/fpu/core.c  | 38 +
 arch/x86/kernel/fpu/signal.c|  8 +++-
 arch/x86/kernel/fpu/xstate.c|  9 -
 arch/x86/kvm/cpuid.c|  4 +---
 arch/x86/kvm/x86.c  | 10 --
 7 files changed, 14 insertions(+), 106 deletions(-)

diff --git a/arch/x86/crypto/crc32c-intel_glue.c 
b/arch/x86/crypto/crc32c-intel_glue.c
index 715399b14ed7..c194d5717ae5 100644
--- a/arch/x86/crypto/crc32c-intel_glue.c
+++ b/arch/x86/crypto/crc32c-intel_glue.c
@@ -48,21 +48,13 @@
 #ifdef CONFIG_X86_64
 /*
  * use carryless multiply version of crc32c when buffer
- * size is >= 512 (when eager fpu is enabled) or
- * >= 1024 (when eager fpu is disabled) to account
+ * size is >= 512 to account
  * for fpu state save/restore overhead.
  */
-#define CRC32C_PCL_BREAKEVEN_EAGERFPU  512
-#define CRC32C_PCL_BREAKEVEN_NOEAGERFPU1024
+#define CRC32C_PCL_BREAKEVEN   512
 
 asmlinkage unsigned int crc_pcl(const u8 *buffer, int len,
unsigned int crc_init);
-static int crc32c_pcl_breakeven = CRC32C_PCL_BREAKEVEN_EAGERFPU;
-#define set_pcl_breakeven_point()  \
-do {   \
-   if (!use_eager_fpu())   \
-   crc32c_pcl_breakeven = CRC32C_PCL_BREAKEVEN_NOEAGERFPU; \
-} while (0)
 #endif /* CONFIG_X86_64 */
 
 static u32 crc32c_intel_le_hw_byte(u32 crc, unsigned char const *data, size_t 
length)
@@ -185,7 +177,7 @@ static int crc32c_pcl_intel_update(struct shash_desc *desc, 
const u8 *data,
 * use faster PCL version if datasize is large enough to
 * overcome kernel fpu state save/restore overhead
 */
-   if (len >= crc32c_pcl_breakeven && irq_fpu_usable()) {
+   if (len >= CRC32C_PCL_BREAKEVEN && irq_fpu_usable()) {
kernel_fpu_begin();
*crcp = crc_pcl(data, len, *crcp);
kernel_fpu_end();
@@ -197,7 +189,7 @@ static int crc32c_pcl_intel_update(struct shash_desc *desc, 
const u8 *data,
 static int __crc32c_pcl_intel_finup(u32 *crcp, const u8 *data, unsigned int 
len,
u8 *out)
 {
-   if (len >= crc32c_pcl_breakeven && irq_fpu_usable()) {
+   if (len >= CRC32C_PCL_BREAKEVEN && irq_fpu_usable()) {
kernel_fpu_begin();
*(__le32 *)out = ~cpu_to_le32(crc_pcl(data, len, *crcp));
kernel_fpu_end();
@@ -256,7 +248,6 @@ static int __init crc32c_intel_mod_init(void)
alg.update = crc32c_pcl_intel_update;
alg.finup = crc32c_pcl_intel_finup;
alg.digest = crc32c_pcl_intel_digest;
-   set_pcl_breakeven_point();
}
 #endif
return crypto_register_shash();
diff --git a/arch/x86/include/asm/fpu/internal.h 
b/arch/x86/include/asm/fpu/internal.h
index 8852e3afa1ad..7801d32347a2 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -60,11 +60,6 @@ extern u64 fpu__get_supported_xfeatures_mask(void);
 /*
  * FPU related CPU feature flag helper routines:
  */
-static __always_inline __pure bool use_eager_fpu(void)
-{
-   return true;
-}
-
 static __always_inline __pure bool use_xsaveopt(void)
 {
return static_cpu_has(X86_FEATURE_XSAVEOPT);
@@ -501,24 +496,6 @@ static inline int fpu_want_lazy_restore(struct fpu *fpu, 
unsigned int cpu)
 }
 
 
-/*
- * Wrap lazy FPU TS handling in a 'hw fpregs activation/deactivation'
- * idiom, which is then paired with the sw-flag (fpregs_active) later on:
- */
-
-static inline void __fpregs_activate_hw(void)
-{
-   if (!use_eager_fpu())
-   clts();
-}
-
-static inline void __fpregs_deactivate_hw(void)
-{
-   if (!use_eager_fpu())
-   stts();
-}
-
-/* Must be paired with an 'stts' (fpregs_deactivate_hw()) after! */
 static inline void __fpregs_deactivate(struct fpu *fpu)
 {
WARN_ON_FPU(!fpu->fpregs_active);
@@ -528,7 +505,6 @@ static inline void __fpregs_deactivate(struct fpu *fpu)
trace_x86_fpu_regs_deactivated(fpu);
 }
 
-/* Must be paired with a 'clts' (fpregs_activate_hw()) before! */
 static inline void __fpregs_activate(struct fpu *fpu)
 {
WARN_ON_FPU(fpu->fpregs_active);
@@ -554,22 +530,17 @@ static inline int fpregs_active(void)
 }
 
 /*
- * Encapsulate the CR0.TS handling together with the
- * software flag.
- *
  * These generally need preemption protection to work,
  * do try to avoid using these on

[PATCH 1/9] x86/crypto: Remove X86_FEATURE_EAGER_FPU ifdef from the crc32c code

2016-10-04 Thread riel

From: Rik van Riel 

>From : Andy Lutomirski 

The crypto code was checking both use_eager_fpu() and
defined(X86_FEATURE_EAGER_FPU).  The latter was nonsensical, so
remove it.  This will avoid breakage when we remove
X86_FEATURE_EAGER_FPU.

Signed-off-by: Rik van Riel 
Signed-off-by: Andy Lutomirski 
---
 arch/x86/crypto/crc32c-intel_glue.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/arch/x86/crypto/crc32c-intel_glue.c 
b/arch/x86/crypto/crc32c-intel_glue.c
index 0857b1a1de3b..715399b14ed7 100644
--- a/arch/x86/crypto/crc32c-intel_glue.c
+++ b/arch/x86/crypto/crc32c-intel_glue.c
@@ -58,16 +58,11 @@
 asmlinkage unsigned int crc_pcl(const u8 *buffer, int len,
unsigned int crc_init);
 static int crc32c_pcl_breakeven = CRC32C_PCL_BREAKEVEN_EAGERFPU;
-#if defined(X86_FEATURE_EAGER_FPU)
 #define set_pcl_breakeven_point()  \
 do {   \
if (!use_eager_fpu())   \
crc32c_pcl_breakeven = CRC32C_PCL_BREAKEVEN_NOEAGERFPU; \
 } while (0)
-#else
-#define set_pcl_breakeven_point()  \
-   (crc32c_pcl_breakeven = CRC32C_PCL_BREAKEVEN_NOEAGERFPU)
-#endif
 #endif /* CONFIG_X86_64 */
 
 static u32 crc32c_intel_le_hw_byte(u32 crc, unsigned char const *data, size_t 
length)
-- 
2.7.4

[PATCH 2/9] x86/fpu: Hard-disable lazy fpu mode

2016-10-04 Thread riel

From: Andy Lutomirski 

Since commit 58122bf1d856 ("x86/fpu: Default eagerfpu=on on all
CPUs") in Linux 4.6, eager FPU mode has been the default on all x86
systems, and no one has reported any regressions.

This patch removes the ability to enable lazy mode: use_eager_fpu()
becomes "return true" and all of the FPU mode selection machinery is
removed.

Signed-off-by: Rik van Riel 
Signed-off-by: Andy Lutomirski 
---
 arch/x86/include/asm/cpufeatures.h  |  2 +-
 arch/x86/include/asm/fpu/internal.h |  2 +-
 arch/x86/kernel/fpu/init.c  | 91 ++---
 3 files changed, 5 insertions(+), 90 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index 1188bc849ee3..b212b862314a 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -104,7 +104,7 @@
 #define X86_FEATURE_EXTD_APICID( 3*32+26) /* has extended APICID (8 
bits) */
 #define X86_FEATURE_AMD_DCM ( 3*32+27) /* multi-node processor */
 #define X86_FEATURE_APERFMPERF ( 3*32+28) /* APERFMPERF */
-#define X86_FEATURE_EAGER_FPU  ( 3*32+29) /* "eagerfpu" Non lazy FPU restore */
+/* free, was #define X86_FEATURE_EAGER_FPU ( 3*32+29) * "eagerfpu" Non 
lazy FPU restore */
 #define X86_FEATURE_NONSTOP_TSC_S3 ( 3*32+30) /* TSC doesn't stop in S3 state 
*/
 
 /* Intel-defined CPU features, CPUID level 0x0001 (ecx), word 4 */
diff --git a/arch/x86/include/asm/fpu/internal.h 
b/arch/x86/include/asm/fpu/internal.h
index 2737366ea583..8852e3afa1ad 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -62,7 +62,7 @@ extern u64 fpu__get_supported_xfeatures_mask(void);
  */
 static __always_inline __pure bool use_eager_fpu(void)
 {
-   return static_cpu_has(X86_FEATURE_EAGER_FPU);
+   return true;
 }
 
 static __always_inline __pure bool use_xsaveopt(void)
diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
index 2f2b8c7ccb85..1a09d133c801 100644
--- a/arch/x86/kernel/fpu/init.c
+++ b/arch/x86/kernel/fpu/init.c
@@ -15,10 +15,7 @@
  */
 static void fpu__init_cpu_ctx_switch(void)
 {
-   if (!boot_cpu_has(X86_FEATURE_EAGER_FPU))
-   stts();
-   else
-   clts();
+   clts();
 }
 
 /*
@@ -233,82 +230,16 @@ static void __init 
fpu__init_system_xstate_size_legacy(void)
 }
 
 /*
- * FPU context switching strategies:
- *
- * Against popular belief, we don't do lazy FPU saves, due to the
- * task migration complications it brings on SMP - we only do
- * lazy FPU restores.
- *
- * 'lazy' is the traditional strategy, which is based on setting
- * CR0::TS to 1 during context-switch (instead of doing a full
- * restore of the FPU state), which causes the first FPU instruction
- * after the context switch (whenever it is executed) to fault - at
- * which point we lazily restore the FPU state into FPU registers.
- *
- * Tasks are of course under no obligation to execute FPU instructions,
- * so it can easily happen that another context-switch occurs without
- * a single FPU instruction being executed. If we eventually switch
- * back to the original task (that still owns the FPU) then we have
- * not only saved the restores along the way, but we also have the
- * FPU ready to be used for the original task.
- *
- * 'lazy' is deprecated because it's almost never a performance win
- * and it's much more complicated than 'eager'.
- *
- * 'eager' switching is by default on all CPUs, there we switch the FPU
- * state during every context switch, regardless of whether the task
- * has used FPU instructions in that time slice or not. This is done
- * because modern FPU context saving instructions are able to optimize
- * state saving and restoration in hardware: they can detect both
- * unused and untouched FPU state and optimize accordingly.
- *
- * [ Note that even in 'lazy' mode we might optimize context switches
- *   to use 'eager' restores, if we detect that a task is using the FPU
- *   frequently. See the fpu->counter logic in fpu/internal.h for that. ]
- */
-static enum { ENABLE, DISABLE } eagerfpu = ENABLE;
-
-/*
  * Find supported xfeatures based on cpu features and command-line input.
  * This must be called after fpu__init_parse_early_param() is called and
  * xfeatures_mask is enumerated.
  */
 u64 __init fpu__get_supported_xfeatures_mask(void)
 {
-   /* Support all xfeatures known to us */
-   if (eagerfpu != DISABLE)
-   return XCNTXT_MASK;
-
-   /* Warning of xfeatures being disabled for no eagerfpu mode */
-   if (xfeatures_mask & XFEATURE_MASK_EAGER) {
-   pr_err("x86/fpu: eagerfpu switching disabled, disabling the 
following xstate features: 0x%llx.\n",
-   xfeatures_mask & XFEATURE_MASK_EAGER);
-   }
-
-   /* Return a mask that masks out all features requiring eagerfpu mode */
-   return ~XFEATURE_MA

[PATCH 3/5] x86: ascii armor the x86_64 boot init stack canary

2017-05-24 Thread riel

From: Rik van Riel 

Use the ascii-armor canary to prevent unterminated C string overflows
from being able to successfully overwrite the canary, even if they
somehow obtain the canary value.

Inspired by execshield ascii-armor and Daniel Micay's linux-hardened tree.

Signed-off-by: Rik van Riel 
---
 arch/x86/include/asm/stackprotector.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/stackprotector.h 
b/arch/x86/include/asm/stackprotector.h
index dcbd9bcce714..8abedf1d650e 100644
--- a/arch/x86/include/asm/stackprotector.h
+++ b/arch/x86/include/asm/stackprotector.h
@@ -74,6 +74,7 @@ static __always_inline void boot_init_stack_canary(void)
get_random_bytes(, sizeof(canary));
tsc = rdtsc();
canary += tsc + (tsc << 32UL);
+   canary &= CANARY_MASK;
 
current->stack_canary = canary;
 #ifdef CONFIG_X86_64
-- 
2.9.3

[PATCH 4/5] arm64: ascii armor the arm64 boot init stack canary

2017-05-24 Thread riel

From: Rik van Riel 

Use the ascii-armor canary to prevent unterminated C string overflows
from being able to successfully overwrite the canary, even if they
somehow obtain the canary value.

Inspired by execshield ascii-armor and Daniel Micay's linux-hardened tree.

Signed-off-by: Rik van Riel 
---
 arch/arm64/include/asm/stackprotector.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/include/asm/stackprotector.h 
b/arch/arm64/include/asm/stackprotector.h
index fe5e287dc56b..b86a0865ddf1 100644
--- a/arch/arm64/include/asm/stackprotector.h
+++ b/arch/arm64/include/asm/stackprotector.h
@@ -30,6 +30,7 @@ static __always_inline void boot_init_stack_canary(void)
/* Try to get a semi random initial value. */
get_random_bytes(, sizeof(canary));
canary ^= LINUX_VERSION_CODE;
+   canary &= CANARY_MASK;
 
current->stack_canary = canary;
__stack_chk_guard = current->stack_canary;
-- 
2.9.3

[PATCH 2/5] fork,random: use get_random_canary to set tsk->stack_canary

2017-05-24 Thread riel

From: Rik van Riel 

Use the ascii-armor canary to prevent unterminated C string overflows
from being able to successfully overwrite the canary, even if they
somehow obtain the canary value.

Inspired by execshield ascii-armor and Daniel Micay's linux-hardened tree.

Signed-off-by: Rik van Riel 
---
 kernel/fork.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index aa1076c5e4a9..b3591e9250a8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -560,7 +560,7 @@ static struct task_struct *dup_task_struct(struct 
task_struct *orig, int node)
set_task_stack_end_magic(tsk);
 
 #ifdef CONFIG_CC_STACKPROTECTOR
-   tsk->stack_canary = get_random_long();
+   tsk->stack_canary = get_random_canary();
 #endif
 
/*
-- 
2.9.3

[PATCH 1/5] random,stackprotect: introduce get_random_canary function

2017-05-24 Thread riel

From: Rik van Riel 

Introduce the get_random_canary function, which provides a random
unsigned long canary value with the first byte zeroed out on 64
bit architectures, in order to mitigate non-terminated C string
overflows.

The null byte both prevents C string functions from reading the
canary, and from writing it if the canary value were guessed or
obtained through some other means.

Reducing the entropy by 8 bits is acceptable on 64-bit systems,
which will still have 56 bits of entropy left, but not on 32
bit systems, so the "ascii armor" canary is only implemented on
64-bit systems.

Inspired by the "ascii armor" code in the old execshield patches,
and Daniel Micay's linux-hardened tree.

Signed-off-by: Rik van Riel 
---
 include/linux/random.h | 21 +
 1 file changed, 21 insertions(+)

diff --git a/include/linux/random.h b/include/linux/random.h
index ed5c3838780d..1fa0dc880bd7 100644
--- a/include/linux/random.h
+++ b/include/linux/random.h
@@ -57,6 +57,27 @@ static inline unsigned long get_random_long(void)
 #endif
 }
 
+/*
+ * On 64-bit architectures, protect against non-terminated C string overflows
+ * by zeroing out the first byte of the canary; this leaves 56 bits of entropy.
+ */
+#ifdef CONFIG_64BIT
+# ifdef __LITTLE_ENDIAN
+#  define CANARY_MASK 0xff00UL
+# else /* big endian, 64 bits: */
+#  define CANARY_MASK 0x00ffUL
+# endif
+#else /* 32 bits: */
+# define CANARY_MASK 0xUL
+#endif
+
+static inline unsigned long get_random_canary(void)
+{
+   unsigned long val = get_random_long();
+
+   return val & CANARY_MASK;
+}
+
 unsigned long randomize_page(unsigned long start, unsigned long range);
 
 u32 prandom_u32(void);
-- 
2.9.3

[PATCH v2 0/5] stackprotector: ascii armor the stack canary

2017-05-24 Thread riel

Zero out the first byte of the stack canary value on 64 bit systems,
in order to mitigate unterminated C string overflows.

The null byte both prevents C string functions from reading the
canary, and from writing it if the canary value were guessed or
obtained through some other means.

Reducing the entropy by 8 bits is acceptable on 64-bit systems,
which will still have 56 bits of entropy left, but not on 32
bit systems, so the "ascii armor" canary is only implemented on
64-bit systems.

Inspired by the "ascii armor" code in execshield and Daniel Micay's
linux-hardened tree.

Also see https://github.com/thestinger/linux-hardened/

v2:
 - improve changelogs
 - address Ingo's coding style comments

[PATCH 5/5] sh64: ascii armor the sh64 boot init stack canary

2017-05-24 Thread riel

From: Rik van Riel 

Use the ascii-armor canary to prevent unterminated C string overflows
from being able to successfully overwrite the canary, even if they
somehow obtain the canary value.

Inspired by execshield ascii-armor and Daniel Micay's linux-hardened tree.

Signed-off-by: Rik van Riel 
---
 arch/sh/include/asm/stackprotector.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/sh/include/asm/stackprotector.h 
b/arch/sh/include/asm/stackprotector.h
index d9df3a76847c..141515a43b78 100644
--- a/arch/sh/include/asm/stackprotector.h
+++ b/arch/sh/include/asm/stackprotector.h
@@ -19,6 +19,7 @@ static __always_inline void boot_init_stack_canary(void)
/* Try to get a semi random initial value. */
get_random_bytes(, sizeof(canary));
canary ^= LINUX_VERSION_CODE;
+   canary &= CANARY_MASK;
 
current->stack_canary = canary;
__stack_chk_guard = current->stack_canary;
-- 
2.9.3

[PATCH 2/5] fork,random: use get_random_canary to set tsk->stack_canary

2017-05-19 Thread riel

From: Rik van Riel 

Use the ascii-armor canary to prevent unterminated C string overflows
from being able to successfully overwrite the canary, even if they
somehow obtain the canary value.

Inspired by execshield ascii-armor and PaX/grsecurity.

Signed-off-by: Rik van Riel 
---
 kernel/fork.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index aa1076c5e4a9..b3591e9250a8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -560,7 +560,7 @@ static struct task_struct *dup_task_struct(struct 
task_struct *orig, int node)
set_task_stack_end_magic(tsk);
 
 #ifdef CONFIG_CC_STACKPROTECTOR
-   tsk->stack_canary = get_random_long();
+   tsk->stack_canary = get_random_canary();
 #endif
 
/*
-- 
2.9.3

[PATCH 3/5] x86: ascii armor the x86_64 boot init stack canary

2017-05-19 Thread riel

From: Rik van Riel 

Use the ascii-armor canary to prevent unterminated C string overflows
from being able to successfully overwrite the canary, even if they
somehow obtain the canary value.

Inspired by execshield ascii-armor and PaX/grsecurity.

Signed-off-by: Rik van Riel 
---
 arch/x86/include/asm/stackprotector.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/include/asm/stackprotector.h 
b/arch/x86/include/asm/stackprotector.h
index dcbd9bcce714..8abedf1d650e 100644
--- a/arch/x86/include/asm/stackprotector.h
+++ b/arch/x86/include/asm/stackprotector.h
@@ -74,6 +74,7 @@ static __always_inline void boot_init_stack_canary(void)
get_random_bytes(, sizeof(canary));
tsc = rdtsc();
canary += tsc + (tsc << 32UL);
+   canary &= CANARY_MASK;
 
current->stack_canary = canary;
 #ifdef CONFIG_X86_64
-- 
2.9.3

[PATCH 1/5] random,stackprotect: introduce get_random_canary function

2017-05-19 Thread riel

From: Rik van Riel 

Introduce the get_random_canary function, which provides a random
unsigned long canary value with the first byte zeroed out on 64
bit architectures, in order to mitigate non-terminated C string
overflows.

Inspired by the "ascii armor" code in the old execshield patches,
and the current PaX/grsecurity code base.

Signed-off-by: Rik van Riel 
---
 include/linux/random.h | 20 
 1 file changed, 20 insertions(+)

diff --git a/include/linux/random.h b/include/linux/random.h
index ed5c3838780d..765a992c6774 100644
--- a/include/linux/random.h
+++ b/include/linux/random.h
@@ -57,6 +57,26 @@ static inline unsigned long get_random_long(void)
 #endif
 }
 
+/*
+ * On 64 bit architectures, protect against non-terminated C string overflows
+ * by zeroing out the first byte of the canary; this leaves 56 bits of entropy.
+ */
+#ifdef CONFIG_64BIT
+#ifdef __LITTLE_ENDIAN
+#define CANARY_MASK 0xff00UL
+#else /* big endian 64 bits */
+#define CANARY_MASK 0x00ffUL
+#endif
+#else /* 32 bits */
+#define CANARY_MASK 0xUL
+#endif
+static inline unsigned long get_random_canary(void)
+{
+   unsigned long val = get_random_long();
+
+   return val & CANARY_MASK;
+}
+
 unsigned long randomize_page(unsigned long start, unsigned long range);
 
 u32 prandom_u32(void);
-- 
2.9.3

[PATCH 4/5] arm64: ascii armor the arm64 boot init stack canary

2017-05-19 Thread riel

From: Rik van Riel 

Use the ascii-armor canary to prevent unterminated C string overflows
from being able to successfully overwrite the canary, even if they
somehow obtain the canary value.

Inspired by execshield ascii-armor and PaX/grsecurity.

Signed-off-by: Rik van Riel 
---
 arch/arm64/include/asm/stackprotector.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/include/asm/stackprotector.h 
b/arch/arm64/include/asm/stackprotector.h
index fe5e287dc56b..b86a0865ddf1 100644
--- a/arch/arm64/include/asm/stackprotector.h
+++ b/arch/arm64/include/asm/stackprotector.h
@@ -30,6 +30,7 @@ static __always_inline void boot_init_stack_canary(void)
/* Try to get a semi random initial value. */
get_random_bytes(, sizeof(canary));
canary ^= LINUX_VERSION_CODE;
+   canary &= CANARY_MASK;
 
current->stack_canary = canary;
__stack_chk_guard = current->stack_canary;
-- 
2.9.3

[PATCH 5/5] sh64: ascii armor the sh64 boot init stack canary

2017-05-19 Thread riel

From: Rik van Riel 

Use the ascii-armor canary to prevent unterminated C string overflows
from being able to successfully overwrite the canary, even if they
somehow obtain the canary value.

Inspired by execshield ascii-armor and PaX/grsecurity.

Signed-off-by: Rik van Riel 
---
 arch/sh/include/asm/stackprotector.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/sh/include/asm/stackprotector.h 
b/arch/sh/include/asm/stackprotector.h
index d9df3a76847c..141515a43b78 100644
--- a/arch/sh/include/asm/stackprotector.h
+++ b/arch/sh/include/asm/stackprotector.h
@@ -19,6 +19,7 @@ static __always_inline void boot_init_stack_canary(void)
/* Try to get a semi random initial value. */
get_random_bytes(, sizeof(canary));
canary ^= LINUX_VERSION_CODE;
+   canary &= CANARY_MASK;
 
current->stack_canary = canary;
__stack_chk_guard = current->stack_canary;
-- 
2.9.3

stackprotector: ascii armor the stack canary

2017-05-19 Thread riel

Zero out the first byte of the stack canary value on 64 bit systems,
in order to prevent unterminated C string overflows from being able
to successfully overwrite the canary, even if an attacker somehow
guessed or obtained the canary value.

Inspired by execshield ascii-armor and PaX/grsecurity.

Thanks to Daniel Micay for extracting code of similar functionality
from PaX/grsecurity and making it easy to find in his linux-hardened
git tree on https://github.com/thestinger/linux-hardened/

< 1 2 3 4 5 6 7 8 9 10 >

501 - 600 of 7758 matches

Mail list logo